diff mbox series

[4/4] vl: Prioritize realizations of devices

Message ID 20210818194318.110993-1-peterx@redhat.com
State New
Headers show
Series vl: Prioritize device realizations | expand

Commit Message

Peter Xu Aug. 18, 2021, 7:43 p.m. UTC
QEMU creates -device objects in order as specified by the user's cmdline.
However that ordering may not be the ideal order.  For example, some platform
devices (vIOMMUs) may want to be created earlier than most of the rest
devices (e.g., vfio-pci, virtio).

This patch orders the QemuOptsList of '-device's so they'll be sorted first
before kicking off the device realizations.  This will allow the device
realization code to be able to use APIs like pci_device_iommu_address_space()
correctly, because those functions rely on the platfrom devices being realized.

Now we rely on vmsd->priority which is defined as MigrationPriority to provide
the ordering, as either VM init and migration completes will need such an
ordering.  In the future we can move that priority information out of vmsd.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 softmmu/vl.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

Comments

Eduardo Habkost Aug. 23, 2021, 6:49 p.m. UTC | #1
On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> QEMU creates -device objects in order as specified by the user's cmdline.
> However that ordering may not be the ideal order.  For example, some platform
> devices (vIOMMUs) may want to be created earlier than most of the rest
> devices (e.g., vfio-pci, virtio).
> 
> This patch orders the QemuOptsList of '-device's so they'll be sorted first
> before kicking off the device realizations.  This will allow the device
> realization code to be able to use APIs like pci_device_iommu_address_space()
> correctly, because those functions rely on the platfrom devices being realized.
> 
> Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> the ordering, as either VM init and migration completes will need such an
> ordering.  In the future we can move that priority information out of vmsd.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Can we be 100% sure that changing the ordering of every single
device being created won't affect guest ABI?  (I don't think we can)

How many device types in QEMU have non-default vmsd priority?

Can we at least ensure devices with the same priority won't be
reordered, just to be safe?  (qsort() doesn't guarantee that)

If very few device types have non-default vmsd priority and
devices with the same priority aren't reordered, the risk of
compatibility breakage would be much smaller.
Peter Xu Aug. 23, 2021, 7:18 p.m. UTC | #2
On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > QEMU creates -device objects in order as specified by the user's cmdline.
> > However that ordering may not be the ideal order.  For example, some platform
> > devices (vIOMMUs) may want to be created earlier than most of the rest
> > devices (e.g., vfio-pci, virtio).
> > 
> > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > before kicking off the device realizations.  This will allow the device
> > realization code to be able to use APIs like pci_device_iommu_address_space()
> > correctly, because those functions rely on the platfrom devices being realized.
> > 
> > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > the ordering, as either VM init and migration completes will need such an
> > ordering.  In the future we can move that priority information out of vmsd.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Can we be 100% sure that changing the ordering of every single
> device being created won't affect guest ABI?  (I don't think we can)

That's a good question, however I doubt whether there's any real-world guest
ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
way, so that I assume most parameters are not sensitive to ordering and I can
tune the ordering as wish.  I'm not sure whether that's common for qemu users,
I would expect so, but I may have missed something that I'm not aware of.

Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
to be before "intel-iommu": it'll be constantly broken before this patchset,
while after this series it'll be working.  It's just that I don't think those
"guest ABI" is necessary to be kept, and that's exactly what I want to fix with
the patchset..

> 
> How many device types in QEMU have non-default vmsd priority?

Not so much; here's the list of priorities and the devices using it:

       |--------------------+---------|
       | priority           | devices |
       |--------------------+---------|
       | MIG_PRI_IOMMU      |       3 |
       | MIG_PRI_PCI_BUS    |       7 |
       | MIG_PRI_VIRTIO_MEM |       1 |
       | MIG_PRI_GICV3_ITS  |       1 |
       | MIG_PRI_GICV3      |       1 |
       |--------------------+---------|

All the rest devices are using the default (0) priority.

> 
> Can we at least ensure devices with the same priority won't be
> reordered, just to be safe?  (qsort() doesn't guarantee that)
> 
> If very few device types have non-default vmsd priority and
> devices with the same priority aren't reordered, the risk of
> compatibility breakage would be much smaller.

I'm also wondering whether it's a good thing to break some guest ABI due to
this change, if possible.

Let's imagine something breaks after applied, then the only reason should be
that qsort() changed the order of some same-priority devices and it's not the
same as user specified any more.  Then, does it also means there's yet another
ordering requirement that we didn't even notice?

I doubt whether that'll even happen (or I think there'll be report already, as
in qemu man page there's no requirement on parameter ordering).  In all cases,
instead of "keeping the same priority devices in the same order as the user has
specified", IMHO we should make the broken devices to have different priorities
so the ordering will be guaranteed by qemu internal, rather than how user
specified it.

From that pov, maybe this patchset would be great if it can be accepted and
applied in early stage of a release? So we can figure out what's missing and
fix them within the same release.  However again I still doubt whether there's
any user that will break in a bad way.

Thanks,
Eduardo Habkost Aug. 23, 2021, 9:07 p.m. UTC | #3
On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > However that ordering may not be the ideal order.  For example, some platform
> > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > devices (e.g., vfio-pci, virtio).
> > > 
> > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > before kicking off the device realizations.  This will allow the device
> > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > correctly, because those functions rely on the platfrom devices being realized.
> > > 
> > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > the ordering, as either VM init and migration completes will need such an
> > > ordering.  In the future we can move that priority information out of vmsd.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Can we be 100% sure that changing the ordering of every single
> > device being created won't affect guest ABI?  (I don't think we can)
> 
> That's a good question, however I doubt whether there's any real-world guest
> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> way, so that I assume most parameters are not sensitive to ordering and I can
> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> I would expect so, but I may have missed something that I'm not aware of.

To give just one example:

$ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
  Bus  0, device   4, function 0:
    Ethernet controller: PCI device 1af4:1000
      PCI subsystem 1af4:0001
      IRQ 0, pin A
      BAR0: I/O at 0xffffffffffffffff [0x001e].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
  Bus  0, device   5, function 0:
    Ethernet controller: PCI device 8086:10d3
      PCI subsystem 8086:0000
      IRQ 0, pin A
      BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR2: I/O at 0xffffffffffffffff [0x001e].
      BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
(qemu) quit
$ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
  Bus  0, device   4, function 0:
    Ethernet controller: PCI device 8086:10d3
      PCI subsystem 8086:0000
      IRQ 0, pin A
      BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
      BAR2: I/O at 0xffffffffffffffff [0x001e].
      BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
  Bus  0, device   5, function 0:
    Ethernet controller: PCI device 1af4:1000
      PCI subsystem 1af4:0001
      IRQ 0, pin A
      BAR0: I/O at 0xffffffffffffffff [0x001e].
      BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
      BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
      BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
      id ""
(qemu) quit


If the order of the -device arguments changes, the devices are assigned to
different PCI slots.


> 
> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> to be before "intel-iommu": it'll be constantly broken before this patchset,
> while after this series it'll be working.  It's just that I don't think those
> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> the patchset..

If the only ordering changes caused by this patch were intentional and affected
only configurations that are known to be broken (like vfio-pci vs intel-iommu),
I would agree.

However, if we are reordering every single -device option in an unspecified way
(like qsort() does when elements compare as equal), we are probably breaking
guest ABI and creating a completely different machine (like in the PCI example above).


> 
> > 
> > How many device types in QEMU have non-default vmsd priority?
> 
> Not so much; here's the list of priorities and the devices using it:
> 
>        |--------------------+---------|
>        | priority           | devices |
>        |--------------------+---------|
>        | MIG_PRI_IOMMU      |       3 |
>        | MIG_PRI_PCI_BUS    |       7 |
>        | MIG_PRI_VIRTIO_MEM |       1 |
>        | MIG_PRI_GICV3_ITS  |       1 |
>        | MIG_PRI_GICV3      |       1 |
>        |--------------------+---------|
> 
> All the rest devices are using the default (0) priority.
> 
> > 
> > Can we at least ensure devices with the same priority won't be
> > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > 
> > If very few device types have non-default vmsd priority and
> > devices with the same priority aren't reordered, the risk of
> > compatibility breakage would be much smaller.
> 
> I'm also wondering whether it's a good thing to break some guest ABI due to
> this change, if possible.
> 
> Let's imagine something breaks after applied, then the only reason should be
> that qsort() changed the order of some same-priority devices and it's not the
> same as user specified any more.  Then, does it also means there's yet another
> ordering requirement that we didn't even notice?

Does the PCI slot assignment example above demonstrate a ordering requirement?
I don't think it does.  It just demonstrates that different orderings are
expected to give different results, but both orderings are valid.


> 
> I doubt whether that'll even happen (or I think there'll be report already, as
> in qemu man page there's no requirement on parameter ordering).  In all cases,
> instead of "keeping the same priority devices in the same order as the user has
> specified", IMHO we should make the broken devices to have different priorities
> so the ordering will be guaranteed by qemu internal, rather than how user
> specified it.
> 
> From that pov, maybe this patchset would be great if it can be accepted and
> applied in early stage of a release? So we can figure out what's missing and
> fix them within the same release.  However again I still doubt whether there's
> any user that will break in a bad way.
> 
> Thanks,
> 
> -- 
> Peter Xu
>
Peter Xu Aug. 23, 2021, 9:31 p.m. UTC | #4
On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote:
> To give just one example:
> 
> $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
>   Bus  0, device   4, function 0:
>     Ethernet controller: PCI device 1af4:1000
>       PCI subsystem 1af4:0001
>       IRQ 0, pin A
>       BAR0: I/O at 0xffffffffffffffff [0x001e].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
>       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
>   Bus  0, device   5, function 0:
>     Ethernet controller: PCI device 8086:10d3
>       PCI subsystem 8086:0000
>       IRQ 0, pin A
>       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR2: I/O at 0xffffffffffffffff [0x001e].
>       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
> (qemu) quit
> $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
>   Bus  0, device   4, function 0:
>     Ethernet controller: PCI device 8086:10d3
>       PCI subsystem 8086:0000
>       IRQ 0, pin A
>       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
>       BAR2: I/O at 0xffffffffffffffff [0x001e].
>       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
>   Bus  0, device   5, function 0:
>     Ethernet controller: PCI device 1af4:1000
>       PCI subsystem 1af4:0001
>       IRQ 0, pin A
>       BAR0: I/O at 0xffffffffffffffff [0x001e].
>       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
>       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
>       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
>       id ""
> (qemu) quit
> 
> 
> If the order of the -device arguments changes, the devices are assigned to
> different PCI slots.

Thanks for the example.

Initially I thought about this and didn't think it an issue (because serious
users will always specify addr=XXX for -device; I thought libvirt always does
that), but I do remember that guest OS could identify its hardware config with
devfn number, so nmcli may mess up its config with before/after this change
indeed..

I can use a custom sort to replace qsort() to guarantee that.

Do you have other examples in mind that I may have overlooked, especially I may
not be able to fix by a custom sort with only moving priority>=1 devices?

Thanks,
Michael S. Tsirkin Aug. 23, 2021, 9:54 p.m. UTC | #5
On Mon, Aug 23, 2021 at 05:31:46PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote:
> > To give just one example:
> > 
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > 
> > 
> > If the order of the -device arguments changes, the devices are assigned to
> > different PCI slots.
> 
> Thanks for the example.
> 
> Initially I thought about this and didn't think it an issue (because serious
> users will always specify addr=XXX for -device; I thought libvirt always does
> that), but I do remember that guest OS could identify its hardware config with
> devfn number, so nmcli may mess up its config with before/after this change
> indeed..
> 
> I can use a custom sort to replace qsort() to guarantee that.


You don't have to do that. Simply use the device position on the command
line for comparisons when priority is the same.


> Do you have other examples in mind that I may have overlooked, especially I may
> not be able to fix by a custom sort with only moving priority>=1 devices?
> 
> Thanks,

> -- 
> Peter Xu
Eduardo Habkost Aug. 23, 2021, 9:56 p.m. UTC | #6
On Mon, Aug 23, 2021 at 05:31:46PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 05:07:03PM -0400, Eduardo Habkost wrote:
> > To give just one example:
> > 
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device virtio-net-pci -device e1000e -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > $ (echo 'info pci';echo quit;) | qemu-system-x86_64 -device e1000e -device virtio-net-pci -monitor stdio | tail -n 20
> >   Bus  0, device   4, function 0:
> >     Ethernet controller: PCI device 8086:10d3
> >       PCI subsystem 8086:0000
> >       IRQ 0, pin A
> >       BAR0: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x0001fffe].
> >       BAR2: I/O at 0xffffffffffffffff [0x001e].
> >       BAR3: 32 bit memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> >   Bus  0, device   5, function 0:
> >     Ethernet controller: PCI device 1af4:1000
> >       PCI subsystem 1af4:0001
> >       IRQ 0, pin A
> >       BAR0: I/O at 0xffffffffffffffff [0x001e].
> >       BAR1: 32 bit memory at 0xffffffffffffffff [0x00000ffe].
> >       BAR4: 64 bit prefetchable memory at 0xffffffffffffffff [0x00003ffe].
> >       BAR6: 32 bit memory at 0xffffffffffffffff [0x0003fffe].
> >       id ""
> > (qemu) quit
> > 
> > 
> > If the order of the -device arguments changes, the devices are assigned to
> > different PCI slots.
> 
> Thanks for the example.
> 
> Initially I thought about this and didn't think it an issue (because serious
> users will always specify addr=XXX for -device; I thought libvirt always does
> that), but I do remember that guest OS could identify its hardware config with
> devfn number, so nmcli may mess up its config with before/after this change
> indeed..
> 
> I can use a custom sort to replace qsort() to guarantee that.
> 
> Do you have other examples in mind that I may have overlooked, especially I may
> not be able to fix by a custom sort with only moving priority>=1 devices?

I don't have any other example, but I assume address assignment
based on ordering is a common pattern in device code.

I would take a very close and careful look at the devices with
non-default vmsd priority.  If you can prove that the 13 device
types with non-default priority are all order-insensitive, a
custom sort function as you describe might be safe.
Michael S. Tsirkin Aug. 23, 2021, 10:05 p.m. UTC | #7
On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > However that ordering may not be the ideal order.  For example, some platform
> > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > devices (e.g., vfio-pci, virtio).
> > > 
> > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > before kicking off the device realizations.  This will allow the device
> > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > correctly, because those functions rely on the platfrom devices being realized.
> > > 
> > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > the ordering, as either VM init and migration completes will need such an
> > > ordering.  In the future we can move that priority information out of vmsd.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > Can we be 100% sure that changing the ordering of every single
> > device being created won't affect guest ABI?  (I don't think we can)
> 
> That's a good question, however I doubt whether there's any real-world guest
> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> way, so that I assume most parameters are not sensitive to ordering and I can
> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> I would expect so, but I may have missed something that I'm not aware of.
> 
> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> to be before "intel-iommu": it'll be constantly broken before this patchset,
> while after this series it'll be working.  It's just that I don't think those
> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> the patchset..
> 
> > 
> > How many device types in QEMU have non-default vmsd priority?
> 
> Not so much; here's the list of priorities and the devices using it:
> 
>        |--------------------+---------|
>        | priority           | devices |
>        |--------------------+---------|
>        | MIG_PRI_IOMMU      |       3 |
>        | MIG_PRI_PCI_BUS    |       7 |
>        | MIG_PRI_VIRTIO_MEM |       1 |
>        | MIG_PRI_GICV3_ITS  |       1 |
>        | MIG_PRI_GICV3      |       1 |
>        |--------------------+---------|

iommu is probably ok. I think virtio mem is ok too,
in that it is normally created by virtio-mem-pci ...



> All the rest devices are using the default (0) priority.
> 
> > 
> > Can we at least ensure devices with the same priority won't be
> > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > 
> > If very few device types have non-default vmsd priority and
> > devices with the same priority aren't reordered, the risk of
> > compatibility breakage would be much smaller.
> 
> I'm also wondering whether it's a good thing to break some guest ABI due to
> this change, if possible.
> 
> Let's imagine something breaks after applied, then the only reason should be
> that qsort() changed the order of some same-priority devices and it's not the
> same as user specified any more.  Then, does it also means there's yet another
> ordering requirement that we didn't even notice?
> 
> I doubt whether that'll even happen (or I think there'll be report already, as
> in qemu man page there's no requirement on parameter ordering).  In all cases,
> instead of "keeping the same priority devices in the same order as the user has
> specified", IMHO we should make the broken devices to have different priorities
> so the ordering will be guaranteed by qemu internal, rather than how user
> specified it.

Well giving user control of guest ABI is a reasonable thing to do,
it is realize order that users do not really care about.

I guess we could move pci slot allocation out of realize
so it does not depend on realize order?


> >From that pov, maybe this patchset would be great if it can be accepted and
> applied in early stage of a release? So we can figure out what's missing and
> fix them within the same release.  However again I still doubt whether there's
> any user that will break in a bad way.
> 
> Thanks,
> 
> -- 
> Peter Xu
Peter Xu Aug. 23, 2021, 10:36 p.m. UTC | #8
On Mon, Aug 23, 2021 at 06:05:07PM -0400, Michael S. Tsirkin wrote:
> On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > > However that ordering may not be the ideal order.  For example, some platform
> > > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > > devices (e.g., vfio-pci, virtio).
> > > > 
> > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > > before kicking off the device realizations.  This will allow the device
> > > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > > correctly, because those functions rely on the platfrom devices being realized.
> > > > 
> > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > > the ordering, as either VM init and migration completes will need such an
> > > > ordering.  In the future we can move that priority information out of vmsd.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > Can we be 100% sure that changing the ordering of every single
> > > device being created won't affect guest ABI?  (I don't think we can)
> > 
> > That's a good question, however I doubt whether there's any real-world guest
> > ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> > way, so that I assume most parameters are not sensitive to ordering and I can
> > tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> > I would expect so, but I may have missed something that I'm not aware of.
> > 
> > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> > to be before "intel-iommu": it'll be constantly broken before this patchset,
> > while after this series it'll be working.  It's just that I don't think those
> > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> > the patchset..
> > 
> > > 
> > > How many device types in QEMU have non-default vmsd priority?
> > 
> > Not so much; here's the list of priorities and the devices using it:
> > 
> >        |--------------------+---------|
> >        | priority           | devices |
> >        |--------------------+---------|
> >        | MIG_PRI_IOMMU      |       3 |
> >        | MIG_PRI_PCI_BUS    |       7 |
> >        | MIG_PRI_VIRTIO_MEM |       1 |
> >        | MIG_PRI_GICV3_ITS  |       1 |
> >        | MIG_PRI_GICV3      |       1 |
> >        |--------------------+---------|
> 
> iommu is probably ok. I think virtio mem is ok too,
> in that it is normally created by virtio-mem-pci ...

Hmm this reminded me whether virtio-mem-pci could have another devfn allocated
after being moved..

But frankly I still doubt whether we should guarantee that guest ABI on user
not specifying addr=XXX in pci device parameters - I feel like it's a burden
that we don't need to carry.

(Btw, trying to keep the order is one thing; declare it guest ABI would be
 another thing to me)

> 
> 
> 
> > All the rest devices are using the default (0) priority.
> > 
> > > 
> > > Can we at least ensure devices with the same priority won't be
> > > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > > 
> > > If very few device types have non-default vmsd priority and
> > > devices with the same priority aren't reordered, the risk of
> > > compatibility breakage would be much smaller.
> > 
> > I'm also wondering whether it's a good thing to break some guest ABI due to
> > this change, if possible.
> > 
> > Let's imagine something breaks after applied, then the only reason should be
> > that qsort() changed the order of some same-priority devices and it's not the
> > same as user specified any more.  Then, does it also means there's yet another
> > ordering requirement that we didn't even notice?
> > 
> > I doubt whether that'll even happen (or I think there'll be report already, as
> > in qemu man page there's no requirement on parameter ordering).  In all cases,
> > instead of "keeping the same priority devices in the same order as the user has
> > specified", IMHO we should make the broken devices to have different priorities
> > so the ordering will be guaranteed by qemu internal, rather than how user
> > specified it.
> 
> Well giving user control of guest ABI is a reasonable thing to do,
> it is realize order that users do not really care about.

Makes sense.

> 
> I guess we could move pci slot allocation out of realize
> so it does not depend on realize order?

Yes that sounds like another approach, but it seems to require more changes.

Thanks,
Peter Xu Aug. 23, 2021, 10:51 p.m. UTC | #9
On Mon, Aug 23, 2021 at 05:54:44PM -0400, Michael S. Tsirkin wrote:
> > I can use a custom sort to replace qsort() to guarantee that.
> You don't have to do that. Simply use the device position on the command
> line for comparisons when priority is the same.

Indeed. :) Thanks,
Peter Xu Aug. 23, 2021, 11:05 p.m. UTC | #10
On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
> I don't have any other example, but I assume address assignment
> based on ordering is a common pattern in device code.
> 
> I would take a very close and careful look at the devices with
> non-default vmsd priority.  If you can prove that the 13 device
> types with non-default priority are all order-insensitive, a
> custom sort function as you describe might be safe.

Besides virtio-mem-pci, there'll also similar devfn issue with all
MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
below two cmdlines will generate different pci topology too:

  $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
                       -device pcie-root-port,chassis=1 \
                       -device virtio-net-pci

And:

  $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
                       -device virtio-net-pci
                       -device pcie-root-port,chassis=1 \

This cannot be solved by keeping priority==0 ordering.

After a second thought, I think I was initially wrong on seeing migration
priority and device realization the same problem.

For example, for live migration we have a requirement on PCI_BUS being migrated
earlier than MIG_PRI_IOMMU because there's bus number information required
because IOMMU relies on the bus number to find address spaces.  However that's
definitely not a requirement for device realizations, say, realizing vIOMMU
after pci buses are fine (bus assigned during bios).

I've probably messed up with the ideas (though they really look alike!).  Sorry
about that.

Since the only ordering constraint so far is IOMMU vs all the rest of devices,
I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
earlier.  That'll also avoid other implications on pci devfn allocations.

Will rework a new version tomorrow.  Thanks a lot for all the comments,
Jason Wang Aug. 24, 2021, 2:51 a.m. UTC | #11
On Tue, Aug 24, 2021 at 3:18 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > However that ordering may not be the ideal order.  For example, some platform
> > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > devices (e.g., vfio-pci, virtio).
> > >
> > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > before kicking off the device realizations.  This will allow the device
> > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > correctly, because those functions rely on the platfrom devices being realized.
> > >
> > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > the ordering, as either VM init and migration completes will need such an
> > > ordering.  In the future we can move that priority information out of vmsd.
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> >
> > Can we be 100% sure that changing the ordering of every single
> > device being created won't affect guest ABI?  (I don't think we can)
>
> That's a good question, however I doubt whether there's any real-world guest
> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> way, so that I assume most parameters are not sensitive to ordering and I can
> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> I would expect so, but I may have missed something that I'm not aware of.
>
> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> to be before "intel-iommu": it'll be constantly broken before this patchset,
> while after this series it'll be working.  It's just that I don't think those
> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> the patchset..

Yes, and I wonder if we limit this to new machine types, we don't even
need to care about ABI stuff.

Thanks

>
> >
> > How many device types in QEMU have non-default vmsd priority?
>
> Not so much; here's the list of priorities and the devices using it:
>
>        |--------------------+---------|
>        | priority           | devices |
>        |--------------------+---------|
>        | MIG_PRI_IOMMU      |       3 |
>        | MIG_PRI_PCI_BUS    |       7 |
>        | MIG_PRI_VIRTIO_MEM |       1 |
>        | MIG_PRI_GICV3_ITS  |       1 |
>        | MIG_PRI_GICV3      |       1 |
>        |--------------------+---------|
>
> All the rest devices are using the default (0) priority.
>
> >
> > Can we at least ensure devices with the same priority won't be
> > reordered, just to be safe?  (qsort() doesn't guarantee that)
> >
> > If very few device types have non-default vmsd priority and
> > devices with the same priority aren't reordered, the risk of
> > compatibility breakage would be much smaller.
>
> I'm also wondering whether it's a good thing to break some guest ABI due to
> this change, if possible.
>
> Let's imagine something breaks after applied, then the only reason should be
> that qsort() changed the order of some same-priority devices and it's not the
> same as user specified any more.  Then, does it also means there's yet another
> ordering requirement that we didn't even notice?
>
> I doubt whether that'll even happen (or I think there'll be report already, as
> in qemu man page there's no requirement on parameter ordering).  In all cases,
> instead of "keeping the same priority devices in the same order as the user has
> specified", IMHO we should make the broken devices to have different priorities
> so the ordering will be guaranteed by qemu internal, rather than how user
> specified it.
>
> From that pov, maybe this patchset would be great if it can be accepted and
> applied in early stage of a release? So we can figure out what's missing and
> fix them within the same release.  However again I still doubt whether there's
> any user that will break in a bad way.
>
> Thanks,
>
> --
> Peter Xu
>
Jason Wang Aug. 24, 2021, 2:52 a.m. UTC | #12
On Tue, Aug 24, 2021 at 6:37 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Aug 23, 2021 at 06:05:07PM -0400, Michael S. Tsirkin wrote:
> > On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
> > > On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
> > > > On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
> > > > > QEMU creates -device objects in order as specified by the user's cmdline.
> > > > > However that ordering may not be the ideal order.  For example, some platform
> > > > > devices (vIOMMUs) may want to be created earlier than most of the rest
> > > > > devices (e.g., vfio-pci, virtio).
> > > > >
> > > > > This patch orders the QemuOptsList of '-device's so they'll be sorted first
> > > > > before kicking off the device realizations.  This will allow the device
> > > > > realization code to be able to use APIs like pci_device_iommu_address_space()
> > > > > correctly, because those functions rely on the platfrom devices being realized.
> > > > >
> > > > > Now we rely on vmsd->priority which is defined as MigrationPriority to provide
> > > > > the ordering, as either VM init and migration completes will need such an
> > > > > ordering.  In the future we can move that priority information out of vmsd.
> > > > >
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > >
> > > > Can we be 100% sure that changing the ordering of every single
> > > > device being created won't affect guest ABI?  (I don't think we can)
> > >
> > > That's a good question, however I doubt whether there's any real-world guest
> > > ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
> > > way, so that I assume most parameters are not sensitive to ordering and I can
> > > tune the ordering as wish.  I'm not sure whether that's common for qemu users,
> > > I would expect so, but I may have missed something that I'm not aware of.
> > >
> > > Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
> > > to be before "intel-iommu": it'll be constantly broken before this patchset,
> > > while after this series it'll be working.  It's just that I don't think those
> > > "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
> > > the patchset..
> > >
> > > >
> > > > How many device types in QEMU have non-default vmsd priority?
> > >
> > > Not so much; here's the list of priorities and the devices using it:
> > >
> > >        |--------------------+---------|
> > >        | priority           | devices |
> > >        |--------------------+---------|
> > >        | MIG_PRI_IOMMU      |       3 |
> > >        | MIG_PRI_PCI_BUS    |       7 |
> > >        | MIG_PRI_VIRTIO_MEM |       1 |
> > >        | MIG_PRI_GICV3_ITS  |       1 |
> > >        | MIG_PRI_GICV3      |       1 |
> > >        |--------------------+---------|
> >
> > iommu is probably ok. I think virtio mem is ok too,
> > in that it is normally created by virtio-mem-pci ...
>
> Hmm this reminded me whether virtio-mem-pci could have another devfn allocated
> after being moved..
>
> But frankly I still doubt whether we should guarantee that guest ABI on user
> not specifying addr=XXX in pci device parameters - I feel like it's a burden
> that we don't need to carry.
>
> (Btw, trying to keep the order is one thing; declare it guest ABI would be
>  another thing to me)
>
> >
> >
> >
> > > All the rest devices are using the default (0) priority.
> > >
> > > >
> > > > Can we at least ensure devices with the same priority won't be
> > > > reordered, just to be safe?  (qsort() doesn't guarantee that)
> > > >
> > > > If very few device types have non-default vmsd priority and
> > > > devices with the same priority aren't reordered, the risk of
> > > > compatibility breakage would be much smaller.
> > >
> > > I'm also wondering whether it's a good thing to break some guest ABI due to
> > > this change, if possible.
> > >
> > > Let's imagine something breaks after applied, then the only reason should be
> > > that qsort() changed the order of some same-priority devices and it's not the
> > > same as user specified any more.  Then, does it also means there's yet another
> > > ordering requirement that we didn't even notice?
> > >
> > > I doubt whether that'll even happen (or I think there'll be report already, as
> > > in qemu man page there's no requirement on parameter ordering).  In all cases,
> > > instead of "keeping the same priority devices in the same order as the user has
> > > specified", IMHO we should make the broken devices to have different priorities
> > > so the ordering will be guaranteed by qemu internal, rather than how user
> > > specified it.
> >
> > Well giving user control of guest ABI is a reasonable thing to do,
> > it is realize order that users do not really care about.
>
> Makes sense.
>
> >
> > I guess we could move pci slot allocation out of realize
> > so it does not depend on realize order?
>
> Yes that sounds like another approach, but it seems to require more changes.

It looks to me this doesn't solve the issue of using virtio-mmio with vhost?

Thanks

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Aug. 24, 2021, 3:50 p.m. UTC | #13
On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote:
> It looks to me this doesn't solve the issue of using virtio-mmio with vhost?

No IOMMU supported for any of the MMIO devices, right?  Or am I wrong?  Thanks,
David Hildenbrand Aug. 24, 2021, 4:24 p.m. UTC | #14
On 24.08.21 00:05, Michael S. Tsirkin wrote:
> On Mon, Aug 23, 2021 at 03:18:51PM -0400, Peter Xu wrote:
>> On Mon, Aug 23, 2021 at 02:49:12PM -0400, Eduardo Habkost wrote:
>>> On Wed, Aug 18, 2021 at 03:43:18PM -0400, Peter Xu wrote:
>>>> QEMU creates -device objects in order as specified by the user's cmdline.
>>>> However that ordering may not be the ideal order.  For example, some platform
>>>> devices (vIOMMUs) may want to be created earlier than most of the rest
>>>> devices (e.g., vfio-pci, virtio).
>>>>
>>>> This patch orders the QemuOptsList of '-device's so they'll be sorted first
>>>> before kicking off the device realizations.  This will allow the device
>>>> realization code to be able to use APIs like pci_device_iommu_address_space()
>>>> correctly, because those functions rely on the platfrom devices being realized.
>>>>
>>>> Now we rely on vmsd->priority which is defined as MigrationPriority to provide
>>>> the ordering, as either VM init and migration completes will need such an
>>>> ordering.  In the future we can move that priority information out of vmsd.
>>>>
>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>
>>> Can we be 100% sure that changing the ordering of every single
>>> device being created won't affect guest ABI?  (I don't think we can)
>>
>> That's a good question, however I doubt whether there's any real-world guest
>> ABI for that.  As a developer, I normally specify cmdline parameter in an adhoc
>> way, so that I assume most parameters are not sensitive to ordering and I can
>> tune the ordering as wish.  I'm not sure whether that's common for qemu users,
>> I would expect so, but I may have missed something that I'm not aware of.
>>
>> Per my knowledge the only "guest ABI" change is e.g. when we specify "vfio-pci"
>> to be before "intel-iommu": it'll be constantly broken before this patchset,
>> while after this series it'll be working.  It's just that I don't think those
>> "guest ABI" is necessary to be kept, and that's exactly what I want to fix with
>> the patchset..
>>
>>>
>>> How many device types in QEMU have non-default vmsd priority?
>>
>> Not so much; here's the list of priorities and the devices using it:
>>
>>         |--------------------+---------|
>>         | priority           | devices |
>>         |--------------------+---------|
>>         | MIG_PRI_IOMMU      |       3 |
>>         | MIG_PRI_PCI_BUS    |       7 |
>>         | MIG_PRI_VIRTIO_MEM |       1 |
>>         | MIG_PRI_GICV3_ITS  |       1 |
>>         | MIG_PRI_GICV3      |       1 |
>>         |--------------------+---------|
> 
> iommu is probably ok. I think virtio mem is ok too,
> in that it is normally created by virtio-mem-pci ...

IIRC:

intel-iommu has to be created on the QEMU cmdline before creating 
virtio-mem-pci.

-device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \
...
-device 
virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,...

Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem 
device state has to be migrated before migrating intel-iommu state.

I do wonder if migration priorities are really what we want to reuse 
here. I guess it works right, but just by pure luck (because we ignore 
the implicit dependency regarding priorities)
Peter Xu Aug. 24, 2021, 7:52 p.m. UTC | #15
On Tue, Aug 24, 2021 at 06:24:27PM +0200, David Hildenbrand wrote:
> > > Not so much; here's the list of priorities and the devices using it:
> > > 
> > >         |--------------------+---------|
> > >         | priority           | devices |
> > >         |--------------------+---------|
> > >         | MIG_PRI_IOMMU      |       3 |
> > >         | MIG_PRI_PCI_BUS    |       7 |
> > >         | MIG_PRI_VIRTIO_MEM |       1 |
> > >         | MIG_PRI_GICV3_ITS  |       1 |
> > >         | MIG_PRI_GICV3      |       1 |
> > >         |--------------------+---------|
> > 
> > iommu is probably ok. I think virtio mem is ok too,
> > in that it is normally created by virtio-mem-pci ...
> 
> IIRC:
> 
> intel-iommu has to be created on the QEMU cmdline before creating
> virtio-mem-pci.
> 
> -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \
> ...
> -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,...
> 
> Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem device
> state has to be migrated before migrating intel-iommu state.

Since we're at it.. frankly I didn't fully digest why virtio-mem needs to be
migrated before when reading commit 0fd7616e0f1171b.  It gives me the feeling
more like that virtio-mem has a ordering dependency with vfio-pci not
virtio-mem, but I could be wrong.

E.g., the IOMMU unit shouldn't be walking page table if without a device using
it, then it won't fail until the device walks it in region_add() hooks when
memory replay happens.

> 
> I do wonder if migration priorities are really what we want to reuse here. I
> guess it works right, but just by pure luck (because we ignore the implicit
> dependency regarding priorities)

Yep, that's probably not what we want.  I'll make it a new list of priorities
as I've stated in the other thread.

Thanks,
Jason Wang Aug. 25, 2021, 4:23 a.m. UTC | #16
On Tue, Aug 24, 2021 at 11:50 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote:
> > It looks to me this doesn't solve the issue of using virtio-mmio with vhost?
>
> No IOMMU supported for any of the MMIO devices, right?  Or am I wrong?  Thanks,

I guess virtio IOMMU has that support, but I might be wrong.

Thanks

>
> --
> Peter Xu
>
David Hildenbrand Aug. 25, 2021, 8:08 a.m. UTC | #17
On 24.08.21 21:52, Peter Xu wrote:
> On Tue, Aug 24, 2021 at 06:24:27PM +0200, David Hildenbrand wrote:
>>>> Not so much; here's the list of priorities and the devices using it:
>>>>
>>>>          |--------------------+---------|
>>>>          | priority           | devices |
>>>>          |--------------------+---------|
>>>>          | MIG_PRI_IOMMU      |       3 |
>>>>          | MIG_PRI_PCI_BUS    |       7 |
>>>>          | MIG_PRI_VIRTIO_MEM |       1 |
>>>>          | MIG_PRI_GICV3_ITS  |       1 |
>>>>          | MIG_PRI_GICV3      |       1 |
>>>>          |--------------------+---------|
>>>
>>> iommu is probably ok. I think virtio mem is ok too,
>>> in that it is normally created by virtio-mem-pci ...
>>
>> IIRC:
>>
>> intel-iommu has to be created on the QEMU cmdline before creating
>> virtio-mem-pci.
>>
>> -device intel-iommu,caching-mode=on,intremap=on,device-iotlb=on \
>> ...
>> -device virtio-mem-pci,disable-legacy=on,disable-modern=off,iommu_platform=on,ats=on,id=vm0,...
>>
>> Creating virtio-mem-pci will implicitly create virtio-mem. virtio-mem device
>> state has to be migrated before migrating intel-iommu state.
> 
> Since we're at it.. frankly I didn't fully digest why virtio-mem needs to be
> migrated before when reading commit 0fd7616e0f1171b.  It gives me the feeling
> more like that virtio-mem has a ordering dependency with vfio-pci not
> virtio-mem, but I could be wrong.

"We have to take care of incoming migration: at the point the
  IOMMUs get restored and start creating mappings in vfio,
  RamDiscardManager implementations might not be back up and running yet:
  let's add runstate priorities to enforce the order when restoring.

s/runstate/vmstate/ :|

I recall that we can see IOMMU_NOTIFIER_MAP events when restoring an 
IOMMU device. vfio_get_xlat_addr() would be called and could reject 
mappings targeting virtio-mem memory in case the virtio-mem device has 
not restored its bitmap from the migration stream such that 
ram_discard_manager_is_populated() would be reliable. Consequently, we 
have to restore the virtio-mem device state (not the virtio-mem-pci 
device state!) before restoring an IOMMU.



> 
> E.g., the IOMMU unit shouldn't be walking page table if without a device using
> it, then it won't fail until the device walks it in region_add() hooks when
> memory replay happens.

I recall it happened when switching to the iommu region e.g., in 
vtd_post_load()->vtd_switch_address_space(). But I forgot the exact call 
path.
Markus Armbruster Aug. 25, 2021, 9:39 a.m. UTC | #18
Peter Xu <peterx@redhat.com> writes:

> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
>> I don't have any other example, but I assume address assignment
>> based on ordering is a common pattern in device code.
>> 
>> I would take a very close and careful look at the devices with
>> non-default vmsd priority.  If you can prove that the 13 device
>> types with non-default priority are all order-insensitive, a
>> custom sort function as you describe might be safe.
>
> Besides virtio-mem-pci, there'll also similar devfn issue with all
> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
> below two cmdlines will generate different pci topology too:
>
>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>                        -device pcie-root-port,chassis=1 \
>                        -device virtio-net-pci
>
> And:
>
>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>                        -device virtio-net-pci
>                        -device pcie-root-port,chassis=1 \
>
> This cannot be solved by keeping priority==0 ordering.
>
> After a second thought, I think I was initially wrong on seeing migration
> priority and device realization the same problem.
>
> For example, for live migration we have a requirement on PCI_BUS being migrated
> earlier than MIG_PRI_IOMMU because there's bus number information required
> because IOMMU relies on the bus number to find address spaces.  However that's
> definitely not a requirement for device realizations, say, realizing vIOMMU
> after pci buses are fine (bus assigned during bios).
>
> I've probably messed up with the ideas (though they really look alike!).  Sorry
> about that.
>
> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
> earlier.  That'll also avoid other implications on pci devfn allocations.
>
> Will rework a new version tomorrow.  Thanks a lot for all the comments,

Is it really a good idea to magically reorder device realization just to
make a non-working command line work?  Why can't we just fail the
non-working command line in a way that tells users how to get a working
one?  We have way too much ordering magic already...

If we decide we want more magic, then I'd argue for *dependencies*
instead of priorities.  Dependencies are specific and local: $this needs
to go after $that because $reasons.  Priorities are unspecific and
global.
Markus Armbruster Aug. 25, 2021, 12:28 p.m. UTC | #19
Markus Armbruster <armbru@redhat.com> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
>>> I don't have any other example, but I assume address assignment
>>> based on ordering is a common pattern in device code.
>>> 
>>> I would take a very close and careful look at the devices with
>>> non-default vmsd priority.  If you can prove that the 13 device
>>> types with non-default priority are all order-insensitive, a
>>> custom sort function as you describe might be safe.
>>
>> Besides virtio-mem-pci, there'll also similar devfn issue with all
>> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
>> below two cmdlines will generate different pci topology too:
>>
>>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>>                        -device pcie-root-port,chassis=1 \
>>                        -device virtio-net-pci
>>
>> And:
>>
>>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>>                        -device virtio-net-pci
>>                        -device pcie-root-port,chassis=1 \
>>
>> This cannot be solved by keeping priority==0 ordering.
>>
>> After a second thought, I think I was initially wrong on seeing migration
>> priority and device realization the same problem.
>>
>> For example, for live migration we have a requirement on PCI_BUS being migrated
>> earlier than MIG_PRI_IOMMU because there's bus number information required
>> because IOMMU relies on the bus number to find address spaces.  However that's
>> definitely not a requirement for device realizations, say, realizing vIOMMU
>> after pci buses are fine (bus assigned during bios).
>>
>> I've probably messed up with the ideas (though they really look alike!).  Sorry
>> about that.
>>
>> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
>> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
>> earlier.  That'll also avoid other implications on pci devfn allocations.
>>
>> Will rework a new version tomorrow.  Thanks a lot for all the comments,
>
> Is it really a good idea to magically reorder device realization just to
> make a non-working command line work?  Why can't we just fail the
> non-working command line in a way that tells users how to get a working
> one?  We have way too much ordering magic already...
>
> If we decide we want more magic, then I'd argue for *dependencies*
> instead of priorities.  Dependencies are specific and local: $this needs
> to go after $that because $reasons.  Priorities are unspecific and
> global.

Having thought about this a bit more...

Constraints on realize order are nothing new.  For instance, when a
device plugs into a bus, it needs to be realized after the device
providing the bus.

We ensure this by having the device refer to the bus, e.g. bus=pci.0.
The reference may be implicit, but it's there.  It must resolve for
device creation to succeed, and if it resolves, the device providing the
bus will be realized in time.

I believe what's getting us into trouble with IOMMU is not having such a
reference.  Or in other words, keeping the dependence between the IOMMU
and the devices relying on it *implicit*, and thus hidden from the
existing realize-ordering machinery.

Instead of inventing another such machinery, let's try to use the one we
already have.
Peter Xu Aug. 25, 2021, 9:50 p.m. UTC | #20
On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
> Markus Armbruster <armbru@redhat.com> writes:
> 
> > Peter Xu <peterx@redhat.com> writes:
> >
> >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
> >>> I don't have any other example, but I assume address assignment
> >>> based on ordering is a common pattern in device code.
> >>> 
> >>> I would take a very close and careful look at the devices with
> >>> non-default vmsd priority.  If you can prove that the 13 device
> >>> types with non-default priority are all order-insensitive, a
> >>> custom sort function as you describe might be safe.
> >>
> >> Besides virtio-mem-pci, there'll also similar devfn issue with all
> >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
> >> below two cmdlines will generate different pci topology too:
> >>
> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> >>                        -device pcie-root-port,chassis=1 \
> >>                        -device virtio-net-pci
> >>
> >> And:
> >>
> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> >>                        -device virtio-net-pci
> >>                        -device pcie-root-port,chassis=1 \
> >>
> >> This cannot be solved by keeping priority==0 ordering.
> >>
> >> After a second thought, I think I was initially wrong on seeing migration
> >> priority and device realization the same problem.
> >>
> >> For example, for live migration we have a requirement on PCI_BUS being migrated
> >> earlier than MIG_PRI_IOMMU because there's bus number information required
> >> because IOMMU relies on the bus number to find address spaces.  However that's
> >> definitely not a requirement for device realizations, say, realizing vIOMMU
> >> after pci buses are fine (bus assigned during bios).
> >>
> >> I've probably messed up with the ideas (though they really look alike!).  Sorry
> >> about that.
> >>
> >> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
> >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
> >> earlier.  That'll also avoid other implications on pci devfn allocations.
> >>
> >> Will rework a new version tomorrow.  Thanks a lot for all the comments,
> >
> > Is it really a good idea to magically reorder device realization just to
> > make a non-working command line work?  Why can't we just fail the
> > non-working command line in a way that tells users how to get a working
> > one?  We have way too much ordering magic already...
> >
> > If we decide we want more magic, then I'd argue for *dependencies*
> > instead of priorities.  Dependencies are specific and local: $this needs
> > to go after $that because $reasons.  Priorities are unspecific and
> > global.
> 
> Having thought about this a bit more...
> 
> Constraints on realize order are nothing new.  For instance, when a
> device plugs into a bus, it needs to be realized after the device
> providing the bus.
> 
> We ensure this by having the device refer to the bus, e.g. bus=pci.0.
> The reference may be implicit, but it's there.  It must resolve for
> device creation to succeed, and if it resolves, the device providing the
> bus will be realized in time.
> 
> I believe what's getting us into trouble with IOMMU is not having such a
> reference.  Or in other words, keeping the dependence between the IOMMU
> and the devices relying on it *implicit*, and thus hidden from the
> existing realize-ordering machinery.
> 
> Instead of inventing another such machinery, let's try to use the one we
> already have.

Hmm... I just found that we don't have such machinery, do we?

This does not really work:

$ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
                       -device pcie-root-port,id=pcie.1,bus=pcie.0
qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found

While this will:

$ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
                       -device virtio-net-pci,bus=pcie.1

Thanks,
Peter Xu Aug. 26, 2021, 3:50 a.m. UTC | #21
On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:
> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
> > Markus Armbruster <armbru@redhat.com> writes:
> > 
> > > Peter Xu <peterx@redhat.com> writes:
> > >
> > >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
> > >>> I don't have any other example, but I assume address assignment
> > >>> based on ordering is a common pattern in device code.
> > >>> 
> > >>> I would take a very close and careful look at the devices with
> > >>> non-default vmsd priority.  If you can prove that the 13 device
> > >>> types with non-default priority are all order-insensitive, a
> > >>> custom sort function as you describe might be safe.
> > >>
> > >> Besides virtio-mem-pci, there'll also similar devfn issue with all
> > >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
> > >> below two cmdlines will generate different pci topology too:
> > >>
> > >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> > >>                        -device pcie-root-port,chassis=1 \
> > >>                        -device virtio-net-pci
> > >>
> > >> And:
> > >>
> > >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> > >>                        -device virtio-net-pci
> > >>                        -device pcie-root-port,chassis=1 \
> > >>
> > >> This cannot be solved by keeping priority==0 ordering.
> > >>
> > >> After a second thought, I think I was initially wrong on seeing migration
> > >> priority and device realization the same problem.
> > >>
> > >> For example, for live migration we have a requirement on PCI_BUS being migrated
> > >> earlier than MIG_PRI_IOMMU because there's bus number information required
> > >> because IOMMU relies on the bus number to find address spaces.  However that's
> > >> definitely not a requirement for device realizations, say, realizing vIOMMU
> > >> after pci buses are fine (bus assigned during bios).
> > >>
> > >> I've probably messed up with the ideas (though they really look alike!).  Sorry
> > >> about that.
> > >>
> > >> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
> > >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
> > >> earlier.  That'll also avoid other implications on pci devfn allocations.
> > >>
> > >> Will rework a new version tomorrow.  Thanks a lot for all the comments,
> > >
> > > Is it really a good idea to magically reorder device realization just to
> > > make a non-working command line work?  Why can't we just fail the
> > > non-working command line in a way that tells users how to get a working
> > > one?  We have way too much ordering magic already...
> > >
> > > If we decide we want more magic, then I'd argue for *dependencies*
> > > instead of priorities.  Dependencies are specific and local: $this needs
> > > to go after $that because $reasons.  Priorities are unspecific and
> > > global.
> > 
> > Having thought about this a bit more...
> > 
> > Constraints on realize order are nothing new.  For instance, when a
> > device plugs into a bus, it needs to be realized after the device
> > providing the bus.
> > 
> > We ensure this by having the device refer to the bus, e.g. bus=pci.0.
> > The reference may be implicit, but it's there.  It must resolve for
> > device creation to succeed, and if it resolves, the device providing the
> > bus will be realized in time.
> > 
> > I believe what's getting us into trouble with IOMMU is not having such a
> > reference.  Or in other words, keeping the dependence between the IOMMU
> > and the devices relying on it *implicit*, and thus hidden from the
> > existing realize-ordering machinery.

[1]

> > 
> > Instead of inventing another such machinery, let's try to use the one we
> > already have.
> 
> Hmm... I just found that we don't have such machinery, do we?
> 
> This does not really work:
> 
> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
>                        -device pcie-root-port,id=pcie.1,bus=pcie.0
> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found
> 
> While this will:
> 
> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
>                        -device virtio-net-pci,bus=pcie.1

I think I fully agree at [1], the iommu is special in that it's not only
implicit, but also does not have a default value.  Pci buses have default
values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier.

When parsing the "-device" entry for a pci device, we'll 100% sure what's the
pci bus for the device, either specified or it's just the default.  We don't
look at the rest of "-device"s to be sure of it.  We just try to look up the
pci bus, if it's there we continue, otherwise we abort.

But IOMMU is different, the device can run without a vIOMMU (in which case
there's no dependency), but when vIOMMU is specified there's the dependency.

That's why I am not sure whether the old machinery and "early failure" solution
would work trivially for IOMMUs.

We can add an "-iommu" parameter, then we simply parse it before "-device".
However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu"
is not ideal in that the IOMMU unit is indeed implemented as a device for the
archs I'm aware of, so it's kind of some extra glue that seems to just work
around the ordering problem we have right now.  Meanwhile that solution won't
help the ordering issue of pci bus/device.

That's why I still think the idea of having a global priority for device
realization (or describe the dependency of devices) makes sense.  We can
actually fix both IOMMU and pci bus so we can allow pci bus to be put latter
than the pci device that belongs to the bus alongside of fixing the IOMMU.

IMHO a list of global priority (maybe a realize_priority field in the class of
devices) is just a simpler version of device dependencies.  Say, at least we
don't need to worry about cycle-dependency issues.  So far the ordering
requirement is still simple, so globally define them should fix most of the
issues already and in a straightforward way, also less LOCs.  If it goes
complicated one day, we can always switch the global priority list into device
dependency easily, because that's not guest ABI.

Any further thoughts will be greatly welcomed.

Thanks,
Markus Armbruster Aug. 26, 2021, 4:57 a.m. UTC | #22
Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
>> Markus Armbruster <armbru@redhat.com> writes:
>> 
>> > Peter Xu <peterx@redhat.com> writes:
>> >
>> >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
>> >>> I don't have any other example, but I assume address assignment
>> >>> based on ordering is a common pattern in device code.
>> >>> 
>> >>> I would take a very close and careful look at the devices with
>> >>> non-default vmsd priority.  If you can prove that the 13 device
>> >>> types with non-default priority are all order-insensitive, a
>> >>> custom sort function as you describe might be safe.
>> >>
>> >> Besides virtio-mem-pci, there'll also similar devfn issue with all
>> >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  Say,
>> >> below two cmdlines will generate different pci topology too:
>> >>
>> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>> >>                        -device pcie-root-port,chassis=1 \
>> >>                        -device virtio-net-pci
>> >>
>> >> And:
>> >>
>> >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
>> >>                        -device virtio-net-pci
>> >>                        -device pcie-root-port,chassis=1 \
>> >>
>> >> This cannot be solved by keeping priority==0 ordering.
>> >>
>> >> After a second thought, I think I was initially wrong on seeing migration
>> >> priority and device realization the same problem.
>> >>
>> >> For example, for live migration we have a requirement on PCI_BUS being migrated
>> >> earlier than MIG_PRI_IOMMU because there's bus number information required
>> >> because IOMMU relies on the bus number to find address spaces.  However that's
>> >> definitely not a requirement for device realizations, say, realizing vIOMMU
>> >> after pci buses are fine (bus assigned during bios).
>> >>
>> >> I've probably messed up with the ideas (though they really look alike!).  Sorry
>> >> about that.
>> >>
>> >> Since the only ordering constraint so far is IOMMU vs all the rest of devices,
>> >> I'll introduce a new priority mechanism and only make sure vIOMMUs are realized
>> >> earlier.  That'll also avoid other implications on pci devfn allocations.
>> >>
>> >> Will rework a new version tomorrow.  Thanks a lot for all the comments,
>> >
>> > Is it really a good idea to magically reorder device realization just to
>> > make a non-working command line work?  Why can't we just fail the
>> > non-working command line in a way that tells users how to get a working
>> > one?  We have way too much ordering magic already...
>> >
>> > If we decide we want more magic, then I'd argue for *dependencies*
>> > instead of priorities.  Dependencies are specific and local: $this needs
>> > to go after $that because $reasons.  Priorities are unspecific and
>> > global.
>> 
>> Having thought about this a bit more...
>> 
>> Constraints on realize order are nothing new.  For instance, when a
>> device plugs into a bus, it needs to be realized after the device
>> providing the bus.
>> 
>> We ensure this by having the device refer to the bus, e.g. bus=pci.0.
>> The reference may be implicit, but it's there.  It must resolve for
>> device creation to succeed, and if it resolves, the device providing the
>> bus will be realized in time.
>> 
>> I believe what's getting us into trouble with IOMMU is not having such a
>> reference.  Or in other words, keeping the dependence between the IOMMU
>> and the devices relying on it *implicit*, and thus hidden from the
>> existing realize-ordering machinery.
>> 
>> Instead of inventing another such machinery, let's try to use the one we
>> already have.
>
> Hmm... I just found that we don't have such machinery, do we?
>
> This does not really work:
>
> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
>                        -device pcie-root-port,id=pcie.1,bus=pcie.0
> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found
>
> While this will:
>
> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
>                        -device virtio-net-pci,bus=pcie.1

This is exactly what I described.  bus=pcie.0 is the explicit reference.
It must resolve for device creation to succeed, and if it resolves, the
device providing the bus will be realized in time.  It resolves in the
second example, but not the first.

Look ma, no magic!  Instead, stupid & predictable.
Markus Armbruster Aug. 26, 2021, 8:01 a.m. UTC | #23
Peter Xu <peterx@redhat.com> writes:

> On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:
>> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
>> > Having thought about this a bit more...
>> > 
>> > Constraints on realize order are nothing new.  For instance, when a
>> > device plugs into a bus, it needs to be realized after the device
>> > providing the bus.
>> > 
>> > We ensure this by having the device refer to the bus, e.g. bus=pci.0.
>> > The reference may be implicit, but it's there.  It must resolve for
>> > device creation to succeed, and if it resolves, the device providing the
>> > bus will be realized in time.
>> > 
>> > I believe what's getting us into trouble with IOMMU is not having such a
>> > reference.  Or in other words, keeping the dependence between the IOMMU
>> > and the devices relying on it *implicit*, and thus hidden from the
>> > existing realize-ordering machinery.
>
> [1]
>
>> > 
>> > Instead of inventing another such machinery, let's try to use the one we
>> > already have.
>> 
>> Hmm... I just found that we don't have such machinery, do we?
>> 
>> This does not really work:
>> 
>> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
>>                        -device pcie-root-port,id=pcie.1,bus=pcie.0
>> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found
>> 
>> While this will:
>> 
>> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
>>                        -device virtio-net-pci,bus=pcie.1
>
> I think I fully agree at [1], the iommu is special in that it's not only
> implicit, but also does not have a default value.  Pci buses have default
> values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier.

To be precise: when device property "bus" is absent, and the device
plugs into a bus, QEMU picks a suitable bus.  If it can't, device
creation fails.

It can't when no bus of the required type exists, or the existing buses
are all full.

Sometimes, the board provides a bus of the required type.  If not, you
have to plug one, and you have to do it before you refer to it, whether
the reference is explicit (bus=...) or implicit (bus absent).

Example 1: no bus of the required type

    $ qemu-system-x86_64 -S -display none -device usb-mouse
    qemu-system-x86_64: -device usb-mouse: No 'usb-bus' bus found for device 'usb-mouse'

Example 2: no bus of the required type, yet

    $ qemu-system-x86_64 -S -display none -device usb-mouse -device qemu-xhci
    qemu-system-x86_64: -device usb-mouse: No 'usb-bus' bus found for device 'usb-mouse'

Example 3: reordered to the bus is there in time

    $ qemu-system-x86_64 -S -display none -device qemu-xhci -device usb-mouse

Example 4: got a bus, but it's full

    $ qemu-system-x86_64 -S -display none -monitor stdio -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000 -device e1000,id=noslot
    QEMU 6.1.0 monitor - type 'help' for more information
    (qemu) qemu-system-x86_64: -device e1000,id=noslot: PCI: no slot/function available for e1000, all in use or reserved

> When parsing the "-device" entry for a pci device, we'll 100% sure what's the
> pci bus for the device, either specified or it's just the default.  We don't
> look at the rest of "-device"s to be sure of it.  We just try to look up the
> pci bus, if it's there we continue, otherwise we abort.

Yes.  It's a mandatory dependency, optionally explicit.

> But IOMMU is different, the device can run without a vIOMMU (in which case
> there's no dependency), but when vIOMMU is specified there's the dependency.
>
> That's why I am not sure whether the old machinery and "early failure" solution
> would work trivially for IOMMUs.

Yes.  It's an optional, implicit dependency.

> We can add an "-iommu" parameter, then we simply parse it before "-device".
> However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu"
> is not ideal in that the IOMMU unit is indeed implemented as a device for the
> archs I'm aware of, so it's kind of some extra glue that seems to just work
> around the ordering problem we have right now.  Meanwhile that solution won't
> help the ordering issue of pci bus/device.

The "device providing bus before device plugging into bus" dependency is
not actually a problem: -device and device_add ensure by design it's
satisfied.

As discussed before, there are two workable ways to process the command
line: strictly left to right (leave ordering to the user), and "do the
right thing" (order of options doesn't matter).

Of course, we do neither.  We kind of try to do the right thing, by
adding special cases whenever we get bitten.  Order doesn't matter,
except when it does, and things work, except when they don't.
Reordering your command line may or may not get it to work.

Fails the basic interface taste test: would explaining it in plain
English be impractical and/or embarrassing?

How to best get out of this self-dug hole isn't obvious.  Switching to
strictly left to right will break some command lines.  Making order
truly not matter looks hard, because the dependencies are complex and
not well understood.  But this isn't the problem at hand here.  It's
whether to add more magic, or do without.

I'm wary about piling more magic onto a heap of magic that's already
wobbling under its weight.

So, can we do without?

The issue, as I understand it, is that certain devices need to know at
realize time whether the machine has an IOMMU.  When you add an IOMMU
with -device, you morph the machine from one that doesn't have one into
one that does.  Therefore, it needs to be added before any of the
devices that need to know whether the machine has one.

> That's why I still think the idea of having a global priority for device
> realization (or describe the dependency of devices) makes sense.

The proposed solution is to magically reorder -device so that IOMMU get
created before devices that need to know.

This solution falls apart when we use QMP device_add instead of -device,
because we can't reorder QMP commands.

Right now we can't use device_add for everything, but that's fixable.
Maybe it's been fixed already.

To keep the proposed solution working, we'd need even more magic.

>                                                                   We can
> actually fix both IOMMU and pci bus so we can allow pci bus to be put latter
> than the pci device that belongs to the bus alongside of fixing the IOMMU.

To "fix" the bus issue, we'd need to delay resolution of property bus
until realize time.  This might break .instance_init() methods.  Of
which we have several hundred.

> IMHO a list of global priority (maybe a realize_priority field in the class of
> devices) is just a simpler version of device dependencies.  Say, at least we
> don't need to worry about cycle-dependency issues.  So far the ordering
> requirement is still simple, so globally define them should fix most of the
> issues already and in a straightforward way, also less LOCs.  If it goes
> complicated one day, we can always switch the global priority list into device
> dependency easily, because that's not guest ABI.
>
> Any further thoughts will be greatly welcomed.

I'd like to propose a more modest solution: detect the problem and fail.

A simple state machine can track "has IOMMU" state.  It has three states
"no so far", "yes", and "no", and two events "add IOMMU" and "add device
that needs to know".  State diagram:

                          no so far
                   +--- (start state) ---+
                   |                     |
         add IOMMU |                     | add device that
                   |                     |  needs to know
                   v                     v
             +--> yes                    no <--+
             |     |   add device that   |     |
             +-----+    needs to know    +-----+

"Add IOMMU" in state "no" is an error.

"Add IOMMU" in state "yes" stays in state "yes" if multiple IOMMU make
sense.  Else it's an error.

The state machine could be owned by the machine type.
Igor Mammedov Aug. 26, 2021, 11:36 a.m. UTC | #24
On Thu, 26 Aug 2021 10:01:01 +0200
Markus Armbruster <armbru@redhat.com> wrote:

> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:  
> >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:  
> >> > Having thought about this a bit more...
...
> > Any further thoughts will be greatly welcomed.  
> 
> I'd like to propose a more modest solution: detect the problem and fail.
That's or proper dependency tracking (which stands chance to work with QMP,
but like it was said it's complex)

> A simple state machine can track "has IOMMU" state.  It has three states
> "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> that needs to know".  State diagram:
> 
>                           no so far
>                    +--- (start state) ---+
>                    |                     |
>          add IOMMU |                     | add device that
>                    |                     |  needs to know
>                    v                     v
>              +--> yes                    no <--+
>              |     |   add device that   |     |
>              +-----+    needs to know    +-----+
> 
> "Add IOMMU" in state "no" is an error.

question is how we distinguish "device that needs to know"
from device that doesn't need to know, and then recently
added feature 'bypass IOMMU' adds more fun to this.
 
> "Add IOMMU" in state "yes" stays in state "yes" if multiple IOMMU make
> sense.  Else it's an error.
> 
> The state machine could be owned by the machine type.
> 
>
Peter Xu Aug. 26, 2021, 1:43 p.m. UTC | #25
On Thu, Aug 26, 2021 at 01:36:29PM +0200, Igor Mammedov wrote:
> On Thu, 26 Aug 2021 10:01:01 +0200
> Markus Armbruster <armbru@redhat.com> wrote:
> 
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:  
> > >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:  
> > >> > Having thought about this a bit more...
> ...
> > > Any further thoughts will be greatly welcomed.  
> > 
> > I'd like to propose a more modest solution: detect the problem and fail.
> That's or proper dependency tracking (which stands chance to work with QMP,
> but like it was said it's complex)
> 
> > A simple state machine can track "has IOMMU" state.  It has three states
> > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > that needs to know".  State diagram:
> > 
> >                           no so far
> >                    +--- (start state) ---+
> >                    |                     |
> >          add IOMMU |                     | add device that
> >                    |                     |  needs to know
> >                    v                     v
> >              +--> yes                    no <--+
> >              |     |   add device that   |     |
> >              +-----+    needs to know    +-----+
> > 
> > "Add IOMMU" in state "no" is an error.
> 
> question is how we distinguish "device that needs to know"
> from device that doesn't need to know, and then recently
> added feature 'bypass IOMMU' adds more fun to this.

Maybe we can start from "no device needs to know"? Then add more into it when
the list expands.

vfio-pci should be a natural fit because we're sure it won't break any valid
old configurations.  However we may need to be careful on adding more devices,
e.g. when some old configuration might work on old binaries, but I'm not sure.

Off-topic: I'm wondering whether bypass_iommu is just a work-around of not
using iommu=pt in the guest cmdlines?  Is there any performance comparison of
using bypass_iommu against having iommu but with iommu=pt?  At least intel
iommu has pt enabled explicitly, i.e. it shouldn't even need a shadow iommu
pgtable in the guest but only a single bit in device context entry showing that
"this device wants to pass-through iommu", so I would expect the perf can be
similar to explicitly bypass iommu in the qemu cmdline.

Thanks,
Peter Xu Aug. 30, 2021, 7:02 p.m. UTC | #26
On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > A simple state machine can track "has IOMMU" state.  It has three states
> > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > that needs to know".  State diagram:
> > > 
> > >                           no so far
> > >                    +--- (start state) ---+
> > >                    |                     |
> > >          add IOMMU |                     | add device that
> > >                    |                     |  needs to know
> > >                    v                     v
> > >              +--> yes                    no <--+
> > >              |     |   add device that   |     |
> > >              +-----+    needs to know    +-----+
> > > 
> > > "Add IOMMU" in state "no" is an error.
> > 
> > question is how we distinguish "device that needs to know"
> > from device that doesn't need to know, and then recently
> > added feature 'bypass IOMMU' adds more fun to this.
> 
> Maybe we can start from "no device needs to know"? Then add more into it when
> the list expands.
> 
> vfio-pci should be a natural fit because we're sure it won't break any valid
> old configurations.  However we may need to be careful on adding more devices,
> e.g. when some old configuration might work on old binaries, but I'm not sure.

Btw, I think this state machine is indeed bringing some complexity on even
understanding it - it is definitely working but it's not obvious to anyone at
the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
yet another state machine for some other ordering constraints?

From that POV, I don't like this solution more than the simple "assign priority
for device realization" idea..
Markus Armbruster Aug. 31, 2021, 11:35 a.m. UTC | #27
Peter Xu <peterx@redhat.com> writes:

> On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
>> > > A simple state machine can track "has IOMMU" state.  It has three states
>> > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
>> > > that needs to know".  State diagram:
>> > > 
>> > >                           no so far
>> > >                    +--- (start state) ---+
>> > >                    |                     |
>> > >          add IOMMU |                     | add device that
>> > >                    |                     |  needs to know
>> > >                    v                     v
>> > >              +--> yes                    no <--+
>> > >              |     |   add device that   |     |
>> > >              +-----+    needs to know    +-----+
>> > > 
>> > > "Add IOMMU" in state "no" is an error.
>> > 
>> > question is how we distinguish "device that needs to know"
>> > from device that doesn't need to know, and then recently
>> > added feature 'bypass IOMMU' adds more fun to this.
>> 
>> Maybe we can start from "no device needs to know"? Then add more into it when
>> the list expands.
>> 
>> vfio-pci should be a natural fit because we're sure it won't break any valid
>> old configurations.  However we may need to be careful on adding more devices,
>> e.g. when some old configuration might work on old binaries, but I'm not sure.
>
> Btw, I think this state machine is indeed bringing some complexity on even
> understanding it - it is definitely working but it's not obvious to anyone at
> the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> yet another state machine for some other ordering constraints?
>
> From that POV, I don't like this solution more than the simple "assign priority
> for device realization" idea..

I wouldn't worry about other ordering constraints until we have them.
If you do, please tell!

I'm hoping you can't, because such implicit constraints are commonly
signs of oversimplified / screwed up machine modeling.
Igor Mammedov Sept. 2, 2021, 7:46 a.m. UTC | #28
On Thu, 26 Aug 2021 09:43:59 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Aug 26, 2021 at 01:36:29PM +0200, Igor Mammedov wrote:
> > On Thu, 26 Aug 2021 10:01:01 +0200
> > Markus Armbruster <armbru@redhat.com> wrote:
> >   
> > > Peter Xu <peterx@redhat.com> writes:
> > >   
> > > > On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:    
> > > >> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:    
> > > >> > Having thought about this a bit more...  
> > ...  
> > > > Any further thoughts will be greatly welcomed.    
> > > 
> > > I'd like to propose a more modest solution: detect the problem and fail.  
> > That's or proper dependency tracking (which stands chance to work with QMP,
> > but like it was said it's complex)
> >   
> > > A simple state machine can track "has IOMMU" state.  It has three states
> > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > that needs to know".  State diagram:
> > > 
> > >                           no so far
> > >                    +--- (start state) ---+
> > >                    |                     |
> > >          add IOMMU |                     | add device that
> > >                    |                     |  needs to know
> > >                    v                     v
> > >              +--> yes                    no <--+
> > >              |     |   add device that   |     |
> > >              +-----+    needs to know    +-----+
> > > 
> > > "Add IOMMU" in state "no" is an error.  
> > 
> > question is how we distinguish "device that needs to know"
> > from device that doesn't need to know, and then recently
> > added feature 'bypass IOMMU' adds more fun to this.  
> 
> Maybe we can start from "no device needs to know"? Then add more into it when
> the list expands.
> 
> vfio-pci should be a natural fit because we're sure it won't break any valid
> old configurations.  However we may need to be careful on adding more devices,
> e.g. when some old configuration might work on old binaries, but I'm not sure.

 
> Off-topic: I'm wondering whether bypass_iommu is just a work-around of not
> using iommu=pt in the guest cmdlines?  Is there any performance comparison of
> using bypass_iommu against having iommu but with iommu=pt?  At least intel
> iommu has pt enabled explicitly, i.e. it shouldn't even need a shadow iommu
> pgtable in the guest but only a single bit in device context entry showing that
> "this device wants to pass-through iommu", so I would expect the perf can be
> similar to explicitly bypass iommu in the qemu cmdline.
They wanted to have a mix of iommu and non-iommu devices in VM
(last merged revision was: [PATCH v5 0/9] IOMMU: Add support for IOMMU Bypass Feature)
But 'why' reasoning was lost somewhere, CCing author.


> 
> Thanks,
>
Igor Mammedov Sept. 2, 2021, 8:26 a.m. UTC | #29
On Mon, 30 Aug 2021 15:02:53 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > that needs to know".  State diagram:
> > > > 
> > > >                           no so far
> > > >                    +--- (start state) ---+
> > > >                    |                     |
> > > >          add IOMMU |                     | add device that
> > > >                    |                     |  needs to know
> > > >                    v                     v
> > > >              +--> yes                    no <--+
> > > >              |     |   add device that   |     |
> > > >              +-----+    needs to know    +-----+
> > > > 
> > > > "Add IOMMU" in state "no" is an error.  
> > > 
> > > question is how we distinguish "device that needs to know"
> > > from device that doesn't need to know, and then recently
> > > added feature 'bypass IOMMU' adds more fun to this.  
> > 
> > Maybe we can start from "no device needs to know"? Then add more into it when
> > the list expands.
> > 
> > vfio-pci should be a natural fit because we're sure it won't break any valid
> > old configurations.  However we may need to be careful on adding more devices,
> > e.g. when some old configuration might work on old binaries, but I'm not sure.  
> 
> Btw, I think this state machine is indeed bringing some complexity on even
> understanding it - it is definitely working but it's not obvious to anyone at
> the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> yet another state machine for some other ordering constraints?
>
> From that POV, I don't like this solution more than the simple "assign priority
> for device realization" idea..
It seems simple but implicit dependencies are fragile (good or
I'd rather say bad example implicit dependencies is vl.c magic code,
where changing order of initialization can easily break QEMU,
which happens almost every time it's refactored),
and as Markus already mentioned it won't work in QMP case.

What could work for both cases is explicit dependencies,
however it would be hard to describe such dependency in this case,
where device can work both with and without IOMMU depending
on the bus settings it's attached to.

Have you considered another approach, i.e. instead of reordering,
change vfio-pci device model to reconfigure DMA address space
after realize time (ex: at reset time)?
Peter Xu Sept. 2, 2021, 1:45 p.m. UTC | #30
On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
> On Mon, 30 Aug 2021 15:02:53 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > > that needs to know".  State diagram:
> > > > > 
> > > > >                           no so far
> > > > >                    +--- (start state) ---+
> > > > >                    |                     |
> > > > >          add IOMMU |                     | add device that
> > > > >                    |                     |  needs to know
> > > > >                    v                     v
> > > > >              +--> yes                    no <--+
> > > > >              |     |   add device that   |     |
> > > > >              +-----+    needs to know    +-----+
> > > > > 
> > > > > "Add IOMMU" in state "no" is an error.  
> > > > 
> > > > question is how we distinguish "device that needs to know"
> > > > from device that doesn't need to know, and then recently
> > > > added feature 'bypass IOMMU' adds more fun to this.  
> > > 
> > > Maybe we can start from "no device needs to know"? Then add more into it when
> > > the list expands.
> > > 
> > > vfio-pci should be a natural fit because we're sure it won't break any valid
> > > old configurations.  However we may need to be careful on adding more devices,
> > > e.g. when some old configuration might work on old binaries, but I'm not sure.  
> > 
> > Btw, I think this state machine is indeed bringing some complexity on even
> > understanding it - it is definitely working but it's not obvious to anyone at
> > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> > yet another state machine for some other ordering constraints?
> >
> > From that POV, I don't like this solution more than the simple "assign priority
> > for device realization" idea..
> It seems simple but implicit dependencies are fragile (good or
> I'd rather say bad example implicit dependencies is vl.c magic code,
> where changing order of initialization can easily break QEMU,
> which happens almost every time it's refactored),

True.  I was never able of telling whether that comes from natural complexity
of the piece of software or there's indeed some smarter moves.

> and as Markus already mentioned it won't work in QMP case.

I didn't ask before, but I do have a question on whether that's a real problem.

When QMP interface is ready, we should have a last phase of "preconfig done"
command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
am guessing.  If so, can we do the reorder there and postpone all device
creations, maybe?  Then the ordering properties can be applied there too.

The other thing is I think only libvirt will use the QMP case even if it'll be
ready, and libvirt will need to know the ordering already like vIOMMU and vfio
and pci buses - even if a new qemu supported the ordering of all these, libvirt
still need to support old qemu binaries (and the code is already there anyway)
and it's fairly natural they initiate the QMP commands using the same ordering.

IMHO it means ordering for QMP is not as important as cmdline if that's always
initiated by another software.

> 
> What could work for both cases is explicit dependencies,
> however it would be hard to describe such dependency in this case,
> where device can work both with and without IOMMU depending
> on the bus settings it's attached to.
> 
> Have you considered another approach, i.e. instead of reordering,
> change vfio-pci device model to reconfigure DMA address space
> after realize time (ex: at reset time)?

Yes, I agree this seems to be the cleanest appproach.  It may just need more
changes and they'll be on vfio, and I'm not aware of the rest (Jason should
know better on virtio/vdpa).

The other thing is some dependency cannot be resolved by this, e.g., the pci
bus issue where we put the pci bus to be after the device that it owns.  But
indeed we're already early-fail that now so it seems ok.

Side note: from my gut feeling I still think at some point we shouldn't do the
early-failure for cases in the case of pci bus and device - the cmdline
obviously contains enough information on the dependency on "this pci device
needs that pci bus", and early-fail does seem to be an work-around to me just
because they specified in the wrong order.

Thanks,
Daniel P. Berrangé Sept. 2, 2021, 1:53 p.m. UTC | #31
On Thu, Sep 02, 2021 at 09:45:43AM -0400, Peter Xu wrote:
> On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
> > On Mon, 30 Aug 2021 15:02:53 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> > > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
> > > > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > > > that needs to know".  State diagram:
> > > > > > 
> > > > > >                           no so far
> > > > > >                    +--- (start state) ---+
> > > > > >                    |                     |
> > > > > >          add IOMMU |                     | add device that
> > > > > >                    |                     |  needs to know
> > > > > >                    v                     v
> > > > > >              +--> yes                    no <--+
> > > > > >              |     |   add device that   |     |
> > > > > >              +-----+    needs to know    +-----+
> > > > > > 
> > > > > > "Add IOMMU" in state "no" is an error.  
> > > > > 
> > > > > question is how we distinguish "device that needs to know"
> > > > > from device that doesn't need to know, and then recently
> > > > > added feature 'bypass IOMMU' adds more fun to this.  
> > > > 
> > > > Maybe we can start from "no device needs to know"? Then add more into it when
> > > > the list expands.
> > > > 
> > > > vfio-pci should be a natural fit because we're sure it won't break any valid
> > > > old configurations.  However we may need to be careful on adding more devices,
> > > > e.g. when some old configuration might work on old binaries, but I'm not sure.  
> > > 
> > > Btw, I think this state machine is indeed bringing some complexity on even
> > > understanding it - it is definitely working but it's not obvious to anyone at
> > > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> > > yet another state machine for some other ordering constraints?
> > >
> > > From that POV, I don't like this solution more than the simple "assign priority
> > > for device realization" idea..
> > It seems simple but implicit dependencies are fragile (good or
> > I'd rather say bad example implicit dependencies is vl.c magic code,
> > where changing order of initialization can easily break QEMU,
> > which happens almost every time it's refactored),
> 
> True.  I was never able of telling whether that comes from natural complexity
> of the piece of software or there's indeed some smarter moves.
> 
> > and as Markus already mentioned it won't work in QMP case.
> 
> I didn't ask before, but I do have a question on whether that's a real problem.
> 
> When QMP interface is ready, we should have a last phase of "preconfig done"
> command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
> am guessing.  If so, can we do the reorder there and postpone all device
> creations, maybe?  Then the ordering properties can be applied there too.
> 
> The other thing is I think only libvirt will use the QMP case even if it'll be
> ready, and libvirt will need to know the ordering already like vIOMMU and vfio
> and pci buses - even if a new qemu supported the ordering of all these, libvirt
> still need to support old qemu binaries (and the code is already there anyway)
> and it's fairly natural they initiate the QMP commands using the same ordering.
> 
> IMHO it means ordering for QMP is not as important as cmdline if that's always
> initiated by another software.

The correct ordering of devices/backends is generally pretty obvious
for libvirt to determine. Most of the problems we've had related to
ordering are on the QEMU side, because the ARGV given to QEMU made
correct sense if parsed left-to-right, but QEMU didn't actually process
them in that order. We've patched QEMU to hack around its inability to
honour the CLI order repeatedly.

Being completely self-ordering on the QEMU side using a topological
sort would be neat from a conceptual purity POV, but that is quite a
challenge to implement and I'm not convinced it is worth it, compared
to other problems we want to spend time on.

If we don't want to keep hacking up the current QEMU system binaries
parsing, then we need to just have new binaries with sane ordering,
or we need to willingly break compat in some staged manner. 

Regards,
Daniel
Peter Xu Sept. 2, 2021, 2:21 p.m. UTC | #32
Hi, Dan,

On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
> The correct ordering of devices/backends is generally pretty obvious
> for libvirt to determine. Most of the problems we've had related to
> ordering are on the QEMU side, because the ARGV given to QEMU made
> correct sense if parsed left-to-right, but QEMU didn't actually process
> them in that order. We've patched QEMU to hack around its inability to
> honour the CLI order repeatedly.

Is there a pointer to the problem?

> 
> Being completely self-ordering on the QEMU side using a topological
> sort would be neat from a conceptual purity POV, but that is quite a
> challenge to implement and I'm not convinced it is worth it, compared
> to other problems we want to spend time on.

I just noticed there can also be dependency between the buses; that cannot be
fixed by ordering of classes indeed as either proposed in this series, or
introduce a new priority.

Thanks,
Markus Armbruster Sept. 2, 2021, 2:57 p.m. UTC | #33
Peter Xu <peterx@redhat.com> writes:

> Hi, Dan,
>
> On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
>> The correct ordering of devices/backends is generally pretty obvious
>> for libvirt to determine. Most of the problems we've had related to
>> ordering are on the QEMU side, because the ARGV given to QEMU made
>> correct sense if parsed left-to-right, but QEMU didn't actually process
>> them in that order. We've patched QEMU to hack around its inability to
>> honour the CLI order repeatedly.
>
> Is there a pointer to the problem?

Just an example:

9ea18ed25a "vl: Fix -drive / -blockdev persistent reservation management
cda4aa9a5a "vl: Create block backends before setting machine properties"

>> Being completely self-ordering on the QEMU side using a topological
>> sort would be neat from a conceptual purity POV, but that is quite a
>> challenge to implement and I'm not convinced it is worth it, compared
>> to other problems we want to spend time on.
>
> I just noticed there can also be dependency between the buses; that cannot be
> fixed by ordering of classes indeed as either proposed in this series, or
> introduce a new priority.

--verbose?
Daniel P. Berrangé Sept. 2, 2021, 3:06 p.m. UTC | #34
On Thu, Sep 02, 2021 at 10:21:19AM -0400, Peter Xu wrote:
> Hi, Dan,
> 
> On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
> > The correct ordering of devices/backends is generally pretty obvious
> > for libvirt to determine. Most of the problems we've had related to
> > ordering are on the QEMU side, because the ARGV given to QEMU made
> > correct sense if parsed left-to-right, but QEMU didn't actually process
> > them in that order. We've patched QEMU to hack around its inability to
> > honour the CLI order repeatedly.
> 
> Is there a pointer to the problem?

The biggest problem was with -object, where some objects need to be
created before chardevs and some created after chardevs.
See object_create_early() method in softmmu/vl.c for the gross
hardcoded hacks


Regards,
Daniel
Markus Armbruster Sept. 2, 2021, 3:26 p.m. UTC | #35
Peter Xu <peterx@redhat.com> writes:

> On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
>> On Mon, 30 Aug 2021 15:02:53 -0400
>> Peter Xu <peterx@redhat.com> wrote:
>> 
>> > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:
>> > > > > A simple state machine can track "has IOMMU" state.  It has three states
>> > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
>> > > > > that needs to know".  State diagram:
>> > > > > 
>> > > > >                           no so far
>> > > > >                    +--- (start state) ---+
>> > > > >                    |                     |
>> > > > >          add IOMMU |                     | add device that
>> > > > >                    |                     |  needs to know
>> > > > >                    v                     v
>> > > > >              +--> yes                    no <--+
>> > > > >              |     |   add device that   |     |
>> > > > >              +-----+    needs to know    +-----+
>> > > > > 
>> > > > > "Add IOMMU" in state "no" is an error.  
>> > > > 
>> > > > question is how we distinguish "device that needs to know"
>> > > > from device that doesn't need to know, and then recently
>> > > > added feature 'bypass IOMMU' adds more fun to this.  
>> > > 
>> > > Maybe we can start from "no device needs to know"? Then add more into it when
>> > > the list expands.
>> > > 
>> > > vfio-pci should be a natural fit because we're sure it won't break any valid
>> > > old configurations.  However we may need to be careful on adding more devices,
>> > > e.g. when some old configuration might work on old binaries, but I'm not sure.  
>> > 
>> > Btw, I think this state machine is indeed bringing some complexity on even
>> > understanding it - it is definitely working but it's not obvious to anyone at
>> > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
>> > yet another state machine for some other ordering constraints?
>> >
>> > From that POV, I don't like this solution more than the simple "assign priority
>> > for device realization" idea..
>> It seems simple but implicit dependencies are fragile (good or
>> I'd rather say bad example implicit dependencies is vl.c magic code,
>> where changing order of initialization can easily break QEMU,
>> which happens almost every time it's refactored),
>
> True.  I was never able of telling whether that comes from natural complexity
> of the piece of software or there's indeed some smarter moves.
>
>> and as Markus already mentioned it won't work in QMP case.
>
> I didn't ask before, but I do have a question on whether that's a real problem.
>
> When QMP interface is ready, we should have a last phase of "preconfig done"
> command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
> am guessing.  If so, can we do the reorder there and postpone all device
> creations, maybe?  Then the ordering properties can be applied there too.

This would regress error reporting.

A significant advantage of doing configuration in basic steps is the
relative ease of pinpointing what exactly fails.

If you change device_add to merely record the request for later, you
delay the non-trivial failures until later.

Compare:

    -> {'execute':'device_add','arguments':{'driver':'virtio-blk',"drive": "foo"}}
    <- {"error": {"class": "GenericError", "desc": "Device needs media, but drive is empty"}}

to getting the same error in reply to x-exit-preconfig, with a dozen or
two device_add queued up.

The error comes from virtio_blk_device_realize(), which you propose to
delay until x-exit-preconfig.

In theory, we can rework all the errors in question to provide
sufficient context, and rework libvirt to pick the context apart, so it
can pinpoint what's wrong in the user's configuration.  In practice,
thanks, but no thanks.

> The other thing is I think only libvirt will use the QMP case even if it'll be
> ready, and libvirt will need to know the ordering already like vIOMMU and vfio
> and pci buses - even if a new qemu supported the ordering of all these, libvirt
> still need to support old qemu binaries (and the code is already there anyway)
> and it's fairly natural they initiate the QMP commands using the same ordering.

This is actually a pretty strong argument for not having QEMU reorder at
all.

> IMHO it means ordering for QMP is not as important as cmdline if that's always
> initiated by another software.
>
>> 
>> What could work for both cases is explicit dependencies,
>> however it would be hard to describe such dependency in this case,
>> where device can work both with and without IOMMU depending
>> on the bus settings it's attached to.
>> 
>> Have you considered another approach, i.e. instead of reordering,
>> change vfio-pci device model to reconfigure DMA address space
>> after realize time (ex: at reset time)?
>
> Yes, I agree this seems to be the cleanest appproach.  It may just need more
> changes and they'll be on vfio, and I'm not aware of the rest (Jason should
> know better on virtio/vdpa).
>
> The other thing is some dependency cannot be resolved by this, e.g., the pci
> bus issue where we put the pci bus to be after the device that it owns.  But
> indeed we're already early-fail that now so it seems ok.

Yes, that's a complete non-problem :)

> Side note: from my gut feeling I still think at some point we shouldn't do the
> early-failure for cases in the case of pci bus and device - the cmdline
> obviously contains enough information on the dependency on "this pci device
> needs that pci bus", and early-fail does seem to be an work-around to me just
> because they specified in the wrong order.

Again, there are two sane command line evaluation orders: (1) strictly
left to right, and (2) order doesn't matter.  QEMU of course does (3)
unpredicable for humans without referring back to the source code.

The difficulty with getting to (1) is mostly compatibility and putting
in the work.  Reporting when things don't get created in time nicely and
reliably is yet more work.  Not particularly challenging, just work.

Getting to a reliable version of (2) feels like a pipe dream to me.
Igor Mammedov Sept. 3, 2021, 1 p.m. UTC | #36
On Thu, 2 Sep 2021 09:45:43 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Sep 02, 2021 at 10:26:16AM +0200, Igor Mammedov wrote:
> > On Mon, 30 Aug 2021 15:02:53 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Thu, Aug 26, 2021 at 09:43:59AM -0400, Peter Xu wrote:  
> > > > > > A simple state machine can track "has IOMMU" state.  It has three states
> > > > > > "no so far", "yes", and "no", and two events "add IOMMU" and "add device
> > > > > > that needs to know".  State diagram:
> > > > > > 
> > > > > >                           no so far
> > > > > >                    +--- (start state) ---+
> > > > > >                    |                     |
> > > > > >          add IOMMU |                     | add device that
> > > > > >                    |                     |  needs to know
> > > > > >                    v                     v
> > > > > >              +--> yes                    no <--+
> > > > > >              |     |   add device that   |     |
> > > > > >              +-----+    needs to know    +-----+
> > > > > > 
> > > > > > "Add IOMMU" in state "no" is an error.    
> > > > > 
> > > > > question is how we distinguish "device that needs to know"
> > > > > from device that doesn't need to know, and then recently
> > > > > added feature 'bypass IOMMU' adds more fun to this.    
> > > > 
> > > > Maybe we can start from "no device needs to know"? Then add more into it when
> > > > the list expands.
> > > > 
> > > > vfio-pci should be a natural fit because we're sure it won't break any valid
> > > > old configurations.  However we may need to be careful on adding more devices,
> > > > e.g. when some old configuration might work on old binaries, but I'm not sure.    
> > > 
> > > Btw, I think this state machine is indeed bringing some complexity on even
> > > understanding it - it is definitely working but it's not obvious to anyone at
> > > the first glance, and it's only solving problem for vIOMMU.  E.g., do we need
> > > yet another state machine for some other ordering constraints?
> > >
> > > From that POV, I don't like this solution more than the simple "assign priority
> > > for device realization" idea..  
> > It seems simple but implicit dependencies are fragile (good or
> > I'd rather say bad example implicit dependencies is vl.c magic code,
> > where changing order of initialization can easily break QEMU,
> > which happens almost every time it's refactored),  
> 
> True.  I was never able of telling whether that comes from natural complexity
> of the piece of software or there's indeed some smarter moves.
> 
> > and as Markus already mentioned it won't work in QMP case.  
> 
> I didn't ask before, but I do have a question on whether that's a real problem.
> 
> When QMP interface is ready, we should have a last phase of "preconfig done"
> command, right?  I thought it was "x-exit-preconfig" now, but I'm not sure so
> am guessing.  If so, can we do the reorder there and postpone all device
> creations, maybe?  Then the ordering properties can be applied there too.

I don't think "x-exit-preconfig" would help here, ideally devices would be
created before this stage, interactively via QMP.
So whoever is calling device_add, would need to know the proper ordering
or get error as result.

And even if we had explicit dependencies, they would be easier to use
with current error-out policy if something is out of expected order,
and tell user to fix CLI/QMP command order (and replace adhoc checks
use currently).


> The other thing is I think only libvirt will use the QMP case even if it'll be
> ready, and libvirt will need to know the ordering already like vIOMMU and vfio
> and pci buses - even if a new qemu supported the ordering of all these, libvirt
> still need to support old qemu binaries (and the code is already there anyway)
> and it's fairly natural they initiate the QMP commands using the same ordering.
> 
> IMHO it means ordering for QMP is not as important as cmdline if that's always
> initiated by another software.

PS:
If I'm not wrong, we are slowly working towards to composing machine
from QMP and ideally CLI becoming a shim on top of that. Hence the reason,
QMP is mentioned in this thread and desire to avoid more magic.


> > What could work for both cases is explicit dependencies,
> > however it would be hard to describe such dependency in this case,
> > where device can work both with and without IOMMU depending
> > on the bus settings it's attached to.
> > 
> > Have you considered another approach, i.e. instead of reordering,
> > change vfio-pci device model to reconfigure DMA address space
> > after realize time (ex: at reset time)?  
> 
> Yes, I agree this seems to be the cleanest appproach.  It may just need more
> changes and they'll be on vfio, and I'm not aware of the rest (Jason should
> know better on virtio/vdpa).

It's also a bit more closer "to real world", where IOMMU dependency is
configured by firmware/OS. In addition it doesn't require any machine
specific code, a device checks if its parent bus is managed by IOMMU
and it is configured accordingly to that.
 
> The other thing is some dependency cannot be resolved by this, e.g., the pci
> bus issue where we put the pci bus to be after the device that it owns.  But
> indeed we're already early-fail that now so it seems ok.
> 
> Side note: from my gut feeling I still think at some point we shouldn't do the
> early-failure for cases in the case of pci bus and device - the cmdline
> obviously contains enough information on the dependency on "this pci device
> needs that pci bus", and early-fail does seem to be an work-around to me just
> because they specified in the wrong order.
That's how QEMU CLI works most of the time (i.e. left to right order of
initialization) where it errors out on offending CLI option,
and tells user to fix it instead of trying to fixup CLI order.
So I think current pci_bus behavior does align with that.

PS:
Another, albeit machine depended approach to resolve IOMMU ordering problem
can be adding to a specific machine  pre_plug hook, an IOMMU handling.
Which is called during IOMMU realize time and check if existing buses
without bypass enabled (iommu managed) have any children. And if they
have devices attached, error out telling user to reorder '-device iommu'
before affected devices/bus.
It should cover mixed IOMMU+bypass case and doesn't require fixing
vfio-pci address space initialization nor defining any priorities
for PCI devices.

(but I think it's more a hack compared earlier suggested
address space initialization at reset time, and it would need to be
done for every affected machine)

> Thanks,
>
Peter Xu Sept. 3, 2021, 3:48 p.m. UTC | #37
On Thu, Sep 02, 2021 at 04:57:04PM +0200, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Hi, Dan,
> >
> > On Thu, Sep 02, 2021 at 02:53:00PM +0100, Daniel P. Berrangé wrote:
> >> The correct ordering of devices/backends is generally pretty obvious
> >> for libvirt to determine. Most of the problems we've had related to
> >> ordering are on the QEMU side, because the ARGV given to QEMU made
> >> correct sense if parsed left-to-right, but QEMU didn't actually process
> >> them in that order. We've patched QEMU to hack around its inability to
> >> honour the CLI order repeatedly.
> >
> > Is there a pointer to the problem?
> 
> Just an example:
> 
> 9ea18ed25a "vl: Fix -drive / -blockdev persistent reservation management
> cda4aa9a5a "vl: Create block backends before setting machine properties"

Thanks, same to Dan.

> 
> >> Being completely self-ordering on the QEMU side using a topological
> >> sort would be neat from a conceptual purity POV, but that is quite a
> >> challenge to implement and I'm not convinced it is worth it, compared
> >> to other problems we want to spend time on.
> >
> > I just noticed there can also be dependency between the buses; that cannot be
> > fixed by ordering of classes indeed as either proposed in this series, or
> > introduce a new priority.
> 
> --verbose?

Please ignore it - I just started to realize what kind of a rabbit hole I'm
digging, and I am already prepared to run away. :)
Peter Xu Sept. 3, 2021, 4:03 p.m. UTC | #38
On Fri, Sep 03, 2021 at 03:00:05PM +0200, Igor Mammedov wrote:
> PS:
> Another, albeit machine depended approach to resolve IOMMU ordering problem
> can be adding to a specific machine  pre_plug hook, an IOMMU handling.
> Which is called during IOMMU realize time and check if existing buses
> without bypass enabled (iommu managed) have any children. And if they
> have devices attached, error out telling user to reorder '-device iommu'
> before affected devices/bus.
> It should cover mixed IOMMU+bypass case and doesn't require fixing
> vfio-pci address space initialization nor defining any priorities
> for PCI devices.

This sounds appealing among the approaches.

Does it need to be a pre_plug hook?  I thought we might just need a flag in the
pci device classes showing that it should be after vIOMMUs, then in vIOMMU
realize functions we walk pci bus to make sure no such device exist?

We could have a base vIOMMU class, then that could be in the realize() of the
common class.

> 
> (but I think it's more a hack compared earlier suggested
> address space initialization at reset time, and it would need to be
> done for every affected machine)

Agreed.
Igor Mammedov Sept. 6, 2021, 8:49 a.m. UTC | #39
On Fri, 3 Sep 2021 12:03:06 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Fri, Sep 03, 2021 at 03:00:05PM +0200, Igor Mammedov wrote:
> > PS:
> > Another, albeit machine depended approach to resolve IOMMU ordering problem
> > can be adding to a specific machine  pre_plug hook, an IOMMU handling.
> > Which is called during IOMMU realize time and check if existing buses
> > without bypass enabled (iommu managed) have any children. And if they
> > have devices attached, error out telling user to reorder '-device iommu'
> > before affected devices/bus.
> > It should cover mixed IOMMU+bypass case and doesn't require fixing
> > vfio-pci address space initialization nor defining any priorities
> > for PCI devices.  
> 
> This sounds appealing among the approaches.

That's the easy one, compared to moving address space (re)initialization
to reset time (at least to me since vfio realize looks intimidating on
the first glance, but its maintainer(s) probably should know enough to
impl. change properly).

 
> Does it need to be a pre_plug hook?  I thought we might just need a flag in the
> pci device classes showing that it should be after vIOMMUs, then in vIOMMU
> realize functions we walk pci bus to make sure no such device exist?
> 
> We could have a base vIOMMU class, then that could be in the realize() of the
> common class.

We basically don't know if device needs IOMMU or not and can work
with/without it just fine. In this case I'd think about IOMMU as board
feature that morphs PCI buses (some of them) (address space, bus numers, ...).
So I don't perceive any iommu flag as a device property at all.

As for realize vs pre_plug, the later is the part of abstract realize
(see: device_set_realized) and is already used by some PCI infrastructure:
  ex: pcie_cap_slot_pre_plug_cb/spapr_pci_pre_plug

It's purpose is to check pre-conditions and possibly pre-configure some
some wiring on behalf of device's parent hot-plug handler (bus owner/machine),
and fail cleanly if something is wrong without leaving side effects.

See 0ed48fd32eb8 for boiler plate required to set up custom hot-plug handler.
You might need only parts of it, but still it's something that's to be done
for each affected machine type, to implement error checking at proper
layer.

So I'd rather look into 'reset' approach and only if that doesn't
look possible, resort to adding pre_plug/error check.


PS:
 yours d2321d31ff98b & c6cbc29d36f look to me like another candidate
 for pre_plug for pci deivice instead of adding dedicated hook
 just for vfio-pci to generic machine.
 
> > (but I think it's more a hack compared earlier suggested
> > address space initialization at reset time, and it would need to be
> > done for every affected machine)  
> 
> Agreed.
>
Eric Auger Sept. 6, 2021, 9:22 a.m. UTC | #40
Hi,

On 8/25/21 6:23 AM, Jason Wang wrote:
> On Tue, Aug 24, 2021 at 11:50 PM Peter Xu <peterx@redhat.com> wrote:
>> On Tue, Aug 24, 2021 at 10:52:24AM +0800, Jason Wang wrote:
>>> It looks to me this doesn't solve the issue of using virtio-mmio with vhost?
>> No IOMMU supported for any of the MMIO devices, right?  Or am I wrong?  Thanks,
> I guess virtio IOMMU has that support, but I might be wrong.

No that's not supported yet in QEMU. There is an inflight contribution
attempting to support vSMMUv3 protection for sysbus devices though:
[PATCH v6 0/4] hw/arm/smmuv3: Support non PCI/PCIe devices

Thanks

Eric
>
> Thanks
>
>> --
>> Peter Xu
>>
diff mbox series

Patch

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 5ca11e7469..3a30dfe27d 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -126,6 +126,7 @@ 
 #include "qapi/qmp/qerror.h"
 #include "sysemu/iothread.h"
 #include "qemu/guest-random.h"
+#include "migration/vmstate.h"
 
 #include "config-host.h"
 
@@ -2627,6 +2628,35 @@  static void qemu_init_board(void)
     }
 }
 
+/* Return the priority of the device option; zero is the default priority */
+static int qemu_opts_device_priority(QemuOpts *opts)
+{
+    const char *driver;
+    DeviceClass *dc;
+
+    driver = qemu_opt_get(opts, "driver");
+    if (!driver) {
+        return 0;
+    }
+
+    dc = qdev_get_device_class(&driver, NULL);
+    if (!dc) {
+        return 0;
+    }
+
+    if (!dc->vmsd) {
+        return 0;
+    }
+
+    /*
+     * Currently we rely on vmsd priority because that's solving the same
+     * problem for device realization ordering but just for migration.  In the
+     * future, we can move it out of vmsd, but that's not urgently needed.
+     * Return the negative of it so it'll be sorted with descendant order.
+     */
+    return -dc->vmsd->priority;
+}
+
 static void qemu_create_cli_devices(void)
 {
     soundhw_init();
@@ -2642,6 +2672,11 @@  static void qemu_create_cli_devices(void)
 
     /* init generic devices */
     rom_set_order_override(FW_CFG_ORDER_OVERRIDE_DEVICE);
+    /*
+     * Sort all the -device parameters; e.g., platform devices like vIOMMU
+     * should be initialized earlier.
+     */
+    qemu_sort_opts("device", qemu_opts_device_priority);
     qemu_opts_foreach(qemu_find_opts("device"),
                       device_init_func, NULL, &error_fatal);
     rom_reset_order_override();