Patchwork pci-assign: Enable MSIX on device to match guest

login
register
mail settings
Submitter Alex Williamson
Date Dec. 20, 2012, 4:05 p.m.
Message ID <20121220160107.3975.53264.stgit@bling.home>
Download mbox | patch
Permalink /patch/207681/
State New
Headers show

Comments

Alex Williamson - Dec. 20, 2012, 4:05 p.m.
When a guest enables MSIX on a device we evaluate the MSIX vector
table, typically find no unmasked vectors and don't switch the device
to MSIX mode.  This generally works fine and the device will be
switched once the guest enables and therefore unmasks a vector.
Unfortunately some drivers enable MSIX, then use interfaces to send
commands between VF & PF or PF & firmware that act based on the host
state of the device.  These therefore break when MSIX is managed
lazily.  This change re-enables the previous test used to enable MSIX
(see qemu-kvm a6b402c9), which basically guesses whether a vector
will be used based on the data field of the vector table.

Cc: qemu-stable@nongnu.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/kvm/pci-assign.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

I think we might be able to do a little better than this, but I think
this is the right fix for stable and we can build on it to perhaps only
enable a single vector.
Michael S. Tsirkin - Dec. 20, 2012, 4:38 p.m.
On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> When a guest enables MSIX on a device we evaluate the MSIX vector
> table, typically find no unmasked vectors and don't switch the device
> to MSIX mode.  This generally works fine and the device will be
> switched once the guest enables and therefore unmasks a vector.
> Unfortunately some drivers enable MSIX, then use interfaces to send
> commands between VF & PF or PF & firmware that act based on the host
> state of the device.  These therefore break when MSIX is managed
> lazily.  This change re-enables the previous test used to enable MSIX
> (see qemu-kvm a6b402c9), which basically guesses whether a vector
> will be used based on the data field of the vector table.
> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Same question: can't we enable and mask MSIX through config sysfs?
In this case it can be done in userspace ...

> ---
>  hw/kvm/pci-assign.c |   17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> I think we might be able to do a little better than this, but I think
> this is the right fix for stable and we can build on it to perhaps only
> enable a single vector.
> 
> diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> index e80dad0..12a219b 100644
> --- a/hw/kvm/pci-assign.c
> +++ b/hw/kvm/pci-assign.c
> @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
>      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
>  }
>  
> +/*
> + * When MSI-X is first enabled the vector table typically has all the
> + * vectors masked, so we can't use that as the obvious test to figure out
> + * how many vectors to initially enable.  Instead we look at the data field
> + * because this is what worked for pci-assign for a long time.  This makes
> + * sure the physical MSI-X state tracks the guest's view, which is important
> + * for some VF/PF and PF/fw communication channels.
> + */
> +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> +{
> +    return !entry->data;
> +}
> +
>  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
>  {
>      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
>  
>      /* Get the usable entry number for allocating */
>      for (i = 0; i < adev->msix_max; i++, entry++) {
> -        if (assigned_dev_msix_masked(entry)) {
> +        if (assigned_dev_msix_skipped(entry)) {
>              continue;
>          }
>          entries_nr++;
> @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
>      for (i = 0; i < adev->msix_max; i++, entry++) {
>          adev->msi_virq[i] = -1;
>  
> -        if (assigned_dev_msix_masked(entry)) {
> +        if (assigned_dev_msix_skipped(entry)) {
>              continue;
>          }
>
Alex Williamson - Dec. 20, 2012, 10:15 p.m.
On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > When a guest enables MSIX on a device we evaluate the MSIX vector
> > table, typically find no unmasked vectors and don't switch the device
> > to MSIX mode.  This generally works fine and the device will be
> > switched once the guest enables and therefore unmasks a vector.
> > Unfortunately some drivers enable MSIX, then use interfaces to send
> > commands between VF & PF or PF & firmware that act based on the host
> > state of the device.  These therefore break when MSIX is managed
> > lazily.  This change re-enables the previous test used to enable MSIX
> > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > will be used based on the data field of the vector table.
> > 
> > Cc: qemu-stable@nongnu.org
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> Same question: can't we enable and mask MSIX through config sysfs?
> In this case it can be done in userspace ...

In this case userspace could do this, but I think it's still incredibly
dangerous.  Kernel space drivers can also directly enable MSI-X on a
device, but you might get shot for writing one that did.  We should
follow the rules, play be the existing kernel interfaces, and work to
eventually improve those interfaces.  Thanks,

Alex

> > ---
> >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> > 
> > I think we might be able to do a little better than this, but I think
> > this is the right fix for stable and we can build on it to perhaps only
> > enable a single vector.
> > 
> > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > index e80dad0..12a219b 100644
> > --- a/hw/kvm/pci-assign.c
> > +++ b/hw/kvm/pci-assign.c
> > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> >  }
> >  
> > +/*
> > + * When MSI-X is first enabled the vector table typically has all the
> > + * vectors masked, so we can't use that as the obvious test to figure out
> > + * how many vectors to initially enable.  Instead we look at the data field
> > + * because this is what worked for pci-assign for a long time.  This makes
> > + * sure the physical MSI-X state tracks the guest's view, which is important
> > + * for some VF/PF and PF/fw communication channels.
> > + */
> > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > +{
> > +    return !entry->data;
> > +}
> > +
> >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> >  {
> >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> >  
> >      /* Get the usable entry number for allocating */
> >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > -        if (assigned_dev_msix_masked(entry)) {
> > +        if (assigned_dev_msix_skipped(entry)) {
> >              continue;
> >          }
> >          entries_nr++;
> > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> >      for (i = 0; i < adev->msix_max; i++, entry++) {
> >          adev->msi_virq[i] = -1;
> >  
> > -        if (assigned_dev_msix_masked(entry)) {
> > +        if (assigned_dev_msix_skipped(entry)) {
> >              continue;
> >          }
> >
Michael S. Tsirkin - Dec. 21, 2012, 12:17 p.m.
On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote:
> On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > > When a guest enables MSIX on a device we evaluate the MSIX vector
> > > table, typically find no unmasked vectors and don't switch the device
> > > to MSIX mode.  This generally works fine and the device will be
> > > switched once the guest enables and therefore unmasks a vector.
> > > Unfortunately some drivers enable MSIX, then use interfaces to send
> > > commands between VF & PF or PF & firmware that act based on the host
> > > state of the device.  These therefore break when MSIX is managed
> > > lazily.  This change re-enables the previous test used to enable MSIX
> > > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > > will be used based on the data field of the vector table.
> > > 
> > > Cc: qemu-stable@nongnu.org
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > 
> > Same question: can't we enable and mask MSIX through config sysfs?
> > In this case it can be done in userspace ...
> 
> In this case userspace could do this, but I think it's still incredibly
> dangerous.  Kernel space drivers can also directly enable MSI-X on a
> device, but you might get shot for writing one that did.

What would be the reason for the kernel driver to do this?

>  We should
> follow the rules, play be the existing kernel interfaces, and work to
> eventually improve those interfaces.  Thanks,
> 
> Alex

I'm not against adding an interface for this long term but we have
existing kernels to support too.  IMHO it would be nicer than
the data hack which relies on non-documented guest behaviour
that might change without warning in the future.

> > > ---
> > >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > 
> > > I think we might be able to do a little better than this, but I think
> > > this is the right fix for stable and we can build on it to perhaps only
> > > enable a single vector.
> > > 
> > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > > index e80dad0..12a219b 100644
> > > --- a/hw/kvm/pci-assign.c
> > > +++ b/hw/kvm/pci-assign.c
> > > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> > >  }
> > >  
> > > +/*
> > > + * When MSI-X is first enabled the vector table typically has all the
> > > + * vectors masked, so we can't use that as the obvious test to figure out
> > > + * how many vectors to initially enable.  Instead we look at the data field
> > > + * because this is what worked for pci-assign for a long time.  This makes
> > > + * sure the physical MSI-X state tracks the guest's view, which is important
> > > + * for some VF/PF and PF/fw communication channels.
> > > + */
> > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > > +{
> > > +    return !entry->data;
> > > +}
> > > +
> > >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > >  {
> > >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > >  
> > >      /* Get the usable entry number for allocating */
> > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > -        if (assigned_dev_msix_masked(entry)) {
> > > +        if (assigned_dev_msix_skipped(entry)) {
> > >              continue;
> > >          }
> > >          entries_nr++;
> > > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > >          adev->msi_virq[i] = -1;
> > >  
> > > -        if (assigned_dev_msix_masked(entry)) {
> > > +        if (assigned_dev_msix_skipped(entry)) {
> > >              continue;
> > >          }
> > >  
> 
>
Alex Williamson - Dec. 21, 2012, 3:46 p.m.
On Fri, 2012-12-21 at 14:17 +0200, Michael S. Tsirkin wrote:
> On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote:
> > On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > > > When a guest enables MSIX on a device we evaluate the MSIX vector
> > > > table, typically find no unmasked vectors and don't switch the device
> > > > to MSIX mode.  This generally works fine and the device will be
> > > > switched once the guest enables and therefore unmasks a vector.
> > > > Unfortunately some drivers enable MSIX, then use interfaces to send
> > > > commands between VF & PF or PF & firmware that act based on the host
> > > > state of the device.  These therefore break when MSIX is managed
> > > > lazily.  This change re-enables the previous test used to enable MSIX
> > > > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > > > will be used based on the data field of the vector table.
> > > > 
> > > > Cc: qemu-stable@nongnu.org
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > 
> > > Same question: can't we enable and mask MSIX through config sysfs?
> > > In this case it can be done in userspace ...
> > 
> > In this case userspace could do this, but I think it's still incredibly
> > dangerous.  Kernel space drivers can also directly enable MSI-X on a
> > device, but you might get shot for writing one that did.
> 
> What would be the reason for the kernel driver to do this?

Maybe they don't know how many vectors to use until they enable MSI-X
and query some firmware interface.  It's a hypothetical situation, I'm
just trying to illustrate that if a kernel driver did want to do this,
they'd have to develop interfaces to allow it, not just manually poke
their MSI-X enable bit.

> >  We should
> > follow the rules, play be the existing kernel interfaces, and work to
> > eventually improve those interfaces.  Thanks,
> > 
> > Alex
> 
> I'm not against adding an interface for this long term but we have
> existing kernels to support too.  IMHO it would be nicer than
> the data hack which relies on non-documented guest behaviour
> that might change without warning in the future.

We've unwittingly used the data hack for years and only ripped it out
because it was undocumented.  The patch below adds documentation for it,
so at least we have a more clear understanding of why it was there if we
want to try to rip it out again.  This fully supports existing kernels
and as I mention below, we might be able to do better with limiting how
many vectors we enabled, but I think this is the right initial fix and
right fix for stable and we can continue to experiment from here.
Thanks,

Alex

> > > > ---
> > > >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > 
> > > > I think we might be able to do a little better than this, but I think
> > > > this is the right fix for stable and we can build on it to perhaps only
> > > > enable a single vector.
> > > > 
> > > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > > > index e80dad0..12a219b 100644
> > > > --- a/hw/kvm/pci-assign.c
> > > > +++ b/hw/kvm/pci-assign.c
> > > > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > > >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> > > >  }
> > > >  
> > > > +/*
> > > > + * When MSI-X is first enabled the vector table typically has all the
> > > > + * vectors masked, so we can't use that as the obvious test to figure out
> > > > + * how many vectors to initially enable.  Instead we look at the data field
> > > > + * because this is what worked for pci-assign for a long time.  This makes
> > > > + * sure the physical MSI-X state tracks the guest's view, which is important
> > > > + * for some VF/PF and PF/fw communication channels.
> > > > + */
> > > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > > > +{
> > > > +    return !entry->data;
> > > > +}
> > > > +
> > > >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > >  {
> > > >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > > > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > >  
> > > >      /* Get the usable entry number for allocating */
> > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > >              continue;
> > > >          }
> > > >          entries_nr++;
> > > > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > >          adev->msi_virq[i] = -1;
> > > >  
> > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > >              continue;
> > > >          }
> > > >  
> > 
> >
Alex Williamson - Jan. 2, 2013, 3:49 p.m.
On Fri, 2012-12-21 at 08:46 -0700, Alex Williamson wrote:
> On Fri, 2012-12-21 at 14:17 +0200, Michael S. Tsirkin wrote:
> > On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote:
> > > On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > > > > When a guest enables MSIX on a device we evaluate the MSIX vector
> > > > > table, typically find no unmasked vectors and don't switch the device
> > > > > to MSIX mode.  This generally works fine and the device will be
> > > > > switched once the guest enables and therefore unmasks a vector.
> > > > > Unfortunately some drivers enable MSIX, then use interfaces to send
> > > > > commands between VF & PF or PF & firmware that act based on the host
> > > > > state of the device.  These therefore break when MSIX is managed
> > > > > lazily.  This change re-enables the previous test used to enable MSIX
> > > > > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > > > > will be used based on the data field of the vector table.
> > > > > 
> > > > > Cc: qemu-stable@nongnu.org
> > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > 
> > > > Same question: can't we enable and mask MSIX through config sysfs?
> > > > In this case it can be done in userspace ...
> > > 
> > > In this case userspace could do this, but I think it's still incredibly
> > > dangerous.  Kernel space drivers can also directly enable MSI-X on a
> > > device, but you might get shot for writing one that did.
> > 
> > What would be the reason for the kernel driver to do this?
> 
> Maybe they don't know how many vectors to use until they enable MSI-X
> and query some firmware interface.  It's a hypothetical situation, I'm
> just trying to illustrate that if a kernel driver did want to do this,
> they'd have to develop interfaces to allow it, not just manually poke
> their MSI-X enable bit.
> 
> > >  We should
> > > follow the rules, play be the existing kernel interfaces, and work to
> > > eventually improve those interfaces.  Thanks,
> > > 
> > > Alex
> > 
> > I'm not against adding an interface for this long term but we have
> > existing kernels to support too.  IMHO it would be nicer than
> > the data hack which relies on non-documented guest behaviour
> > that might change without warning in the future.
> 
> We've unwittingly used the data hack for years and only ripped it out
> because it was undocumented.  The patch below adds documentation for it,
> so at least we have a more clear understanding of why it was there if we
> want to try to rip it out again.  This fully supports existing kernels
> and as I mention below, we might be able to do better with limiting how
> many vectors we enabled, but I think this is the right initial fix and
> right fix for stable and we can continue to experiment from here.

Happy new year.  I'd like to close on this as we do currently have a
regression for devices that cannot handle MSI-X being lazily enabled.
The option here is to document and revert to the old style
initialization behavior where we look at the data field of the vector to
get a hint whether the guest intends to make use of the vector.  This
gives us the same behavior as we had previously, but still allows
vectors to be added, so we maintain the current FreeBSD support.  This
much needs to go to stable.

For the development tree, I think we can do better.  Using the data
field is not 100% reliable in giving us the number of vectors the guest
actually intends to use.  Instead we'd like to enable MSI-X with no
vectors and add vectors as the guest unmasks them.  The host Linux MSI
API currently doesn't allow this, so I think the next best thing is to
enable MSI-X with a single vector in the case where MSI-X is enabled but
no vectors are unmasked.  This conserves vectors on the host though we
do potentially allow spurious interrupts through the enabled vector
(though we previously enabled multiple vectors using the above data
method without problems).

The alternative that you're proposing to this longer term solution is to
manually mask all vectors in the physical MSI-X vector table from
userspace then manually enable MSI-X on the physical device (through
pci-sysfs resource and config access respectively).  This puts the
physical device is a state that better matches the guest view of the
devices, but I'm doubtful that the risk is worth the reward.  This adds
a new state to the qemu MSI-X model where we have entirely host kernel
managed physical IRQ state, except for this.  It also creates a
synchronization problem that the physical device moves to a new
interrupt state outside of the control of the host kernel, possibly
bypassing any quirks for the host platform.

Another option is to modify the host MSI API to allow the interface we
want, splitting enabling MSI-X from vector allocation.  That of course
has a much longer lead time. 

We can certainly continue the discussion on this, but we need a fix for
stable and I don't think either of these longer term methods are known
to have the reliability or simplicity of reverting to the previous
initialization criteria.  Please ack if you agree.  Thanks,

Alex

> > > > > ---
> > > > >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > 
> > > > > I think we might be able to do a little better than this, but I think
> > > > > this is the right fix for stable and we can build on it to perhaps only
> > > > > enable a single vector.
> > > > > 
> > > > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > > > > index e80dad0..12a219b 100644
> > > > > --- a/hw/kvm/pci-assign.c
> > > > > +++ b/hw/kvm/pci-assign.c
> > > > > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > > > >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> > > > >  }
> > > > >  
> > > > > +/*
> > > > > + * When MSI-X is first enabled the vector table typically has all the
> > > > > + * vectors masked, so we can't use that as the obvious test to figure out
> > > > > + * how many vectors to initially enable.  Instead we look at the data field
> > > > > + * because this is what worked for pci-assign for a long time.  This makes
> > > > > + * sure the physical MSI-X state tracks the guest's view, which is important
> > > > > + * for some VF/PF and PF/fw communication channels.
> > > > > + */
> > > > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > > > > +{
> > > > > +    return !entry->data;
> > > > > +}
> > > > > +
> > > > >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > >  {
> > > > >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > > > > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > >  
> > > > >      /* Get the usable entry number for allocating */
> > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > >              continue;
> > > > >          }
> > > > >          entries_nr++;
> > > > > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > >          adev->msi_virq[i] = -1;
> > > > >  
> > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > >              continue;
> > > > >          }
> > > > >  
> > > 
> > > 
> 
>
Michael S. Tsirkin - Jan. 6, 2013, 1:23 p.m.
On Wed, Jan 02, 2013 at 08:49:42AM -0700, Alex Williamson wrote:
> On Fri, 2012-12-21 at 08:46 -0700, Alex Williamson wrote:
> > On Fri, 2012-12-21 at 14:17 +0200, Michael S. Tsirkin wrote:
> > > On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote:
> > > > On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > > > > > When a guest enables MSIX on a device we evaluate the MSIX vector
> > > > > > table, typically find no unmasked vectors and don't switch the device
> > > > > > to MSIX mode.  This generally works fine and the device will be
> > > > > > switched once the guest enables and therefore unmasks a vector.
> > > > > > Unfortunately some drivers enable MSIX, then use interfaces to send
> > > > > > commands between VF & PF or PF & firmware that act based on the host
> > > > > > state of the device.  These therefore break when MSIX is managed
> > > > > > lazily.  This change re-enables the previous test used to enable MSIX
> > > > > > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > > > > > will be used based on the data field of the vector table.
> > > > > > 
> > > > > > Cc: qemu-stable@nongnu.org
> > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > 
> > > > > Same question: can't we enable and mask MSIX through config sysfs?
> > > > > In this case it can be done in userspace ...
> > > > 
> > > > In this case userspace could do this, but I think it's still incredibly
> > > > dangerous.  Kernel space drivers can also directly enable MSI-X on a
> > > > device, but you might get shot for writing one that did.
> > > 
> > > What would be the reason for the kernel driver to do this?
> > 
> > Maybe they don't know how many vectors to use until they enable MSI-X
> > and query some firmware interface.  It's a hypothetical situation, I'm
> > just trying to illustrate that if a kernel driver did want to do this,
> > they'd have to develop interfaces to allow it, not just manually poke
> > their MSI-X enable bit.
> > 
> > > >  We should
> > > > follow the rules, play be the existing kernel interfaces, and work to
> > > > eventually improve those interfaces.  Thanks,
> > > > 
> > > > Alex
> > > 
> > > I'm not against adding an interface for this long term but we have
> > > existing kernels to support too.  IMHO it would be nicer than
> > > the data hack which relies on non-documented guest behaviour
> > > that might change without warning in the future.
> > 
> > We've unwittingly used the data hack for years and only ripped it out
> > because it was undocumented.  The patch below adds documentation for it,
> > so at least we have a more clear understanding of why it was there if we
> > want to try to rip it out again.  This fully supports existing kernels
> > and as I mention below, we might be able to do better with limiting how
> > many vectors we enabled, but I think this is the right initial fix and
> > right fix for stable and we can continue to experiment from here.
> 
> Happy new year.  I'd like to close on this as we do currently have a
> regression for devices that cannot handle MSI-X being lazily enabled.
> The option here is to document and revert to the old style
> initialization behavior where we look at the data field of the vector to
> get a hint whether the guest intends to make use of the vector.  This
> gives us the same behavior as we had previously, but still allows
> vectors to be added, so we maintain the current FreeBSD support.  This
> much needs to go to stable.

By the way, could you clarify what exactly happens with FreeBSD?


> For the development tree, I think we can do better.  Using the data
> field is not 100% reliable in giving us the number of vectors the guest
> actually intends to use.  Instead we'd like to enable MSI-X with no
> vectors and add vectors as the guest unmasks them.  The host Linux MSI
> API currently doesn't allow this, so I think the next best thing is to
> enable MSI-X with a single vector in the case where MSI-X is enabled but
> no vectors are unmasked.  This conserves vectors on the host though we
> do potentially allow spurious interrupts through the enabled vector
> (though we previously enabled multiple vectors using the above data
> method without problems).

Hmm I think this can lose interrupts if they are sent before
there is a handler. The data hack only sends interrupts when
there's a handler so it's safe.

> The alternative that you're proposing to this longer term solution is to
> manually mask all vectors in the physical MSI-X vector table from
> userspace

This masking is not necessary I think: all vectors are masked
after reset IIRC.

> then manually enable MSI-X on the physical device (through
> pci-sysfs resource and config access respectively).  This puts the
> physical device is a state that better matches the guest view of the
> devices, but I'm doubtful that the risk is worth the reward.  This adds
> a new state to the qemu MSI-X model where we have entirely host kernel
> managed physical IRQ state, except for this.  It also creates a
> synchronization problem that the physical device moves to a new
> interrupt state outside of the control of the host kernel, possibly
> bypassing any quirks for the host platform.

Hmm as long as all vectors are masked, MSIX is a per device state so I
don't see why any quirks would be needed.

> Another option is to modify the host MSI API to allow the interface we
> want, splitting enabling MSI-X from vector allocation.  That of course
> has a much longer lead time. 

It's a requirement to make it really robust though, isn't it?

> We can certainly continue the discussion on this, but we need a fix for
> stable and I don't think either of these longer term methods are known
> to have the reliability or simplicity of reverting to the previous
> initialization criteria.  Please ack if you agree.  Thanks,
> 
> Alex

I agree with your approach to stable, so ACK.

For long term, I think we should start with fixing up
things in the kernel. Then for old kernels, qemu can do
the userspace hack (or some other hack if we find a better one).

> > > > > > ---
> > > > > >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > I think we might be able to do a little better than this, but I think
> > > > > > this is the right fix for stable and we can build on it to perhaps only
> > > > > > enable a single vector.
> > > > > > 
> > > > > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > > > > > index e80dad0..12a219b 100644
> > > > > > --- a/hw/kvm/pci-assign.c
> > > > > > +++ b/hw/kvm/pci-assign.c
> > > > > > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > > > > >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> > > > > >  }
> > > > > >  
> > > > > > +/*
> > > > > > + * When MSI-X is first enabled the vector table typically has all the
> > > > > > + * vectors masked, so we can't use that as the obvious test to figure out
> > > > > > + * how many vectors to initially enable.  Instead we look at the data field
> > > > > > + * because this is what worked for pci-assign for a long time.  This makes
> > > > > > + * sure the physical MSI-X state tracks the guest's view, which is important
> > > > > > + * for some VF/PF and PF/fw communication channels.
> > > > > > + */
> > > > > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > > > > > +{
> > > > > > +    return !entry->data;
> > > > > > +}
> > > > > > +
> > > > > >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > >  {
> > > > > >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > > > > > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > >  
> > > > > >      /* Get the usable entry number for allocating */
> > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > >              continue;
> > > > > >          }
> > > > > >          entries_nr++;
> > > > > > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > >          adev->msi_virq[i] = -1;
> > > > > >  
> > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > >              continue;
> > > > > >          }
> > > > > >  
> > > > 
> > > > 
> > 
> > 
> 
>
Alex Williamson - Jan. 6, 2013, 3:57 p.m.
On Sun, 2013-01-06 at 15:23 +0200, Michael S. Tsirkin wrote:
> On Wed, Jan 02, 2013 at 08:49:42AM -0700, Alex Williamson wrote:
> > On Fri, 2012-12-21 at 08:46 -0700, Alex Williamson wrote:
> > > On Fri, 2012-12-21 at 14:17 +0200, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote:
> > > > > On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > > > > > > When a guest enables MSIX on a device we evaluate the MSIX vector
> > > > > > > table, typically find no unmasked vectors and don't switch the device
> > > > > > > to MSIX mode.  This generally works fine and the device will be
> > > > > > > switched once the guest enables and therefore unmasks a vector.
> > > > > > > Unfortunately some drivers enable MSIX, then use interfaces to send
> > > > > > > commands between VF & PF or PF & firmware that act based on the host
> > > > > > > state of the device.  These therefore break when MSIX is managed
> > > > > > > lazily.  This change re-enables the previous test used to enable MSIX
> > > > > > > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > > > > > > will be used based on the data field of the vector table.
> > > > > > > 
> > > > > > > Cc: qemu-stable@nongnu.org
> > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > 
> > > > > > Same question: can't we enable and mask MSIX through config sysfs?
> > > > > > In this case it can be done in userspace ...
> > > > > 
> > > > > In this case userspace could do this, but I think it's still incredibly
> > > > > dangerous.  Kernel space drivers can also directly enable MSI-X on a
> > > > > device, but you might get shot for writing one that did.
> > > > 
> > > > What would be the reason for the kernel driver to do this?
> > > 
> > > Maybe they don't know how many vectors to use until they enable MSI-X
> > > and query some firmware interface.  It's a hypothetical situation, I'm
> > > just trying to illustrate that if a kernel driver did want to do this,
> > > they'd have to develop interfaces to allow it, not just manually poke
> > > their MSI-X enable bit.
> > > 
> > > > >  We should
> > > > > follow the rules, play be the existing kernel interfaces, and work to
> > > > > eventually improve those interfaces.  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > I'm not against adding an interface for this long term but we have
> > > > existing kernels to support too.  IMHO it would be nicer than
> > > > the data hack which relies on non-documented guest behaviour
> > > > that might change without warning in the future.
> > > 
> > > We've unwittingly used the data hack for years and only ripped it out
> > > because it was undocumented.  The patch below adds documentation for it,
> > > so at least we have a more clear understanding of why it was there if we
> > > want to try to rip it out again.  This fully supports existing kernels
> > > and as I mention below, we might be able to do better with limiting how
> > > many vectors we enabled, but I think this is the right initial fix and
> > > right fix for stable and we can continue to experiment from here.
> > 
> > Happy new year.  I'd like to close on this as we do currently have a
> > regression for devices that cannot handle MSI-X being lazily enabled.
> > The option here is to document and revert to the old style
> > initialization behavior where we look at the data field of the vector to
> > get a hint whether the guest intends to make use of the vector.  This
> > gives us the same behavior as we had previously, but still allows
> > vectors to be added, so we maintain the current FreeBSD support.  This
> > much needs to go to stable.
> 
> By the way, could you clarify what exactly happens with FreeBSD?

FreeBSD enables MSIX in config space without setting anything in the
MSIX table address/data fields.  Thus previously, when we only looked at
the data field and did not add vectors dynamically, FreeBSD wouldn't get
any vectors setup.  This change therefore does not effect FreeBSD since
it will still lazily enable vectors.  This would be addressed by a
follow-on patch since we don't know of any drivers that care about the
lazy setup on FreeBSD.

> > For the development tree, I think we can do better.  Using the data
> > field is not 100% reliable in giving us the number of vectors the guest
> > actually intends to use.  Instead we'd like to enable MSI-X with no
> > vectors and add vectors as the guest unmasks them.  The host Linux MSI
> > API currently doesn't allow this, so I think the next best thing is to
> > enable MSI-X with a single vector in the case where MSI-X is enabled but
> > no vectors are unmasked.  This conserves vectors on the host though we
> > do potentially allow spurious interrupts through the enabled vector
> > (though we previously enabled multiple vectors using the above data
> > method without problems).
> 
> Hmm I think this can lose interrupts if they are sent before
> there is a handler. The data hack only sends interrupts when
> there's a handler so it's safe.

The data hack is looking at the result of pci_enable_msix, in the case
of a Linux guest.  The data fields are setup, but interrupt handlers are
not setup until the guest calls request_irq, so those would also be
spurious interrupts.

> > The alternative that you're proposing to this longer term solution is to
> > manually mask all vectors in the physical MSI-X vector table from
> > userspace
> 
> This masking is not necessary I think: all vectors are masked
> after reset IIRC.

Assuming the device supports a reset and the host doesn't restore
anything in the vector table.

> > then manually enable MSI-X on the physical device (through
> > pci-sysfs resource and config access respectively).  This puts the
> > physical device is a state that better matches the guest view of the
> > devices, but I'm doubtful that the risk is worth the reward.  This adds
> > a new state to the qemu MSI-X model where we have entirely host kernel
> > managed physical IRQ state, except for this.  It also creates a
> > synchronization problem that the physical device moves to a new
> > interrupt state outside of the control of the host kernel, possibly
> > bypassing any quirks for the host platform.
> 
> Hmm as long as all vectors are masked, MSIX is a per device state so I
> don't see why any quirks would be needed.
> 
> > Another option is to modify the host MSI API to allow the interface we
> > want, splitting enabling MSI-X from vector allocation.  That of course
> > has a much longer lead time. 
> 
> It's a requirement to make it really robust though, isn't it?

It's required to get the kind of control we'd like, including the
ability to add vectors without tearing down MSIX and starting over.

> > We can certainly continue the discussion on this, but we need a fix for
> > stable and I don't think either of these longer term methods are known
> > to have the reliability or simplicity of reverting to the previous
> > initialization criteria.  Please ack if you agree.  Thanks,
> > 
> > Alex
> 
> I agree with your approach to stable, so ACK.
> 
> For long term, I think we should start with fixing up
> things in the kernel. Then for old kernels, qemu can do
> the userspace hack (or some other hack if we find a better one).

Ok, thanks.  Logically it seems easy to split pci_enable_msix into
enable + add vectors, but I expect it gets a lot more complicated across
numerous architectures and things like interrupt remappers.  I'll try to
start looking into this.  Thanks,

Alex

> > > > > > > ---
> > > > > > >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > I think we might be able to do a little better than this, but I think
> > > > > > > this is the right fix for stable and we can build on it to perhaps only
> > > > > > > enable a single vector.
> > > > > > > 
> > > > > > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > > > > > > index e80dad0..12a219b 100644
> > > > > > > --- a/hw/kvm/pci-assign.c
> > > > > > > +++ b/hw/kvm/pci-assign.c
> > > > > > > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > > > > > >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> > > > > > >  }
> > > > > > >  
> > > > > > > +/*
> > > > > > > + * When MSI-X is first enabled the vector table typically has all the
> > > > > > > + * vectors masked, so we can't use that as the obvious test to figure out
> > > > > > > + * how many vectors to initially enable.  Instead we look at the data field
> > > > > > > + * because this is what worked for pci-assign for a long time.  This makes
> > > > > > > + * sure the physical MSI-X state tracks the guest's view, which is important
> > > > > > > + * for some VF/PF and PF/fw communication channels.
> > > > > > > + */
> > > > > > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > > > > > > +{
> > > > > > > +    return !entry->data;
> > > > > > > +}
> > > > > > > +
> > > > > > >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > >  {
> > > > > > >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > > > > > > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > >  
> > > > > > >      /* Get the usable entry number for allocating */
> > > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > > >              continue;
> > > > > > >          }
> > > > > > >          entries_nr++;
> > > > > > > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > > >          adev->msi_virq[i] = -1;
> > > > > > >  
> > > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > > >              continue;
> > > > > > >          }
> > > > > > >  
> > > > > 
> > > > > 
> > > 
> > > 
> > 
> >
Michael S. Tsirkin - Jan. 7, 2013, 3:02 p.m.
On Sun, Jan 06, 2013 at 08:57:50AM -0700, Alex Williamson wrote:
> On Sun, 2013-01-06 at 15:23 +0200, Michael S. Tsirkin wrote:
> > On Wed, Jan 02, 2013 at 08:49:42AM -0700, Alex Williamson wrote:
> > > On Fri, 2012-12-21 at 08:46 -0700, Alex Williamson wrote:
> > > > On Fri, 2012-12-21 at 14:17 +0200, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote:
> > > > > > On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > > > > > > > When a guest enables MSIX on a device we evaluate the MSIX vector
> > > > > > > > table, typically find no unmasked vectors and don't switch the device
> > > > > > > > to MSIX mode.  This generally works fine and the device will be
> > > > > > > > switched once the guest enables and therefore unmasks a vector.
> > > > > > > > Unfortunately some drivers enable MSIX, then use interfaces to send
> > > > > > > > commands between VF & PF or PF & firmware that act based on the host
> > > > > > > > state of the device.  These therefore break when MSIX is managed
> > > > > > > > lazily.  This change re-enables the previous test used to enable MSIX
> > > > > > > > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > > > > > > > will be used based on the data field of the vector table.
> > > > > > > > 
> > > > > > > > Cc: qemu-stable@nongnu.org
> > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > 
> > > > > > > Same question: can't we enable and mask MSIX through config sysfs?
> > > > > > > In this case it can be done in userspace ...
> > > > > > 
> > > > > > In this case userspace could do this, but I think it's still incredibly
> > > > > > dangerous.  Kernel space drivers can also directly enable MSI-X on a
> > > > > > device, but you might get shot for writing one that did.
> > > > > 
> > > > > What would be the reason for the kernel driver to do this?
> > > > 
> > > > Maybe they don't know how many vectors to use until they enable MSI-X
> > > > and query some firmware interface.  It's a hypothetical situation, I'm
> > > > just trying to illustrate that if a kernel driver did want to do this,
> > > > they'd have to develop interfaces to allow it, not just manually poke
> > > > their MSI-X enable bit.
> > > > 
> > > > > >  We should
> > > > > > follow the rules, play be the existing kernel interfaces, and work to
> > > > > > eventually improve those interfaces.  Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > I'm not against adding an interface for this long term but we have
> > > > > existing kernels to support too.  IMHO it would be nicer than
> > > > > the data hack which relies on non-documented guest behaviour
> > > > > that might change without warning in the future.
> > > > 
> > > > We've unwittingly used the data hack for years and only ripped it out
> > > > because it was undocumented.  The patch below adds documentation for it,
> > > > so at least we have a more clear understanding of why it was there if we
> > > > want to try to rip it out again.  This fully supports existing kernels
> > > > and as I mention below, we might be able to do better with limiting how
> > > > many vectors we enabled, but I think this is the right initial fix and
> > > > right fix for stable and we can continue to experiment from here.
> > > 
> > > Happy new year.  I'd like to close on this as we do currently have a
> > > regression for devices that cannot handle MSI-X being lazily enabled.
> > > The option here is to document and revert to the old style
> > > initialization behavior where we look at the data field of the vector to
> > > get a hint whether the guest intends to make use of the vector.  This
> > > gives us the same behavior as we had previously, but still allows
> > > vectors to be added, so we maintain the current FreeBSD support.  This
> > > much needs to go to stable.
> > 
> > By the way, could you clarify what exactly happens with FreeBSD?
> 
> FreeBSD enables MSIX in config space without setting anything in the
> MSIX table address/data fields.  Thus previously, when we only looked at
> the data field and did not add vectors dynamically, FreeBSD wouldn't get
> any vectors setup.  This change therefore does not effect FreeBSD since
> it will still lazily enable vectors.  This would be addressed by a
> follow-on patch since we don't know of any drivers that care about the
> lazy setup on FreeBSD.

Hopefully this can be addressed by the long term solution,
would be nice to have behaviour match spec instead of
relying on guest-specific behaviour.

> > > field is not 100% reliable in giving us the number of vectors the guest
> > > actually intends to use.  Instead we'd like to enable MSI-X with no
> > > vectors and add vectors as the guest unmasks them.  The host Linux MSI
> > > API currently doesn't allow this, so I think the next best thing is to
> > > enable MSI-X with a single vector in the case where MSI-X is enabled but
> > > no vectors are unmasked.  This conserves vectors on the host though we
> > > do potentially allow spurious interrupts through the enabled vector
> > > (though we previously enabled multiple vectors using the above data
> > > method without problems).
> > 
> > Hmm I think this can lose interrupts if they are sent before
> > there is a handler. The data hack only sends interrupts when
> > there's a handler so it's safe.
> 
> The data hack is looking at the result of pci_enable_msix, in the case
> of a Linux guest.  The data fields are setup, but interrupt handlers are
> not setup until the guest calls request_irq, so those would also be
> spurious interrupts.

But are the vectors unmasked? I need to look at the code - if not
they get queued in the hardware.

> > > The alternative that you're proposing to this longer term solution is to
> > > manually mask all vectors in the physical MSI-X vector table from
> > > userspace
> > 
> > This masking is not necessary I think: all vectors are masked
> > after reset IIRC.
> 
> Assuming the device supports a reset and the host doesn't restore
> anything in the vector table.

We more or less require function level reset for assignment, don't we?

> > > then manually enable MSI-X on the physical device (through
> > > pci-sysfs resource and config access respectively).  This puts the
> > > physical device is a state that better matches the guest view of the
> > > devices, but I'm doubtful that the risk is worth the reward.  This adds
> > > a new state to the qemu MSI-X model where we have entirely host kernel
> > > managed physical IRQ state, except for this.  It also creates a
> > > synchronization problem that the physical device moves to a new
> > > interrupt state outside of the control of the host kernel, possibly
> > > bypassing any quirks for the host platform.
> > 
> > Hmm as long as all vectors are masked, MSIX is a per device state so I
> > don't see why any quirks would be needed.
> > 
> > > Another option is to modify the host MSI API to allow the interface we
> > > want, splitting enabling MSI-X from vector allocation.  That of course
> > > has a much longer lead time. 
> > 
> > It's a requirement to make it really robust though, isn't it?
> 
> It's required to get the kind of control we'd like, including the
> ability to add vectors without tearing down MSIX and starting over.
> 
> > > We can certainly continue the discussion on this, but we need a fix for
> > > stable and I don't think either of these longer term methods are known
> > > to have the reliability or simplicity of reverting to the previous
> > > initialization criteria.  Please ack if you agree.  Thanks,
> > > 
> > > Alex
> > 
> > I agree with your approach to stable, so ACK.
> > 
> > For long term, I think we should start with fixing up
> > things in the kernel. Then for old kernels, qemu can do
> > the userspace hack (or some other hack if we find a better one).
> 
> Ok, thanks.  Logically it seems easy to split pci_enable_msix into
> enable + add vectors, but I expect it gets a lot more complicated across
> numerous architectures and things like interrupt remappers.  I'll try to
> start looking into this.  Thanks,
> 
> Alex
> 
> > > > > > > > ---
> > > > > > > >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> > > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > > 
> > > > > > > > I think we might be able to do a little better than this, but I think
> > > > > > > > this is the right fix for stable and we can build on it to perhaps only
> > > > > > > > enable a single vector.
> > > > > > > > 
> > > > > > > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > > > > > > > index e80dad0..12a219b 100644
> > > > > > > > --- a/hw/kvm/pci-assign.c
> > > > > > > > +++ b/hw/kvm/pci-assign.c
> > > > > > > > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > > > > > > >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +/*
> > > > > > > > + * When MSI-X is first enabled the vector table typically has all the
> > > > > > > > + * vectors masked, so we can't use that as the obvious test to figure out
> > > > > > > > + * how many vectors to initially enable.  Instead we look at the data field
> > > > > > > > + * because this is what worked for pci-assign for a long time.  This makes
> > > > > > > > + * sure the physical MSI-X state tracks the guest's view, which is important
> > > > > > > > + * for some VF/PF and PF/fw communication channels.
> > > > > > > > + */
> > > > > > > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > > > > > > > +{
> > > > > > > > +    return !entry->data;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > > >  {
> > > > > > > >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > > > > > > > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > > >  
> > > > > > > >      /* Get the usable entry number for allocating */
> > > > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > > > >              continue;
> > > > > > > >          }
> > > > > > > >          entries_nr++;
> > > > > > > > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > > > >          adev->msi_virq[i] = -1;
> > > > > > > >  
> > > > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > > > >              continue;
> > > > > > > >          }
> > > > > > > >  
> > > > > > 
> > > > > > 
> > > > 
> > > > 
> > > 
> > > 
> 
>
Alex Williamson - Jan. 7, 2013, 3:09 p.m.
On Mon, 2013-01-07 at 17:02 +0200, Michael S. Tsirkin wrote:
> On Sun, Jan 06, 2013 at 08:57:50AM -0700, Alex Williamson wrote:
> > On Sun, 2013-01-06 at 15:23 +0200, Michael S. Tsirkin wrote:
> > > On Wed, Jan 02, 2013 at 08:49:42AM -0700, Alex Williamson wrote:
> > > > On Fri, 2012-12-21 at 08:46 -0700, Alex Williamson wrote:
> > > > > On Fri, 2012-12-21 at 14:17 +0200, Michael S. Tsirkin wrote:
> > > > > > On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote:
> > > > > > > On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote:
> > > > > > > > > When a guest enables MSIX on a device we evaluate the MSIX vector
> > > > > > > > > table, typically find no unmasked vectors and don't switch the device
> > > > > > > > > to MSIX mode.  This generally works fine and the device will be
> > > > > > > > > switched once the guest enables and therefore unmasks a vector.
> > > > > > > > > Unfortunately some drivers enable MSIX, then use interfaces to send
> > > > > > > > > commands between VF & PF or PF & firmware that act based on the host
> > > > > > > > > state of the device.  These therefore break when MSIX is managed
> > > > > > > > > lazily.  This change re-enables the previous test used to enable MSIX
> > > > > > > > > (see qemu-kvm a6b402c9), which basically guesses whether a vector
> > > > > > > > > will be used based on the data field of the vector table.
> > > > > > > > > 
> > > > > > > > > Cc: qemu-stable@nongnu.org
> > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > 
> > > > > > > > Same question: can't we enable and mask MSIX through config sysfs?
> > > > > > > > In this case it can be done in userspace ...
> > > > > > > 
> > > > > > > In this case userspace could do this, but I think it's still incredibly
> > > > > > > dangerous.  Kernel space drivers can also directly enable MSI-X on a
> > > > > > > device, but you might get shot for writing one that did.
> > > > > > 
> > > > > > What would be the reason for the kernel driver to do this?
> > > > > 
> > > > > Maybe they don't know how many vectors to use until they enable MSI-X
> > > > > and query some firmware interface.  It's a hypothetical situation, I'm
> > > > > just trying to illustrate that if a kernel driver did want to do this,
> > > > > they'd have to develop interfaces to allow it, not just manually poke
> > > > > their MSI-X enable bit.
> > > > > 
> > > > > > >  We should
> > > > > > > follow the rules, play be the existing kernel interfaces, and work to
> > > > > > > eventually improve those interfaces.  Thanks,
> > > > > > > 
> > > > > > > Alex
> > > > > > 
> > > > > > I'm not against adding an interface for this long term but we have
> > > > > > existing kernels to support too.  IMHO it would be nicer than
> > > > > > the data hack which relies on non-documented guest behaviour
> > > > > > that might change without warning in the future.
> > > > > 
> > > > > We've unwittingly used the data hack for years and only ripped it out
> > > > > because it was undocumented.  The patch below adds documentation for it,
> > > > > so at least we have a more clear understanding of why it was there if we
> > > > > want to try to rip it out again.  This fully supports existing kernels
> > > > > and as I mention below, we might be able to do better with limiting how
> > > > > many vectors we enabled, but I think this is the right initial fix and
> > > > > right fix for stable and we can continue to experiment from here.
> > > > 
> > > > Happy new year.  I'd like to close on this as we do currently have a
> > > > regression for devices that cannot handle MSI-X being lazily enabled.
> > > > The option here is to document and revert to the old style
> > > > initialization behavior where we look at the data field of the vector to
> > > > get a hint whether the guest intends to make use of the vector.  This
> > > > gives us the same behavior as we had previously, but still allows
> > > > vectors to be added, so we maintain the current FreeBSD support.  This
> > > > much needs to go to stable.
> > > 
> > > By the way, could you clarify what exactly happens with FreeBSD?
> > 
> > FreeBSD enables MSIX in config space without setting anything in the
> > MSIX table address/data fields.  Thus previously, when we only looked at
> > the data field and did not add vectors dynamically, FreeBSD wouldn't get
> > any vectors setup.  This change therefore does not effect FreeBSD since
> > it will still lazily enable vectors.  This would be addressed by a
> > follow-on patch since we don't know of any drivers that care about the
> > lazy setup on FreeBSD.
> 
> Hopefully this can be addressed by the long term solution,
> would be nice to have behaviour match spec instead of
> relying on guest-specific behaviour.
> 
> > > > field is not 100% reliable in giving us the number of vectors the guest
> > > > actually intends to use.  Instead we'd like to enable MSI-X with no
> > > > vectors and add vectors as the guest unmasks them.  The host Linux MSI
> > > > API currently doesn't allow this, so I think the next best thing is to
> > > > enable MSI-X with a single vector in the case where MSI-X is enabled but
> > > > no vectors are unmasked.  This conserves vectors on the host though we
> > > > do potentially allow spurious interrupts through the enabled vector
> > > > (though we previously enabled multiple vectors using the above data
> > > > method without problems).
> > > 
> > > Hmm I think this can lose interrupts if they are sent before
> > > there is a handler. The data hack only sends interrupts when
> > > there's a handler so it's safe.
> > 
> > The data hack is looking at the result of pci_enable_msix, in the case
> > of a Linux guest.  The data fields are setup, but interrupt handlers are
> > not setup until the guest calls request_irq, so those would also be
> > spurious interrupts.
> 
> But are the vectors unmasked? I need to look at the code - if not
> they get queued in the hardware.

No, the kvm call does both pci_enable_msix and pci_request_irq.

> > > > The alternative that you're proposing to this longer term solution is to
> > > > manually mask all vectors in the physical MSI-X vector table from
> > > > userspace
> > > 
> > > This masking is not necessary I think: all vectors are masked
> > > after reset IIRC.
> > 
> > Assuming the device supports a reset and the host doesn't restore
> > anything in the vector table.
> 
> We more or less require function level reset for assignment, don't we?

No, qemu doesn't require any kind of reset but will make use of flr, pm,
or secondary bus reset when available.  libvirt imposes a reset
mechanism requirement.  Thanks,

Alex

> > > > then manually enable MSI-X on the physical device (through
> > > > pci-sysfs resource and config access respectively).  This puts the
> > > > physical device is a state that better matches the guest view of the
> > > > devices, but I'm doubtful that the risk is worth the reward.  This adds
> > > > a new state to the qemu MSI-X model where we have entirely host kernel
> > > > managed physical IRQ state, except for this.  It also creates a
> > > > synchronization problem that the physical device moves to a new
> > > > interrupt state outside of the control of the host kernel, possibly
> > > > bypassing any quirks for the host platform.
> > > 
> > > Hmm as long as all vectors are masked, MSIX is a per device state so I
> > > don't see why any quirks would be needed.
> > > 
> > > > Another option is to modify the host MSI API to allow the interface we
> > > > want, splitting enabling MSI-X from vector allocation.  That of course
> > > > has a much longer lead time. 
> > > 
> > > It's a requirement to make it really robust though, isn't it?
> > 
> > It's required to get the kind of control we'd like, including the
> > ability to add vectors without tearing down MSIX and starting over.
> > 
> > > > We can certainly continue the discussion on this, but we need a fix for
> > > > stable and I don't think either of these longer term methods are known
> > > > to have the reliability or simplicity of reverting to the previous
> > > > initialization criteria.  Please ack if you agree.  Thanks,
> > > > 
> > > > Alex
> > > 
> > > I agree with your approach to stable, so ACK.
> > > 
> > > For long term, I think we should start with fixing up
> > > things in the kernel. Then for old kernels, qemu can do
> > > the userspace hack (or some other hack if we find a better one).
> > 
> > Ok, thanks.  Logically it seems easy to split pci_enable_msix into
> > enable + add vectors, but I expect it gets a lot more complicated across
> > numerous architectures and things like interrupt remappers.  I'll try to
> > start looking into this.  Thanks,
> > 
> > Alex
> > 
> > > > > > > > > ---
> > > > > > > > >  hw/kvm/pci-assign.c |   17 +++++++++++++++--
> > > > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > > > 
> > > > > > > > > I think we might be able to do a little better than this, but I think
> > > > > > > > > this is the right fix for stable and we can build on it to perhaps only
> > > > > > > > > enable a single vector.
> > > > > > > > > 
> > > > > > > > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
> > > > > > > > > index e80dad0..12a219b 100644
> > > > > > > > > --- a/hw/kvm/pci-assign.c
> > > > > > > > > +++ b/hw/kvm/pci-assign.c
> > > > > > > > > @@ -1025,6 +1025,19 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
> > > > > > > > >      return (entry->ctrl & cpu_to_le32(0x1)) != 0;
> > > > > > > > >  }
> > > > > > > > >  
> > > > > > > > > +/*
> > > > > > > > > + * When MSI-X is first enabled the vector table typically has all the
> > > > > > > > > + * vectors masked, so we can't use that as the obvious test to figure out
> > > > > > > > > + * how many vectors to initially enable.  Instead we look at the data field
> > > > > > > > > + * because this is what worked for pci-assign for a long time.  This makes
> > > > > > > > > + * sure the physical MSI-X state tracks the guest's view, which is important
> > > > > > > > > + * for some VF/PF and PF/fw communication channels.
> > > > > > > > > + */
> > > > > > > > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
> > > > > > > > > +{
> > > > > > > > > +    return !entry->data;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > > > >  {
> > > > > > > > >      AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> > > > > > > > > @@ -1035,7 +1048,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > > > >  
> > > > > > > > >      /* Get the usable entry number for allocating */
> > > > > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > > > > >              continue;
> > > > > > > > >          }
> > > > > > > > >          entries_nr++;
> > > > > > > > > @@ -1064,7 +1077,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
> > > > > > > > >      for (i = 0; i < adev->msix_max; i++, entry++) {
> > > > > > > > >          adev->msi_virq[i] = -1;
> > > > > > > > >  
> > > > > > > > > -        if (assigned_dev_msix_masked(entry)) {
> > > > > > > > > +        if (assigned_dev_msix_skipped(entry)) {
> > > > > > > > >              continue;
> > > > > > > > >          }
> > > > > > > > >  
> > > > > > > 
> > > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > 
> >

Patch

diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c
index e80dad0..12a219b 100644
--- a/hw/kvm/pci-assign.c
+++ b/hw/kvm/pci-assign.c
@@ -1025,6 +1025,19 @@  static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
     return (entry->ctrl & cpu_to_le32(0x1)) != 0;
 }
 
+/*
+ * When MSI-X is first enabled the vector table typically has all the
+ * vectors masked, so we can't use that as the obvious test to figure out
+ * how many vectors to initially enable.  Instead we look at the data field
+ * because this is what worked for pci-assign for a long time.  This makes
+ * sure the physical MSI-X state tracks the guest's view, which is important
+ * for some VF/PF and PF/fw communication channels.
+ */
+static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
+{
+    return !entry->data;
+}
+
 static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
 {
     AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, pci_dev);
@@ -1035,7 +1048,7 @@  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
 
     /* Get the usable entry number for allocating */
     for (i = 0; i < adev->msix_max; i++, entry++) {
-        if (assigned_dev_msix_masked(entry)) {
+        if (assigned_dev_msix_skipped(entry)) {
             continue;
         }
         entries_nr++;
@@ -1064,7 +1077,7 @@  static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
     for (i = 0; i < adev->msix_max; i++, entry++) {
         adev->msi_virq[i] = -1;
 
-        if (assigned_dev_msix_masked(entry)) {
+        if (assigned_dev_msix_skipped(entry)) {
             continue;
         }