diff mbox series

[2/2] Do not access /dev/mem in MSI-X PCI passthrough on Xen

Message ID 20221114192011.1539233-2-marmarek@invisiblethingslab.com
State New
Headers show
Series [1/2] hw/xen/xen_pt: Call default handler only if no custom one is set | expand

Commit Message

Marek Marczykowski-Górecki Nov. 14, 2022, 7:20 p.m. UTC
The /dev/mem is used for two purposes:
 - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
 - reading Pending Bit Array (PBA)

The first one was originally done because when Xen did not send all
vector ctrl writes to the device model, so QEMU might have outdated old
register value. This has been changed in Xen, so QEMU can now use its
cached value of the register instead.

The Pending Bit Array (PBA) handling is for the case where it lives on
the same page as the MSI-X table itself. Xen has been extended to handle
this case too (as well as other registers that may live on those pages),
so QEMU handling is not necessary anymore.

Removing /dev/mem access is useful to work within stubdomain, and
necessary when dom0 kernel runs in lockdown mode.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
 hw/xen/xen_pt.h     |  1 -
 hw/xen/xen_pt_msi.c | 51 ++++-----------------------------------------
 2 files changed, 4 insertions(+), 48 deletions(-)

Comments

Andrew Cooper Nov. 14, 2022, 7:39 p.m. UTC | #1
On 14/11/2022 19:20, Marek Marczykowski-Górecki wrote:
> The /dev/mem is used for two purposes:
>  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
>  - reading Pending Bit Array (PBA)
>
> The first one was originally done because when Xen did not send all
> vector ctrl writes to the device model, so QEMU might have outdated old
> register value. This has been changed in Xen, so QEMU can now use its
> cached value of the register instead.
>
> The Pending Bit Array (PBA) handling is for the case where it lives on
> the same page as the MSI-X table itself. Xen has been extended to handle
> this case too (as well as other registers that may live on those pages),
> so QEMU handling is not necessary anymore.
>
> Removing /dev/mem access is useful to work within stubdomain, and
> necessary when dom0 kernel runs in lockdown mode.
>
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

The commit message ought to go further.  Using /dev/mem like this is
buggy anyway, because it is trapped and emulated by Xen in whatever
context Qemu is running.  Qemu cannot get the actual hardware value, and
even if it could, it would be racy with transient operations needing to
mask the vector.

i.e. it's not just nice-to-remote - it's fixing real corner cases.

~Andrew
Jan Beulich Nov. 15, 2022, 8:14 a.m. UTC | #2
On 14.11.2022 20:20, Marek Marczykowski-Górecki wrote:
> The /dev/mem is used for two purposes:
>  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
>  - reading Pending Bit Array (PBA)
> 
> The first one was originally done because when Xen did not send all
> vector ctrl writes to the device model, so QEMU might have outdated old
> register value. This has been changed in Xen, so QEMU can now use its
> cached value of the register instead.
> 
> The Pending Bit Array (PBA) handling is for the case where it lives on
> the same page as the MSI-X table itself. Xen has been extended to handle
> this case too (as well as other registers that may live on those pages),
> so QEMU handling is not necessary anymore.

Don't you need to check for new enough Xen for both aspects?

Jan
Marek Marczykowski-Górecki Nov. 15, 2022, 11:38 a.m. UTC | #3
On Tue, Nov 15, 2022 at 09:14:07AM +0100, Jan Beulich wrote:
> On 14.11.2022 20:20, Marek Marczykowski-Górecki wrote:
> > The /dev/mem is used for two purposes:
> >  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
> >  - reading Pending Bit Array (PBA)
> > 
> > The first one was originally done because when Xen did not send all
> > vector ctrl writes to the device model, so QEMU might have outdated old
> > register value. This has been changed in Xen, so QEMU can now use its
> > cached value of the register instead.
> > 
> > The Pending Bit Array (PBA) handling is for the case where it lives on
> > the same page as the MSI-X table itself. Xen has been extended to handle
> > this case too (as well as other registers that may live on those pages),
> > so QEMU handling is not necessary anymore.
> 
> Don't you need to check for new enough Xen for both aspects?

Yes, see my response to Andrew in the thread. I'm open for suggestions
what to check specifically (Xen version directly?).
Jan Beulich Nov. 15, 2022, 2:05 p.m. UTC | #4
On 15.11.2022 12:38, Marek Marczykowski-Górecki wrote:
> On Tue, Nov 15, 2022 at 09:14:07AM +0100, Jan Beulich wrote:
>> On 14.11.2022 20:20, Marek Marczykowski-Górecki wrote:
>>> The /dev/mem is used for two purposes:
>>>  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
>>>  - reading Pending Bit Array (PBA)
>>>
>>> The first one was originally done because when Xen did not send all
>>> vector ctrl writes to the device model, so QEMU might have outdated old
>>> register value. This has been changed in Xen, so QEMU can now use its
>>> cached value of the register instead.
>>>
>>> The Pending Bit Array (PBA) handling is for the case where it lives on
>>> the same page as the MSI-X table itself. Xen has been extended to handle
>>> this case too (as well as other registers that may live on those pages),
>>> so QEMU handling is not necessary anymore.
>>
>> Don't you need to check for new enough Xen for both aspects?
> 
> Yes, see my response to Andrew in the thread. I'm open for suggestions
> what to check specifically (Xen version directly?). 

I guess we should first see what changes we (you) end up doing in Xen
itself, and then decide whether to surface the new functionality in
some kind of feature indicator. Generally I'd prefer to avoid version
checks, because they don't fit very well with backports nor the (rare)
need to revert something. But sometimes a feature can be probed easily
without requiring an explicit feature indicator.

Jan
Jason Andryuk Nov. 16, 2022, 7:15 p.m. UTC | #5
On Mon, Nov 14, 2022 at 2:21 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
>
> The /dev/mem is used for two purposes:
>  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
>  - reading Pending Bit Array (PBA)
>
> The first one was originally done because when Xen did not send all
> vector ctrl writes to the device model, so QEMU might have outdated old
> register value. This has been changed in Xen, so QEMU can now use its
> cached value of the register instead.
>
> The Pending Bit Array (PBA) handling is for the case where it lives on
> the same page as the MSI-X table itself. Xen has been extended to handle
> this case too (as well as other registers that may live on those pages),
> so QEMU handling is not necessary anymore.
>
> Removing /dev/mem access is useful to work within stubdomain, and
> necessary when dom0 kernel runs in lockdown mode.
>
> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>

I put the Xen, QEMU, and xen-pciback patches into OpenXT and gave a
little test.  When pci_permissive=0, iwlwifi fails to load its
firmware.  With pci_permissive=1, it looks like MSI-X is enabled. (I
previously included your libxl allow_interrupt_control patch - that
seemed to get regular MSIs working prior to the MSI-X patches.)  I
also removed the OpenXT equivalent of 0005-Disable-MSI-X-caps.patch.
I am testing with Linux 5.4.y, so that could be another factor.

One strange thing is the lspci output.  Dom0 shows MSI-X enabled.
Meanwhile NDVM (sys-net) does not show the MSI-X capability.  If you
`hexdump -C /sys/bus/pci/devices/$dev/config` you can see MSI-X
enabled, but you also see that the MSI capability has 00 as the next
pointer, so lspci stops parsing.

MSI cap stubdom:
00000040  10 00 92 00 c0 0e 00 00  10 0c 10 00 00 00 00 00  |................|
0x41 -> next 0x00
MSI cap dom0:
00000040  10 80 92 00 c0 0e 00 10  10 0c 10 00 00 00 00 00  |................|
0x41 -> next 0x80

MSI-X:
00000080  11 00 0f 80 00 20 00 00  00 30 00 00 00 00 00 00

AFAIU, the value 0x80 at offset 0x83 is MSI-X Enabled.

I had a boot where assignment failed with the hypervisor printing:
d12: assign (0000:00:14.3) failed (-16)
Rebooting the laptop seemed to clear that.

Regards,
Jason
Marek Marczykowski-Górecki Nov. 16, 2022, 9:40 p.m. UTC | #6
On Wed, Nov 16, 2022 at 02:15:22PM -0500, Jason Andryuk wrote:
> On Mon, Nov 14, 2022 at 2:21 PM Marek Marczykowski-Górecki
> <marmarek@invisiblethingslab.com> wrote:
> >
> > The /dev/mem is used for two purposes:
> >  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
> >  - reading Pending Bit Array (PBA)
> >
> > The first one was originally done because when Xen did not send all
> > vector ctrl writes to the device model, so QEMU might have outdated old
> > register value. This has been changed in Xen, so QEMU can now use its
> > cached value of the register instead.
> >
> > The Pending Bit Array (PBA) handling is for the case where it lives on
> > the same page as the MSI-X table itself. Xen has been extended to handle
> > this case too (as well as other registers that may live on those pages),
> > so QEMU handling is not necessary anymore.
> >
> > Removing /dev/mem access is useful to work within stubdomain, and
> > necessary when dom0 kernel runs in lockdown mode.
> >
> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> 
> I put the Xen, QEMU, and xen-pciback patches into OpenXT and gave a
> little test.  When pci_permissive=0, iwlwifi fails to load its
> firmware.  With pci_permissive=1, it looks like MSI-X is enabled. (I
> previously included your libxl allow_interrupt_control patch - that
> seemed to get regular MSIs working prior to the MSI-X patches.)  I
> also removed the OpenXT equivalent of 0005-Disable-MSI-X-caps.patch.
> I am testing with Linux 5.4.y, so that could be another factor.

Can you confirm the allow_interrupt_control is set by libxl? Also,
vanilla 5.4 doesn't have the allow_interrupt_control patch at all, and you
may have an earlier version that had "allow_msi_enable" as the sysfs
file name.

> One strange thing is the lspci output.  Dom0 shows MSI-X enabled.
> Meanwhile NDVM (sys-net) does not show the MSI-X capability.  If you
> `hexdump -C /sys/bus/pci/devices/$dev/config` you can see MSI-X
> enabled, but you also see that the MSI capability has 00 as the next
> pointer, so lspci stops parsing.

This 00 value is written by Linux[1] (sic!) and then qemu incorrectly
allowing the write and happily emulating that zero. The other qemu patch
in this series ought to fix that (as in: properly refuse the write), do
you have it included?

[1] https://github.com/torvalds/linux/blob/master/drivers/net/wireless/intel/iwlwifi/pcie/drv.c#L1721

> MSI cap stubdom:
> 00000040  10 00 92 00 c0 0e 00 00  10 0c 10 00 00 00 00 00  |................|
> 0x41 -> next 0x00
> MSI cap dom0:
> 00000040  10 80 92 00 c0 0e 00 10  10 0c 10 00 00 00 00 00  |................|
> 0x41 -> next 0x80
> 
> MSI-X:
> 00000080  11 00 0f 80 00 20 00 00  00 30 00 00 00 00 00 00
> 
> AFAIU, the value 0x80 at offset 0x83 is MSI-X Enabled.
> 
> I had a boot where assignment failed with the hypervisor printing:
> d12: assign (0000:00:14.3) failed (-16)
> Rebooting the laptop seemed to clear that.

Zombie of previous domain? Not set as "assignable" first?
Marek Marczykowski-Górecki Nov. 17, 2022, 3:34 a.m. UTC | #7
On Wed, Nov 16, 2022 at 10:40:02PM +0100, Marek Marczykowski-Górecki wrote:
> On Wed, Nov 16, 2022 at 02:15:22PM -0500, Jason Andryuk wrote:
> > On Mon, Nov 14, 2022 at 2:21 PM Marek Marczykowski-Górecki
> > <marmarek@invisiblethingslab.com> wrote:
> > >
> > > The /dev/mem is used for two purposes:
> > >  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
> > >  - reading Pending Bit Array (PBA)
> > >
> > > The first one was originally done because when Xen did not send all
> > > vector ctrl writes to the device model, so QEMU might have outdated old
> > > register value. This has been changed in Xen, so QEMU can now use its
> > > cached value of the register instead.
> > >
> > > The Pending Bit Array (PBA) handling is for the case where it lives on
> > > the same page as the MSI-X table itself. Xen has been extended to handle
> > > this case too (as well as other registers that may live on those pages),
> > > so QEMU handling is not necessary anymore.
> > >
> > > Removing /dev/mem access is useful to work within stubdomain, and
> > > necessary when dom0 kernel runs in lockdown mode.
> > >
> > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > 
> > I put the Xen, QEMU, and xen-pciback patches into OpenXT and gave a
> > little test.  When pci_permissive=0, iwlwifi fails to load its
> > firmware.  With pci_permissive=1, it looks like MSI-X is enabled. (I
> > previously included your libxl allow_interrupt_control patch - that
> > seemed to get regular MSIs working prior to the MSI-X patches.)  I
> > also removed the OpenXT equivalent of 0005-Disable-MSI-X-caps.patch.
> > I am testing with Linux 5.4.y, so that could be another factor.
> 
> Can you confirm the allow_interrupt_control is set by libxl? Also,
> vanilla 5.4 doesn't have the allow_interrupt_control patch at all, and you
> may have an earlier version that had "allow_msi_enable" as the sysfs
> file name.

Ok, I found what is wrong. Enabling MSI-X is refused, because INTx isn't
disabled at this point yet. And apparently I was testing this with
permissive=1...

Linux does this:
https://github.com/torvalds/linux/blob/master/drivers/pci/msi/msi.c#L611
In short:
1. Enable MSI-X with MASKALL=1
2. Setup MSI-X table
3. Disable INTx
4. Set MASKALL=0

This patch on top should fix this:
----8<----
diff --git a/drivers/xen/xen-pciback/conf_space_capability.c b/drivers/xen/xen-pciback/conf_space_capability.c
index 097316a74126..f4c4381de76e 100644
--- a/drivers/xen/xen-pciback/conf_space_capability.c
+++ b/drivers/xen/xen-pciback/conf_space_capability.c
@@ -235,7 +235,7 @@ static int msi_msix_flags_write(struct pci_dev *dev, int offset, u16 new_value,
 	    (new_value ^ old_value) & ~field_config->allowed_bits)
 		return PCIBIOS_SET_FAILED;
 
-	if (new_value & field_config->enable_bit) {
+	if ((new_value & field_config->allowed_bits) == field_config->enable_bit) {
 		/* don't allow enabling together with other interrupt types */
 		int int_type = xen_pcibk_get_interrupt_type(dev);
 
----8<----

Jan, is the above safe? It should prevent clearing MASKALL if INTx isn't
disabled, unless I missed something? But also, it will allow enabling
MSI-X with MASKALL=1 together with MSI, which I'm not sure if should be
allowed.
Alternatively to the above patch, I could allow specifically setting
MSIX_FLAGS_ENABLE + MSIX_FLAGS_MASKALL while INTx isn't disabled as a
single exception, but at this point I'm not sure if some other driver or
OS wouldn't approach this in yet another way.
Jan Beulich Nov. 17, 2022, 8:04 a.m. UTC | #8
On 17.11.2022 04:34, Marek Marczykowski-Górecki wrote:
> Ok, I found what is wrong. Enabling MSI-X is refused, because INTx isn't
> disabled at this point yet. And apparently I was testing this with
> permissive=1...
> 
> Linux does this:
> https://github.com/torvalds/linux/blob/master/drivers/pci/msi/msi.c#L611
> In short:
> 1. Enable MSI-X with MASKALL=1
> 2. Setup MSI-X table
> 3. Disable INTx
> 4. Set MASKALL=0
> 
> This patch on top should fix this:
> ----8<----
> diff --git a/drivers/xen/xen-pciback/conf_space_capability.c b/drivers/xen/xen-pciback/conf_space_capability.c
> index 097316a74126..f4c4381de76e 100644
> --- a/drivers/xen/xen-pciback/conf_space_capability.c
> +++ b/drivers/xen/xen-pciback/conf_space_capability.c
> @@ -235,7 +235,7 @@ static int msi_msix_flags_write(struct pci_dev *dev, int offset, u16 new_value,
>  	    (new_value ^ old_value) & ~field_config->allowed_bits)
>  		return PCIBIOS_SET_FAILED;
>  
> -	if (new_value & field_config->enable_bit) {
> +	if ((new_value & field_config->allowed_bits) == field_config->enable_bit) {
>  		/* don't allow enabling together with other interrupt types */
>  		int int_type = xen_pcibk_get_interrupt_type(dev);
>  
> ----8<----
> 
> Jan, is the above safe? It should prevent clearing MASKALL if INTx isn't
> disabled, unless I missed something? But also, it will allow enabling
> MSI-X with MASKALL=1 together with MSI, which I'm not sure if should be
> allowed.

While it would probably be okay with what we have now (after your earlier
patch introducing allowed_bits), it's likely not going to be correct once
further bits would be added to allowed_bits (which clearly is going to be
wanted sooner or later, e.g. for multi-vector MSI). Hence I think ...

> Alternatively to the above patch, I could allow specifically setting
> MSIX_FLAGS_ENABLE + MSIX_FLAGS_MASKALL while INTx isn't disabled as a
> single exception,

... this is the way to go, and ...

> but at this point I'm not sure if some other driver or
> OS wouldn't approach this in yet another way.

... I guess we need to further add exceptions as needed - the one further
approach I could see is to keep all entry's mask bits set until setting
"INTx disable", without using MASKALL.

I'd like to note though that the PCI spec has no such exception. It,
however, also doesn't mandate setting "INTx disable" - that's merely a
workaround for flawed hardware afaik. (In Xen we also only cross check
MSI and MSI-X. IOW we rely on Dom0 anyway when it comes to driving
"INTx disable".) Therefore in the end pciback looks to be going too far
in enforcing such dependencies.

Thinking of this - what about making the change in
xen_pcibk_get_interrupt_type() instead, setting INTERRUPT_TYPE_MSIX only
when MASKALL is clear (alongside ENABLE being set)? This would then also
cover command_write().

Jan
Marek Marczykowski-Górecki Nov. 17, 2022, 11:15 a.m. UTC | #9
On Thu, Nov 17, 2022 at 09:04:40AM +0100, Jan Beulich wrote:
> On 17.11.2022 04:34, Marek Marczykowski-Górecki wrote:
> > Ok, I found what is wrong. Enabling MSI-X is refused, because INTx isn't
> > disabled at this point yet. And apparently I was testing this with
> > permissive=1...
> > 
> > Linux does this:
> > https://github.com/torvalds/linux/blob/master/drivers/pci/msi/msi.c#L611
> > In short:
> > 1. Enable MSI-X with MASKALL=1
> > 2. Setup MSI-X table
> > 3. Disable INTx
> > 4. Set MASKALL=0
> > 
> > This patch on top should fix this:
> > ----8<----
> > diff --git a/drivers/xen/xen-pciback/conf_space_capability.c b/drivers/xen/xen-pciback/conf_space_capability.c
> > index 097316a74126..f4c4381de76e 100644
> > --- a/drivers/xen/xen-pciback/conf_space_capability.c
> > +++ b/drivers/xen/xen-pciback/conf_space_capability.c
> > @@ -235,7 +235,7 @@ static int msi_msix_flags_write(struct pci_dev *dev, int offset, u16 new_value,
> >  	    (new_value ^ old_value) & ~field_config->allowed_bits)
> >  		return PCIBIOS_SET_FAILED;
> >  
> > -	if (new_value & field_config->enable_bit) {
> > +	if ((new_value & field_config->allowed_bits) == field_config->enable_bit) {
> >  		/* don't allow enabling together with other interrupt types */
> >  		int int_type = xen_pcibk_get_interrupt_type(dev);
> >  
> > ----8<----
> > 
> > Jan, is the above safe? It should prevent clearing MASKALL if INTx isn't
> > disabled, unless I missed something? But also, it will allow enabling
> > MSI-X with MASKALL=1 together with MSI, which I'm not sure if should be
> > allowed.
> 
> While it would probably be okay with what we have now (after your earlier
> patch introducing allowed_bits), it's likely not going to be correct once
> further bits would be added to allowed_bits (which clearly is going to be
> wanted sooner or later, e.g. for multi-vector MSI). Hence I think ...
> 
> > Alternatively to the above patch, I could allow specifically setting
> > MSIX_FLAGS_ENABLE + MSIX_FLAGS_MASKALL while INTx isn't disabled as a
> > single exception,
> 
> ... this is the way to go, and ...

Ok, I'll go this way then.

> > but at this point I'm not sure if some other driver or
> > OS wouldn't approach this in yet another way.
> 
> ... I guess we need to further add exceptions as needed - the one further
> approach I could see is to keep all entry's mask bits set until setting
> "INTx disable", without using MASKALL.
> 
> I'd like to note though that the PCI spec has no such exception. It,
> however, also doesn't mandate setting "INTx disable" - that's merely a
> workaround for flawed hardware afaik. (In Xen we also only cross check
> MSI and MSI-X. IOW we rely on Dom0 anyway when it comes to driving
> "INTx disable".) Therefore in the end pciback looks to be going too far
> in enforcing such dependencies.
> 
> Thinking of this - what about making the change in
> xen_pcibk_get_interrupt_type() instead, setting INTERRUPT_TYPE_MSIX only
> when MASKALL is clear (alongside ENABLE being set)? This would then also
> cover command_write().

That may be a good idea, but wouldn't cover the case here:
xen_pcibk_get_interrupt_type() at this time returns INTERRUPT_TYPE_INTX
only, and setting MSIX_FLAGS_ENABLE is prevented.
Jason Andryuk Nov. 17, 2022, 5:29 p.m. UTC | #10
On Wed, Nov 16, 2022 at 10:34 PM Marek Marczykowski-Górecki
<marmarek@invisiblethingslab.com> wrote:
>
> On Wed, Nov 16, 2022 at 10:40:02PM +0100, Marek Marczykowski-Górecki wrote:
> > On Wed, Nov 16, 2022 at 02:15:22PM -0500, Jason Andryuk wrote:
> > > On Mon, Nov 14, 2022 at 2:21 PM Marek Marczykowski-Górecki
> > > <marmarek@invisiblethingslab.com> wrote:
> > > >
> > > > The /dev/mem is used for two purposes:
> > > >  - reading PCI_MSIX_ENTRY_CTRL_MASKBIT
> > > >  - reading Pending Bit Array (PBA)
> > > >
> > > > The first one was originally done because when Xen did not send all
> > > > vector ctrl writes to the device model, so QEMU might have outdated old
> > > > register value. This has been changed in Xen, so QEMU can now use its
> > > > cached value of the register instead.
> > > >
> > > > The Pending Bit Array (PBA) handling is for the case where it lives on
> > > > the same page as the MSI-X table itself. Xen has been extended to handle
> > > > this case too (as well as other registers that may live on those pages),
> > > > so QEMU handling is not necessary anymore.
> > > >
> > > > Removing /dev/mem access is useful to work within stubdomain, and
> > > > necessary when dom0 kernel runs in lockdown mode.
> > > >
> > > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > >
> > > I put the Xen, QEMU, and xen-pciback patches into OpenXT and gave a
> > > little test.  When pci_permissive=0, iwlwifi fails to load its
> > > firmware.  With pci_permissive=1, it looks like MSI-X is enabled. (I
> > > previously included your libxl allow_interrupt_control patch - that
> > > seemed to get regular MSIs working prior to the MSI-X patches.)  I
> > > also removed the OpenXT equivalent of 0005-Disable-MSI-X-caps.patch.
> > > I am testing with Linux 5.4.y, so that could be another factor.
> >
> > Can you confirm the allow_interrupt_control is set by libxl? Also,
> > vanilla 5.4 doesn't have the allow_interrupt_control patch at all, and you
> > may have an earlier version that had "allow_msi_enable" as the sysfs
> > file name.

I backported allow_interrupt_control to 5.4 and that is set properly.

> Ok, I found what is wrong. Enabling MSI-X is refused, because INTx isn't
> disabled at this point yet. And apparently I was testing this with
> permissive=1...
>
> Linux does this:
> https://github.com/torvalds/linux/blob/master/drivers/pci/msi/msi.c#L611
> In short:
> 1. Enable MSI-X with MASKALL=1
> 2. Setup MSI-X table
> 3. Disable INTx
> 4. Set MASKALL=0
>
> This patch on top should fix this:
> ----8<----
> diff --git a/drivers/xen/xen-pciback/conf_space_capability.c b/drivers/xen/xen-pciback/conf_space_capability.c
> index 097316a74126..f4c4381de76e 100644
> --- a/drivers/xen/xen-pciback/conf_space_capability.c
> +++ b/drivers/xen/xen-pciback/conf_space_capability.c
> @@ -235,7 +235,7 @@ static int msi_msix_flags_write(struct pci_dev *dev, int offset, u16 new_value,
>             (new_value ^ old_value) & ~field_config->allowed_bits)
>                 return PCIBIOS_SET_FAILED;
>
> -       if (new_value & field_config->enable_bit) {
> +       if ((new_value & field_config->allowed_bits) == field_config->enable_bit) {
>                 /* don't allow enabling together with other interrupt types */
>                 int int_type = xen_pcibk_get_interrupt_type(dev);
>
> ----8<----

FWIW, I can confirm this allows enabling MSI-X with permissive=0 for me.

Thanks,
Jason
diff mbox series

Patch

diff --git a/hw/xen/xen_pt.h b/hw/xen/xen_pt.h
index e7c4316a7d..de4094e7ec 100644
--- a/hw/xen/xen_pt.h
+++ b/hw/xen/xen_pt.h
@@ -206,7 +206,6 @@  typedef struct XenPTMSIX {
     uint32_t table_offset_adjust; /* page align mmap */
     uint64_t mmio_base_addr;
     MemoryRegion mmio;
-    void *phys_iomem_base;
     XenPTMSIXEntry msix_entry[];
 } XenPTMSIX;
 
diff --git a/hw/xen/xen_pt_msi.c b/hw/xen/xen_pt_msi.c
index b71563f98a..a8a75dff66 100644
--- a/hw/xen/xen_pt_msi.c
+++ b/hw/xen/xen_pt_msi.c
@@ -460,15 +460,7 @@  static void pci_msix_write(void *opaque, hwaddr addr,
         entry->updated = true;
     } else if (msix->enabled && entry->updated &&
                !(val & PCI_MSIX_ENTRY_CTRL_MASKBIT)) {
-        const volatile uint32_t *vec_ctrl;
-
-        /*
-         * If Xen intercepts the mask bit access, entry->vec_ctrl may not be
-         * up-to-date. Read from hardware directly.
-         */
-        vec_ctrl = s->msix->phys_iomem_base + entry_nr * PCI_MSIX_ENTRY_SIZE
-            + PCI_MSIX_ENTRY_VECTOR_CTRL;
-        xen_pt_msix_update_one(s, entry_nr, *vec_ctrl);
+        xen_pt_msix_update_one(s, entry_nr, entry->latch(VECTOR_CTRL));
     }
 
     set_entry_value(entry, offset, val);
@@ -493,7 +485,9 @@  static uint64_t pci_msix_read(void *opaque, hwaddr addr,
         return get_entry_value(&msix->msix_entry[entry_nr], offset);
     } else {
         /* Pending Bit Array (PBA) */
-        return *(uint32_t *)(msix->phys_iomem_base + addr);
+        XEN_PT_LOG(&s->dev, "reading PBA, addr %#lx, offset %#lx\n",
+                   addr, addr - msix->total_entries * PCI_MSIX_ENTRY_SIZE);
+        return 0xFFFFFFFF;
     }
 }
 
@@ -529,7 +523,6 @@  int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)
     int i, total_entries, bar_index;
     XenHostPCIDevice *hd = &s->real_device;
     PCIDevice *d = &s->dev;
-    int fd = -1;
     XenPTMSIX *msix = NULL;
     int rc = 0;
 
@@ -576,34 +569,6 @@  int xen_pt_msix_init(XenPCIPassthroughState *s, uint32_t base)
     msix->table_base = s->real_device.io_regions[bar_index].base_addr;
     XEN_PT_LOG(d, "get MSI-X table BAR base 0x%"PRIx64"\n", msix->table_base);
 
-    fd = open("/dev/mem", O_RDWR);
-    if (fd == -1) {
-        rc = -errno;
-        XEN_PT_ERR(d, "Can't open /dev/mem: %s\n", strerror(errno));
-        goto error_out;
-    }
-    XEN_PT_LOG(d, "table_off = 0x%x, total_entries = %d\n",
-               table_off, total_entries);
-    msix->table_offset_adjust = table_off & 0x0fff;
-    msix->phys_iomem_base =
-        mmap(NULL,
-             total_entries * PCI_MSIX_ENTRY_SIZE + msix->table_offset_adjust,
-             PROT_READ,
-             MAP_SHARED | MAP_LOCKED,
-             fd,
-             msix->table_base + table_off - msix->table_offset_adjust);
-    close(fd);
-    if (msix->phys_iomem_base == MAP_FAILED) {
-        rc = -errno;
-        XEN_PT_ERR(d, "Can't map physical MSI-X table: %s\n", strerror(errno));
-        goto error_out;
-    }
-    msix->phys_iomem_base = (char *)msix->phys_iomem_base
-        + msix->table_offset_adjust;
-
-    XEN_PT_LOG(d, "mapping physical MSI-X table to %p\n",
-               msix->phys_iomem_base);
-
     memory_region_add_subregion_overlap(&s->bar[bar_index], table_off,
                                         &msix->mmio,
                                         2); /* Priority: pci default + 1 */
@@ -624,14 +589,6 @@  void xen_pt_msix_unmap(XenPCIPassthroughState *s)
         return;
     }
 
-    /* unmap the MSI-X memory mapped register area */
-    if (msix->phys_iomem_base) {
-        XEN_PT_LOG(&s->dev, "unmapping physical MSI-X table from %p\n",
-                   msix->phys_iomem_base);
-        munmap(msix->phys_iomem_base, msix->total_entries * PCI_MSIX_ENTRY_SIZE
-               + msix->table_offset_adjust);
-    }
-
     memory_region_del_subregion(&s->bar[msix->bar_index], &msix->mmio);
 }