diff mbox series

[RFC,v2,08/18] iommu/riscv: Use MSI table to enable IMSIC access

Message ID 20250920203851.2205115-28-ajones@ventanamicro.com
State New
Headers show
Series iommu/riscv: Add irqbypass support | expand

Commit Message

Andrew Jones Sept. 20, 2025, 8:38 p.m. UTC
When setting irq affinity extract the IMSIC address the device
needs to access and add it to the MSI table. If the device no
longer needs access to an IMSIC then remove it from the table
to prohibit access. This allows isolating device MSIs to a set
of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ
domain flag.

Signed-off-by: Andrew Jones <ajones@ventanamicro.com>
---
 drivers/iommu/riscv/iommu-ir.c | 143 ++++++++++++++++++++++++++++++++-
 drivers/iommu/riscv/iommu.h    |   2 +
 2 files changed, 143 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe Sept. 22, 2025, 6:43 p.m. UTC | #1
On Sat, Sep 20, 2025 at 03:38:58PM -0500, Andrew Jones wrote:
> When setting irq affinity extract the IMSIC address the device
> needs to access and add it to the MSI table. If the device no
> longer needs access to an IMSIC then remove it from the table
> to prohibit access. This allows isolating device MSIs to a set
> of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ
> domain flag.

IRQ_DOMAIN_FLAG_ISOLATED_MSI has nothing to do with HARTs.

 * Isolated MSI means that HW modeled by an irq_domain on the path from the
 * initiating device to the CPU will validate that the MSI message specifies an
 * interrupt number that the device is authorized to trigger. This must block
 * devices from triggering interrupts they are not authorized to trigger.
 * Currently authorization means the MSI vector is one assigned to the device.

It has to do with each PCI BDF having a unique set of
validation/mapping tables for MSIs that are granular to the interrupt
number.

As I understand the spec this is is only possible with msiptp? As
discussed previously this has to be a static property and the SW stack
doesn't expect it to change. So if the IR driver sets
IRQ_DOMAIN_FLAG_ISOLATED_MSI it has to always use misptp?

Further, since the interrupt tables have to be per BDF they cannot be
linked to an iommu_domain! Storing the msiptp in an iommu_domain is
totally wrong?? It needs to somehow be stored in the interrupt layer
per-struct device, check how AMD and Intel have stored their IR tables
programmed into their versions of DC.

It looks like there is something in here to support HW that doesn't
have msiptp? That's different, and also looks very confused. The IR
driver should never be touching the iommu domain or calling iommu_map!
Instead it probably has to use the SW_MSI mechanism to request mapping
the interrupt controller aperture. You don't get
IRQ_DOMAIN_FLAG_ISOLATED_MSI with something like this though. Look at
how ARM GIC works for this mechanism.

Finally, please split this series up, if ther are two different ways
to manage the MSI aperture then please split it into two series with a
clear description how the HW actually works.

Maybe start with the simpler case of no msiptp??

Jason
Andrew Jones Sept. 22, 2025, 9:20 p.m. UTC | #2
On Mon, Sep 22, 2025 at 03:43:36PM -0300, Jason Gunthorpe wrote:
> On Sat, Sep 20, 2025 at 03:38:58PM -0500, Andrew Jones wrote:
> > When setting irq affinity extract the IMSIC address the device
> > needs to access and add it to the MSI table. If the device no
> > longer needs access to an IMSIC then remove it from the table
> > to prohibit access. This allows isolating device MSIs to a set
> > of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ
> > domain flag.
> 
> IRQ_DOMAIN_FLAG_ISOLATED_MSI has nothing to do with HARTs.
> 
>  * Isolated MSI means that HW modeled by an irq_domain on the path from the
>  * initiating device to the CPU will validate that the MSI message specifies an
>  * interrupt number that the device is authorized to trigger. This must block
>  * devices from triggering interrupts they are not authorized to trigger.
>  * Currently authorization means the MSI vector is one assigned to the device.

Unfortunately the RISC-V IOMMU doesn't have support for this. I've raised
the lack of MSI data validation to the spec writers and I'll try to raise
it again, but I was hoping we could still get IRQ_DOMAIN_FLAG_ISOLATED_MSI
by simply ensuring the MSI addresses only include the affined harts (and
also with the NOTE comment I've put in this patch to point out the
deficiency).

> 
> It has to do with each PCI BDF having a unique set of
> validation/mapping tables for MSIs that are granular to the interrupt
> number.

Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way.

> 
> As I understand the spec this is is only possible with msiptp? As
> discussed previously this has to be a static property and the SW stack
> doesn't expect it to change. So if the IR driver sets
> IRQ_DOMAIN_FLAG_ISOLATED_MSI it has to always use misptp?

Yes, the patch only sets IRQ_DOMAIN_FLAG_ISOLATED_MSI when the IOMMU
has RISCV_IOMMU_CAPABILITIES_MSI_FLAT and it will remain set for the
lifetime of the irqdomain, no matter how the IOMMU is being applied.

> 
> Further, since the interrupt tables have to be per BDF they cannot be
> linked to an iommu_domain! Storing the msiptp in an iommu_domain is
> totally wrong?? It needs to somehow be stored in the interrupt layer
> per-struct device, check how AMD and Intel have stored their IR tables
> programmed into their versions of DC.

The RISC-V IOMMU MSI table is simply a flat address remapping table,
which also has support for MRIFs. The table indices come from an
address matching mechanism used to filter out invalid addresses and
to convert valid addresses into MSI table indices. IOW, the RISC-V
MSI table is a simple translation table, and even needs to be tied to
a particular DMA table in order to work. Here's some examples

1. stage1 not BARE
------------------

      stage1     MSI table
 IOVA ------> A  ---------> host-MSI-address

2. stage1 is BARE, for example if only stage2 is in use
-------------------------------------------------------

           MSI table
 IOVA == A ---------> host-MSI-address

When used by the host A == host-MSI-address, but at least we can block
the write when an IRQ has been affined to a set of harts that doesn't
include what it's targeting. When used for irqbypass A == guest-MSI-
address and the host-MSI-address will be that of a guest interrupt file.
This ensures a device assigned to a guest can only reach its own vcpus
when sending MSIs.

In the first example, where stage1 is not BARE, the stage1 page tables
must have some IOVA->A mapping, otherwise the MSI table will not get
a chance to do a translation, as the stage1 DMA will fault. This
series ensures stage1 gets an identity mapping for all possible MSI
targets and then leaves it be, using the MSI tables instead for the
isolation.

I don't think we can apply a lot of AMD's and Intel's model to RISC-V.

> 
> It looks like there is something in here to support HW that doesn't
> have msiptp? That's different, and also looks very confused.

The only support is to ensure all the host IMSICs are mapped, otherwise
we can't turn on IOMMU_DMA since all MSI writes will cause faults. We
don't set IRQ_DOMAIN_FLAG_ISOLATED_MSI in this case, though, since we
don't bother unmapping MSI addresses of harts that IRQs have be un-
affined from.

> The IR
> driver should never be touching the iommu domain or calling iommu_map!

As pointed out above, the RISC-V IR is quite a different beast than AMD
and Intel. Whether or not the IOMMU has MSI table support, the IMSICs
must be mapped in stage1, when stage1 is not BARE. So, in both cases we
roll that mapping into the IR code since there isn't really any better
place for it for the host case and it's necessary for the IR code to
manage it for the virt case. Since IR (or MSI delivery in general) is
dependent upon the stage1 page tables, then it's necessary to be tied to
the same IOMMU domain that those page tables are tied to. Patch4's changes
to riscv_iommu_attach_paging_domain() and riscv_iommu_iodir_update() show
how they're tied together.

> Instead it probably has to use the SW_MSI mechanism to request mapping
> the interrupt controller aperture. You don't get
> IRQ_DOMAIN_FLAG_ISOLATED_MSI with something like this though. Look at
> how ARM GIC works for this mechanism.

I'm not seeing how SW_MSI will help here, but so far I've just done some
quick grepping and code skimming.

> 
> Finally, please split this series up, if ther are two different ways
> to manage the MSI aperture then please split it into two series with a
> clear description how the HW actually works.
> 
> Maybe start with the simpler case of no msiptp??

The first five patches plus the "enable IOMMU_DMA" will allow paging
domains to be used by default, while paving the way for patches 6-8 to
allow host IRQs to be isolated to the best of our ability (only able to
access IMSICs to which they are affined). So we could have

series1: irqdomain + map all imsics + enable IOMMU_DMA
series2: actually apply irqdomain in order to implement map/unmap of MSI
         ptes based on IRQ affinity - set IRQ_DOMAIN_FLAG_ISOLATED_MSI,
         because that's the best we've got...
series3: the rest of the patches of this series which introduce irqbypass
         support for the virt use case

Would that be better? Or do you see some need for some patch splits as
well?

Thanks,
drew
Jason Gunthorpe Sept. 22, 2025, 11:56 p.m. UTC | #3
On Mon, Sep 22, 2025 at 04:20:43PM -0500, Andrew Jones wrote:
> On Mon, Sep 22, 2025 at 03:43:36PM -0300, Jason Gunthorpe wrote:
> > On Sat, Sep 20, 2025 at 03:38:58PM -0500, Andrew Jones wrote:
> > > When setting irq affinity extract the IMSIC address the device
> > > needs to access and add it to the MSI table. If the device no
> > > longer needs access to an IMSIC then remove it from the table
> > > to prohibit access. This allows isolating device MSIs to a set
> > > of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ
> > > domain flag.
> > 
> > IRQ_DOMAIN_FLAG_ISOLATED_MSI has nothing to do with HARTs.
> > 
> >  * Isolated MSI means that HW modeled by an irq_domain on the path from the
> >  * initiating device to the CPU will validate that the MSI message specifies an
> >  * interrupt number that the device is authorized to trigger. This must block
> >  * devices from triggering interrupts they are not authorized to trigger.
> >  * Currently authorization means the MSI vector is one assigned to the device.
> 
> Unfortunately the RISC-V IOMMU doesn't have support for this. I've raised
> the lack of MSI data validation to the spec writers and I'll try to raise
> it again, but I was hoping we could still get IRQ_DOMAIN_FLAG_ISOLATED_MSI
> by simply ensuring the MSI addresses only include the affined harts (and
> also with the NOTE comment I've put in this patch to point out the
> deficiency).

Nope, sorry, it has to validate the incoming target Linux interrupt
vector in some way - that is the whole point and the required
security.

We cannot allow devices assigned to a VFIO to trigger vectors that
belong to devices outside that VFIO. HARTS should have nothing to do
with that security statement.

> > It has to do with each PCI BDF having a unique set of
> > validation/mapping tables for MSIs that are granular to the interrupt
> > number.
> 
> Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way.

Interrupt number is a Linux concept, HW decodes the addr/data pair and
delivers it to some Linux interrupt. Linux doesn't care how the HW
treats the addr/data pair, it can ignore data if it wants.

> 1. stage1 not BARE
> ------------------
> 
>       stage1     MSI table
>  IOVA ------> A  ---------> host-MSI-address

If you run the driver like this then it should use IOMMU_RESV_SW_MSI
to make an aperture. Follow how ARM GIC works for this. The Linux
architecture is that the iommu_domain owner is responsible to provide
a range for SW_MSI and connect it to the interrupt driver.

> 2. stage1 is BARE, for example if only stage2 is in use
> -------------------------------------------------------
> 
>            MSI table
>  IOVA == A ---------> host-MSI-address

And here you'd use HW_MSI to set a fixed aperture that matches
DC.msi_addr_* (?)

> I don't think we can apply a lot of AMD's and Intel's model to RISC-V.

AMD and Intel have the interrupt translation as part of the "DC". That
corresponds to the DC.msiptp related stuff.

ARM has a SW_MSI that applies the S1/S2 translation first and then
puts the interrupt translation in the GIC IP block. That corresponds
to what you just explained above.

Somehow RISCV have the most challenging properties of both
architectures, and mis-understood what security Linux expects out of
interrupt remapping :|

Use SW_MSI to open the aperture like ARM, use techniques like
AMD/Intel to manage the translation table in the shared DC

Don't touch the iommu_domain from the interrupt driver, it simply
cannot work.
 
> > It looks like there is something in here to support HW that doesn't
> > have msiptp? That's different, and also looks very confused.
> 
> The only support is to ensure all the host IMSICs are mapped, otherwise
> we can't turn on IOMMU_DMA since all MSI writes will cause faults. We
> don't set IRQ_DOMAIN_FLAG_ISOLATED_MSI in this case, though, since we
> don't bother unmapping MSI addresses of harts that IRQs have be un-
> affined from.

OK, that seems like a good first series. Implement just the SW_MSI to
expose all the IMSIC's.

> > Instead it probably has to use the SW_MSI mechanism to request mapping
> > the interrupt controller aperture. You don't get
> > IRQ_DOMAIN_FLAG_ISOLATED_MSI with something like this though. Look at
> > how ARM GIC works for this mechanism.
> 
> I'm not seeing how SW_MSI will help here, but so far I've just done some
> quick grepping and code skimming.

SW_MSI is intended to deal with exactly what you are describing - the
IOMMU translates the MSI aperture.

1) The driver declares a IOMMU_RESV_SW_MSI in it's reserved_regions
2) When a subsystem wishes to use an iommu_domain it must detect the
   SW_MSI and setup some IOVA for it to use
   a) dma-iommu.c allocates IOVA dynamically per-page via
      iommu_dma_get_msi_page()
   b) Classic VFIO uses iommu_get_msi_cookie() with the SW_MSI range.
   c) iommufd has its own logic in and around iommufd_sw_msi()
3) When the IRQ driver wants to establish a MSI it computes the
   untranslated address it wants and then calls iommu_dma_prepare_msi()

   iommu_dma_prepare_msi() will return the a msi_desc with the address
   adjusted

   For instance dma-iommu.c has iommu_dma_prepare_msi() call into
   iommu_dma_get_msi_page() which will establish any required mapping
   for phys at some IOVA it picks.

The riscv mess has the additional obnoxious complexity that the S1/S2
are inconsisent.

Right now the only way to deal with this would be to only use one of
the S1 or S2 and make that decision when the iommu driver starts. You
can return the right SW_MSI/HW_MSI based on which PAGING domain style
the driver is going to use.

I recommend if the S2 is available you make the driver always use the
S2 for everything and ignore the S1 except for explicit two stage
translation setup by a hypervisor. Thus always return HW_MSI.

If the iommu does not support S2 then always use S1 and always return
SW_MSI.

Disallow SW from creating a PAGING domain that mismatches the expected
interrupt setup.

There is still quite some mess to enhance the iommu_dma_prepare_msi()
path to support all the modes, that will need new code as ARM doesn't
have the difference between S1/S2 (sigh)

But broadly this is the sketch for where the touchpoints should be to
setup a MSI aperture that has to go through the page table.

> The first five patches plus the "enable IOMMU_DMA" will allow paging
> domains to be used by default, while paving the way for patches 6-8 to
> allow host IRQs to be isolated to the best of our ability (only able to
> access IMSICs to which they are affined). So we could have
> 
> series1: irqdomain + map all imsics + enable IOMMU_DMA

Yes, please lets try to get through this first..

> series2: actually apply irqdomain in order to implement map/unmap of MSI
>          ptes based on IRQ affinity - set IRQ_DOMAIN_FLAG_ISOLATED_MSI,
>          because that's the best we've got...

Make sense, but as above, need to do something to make it actually
implement IRQ_DOMAIN_FLAG_ISOLATED_MSI. Limited to HARTS is not the
same thing.

This also needs some kind of new iommufd support, and I don't know how
it will work, but setting the DC.msi_addr* should come from userspace
and create a new reserved region.. Maybe we need to create per-domain
reserved regions..

> series3: the rest of the patches of this series which introduce irqbypass
>          support for the virt use case

Yes

> Would that be better? Or do you see some need for some patch splits as
> well?

I did not try to look at the patches for this, but I think your three
themes makes alot of sense.

Please include in the cover letter some of the explanations from these
two RFC postings about how the HW works.

Jason
Thomas Gleixner Sept. 23, 2025, 10:12 a.m. UTC | #4
On Mon, Sep 22 2025 at 20:56, Jason Gunthorpe wrote:
> On Mon, Sep 22, 2025 at 04:20:43PM -0500, Andrew Jones wrote:
>> > It has to do with each PCI BDF having a unique set of
>> > validation/mapping tables for MSIs that are granular to the interrupt
>> > number.
>> 
>> Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way.
>
> Interrupt number is a Linux concept, HW decodes the addr/data pair and
> delivers it to some Linux interrupt. Linux doesn't care how the HW
> treats the addr/data pair, it can ignore data if it wants.

Let me explain this a bit deeper.

As you said, the interrupt number is a pure kernel software construct,
which is mapped to a hardware interrupt source.

The interrupt domain, which is associated to a hardware interrupt
source, creates the mapping and supplies the resulting configuration to
the hardware, so that the hardware is able to raise an interrupt in the
CPU.

In case of MSI, this configuration is the MSI message (address,
data). That's composed by the domain according to the requirements of
the underlying CPU hardware resource. This underlying hardware resource
can be the CPUs interrupt controller itself or some intermediary
hardware entity.

The kernel reflects this in the interrupt domain hierarchy. The simplest
case for MSI is:

     [ CPU domain ] --- [ MSI domain ] -- device

The flow is as follows:

   device driver allocates an MSI interrupt in the MSI domain

   MSI domain allocates an interrupt in the CPU domain

   CPU domain allocates an interrupt vector and composes the
   address/data pair. If @data is written to @address, the interrupt is
   raised in the CPU

   MSI domain converts the address/data pair into device format and
   writes it into the device.

   When the device fires an interrupt it writes @data to @address, which
   raises the interrupt in the CPU at the allocated CPU vector.  That
   vector is then translated to the Linux interrupt number in the
   interrupt handling entry code by looking it up in the CPU domain.

With a remapping domain intermediary this looks like this:

     [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device
 
   device driver allocates an MSI interrupt in the MSI domain

   MSI domain allocates an interrupt in the Remap domain

   Remap domain allocates a resource in the remap space, e.g. an entry
   in the remap translation table and then allocates an interrupt in the
   CPU domain.

   CPU domain allocates an interrupt vector and composes the
   address/data pair. If @data is written to @address, the interrupt is
   raised in the CPU

   Remap domain converts the CPU address/data pair to remap table format
   and writes it to the alloacted entry in that table. It then composes
   a new address/data pair, which points at the remap table entry.

   MSI domain converts the remap address/data pair into device format
   and writes it into the device.

   So when the device fires an interrupt it writes @data to @address,
   which triggers the remap unit. The remap unit validates that the
   address/data pair is valid for the device and if so it writes the CPU
   address/data pair, which raises the interrupt in the CPU at the
   allocated vector. That vector is then translated to the Linux
   interrupt number in the interrupt handling entry code by looking it
   up in the CPU domain.

So from a kernel POV, the address/data pairs are just opaque
configuration values, which are written into the remap table and the
device. Whether the content of @data is relevant or not, is a hardware
implementation detail. That implementation detail is only relevant for
the interrupt domain code, which handle a specific part of the
hierarchy.

The MSI domain does not need to know anything about the content and the
meaning of @address and @data. It just cares about converting that into
the device specific storage format.

The Remap domain does not need to know anything about the content and
the meaning of the CPU domain provided @address and @data. It just cares
about converting that into the remap table specific format.

The hardware entities do not know about the Linux interrupt number at
all. That relationship is purely software managed as a mapping from the
allocated CPU vector to the Linux interrupt number.

Hope that helps.

     tglx
Jason Gunthorpe Sept. 23, 2025, 2:06 p.m. UTC | #5
On Tue, Sep 23, 2025 at 12:12:52PM +0200, Thomas Gleixner wrote:
> With a remapping domain intermediary this looks like this:
> 
>      [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device
>  
>    device driver allocates an MSI interrupt in the MSI domain
> 
>    MSI domain allocates an interrupt in the Remap domain
> 
>    Remap domain allocates a resource in the remap space, e.g. an entry
>    in the remap translation table and then allocates an interrupt in the
>    CPU domain.

Thanks!

And to be very crystal clear here, the meaning of
IRQ_DOMAIN_FLAG_ISOLATED_MSI is that the remap domain has a security
feature such that the device can only trigger CPU domain interrupts
that have been explicitly allocated in the remap domain for that
device. The device can never go through the remap domain and trigger
some other device's interrupt.

This is usally done by having the remap domain's HW take in the
Addr/Data pair, do a per-BDF table lookup and then completely replace
the Addr/Data pair with the "remapped" version. By fully replacing the
remap domain prevents the device from generating a disallowed
addr/data pair toward the CPU domain.

It fundamentally must be done by having the HW do a per-RID/BDF table
lookup based on the incoming MSI addr/data and fully sanitize the
resulting output.

There is some legacy history here. When MSI was first invented the
goal was to make interrupts scalable by removing any state from the
CPU side. The device would be told what Addr/Data to send to the CPU
and the CPU would just take some encoded information in that pair as a
delivery instruction. No state on the CPU side per interrupt.

In the world of virtualization it was realized this is not secure, so
the archs undid the core principal of MSI and the CPU HW has some kind
of state/table entry for every single device interrupt source.

x86/AMD did this by having per-device remapping tables in their IOMMU
device context that are selected by incomming RID and effectively
completely rewrite the addr/data pair before it reaches the APIC. The
remap table alone now basically specifies where the interrupt is
delivered.

ARM doesn't do remapping, instead the interrupt controller itself has
a table that converts (BDF,Data) into a delivery instruction. It is
inherently secure.

That flag has nothing to do with affinity.

Jason
Andrew Jones Sept. 23, 2025, 2:37 p.m. UTC | #6
On Tue, Sep 23, 2025 at 12:12:52PM +0200, Thomas Gleixner wrote:
> On Mon, Sep 22 2025 at 20:56, Jason Gunthorpe wrote:
> > On Mon, Sep 22, 2025 at 04:20:43PM -0500, Andrew Jones wrote:
> >> > It has to do with each PCI BDF having a unique set of
> >> > validation/mapping tables for MSIs that are granular to the interrupt
> >> > number.
> >> 
> >> Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way.
> >
> > Interrupt number is a Linux concept, HW decodes the addr/data pair and
> > delivers it to some Linux interrupt. Linux doesn't care how the HW
> > treats the addr/data pair, it can ignore data if it wants.
> 
> Let me explain this a bit deeper.
> 
> As you said, the interrupt number is a pure kernel software construct,
> which is mapped to a hardware interrupt source.
> 
> The interrupt domain, which is associated to a hardware interrupt
> source, creates the mapping and supplies the resulting configuration to
> the hardware, so that the hardware is able to raise an interrupt in the
> CPU.
> 
> In case of MSI, this configuration is the MSI message (address,
> data). That's composed by the domain according to the requirements of
> the underlying CPU hardware resource. This underlying hardware resource
> can be the CPUs interrupt controller itself or some intermediary
> hardware entity.
> 
> The kernel reflects this in the interrupt domain hierarchy. The simplest
> case for MSI is:
> 
>      [ CPU domain ] --- [ MSI domain ] -- device
> 
> The flow is as follows:
> 
>    device driver allocates an MSI interrupt in the MSI domain
> 
>    MSI domain allocates an interrupt in the CPU domain
> 
>    CPU domain allocates an interrupt vector and composes the
>    address/data pair. If @data is written to @address, the interrupt is
>    raised in the CPU
> 
>    MSI domain converts the address/data pair into device format and
>    writes it into the device.
> 
>    When the device fires an interrupt it writes @data to @address, which
>    raises the interrupt in the CPU at the allocated CPU vector.  That
>    vector is then translated to the Linux interrupt number in the
>    interrupt handling entry code by looking it up in the CPU domain.
> 
> With a remapping domain intermediary this looks like this:
> 
>      [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device
>  
>    device driver allocates an MSI interrupt in the MSI domain
> 
>    MSI domain allocates an interrupt in the Remap domain
> 
>    Remap domain allocates a resource in the remap space, e.g. an entry
>    in the remap translation table and then allocates an interrupt in the
>    CPU domain.
> 
>    CPU domain allocates an interrupt vector and composes the
>    address/data pair. If @data is written to @address, the interrupt is
>    raised in the CPU
> 
>    Remap domain converts the CPU address/data pair to remap table format
>    and writes it to the alloacted entry in that table. It then composes
>    a new address/data pair, which points at the remap table entry.
> 
>    MSI domain converts the remap address/data pair into device format
>    and writes it into the device.
> 
>    So when the device fires an interrupt it writes @data to @address,
>    which triggers the remap unit. The remap unit validates that the
>    address/data pair is valid for the device and if so it writes the CPU
>    address/data pair, which raises the interrupt in the CPU at the
>    allocated vector. That vector is then translated to the Linux
>    interrupt number in the interrupt handling entry code by looking it
>    up in the CPU domain.
> 
> So from a kernel POV, the address/data pairs are just opaque
> configuration values, which are written into the remap table and the
> device. Whether the content of @data is relevant or not, is a hardware
> implementation detail. That implementation detail is only relevant for
> the interrupt domain code, which handle a specific part of the
> hierarchy.
> 
> The MSI domain does not need to know anything about the content and the
> meaning of @address and @data. It just cares about converting that into
> the device specific storage format.
> 
> The Remap domain does not need to know anything about the content and
> the meaning of the CPU domain provided @address and @data. It just cares
> about converting that into the remap table specific format.
> 
> The hardware entities do not know about the Linux interrupt number at
> all. That relationship is purely software managed as a mapping from the
> allocated CPU vector to the Linux interrupt number.
> 
> Hope that helps.
>

Thanks, Thomas! I always appreciate these types of detailed design
descriptions which certainly help pull all the pieces together.

So, I think I got this right, as Patch4 adds the Remap domain, creating
this hierarchy

name:   IR-PCI-MSIX-0000:00:01.0-12
 size:   0
 mapped: 3
 flags:  0x00000213
            IRQ_DOMAIN_FLAG_HIERARCHY
            IRQ_DOMAIN_NAME_ALLOCATED
            IRQ_DOMAIN_FLAG_MSI
            IRQ_DOMAIN_FLAG_MSI_DEVICE
 parent: IOMMU-IR-0000:00:01.0-17
    name:   IOMMU-IR-0000:00:01.0-17
     size:   0
     mapped: 3
     flags:  0x00000123
                IRQ_DOMAIN_FLAG_HIERARCHY
                IRQ_DOMAIN_NAME_ALLOCATED
                IRQ_DOMAIN_FLAG_ISOLATED_MSI
                IRQ_DOMAIN_FLAG_MSI_PARENT
     parent: :soc:interrupt-controller@28000000-5
        name:   :soc:interrupt-controller@28000000-5
         size:   0
         mapped: 16
         flags:  0x00000103
                    IRQ_DOMAIN_FLAG_HIERARCHY
                    IRQ_DOMAIN_NAME_ALLOCATED
                    IRQ_DOMAIN_FLAG_MSI_PARENT


But, Patch4 only introduces the irqdomain, the functionality is added with
Patch8. Patch8 introduces riscv_iommu_ir_get_msipte_idx_from_target()
which "converts the CPU address/data pair to remap table format". For the
RISC-V IOMMU, the data part of the pair is not used and the address
undergoes a specified translation into an index of the MSI table. For the
non-virt use case we skip the "composes a new address/data pair, which
points at the remap table entry" step since we just forward the original
with an identity mapping. For the virt case we do write a new addr,data
pair (Patch15) since we need to map guest addresses to host addresses (but
data is still just forwarded since the RISC-V IOMMU doesn't support data
remapping). The lack of data remapping is unfortunate, since the part of
the design where "The remap unit validates that the address/data pair is
valid for the device and if so it writes the CPU address/data pair" is
only half true for riscv (since the remap unit always forwards data so we
can't change it in order to implement validation of it). If we can't set
IRQ_DOMAIN_FLAG_ISOLATED_MSI without data validation, then we'll need to
try to fast-track an IOMMU extension for it before we can use VFIO without
having to set allow_unsafe_interrupts.

Thanks,
drew
Jason Gunthorpe Sept. 23, 2025, 2:52 p.m. UTC | #7
On Tue, Sep 23, 2025 at 09:37:31AM -0500, Andrew Jones wrote:
> undergoes a specified translation into an index of the MSI table. For the
> non-virt use case we skip the "composes a new address/data pair, which
> points at the remap table entry" step since we just forward the original
> with an identity mapping. For the virt case we do write a new addr,data
> pair (Patch15) since we need to map guest addresses to host addresses (but
> data is still just forwarded since the RISC-V IOMMU doesn't support data
> remapping). 

You should banish thinking of non-virt/virt from your lexicon. Linux
doesn't work that way, and trying to force it too is a loosing battle.

If you have a remap domain then it should always be remapping. There
is no such idea in Linux as a conditional IRQ domain dependent on
external factors (like how the IOMMU is configured, if the device is
"virt" or not, etc).

Be specific what you mean.

Jason
Andrew Jones Sept. 23, 2025, 3:12 p.m. UTC | #8
On Tue, Sep 23, 2025 at 11:06:46AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 23, 2025 at 12:12:52PM +0200, Thomas Gleixner wrote:
> > With a remapping domain intermediary this looks like this:
> > 
> >      [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device
> >  
> >    device driver allocates an MSI interrupt in the MSI domain
> > 
> >    MSI domain allocates an interrupt in the Remap domain
> > 
> >    Remap domain allocates a resource in the remap space, e.g. an entry
> >    in the remap translation table and then allocates an interrupt in the
> >    CPU domain.
> 
> Thanks!
> 
> And to be very crystal clear here, the meaning of
> IRQ_DOMAIN_FLAG_ISOLATED_MSI is that the remap domain has a security
> feature such that the device can only trigger CPU domain interrupts
> that have been explicitly allocated in the remap domain for that
> device. The device can never go through the remap domain and trigger
> some other device's interrupt.
> 
> This is usally done by having the remap domain's HW take in the
> Addr/Data pair, do a per-BDF table lookup and then completely replace
> the Addr/Data pair with the "remapped" version. By fully replacing the
> remap domain prevents the device from generating a disallowed
> addr/data pair toward the CPU domain.
> 
> It fundamentally must be done by having the HW do a per-RID/BDF table
> lookup based on the incoming MSI addr/data and fully sanitize the
> resulting output.
> 
> There is some legacy history here. When MSI was first invented the
> goal was to make interrupts scalable by removing any state from the
> CPU side. The device would be told what Addr/Data to send to the CPU
> and the CPU would just take some encoded information in that pair as a
> delivery instruction. No state on the CPU side per interrupt.
> 
> In the world of virtualization it was realized this is not secure, so
> the archs undid the core principal of MSI and the CPU HW has some kind
> of state/table entry for every single device interrupt source.
> 
> x86/AMD did this by having per-device remapping tables in their IOMMU
> device context that are selected by incomming RID and effectively
> completely rewrite the addr/data pair before it reaches the APIC. The
> remap table alone now basically specifies where the interrupt is
> delivered.
> 
> ARM doesn't do remapping, instead the interrupt controller itself has
> a table that converts (BDF,Data) into a delivery instruction. It is
> inherently secure.

Thanks, Jason. All the above information is very much appreciated,
particularly the history.

> 
> That flag has nothing to do with affinity.
>

So the reason I keep bringing affinity into the context of isolation is
that, for MSI-capable RISC-V, each CPU has its own MSI controller (IMSIC).
As riscv is missing data validation its closer to the legacy, insecure
description above, but the "The device would be told what Addr/Data to
send to the CPU and the CPU would just take some encoded information in
that pair as a delivery instruction" part becomes "Addr is used to select
a CPU and then the CPU would take some encoded information in Data as the
delivery instruction". Since setting irq affinity is a way to set Addr
to one of a particular set of CPUs, then a device cannot raise interrupts
on CPUs outside that set. And, only interrupts that the allowed set of
CPUs are aware of may be raised. As a device's irqs move around from
irqbalance or a user's selection we can ensure only the CPU an irq should
be able to reach be reachable by managing the IOMMU MSI table. This gives
us some level of isolation, but there is still the possibility a device
may raise an interrupt it should not be able to when its irqs are affined
to the same CPU as another device's and the malicious/broken device uses
the wrong MSI data. For the non-virt case it's fair to say that's no where
near isolated enough. However, for the virt case, Addr is set to guest
interrupt files (something like virtual IMSICs) which means there will be
no other host device or other guest device irqs sharing those Addrs.
Interrupts for devices assigned to guests are truly isolated (not within
the guest, but we need nested support to fully isolate within the guest
anyway).

In v1, I tried to only turn IRQ_DOMAIN_FLAG_ISOLATED_MSI on for the virt
case, but, as you pointed out, that wasn't a good idea. For v2, I was
hoping the comment above the flag was enough, but thinking about it some
more, I agree it's not. I'm not sure what we can do for this other than
an IOMMU spec change at this point.

Thanks,
drew
Jason Gunthorpe Sept. 23, 2025, 3:27 p.m. UTC | #9
On Tue, Sep 23, 2025 at 10:12:42AM -0500, Andrew Jones wrote:
> be able to reach be reachable by managing the IOMMU MSI table. This gives
> us some level of isolation, but there is still the possibility a device
> may raise an interrupt it should not be able to when its irqs are affined
> to the same CPU as another device's

Yes, exactly, this is the problem with basic VFIO support as there is
no general idea of a virtualization context..

> and the malicious/broken device uses the wrong MSI data.

And to be clear it is not a malicious/broken device at issue here. In
PCI MSI is simple a DMA to a magic address. *ANY* device can be
commanded by system software to generate *ANY* address/data on PCIe.

So any VFIO user can effectively generate any MSI it wants. It isn't a
matter of device brokeness.

> near isolated enough. However, for the virt case, Addr is set to guest
> interrupt files (something like virtual IMSICs) which means there will be
> no other host device or other guest device irqs sharing those Addrs.
> Interrupts for devices assigned to guests are truly isolated (not within
> the guest, but we need nested support to fully isolate within the guest
> anyway).

At least this is something, and I do think this is enough security to
be a useful solution. However, Linux has no existing support for the
idea of a VFIO device that only has access to "guest" interrupt HW.

Presumably this is direct injection only now?

I'm not even sure I could give you a sketch what that would look like,
it involves co-operation between so many orthogonal layers it is hard
to imagine :\

kvm provides the virt context, iommufd controls the MSI aperture, irq
remapping controls the remap table, vfio sets interrupts..

VFIO needs to say 'irq layer only establish an interrupt on this KVM'
as some enforced mode ?

Jason
Andrew Jones Sept. 23, 2025, 3:37 p.m. UTC | #10
On Tue, Sep 23, 2025 at 11:52:51AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 23, 2025 at 09:37:31AM -0500, Andrew Jones wrote:
> > undergoes a specified translation into an index of the MSI table. For the
> > non-virt use case we skip the "composes a new address/data pair, which
> > points at the remap table entry" step since we just forward the original
> > with an identity mapping. For the virt case we do write a new addr,data
> > pair (Patch15) since we need to map guest addresses to host addresses (but
> > data is still just forwarded since the RISC-V IOMMU doesn't support data
> > remapping). 
> 
> You should banish thinking of non-virt/virt from your lexicon. Linux
> doesn't work that way, and trying to force it too is a loosing battle.

Well, we need to consider virt when the hardware has virt-specific
features that we want to control. We also need to consider virt when
additional address translations to go from guest space to host space
are needed, as in this case.

> 
> If you have a remap domain then it should always be remapping. There
> is no such idea in Linux as a conditional IRQ domain dependent on
> external factors (like how the IOMMU is configured, if the device is
> "virt" or not, etc).

The remap domain is created when the platform supports MSIs and always
does remapping when the IOMMU supports the MSI table. It could even do
remapping when the IOMMU doesn't support the MSI table since it could
use the DMA table instead, but that's left for a later patch series if
hardware actually shows up like that.

The difference between virt and non-virt is what addresses get remapped
for the remapping. For virt, guest addresses get remapped, for non-virt,
we don't remap guest addresses. And, since we don't remap guest addresses
for non-virt, then, rather than invent some new, arbitrary address, we
just use the host address. Remapping is still in use, but, as I said
above, it's an identity mapping.

(Note, for the current riscv iommu, "remapping" for the non-virt case just
means keeping the set of IMSICs that a device may reach limited to just
what it should be allowed to reach.)

> 
> Be specific what you mean.

I'm always happy to clarify when asked. I'm not sure what I said that
would lead to thinking remapping was disabled for the non-virt case,
but hopefully what I wrote now clarifies that it is not.

Thanks,
drew
Andrew Jones Sept. 23, 2025, 3:50 p.m. UTC | #11
On Tue, Sep 23, 2025 at 12:27:02PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 23, 2025 at 10:12:42AM -0500, Andrew Jones wrote:
> > be able to reach be reachable by managing the IOMMU MSI table. This gives
> > us some level of isolation, but there is still the possibility a device
> > may raise an interrupt it should not be able to when its irqs are affined
> > to the same CPU as another device's
> 
> Yes, exactly, this is the problem with basic VFIO support as there is
> no general idea of a virtualization context..
> 
> > and the malicious/broken device uses the wrong MSI data.
> 
> And to be clear it is not a malicious/broken device at issue here. In
> PCI MSI is simple a DMA to a magic address. *ANY* device can be
> commanded by system software to generate *ANY* address/data on PCIe.
> 
> So any VFIO user can effectively generate any MSI it wants. It isn't a
> matter of device brokeness.
> 
> > near isolated enough. However, for the virt case, Addr is set to guest
> > interrupt files (something like virtual IMSICs) which means there will be
> > no other host device or other guest device irqs sharing those Addrs.
> > Interrupts for devices assigned to guests are truly isolated (not within
> > the guest, but we need nested support to fully isolate within the guest
> > anyway).
> 
> At least this is something, and I do think this is enough security to
> be a useful solution. However, Linux has no existing support for the
> idea of a VFIO device that only has access to "guest" interrupt HW.
> 
> Presumably this is direct injection only now?

Yup

> 
> I'm not even sure I could give you a sketch what that would look like,
> it involves co-operation between so many orthogonal layers it is hard
> to imagine :\
> 
> kvm provides the virt context, iommufd controls the MSI aperture, irq
> remapping controls the remap table, vfio sets interrupts..
> 
> VFIO needs to say 'irq layer only establish an interrupt on this KVM'
> as some enforced mode ?
>

Yes, this is the part that I'd like to lean on you for, since I understand
we want to avoid too much KVM/virt special casing for VFIO/IOMMUFD. I was
thinking that if I bit the bullet and implemented nested support than when
nesting was selected it would be apparent we're in virt context. However,
I was hoping to pull together a solution that works with current QEMU and
VFIO too.

Thanks,
drew
Jason Gunthorpe Sept. 23, 2025, 4:23 p.m. UTC | #12
On Tue, Sep 23, 2025 at 10:50:56AM -0500, Andrew Jones wrote:
> Yes, this is the part that I'd like to lean on you for, since I understand
> we want to avoid too much KVM/virt special casing for VFIO/IOMMUFD. I was
> thinking that if I bit the bullet and implemented nested support than when
> nesting was selected it would be apparent we're in virt context. However,
> I was hoping to pull together a solution that works with current QEMU and
> VFIO too.

You probably do have to make nested part of this, but I don't have a
clear picture how you'd tie all the parts together through the nested
API..

Somehow you have to load a msiptp that is effectively linked to the
KVM reliably into the DC for the iommufd controlled devices that are
linked to that KVM. Then synchronize with VFIO that this is done and
it can setup KVM only interrupts somehow. This kvm entanglement is the
"virt" you have mentioned many times.

The direct injection interrupt path is already quite a confusing
thing..

Jason
Andrew Jones Sept. 23, 2025, 4:33 p.m. UTC | #13
On Tue, Sep 23, 2025 at 01:23:02PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 23, 2025 at 10:50:56AM -0500, Andrew Jones wrote:
> > Yes, this is the part that I'd like to lean on you for, since I understand
> > we want to avoid too much KVM/virt special casing for VFIO/IOMMUFD. I was
> > thinking that if I bit the bullet and implemented nested support than when
> > nesting was selected it would be apparent we're in virt context. However,
> > I was hoping to pull together a solution that works with current QEMU and
> > VFIO too.
> 
> You probably do have to make nested part of this, but I don't have a
> clear picture how you'd tie all the parts together through the nested
> API..
> 
> Somehow you have to load a msiptp that is effectively linked to the
> KVM reliably into the DC for the iommufd controlled devices that are
> linked to that KVM. Then synchronize with VFIO that this is done and
> it can setup KVM only interrupts somehow. This kvm entanglement is the
> "virt" you have mentioned many times.

Yes, the VFIO+KVM irqbypass framework currently provides much of this
support (which patches 10-17 of this series leverage). But, if using
nested is necessary in order to "signal" to the relevant subsystems that
the device's irqs will truly be isolated, then I'll need to start figuring
out how iommufd can fit into the mix (or if we'll need an iommufd-specific
implementation for a new mix).

> 
> The direct injection interrupt path is already quite a confusing
> thing..

No argument there :-)

Thanks,
drew
Jinvas Oct. 23, 2025, 1:47 p.m. UTC | #14
On Mon, 22 Sep 2025 20:56:51 -0300, Jason Gunthorpe wrote:
> We no longer need to support 32 bit builds and we missed this while
> cleaning up.
> Right now the only way to deal with this would be to only use one of
> the S1 or S2 and make that decision when the iommu driver starts. You
> can return the right SW_MSI/HW_MSI based on which PAGING domain style
> the driver is going to use.

> I recommend if the S2 is available you make the driver always use the
> S2 for everything and ignore the S1 except for explicit two stage
> translation setup by a hypervisor. Thus always return HW_MSI.
> If the iommu does not support S2 then always use S1 and always return
> SW_MSI.

I strongly agree with this suggestion,
because on one hand, the confusing design of RISC-V —
particularly the translation rules of the msix_table —
leads to different translation behaviors in S1 and S2 modes;

on the other hand,
designing a proper caching mechanism for the msix_table
in both S1 and S2 modes is quite challenging.

> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks for the patch!

Best regards,
jinvas
diff mbox series

Patch

diff --git a/drivers/iommu/riscv/iommu-ir.c b/drivers/iommu/riscv/iommu-ir.c
index 290d91a6c6cd..b97768cac4be 100644
--- a/drivers/iommu/riscv/iommu-ir.c
+++ b/drivers/iommu/riscv/iommu-ir.c
@@ -4,6 +4,7 @@ 
  *
  * Copyright © 2025 Ventana Micro Systems Inc.
  */
+#include <linux/cleanup.h>
 #include <linux/irqchip/riscv-imsic.h>
 #include <linux/irqdomain.h>
 #include <linux/msi.h>
@@ -106,6 +107,20 @@  static size_t riscv_iommu_ir_nr_msiptes(struct riscv_iommu_domain *domain)
 	return max_idx + 1;
 }
 
+static void riscv_iommu_ir_set_pte(struct riscv_iommu_msipte *pte, u64 addr)
+{
+	pte->pte = FIELD_PREP(RISCV_IOMMU_MSIPTE_M, 3) |
+		   riscv_iommu_phys_to_ppn(addr) |
+		   FIELD_PREP(RISCV_IOMMU_MSIPTE_V, 1);
+	pte->mrif_info = 0;
+}
+
+static void riscv_iommu_ir_clear_pte(struct riscv_iommu_msipte *pte)
+{
+	pte->pte = 0;
+	pte->mrif_info = 0;
+}
+
 static void riscv_iommu_ir_msitbl_inval(struct riscv_iommu_domain *domain,
 					struct riscv_iommu_msipte *pte)
 {
@@ -149,19 +164,99 @@  static void riscv_iommu_ir_msitbl_inval(struct riscv_iommu_domain *domain,
 	rcu_read_unlock();
 }
 
+static void riscv_iommu_ir_msitbl_map(struct riscv_iommu_domain *domain, size_t idx,
+				      phys_addr_t addr)
+{
+	struct riscv_iommu_msipte *pte;
+
+	if (!domain->msi_root)
+		return;
+
+	if (!refcount_inc_not_zero(&domain->msi_pte_counts[idx])) {
+		scoped_guard(raw_spinlock_irqsave, &domain->msi_lock) {
+			if (refcount_read(&domain->msi_pte_counts[idx]) == 0) {
+				pte = &domain->msi_root[idx];
+				riscv_iommu_ir_set_pte(pte, addr);
+				riscv_iommu_ir_msitbl_inval(domain, pte);
+				refcount_set(&domain->msi_pte_counts[idx], 1);
+			} else {
+				refcount_inc(&domain->msi_pte_counts[idx]);
+			}
+		}
+	}
+}
+
+static void riscv_iommu_ir_msitbl_unmap(struct riscv_iommu_domain *domain, size_t idx)
+{
+	struct riscv_iommu_msipte *pte;
+
+	if (!domain->msi_root)
+		return;
+
+	scoped_guard(raw_spinlock_irqsave, &domain->msi_lock) {
+		if (refcount_dec_and_test(&domain->msi_pte_counts[idx])) {
+			pte = &domain->msi_root[idx];
+			riscv_iommu_ir_clear_pte(pte);
+			riscv_iommu_ir_msitbl_inval(domain, pte);
+		}
+	}
+}
+
+static size_t riscv_iommu_ir_get_msipte_idx_from_target(struct riscv_iommu_domain *domain,
+							struct irq_data *data, phys_addr_t *addr)
+{
+	struct msi_msg msg;
+
+	BUG_ON(irq_chip_compose_msi_msg(data, &msg));
+
+	*addr = ((phys_addr_t)msg.address_hi << 32) | msg.address_lo;
+
+	return riscv_iommu_ir_compute_msipte_idx(domain, *addr);
+}
+
+static int riscv_iommu_ir_irq_set_affinity(struct irq_data *data,
+					   const struct cpumask *dest, bool force)
+{
+	struct riscv_iommu_info *info = data->domain->host_data;
+	struct riscv_iommu_domain *domain = info->domain;
+	phys_addr_t old_addr, new_addr;
+	size_t old_idx, new_idx;
+	int ret;
+
+	old_idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &old_addr);
+
+	ret = irq_chip_set_affinity_parent(data, dest, force);
+	if (ret < 0)
+		return ret;
+
+	new_idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &new_addr);
+
+	if (new_idx == old_idx)
+		return ret;
+
+	riscv_iommu_ir_msitbl_unmap(domain, old_idx);
+	riscv_iommu_ir_msitbl_map(domain, new_idx, new_addr);
+
+	return ret;
+}
+
 static struct irq_chip riscv_iommu_ir_irq_chip = {
 	.name			= "IOMMU-IR",
 	.irq_ack		= irq_chip_ack_parent,
 	.irq_mask		= irq_chip_mask_parent,
 	.irq_unmask		= irq_chip_unmask_parent,
-	.irq_set_affinity	= irq_chip_set_affinity_parent,
+	.irq_set_affinity	= riscv_iommu_ir_irq_set_affinity,
 };
 
 static int riscv_iommu_ir_irq_domain_alloc_irqs(struct irq_domain *irqdomain,
 						unsigned int irq_base, unsigned int nr_irqs,
 						void *arg)
 {
+	struct riscv_iommu_info *info = irqdomain->host_data;
+	struct riscv_iommu_domain *domain = info->domain;
 	struct irq_data *data;
+	phys_addr_t addr;
+	size_t idx;
 	int i, ret;
 
 	ret = irq_domain_alloc_irqs_parent(irqdomain, irq_base, nr_irqs, arg);
@@ -171,14 +266,36 @@  static int riscv_iommu_ir_irq_domain_alloc_irqs(struct irq_domain *irqdomain,
 	for (i = 0; i < nr_irqs; i++) {
 		data = irq_domain_get_irq_data(irqdomain, irq_base + i);
 		data->chip = &riscv_iommu_ir_irq_chip;
+		idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &addr);
+		riscv_iommu_ir_msitbl_map(domain, idx, addr);
 	}
 
 	return 0;
 }
 
+static void riscv_iommu_ir_irq_domain_free_irqs(struct irq_domain *irqdomain,
+						unsigned int irq_base,
+						unsigned int nr_irqs)
+{
+	struct riscv_iommu_info *info = irqdomain->host_data;
+	struct riscv_iommu_domain *domain = info->domain;
+	struct irq_data *data;
+	phys_addr_t addr;
+	size_t idx;
+	int i;
+
+	for (i = 0; i < nr_irqs; i++) {
+		data = irq_domain_get_irq_data(irqdomain, irq_base + i);
+		idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &addr);
+		riscv_iommu_ir_msitbl_unmap(domain, idx);
+	}
+
+	irq_domain_free_irqs_parent(irqdomain, irq_base, nr_irqs);
+}
+
 static const struct irq_domain_ops riscv_iommu_ir_irq_domain_ops = {
 	.alloc = riscv_iommu_ir_irq_domain_alloc_irqs,
-	.free = irq_domain_free_irqs_parent,
+	.free = riscv_iommu_ir_irq_domain_free_irqs,
 };
 
 static const struct msi_parent_ops riscv_iommu_ir_msi_parent_ops = {
@@ -221,6 +338,19 @@  struct irq_domain *riscv_iommu_ir_irq_domain_create(struct riscv_iommu_device *i
 		return NULL;
 	}
 
+	if (iommu->caps & RISCV_IOMMU_CAPABILITIES_MSI_FLAT) {
+		/*
+		 * NOTE: The RISC-V IOMMU doesn't actually support isolated MSI because
+		 * there is no MSI message validation (see the comment above
+		 * msi_device_has_isolated_msi()). However, we claim isolated MSI here
+		 * because applying the IOMMU ensures MSI messages may only be delivered
+		 * to the mapped MSI addresses. This allows MSIs to be isolated to
+		 * particular harts/vcpus where the unvalidated MSI messages can be
+		 * tolerated.
+		 */
+		irqdomain->flags |= IRQ_DOMAIN_FLAG_ISOLATED_MSI;
+	}
+
 	irqdomain->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT;
 	irqdomain->msi_parent_ops = &riscv_iommu_ir_msi_parent_ops;
 	irq_domain_update_bus_token(irqdomain, DOMAIN_BUS_MSI_REMAP);
@@ -233,6 +363,7 @@  struct irq_domain *riscv_iommu_ir_irq_domain_create(struct riscv_iommu_device *i
 static void riscv_iommu_ir_free_msi_table(struct riscv_iommu_domain *domain)
 {
 	iommu_free_pages(domain->msi_root);
+	kfree(domain->msi_pte_counts);
 }
 
 void riscv_iommu_ir_irq_domain_remove(struct riscv_iommu_info *info)
@@ -274,6 +405,14 @@  static int riscv_ir_set_imsic_global_config(struct riscv_iommu_device *iommu,
 							     nr_ptes * sizeof(*domain->msi_root));
 		if (!domain->msi_root)
 			return -ENOMEM;
+
+		domain->msi_pte_counts = kcalloc(nr_ptes, sizeof(refcount_t), GFP_KERNEL_ACCOUNT);
+		if (!domain->msi_pte_counts) {
+			iommu_free_pages(domain->msi_root);
+			return -ENOMEM;
+		}
+
+		raw_spin_lock_init(&domain->msi_lock);
 	}
 
 	return 0;
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index 1fe35f1210fb..aeb5642f003c 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -34,6 +34,8 @@  struct riscv_iommu_domain {
 	unsigned int pgd_mode;
 	unsigned long *pgd_root;
 	struct riscv_iommu_msipte *msi_root;
+	refcount_t *msi_pte_counts;
+	raw_spinlock_t msi_lock;
 	u64 msi_addr_mask;
 	u64 msi_addr_pattern;
 	u32 group_index_bits;