| Message ID | 20250920203851.2205115-28-ajones@ventanamicro.com |
|---|---|
| State | New |
| Headers | show |
| Series | iommu/riscv: Add irqbypass support | expand |
On Sat, Sep 20, 2025 at 03:38:58PM -0500, Andrew Jones wrote: > When setting irq affinity extract the IMSIC address the device > needs to access and add it to the MSI table. If the device no > longer needs access to an IMSIC then remove it from the table > to prohibit access. This allows isolating device MSIs to a set > of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ > domain flag. IRQ_DOMAIN_FLAG_ISOLATED_MSI has nothing to do with HARTs. * Isolated MSI means that HW modeled by an irq_domain on the path from the * initiating device to the CPU will validate that the MSI message specifies an * interrupt number that the device is authorized to trigger. This must block * devices from triggering interrupts they are not authorized to trigger. * Currently authorization means the MSI vector is one assigned to the device. It has to do with each PCI BDF having a unique set of validation/mapping tables for MSIs that are granular to the interrupt number. As I understand the spec this is is only possible with msiptp? As discussed previously this has to be a static property and the SW stack doesn't expect it to change. So if the IR driver sets IRQ_DOMAIN_FLAG_ISOLATED_MSI it has to always use misptp? Further, since the interrupt tables have to be per BDF they cannot be linked to an iommu_domain! Storing the msiptp in an iommu_domain is totally wrong?? It needs to somehow be stored in the interrupt layer per-struct device, check how AMD and Intel have stored their IR tables programmed into their versions of DC. It looks like there is something in here to support HW that doesn't have msiptp? That's different, and also looks very confused. The IR driver should never be touching the iommu domain or calling iommu_map! Instead it probably has to use the SW_MSI mechanism to request mapping the interrupt controller aperture. You don't get IRQ_DOMAIN_FLAG_ISOLATED_MSI with something like this though. Look at how ARM GIC works for this mechanism. Finally, please split this series up, if ther are two different ways to manage the MSI aperture then please split it into two series with a clear description how the HW actually works. Maybe start with the simpler case of no msiptp?? Jason
On Mon, Sep 22, 2025 at 03:43:36PM -0300, Jason Gunthorpe wrote: > On Sat, Sep 20, 2025 at 03:38:58PM -0500, Andrew Jones wrote: > > When setting irq affinity extract the IMSIC address the device > > needs to access and add it to the MSI table. If the device no > > longer needs access to an IMSIC then remove it from the table > > to prohibit access. This allows isolating device MSIs to a set > > of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ > > domain flag. > > IRQ_DOMAIN_FLAG_ISOLATED_MSI has nothing to do with HARTs. > > * Isolated MSI means that HW modeled by an irq_domain on the path from the > * initiating device to the CPU will validate that the MSI message specifies an > * interrupt number that the device is authorized to trigger. This must block > * devices from triggering interrupts they are not authorized to trigger. > * Currently authorization means the MSI vector is one assigned to the device. Unfortunately the RISC-V IOMMU doesn't have support for this. I've raised the lack of MSI data validation to the spec writers and I'll try to raise it again, but I was hoping we could still get IRQ_DOMAIN_FLAG_ISOLATED_MSI by simply ensuring the MSI addresses only include the affined harts (and also with the NOTE comment I've put in this patch to point out the deficiency). > > It has to do with each PCI BDF having a unique set of > validation/mapping tables for MSIs that are granular to the interrupt > number. Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way. > > As I understand the spec this is is only possible with msiptp? As > discussed previously this has to be a static property and the SW stack > doesn't expect it to change. So if the IR driver sets > IRQ_DOMAIN_FLAG_ISOLATED_MSI it has to always use misptp? Yes, the patch only sets IRQ_DOMAIN_FLAG_ISOLATED_MSI when the IOMMU has RISCV_IOMMU_CAPABILITIES_MSI_FLAT and it will remain set for the lifetime of the irqdomain, no matter how the IOMMU is being applied. > > Further, since the interrupt tables have to be per BDF they cannot be > linked to an iommu_domain! Storing the msiptp in an iommu_domain is > totally wrong?? It needs to somehow be stored in the interrupt layer > per-struct device, check how AMD and Intel have stored their IR tables > programmed into their versions of DC. The RISC-V IOMMU MSI table is simply a flat address remapping table, which also has support for MRIFs. The table indices come from an address matching mechanism used to filter out invalid addresses and to convert valid addresses into MSI table indices. IOW, the RISC-V MSI table is a simple translation table, and even needs to be tied to a particular DMA table in order to work. Here's some examples 1. stage1 not BARE ------------------ stage1 MSI table IOVA ------> A ---------> host-MSI-address 2. stage1 is BARE, for example if only stage2 is in use ------------------------------------------------------- MSI table IOVA == A ---------> host-MSI-address When used by the host A == host-MSI-address, but at least we can block the write when an IRQ has been affined to a set of harts that doesn't include what it's targeting. When used for irqbypass A == guest-MSI- address and the host-MSI-address will be that of a guest interrupt file. This ensures a device assigned to a guest can only reach its own vcpus when sending MSIs. In the first example, where stage1 is not BARE, the stage1 page tables must have some IOVA->A mapping, otherwise the MSI table will not get a chance to do a translation, as the stage1 DMA will fault. This series ensures stage1 gets an identity mapping for all possible MSI targets and then leaves it be, using the MSI tables instead for the isolation. I don't think we can apply a lot of AMD's and Intel's model to RISC-V. > > It looks like there is something in here to support HW that doesn't > have msiptp? That's different, and also looks very confused. The only support is to ensure all the host IMSICs are mapped, otherwise we can't turn on IOMMU_DMA since all MSI writes will cause faults. We don't set IRQ_DOMAIN_FLAG_ISOLATED_MSI in this case, though, since we don't bother unmapping MSI addresses of harts that IRQs have be un- affined from. > The IR > driver should never be touching the iommu domain or calling iommu_map! As pointed out above, the RISC-V IR is quite a different beast than AMD and Intel. Whether or not the IOMMU has MSI table support, the IMSICs must be mapped in stage1, when stage1 is not BARE. So, in both cases we roll that mapping into the IR code since there isn't really any better place for it for the host case and it's necessary for the IR code to manage it for the virt case. Since IR (or MSI delivery in general) is dependent upon the stage1 page tables, then it's necessary to be tied to the same IOMMU domain that those page tables are tied to. Patch4's changes to riscv_iommu_attach_paging_domain() and riscv_iommu_iodir_update() show how they're tied together. > Instead it probably has to use the SW_MSI mechanism to request mapping > the interrupt controller aperture. You don't get > IRQ_DOMAIN_FLAG_ISOLATED_MSI with something like this though. Look at > how ARM GIC works for this mechanism. I'm not seeing how SW_MSI will help here, but so far I've just done some quick grepping and code skimming. > > Finally, please split this series up, if ther are two different ways > to manage the MSI aperture then please split it into two series with a > clear description how the HW actually works. > > Maybe start with the simpler case of no msiptp?? The first five patches plus the "enable IOMMU_DMA" will allow paging domains to be used by default, while paving the way for patches 6-8 to allow host IRQs to be isolated to the best of our ability (only able to access IMSICs to which they are affined). So we could have series1: irqdomain + map all imsics + enable IOMMU_DMA series2: actually apply irqdomain in order to implement map/unmap of MSI ptes based on IRQ affinity - set IRQ_DOMAIN_FLAG_ISOLATED_MSI, because that's the best we've got... series3: the rest of the patches of this series which introduce irqbypass support for the virt use case Would that be better? Or do you see some need for some patch splits as well? Thanks, drew
On Mon, Sep 22, 2025 at 04:20:43PM -0500, Andrew Jones wrote: > On Mon, Sep 22, 2025 at 03:43:36PM -0300, Jason Gunthorpe wrote: > > On Sat, Sep 20, 2025 at 03:38:58PM -0500, Andrew Jones wrote: > > > When setting irq affinity extract the IMSIC address the device > > > needs to access and add it to the MSI table. If the device no > > > longer needs access to an IMSIC then remove it from the table > > > to prohibit access. This allows isolating device MSIs to a set > > > of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ > > > domain flag. > > > > IRQ_DOMAIN_FLAG_ISOLATED_MSI has nothing to do with HARTs. > > > > * Isolated MSI means that HW modeled by an irq_domain on the path from the > > * initiating device to the CPU will validate that the MSI message specifies an > > * interrupt number that the device is authorized to trigger. This must block > > * devices from triggering interrupts they are not authorized to trigger. > > * Currently authorization means the MSI vector is one assigned to the device. > > Unfortunately the RISC-V IOMMU doesn't have support for this. I've raised > the lack of MSI data validation to the spec writers and I'll try to raise > it again, but I was hoping we could still get IRQ_DOMAIN_FLAG_ISOLATED_MSI > by simply ensuring the MSI addresses only include the affined harts (and > also with the NOTE comment I've put in this patch to point out the > deficiency). Nope, sorry, it has to validate the incoming target Linux interrupt vector in some way - that is the whole point and the required security. We cannot allow devices assigned to a VFIO to trigger vectors that belong to devices outside that VFIO. HARTS should have nothing to do with that security statement. > > It has to do with each PCI BDF having a unique set of > > validation/mapping tables for MSIs that are granular to the interrupt > > number. > > Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way. Interrupt number is a Linux concept, HW decodes the addr/data pair and delivers it to some Linux interrupt. Linux doesn't care how the HW treats the addr/data pair, it can ignore data if it wants. > 1. stage1 not BARE > ------------------ > > stage1 MSI table > IOVA ------> A ---------> host-MSI-address If you run the driver like this then it should use IOMMU_RESV_SW_MSI to make an aperture. Follow how ARM GIC works for this. The Linux architecture is that the iommu_domain owner is responsible to provide a range for SW_MSI and connect it to the interrupt driver. > 2. stage1 is BARE, for example if only stage2 is in use > ------------------------------------------------------- > > MSI table > IOVA == A ---------> host-MSI-address And here you'd use HW_MSI to set a fixed aperture that matches DC.msi_addr_* (?) > I don't think we can apply a lot of AMD's and Intel's model to RISC-V. AMD and Intel have the interrupt translation as part of the "DC". That corresponds to the DC.msiptp related stuff. ARM has a SW_MSI that applies the S1/S2 translation first and then puts the interrupt translation in the GIC IP block. That corresponds to what you just explained above. Somehow RISCV have the most challenging properties of both architectures, and mis-understood what security Linux expects out of interrupt remapping :| Use SW_MSI to open the aperture like ARM, use techniques like AMD/Intel to manage the translation table in the shared DC Don't touch the iommu_domain from the interrupt driver, it simply cannot work. > > It looks like there is something in here to support HW that doesn't > > have msiptp? That's different, and also looks very confused. > > The only support is to ensure all the host IMSICs are mapped, otherwise > we can't turn on IOMMU_DMA since all MSI writes will cause faults. We > don't set IRQ_DOMAIN_FLAG_ISOLATED_MSI in this case, though, since we > don't bother unmapping MSI addresses of harts that IRQs have be un- > affined from. OK, that seems like a good first series. Implement just the SW_MSI to expose all the IMSIC's. > > Instead it probably has to use the SW_MSI mechanism to request mapping > > the interrupt controller aperture. You don't get > > IRQ_DOMAIN_FLAG_ISOLATED_MSI with something like this though. Look at > > how ARM GIC works for this mechanism. > > I'm not seeing how SW_MSI will help here, but so far I've just done some > quick grepping and code skimming. SW_MSI is intended to deal with exactly what you are describing - the IOMMU translates the MSI aperture. 1) The driver declares a IOMMU_RESV_SW_MSI in it's reserved_regions 2) When a subsystem wishes to use an iommu_domain it must detect the SW_MSI and setup some IOVA for it to use a) dma-iommu.c allocates IOVA dynamically per-page via iommu_dma_get_msi_page() b) Classic VFIO uses iommu_get_msi_cookie() with the SW_MSI range. c) iommufd has its own logic in and around iommufd_sw_msi() 3) When the IRQ driver wants to establish a MSI it computes the untranslated address it wants and then calls iommu_dma_prepare_msi() iommu_dma_prepare_msi() will return the a msi_desc with the address adjusted For instance dma-iommu.c has iommu_dma_prepare_msi() call into iommu_dma_get_msi_page() which will establish any required mapping for phys at some IOVA it picks. The riscv mess has the additional obnoxious complexity that the S1/S2 are inconsisent. Right now the only way to deal with this would be to only use one of the S1 or S2 and make that decision when the iommu driver starts. You can return the right SW_MSI/HW_MSI based on which PAGING domain style the driver is going to use. I recommend if the S2 is available you make the driver always use the S2 for everything and ignore the S1 except for explicit two stage translation setup by a hypervisor. Thus always return HW_MSI. If the iommu does not support S2 then always use S1 and always return SW_MSI. Disallow SW from creating a PAGING domain that mismatches the expected interrupt setup. There is still quite some mess to enhance the iommu_dma_prepare_msi() path to support all the modes, that will need new code as ARM doesn't have the difference between S1/S2 (sigh) But broadly this is the sketch for where the touchpoints should be to setup a MSI aperture that has to go through the page table. > The first five patches plus the "enable IOMMU_DMA" will allow paging > domains to be used by default, while paving the way for patches 6-8 to > allow host IRQs to be isolated to the best of our ability (only able to > access IMSICs to which they are affined). So we could have > > series1: irqdomain + map all imsics + enable IOMMU_DMA Yes, please lets try to get through this first.. > series2: actually apply irqdomain in order to implement map/unmap of MSI > ptes based on IRQ affinity - set IRQ_DOMAIN_FLAG_ISOLATED_MSI, > because that's the best we've got... Make sense, but as above, need to do something to make it actually implement IRQ_DOMAIN_FLAG_ISOLATED_MSI. Limited to HARTS is not the same thing. This also needs some kind of new iommufd support, and I don't know how it will work, but setting the DC.msi_addr* should come from userspace and create a new reserved region.. Maybe we need to create per-domain reserved regions.. > series3: the rest of the patches of this series which introduce irqbypass > support for the virt use case Yes > Would that be better? Or do you see some need for some patch splits as > well? I did not try to look at the patches for this, but I think your three themes makes alot of sense. Please include in the cover letter some of the explanations from these two RFC postings about how the HW works. Jason
On Mon, Sep 22 2025 at 20:56, Jason Gunthorpe wrote: > On Mon, Sep 22, 2025 at 04:20:43PM -0500, Andrew Jones wrote: >> > It has to do with each PCI BDF having a unique set of >> > validation/mapping tables for MSIs that are granular to the interrupt >> > number. >> >> Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way. > > Interrupt number is a Linux concept, HW decodes the addr/data pair and > delivers it to some Linux interrupt. Linux doesn't care how the HW > treats the addr/data pair, it can ignore data if it wants. Let me explain this a bit deeper. As you said, the interrupt number is a pure kernel software construct, which is mapped to a hardware interrupt source. The interrupt domain, which is associated to a hardware interrupt source, creates the mapping and supplies the resulting configuration to the hardware, so that the hardware is able to raise an interrupt in the CPU. In case of MSI, this configuration is the MSI message (address, data). That's composed by the domain according to the requirements of the underlying CPU hardware resource. This underlying hardware resource can be the CPUs interrupt controller itself or some intermediary hardware entity. The kernel reflects this in the interrupt domain hierarchy. The simplest case for MSI is: [ CPU domain ] --- [ MSI domain ] -- device The flow is as follows: device driver allocates an MSI interrupt in the MSI domain MSI domain allocates an interrupt in the CPU domain CPU domain allocates an interrupt vector and composes the address/data pair. If @data is written to @address, the interrupt is raised in the CPU MSI domain converts the address/data pair into device format and writes it into the device. When the device fires an interrupt it writes @data to @address, which raises the interrupt in the CPU at the allocated CPU vector. That vector is then translated to the Linux interrupt number in the interrupt handling entry code by looking it up in the CPU domain. With a remapping domain intermediary this looks like this: [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device device driver allocates an MSI interrupt in the MSI domain MSI domain allocates an interrupt in the Remap domain Remap domain allocates a resource in the remap space, e.g. an entry in the remap translation table and then allocates an interrupt in the CPU domain. CPU domain allocates an interrupt vector and composes the address/data pair. If @data is written to @address, the interrupt is raised in the CPU Remap domain converts the CPU address/data pair to remap table format and writes it to the alloacted entry in that table. It then composes a new address/data pair, which points at the remap table entry. MSI domain converts the remap address/data pair into device format and writes it into the device. So when the device fires an interrupt it writes @data to @address, which triggers the remap unit. The remap unit validates that the address/data pair is valid for the device and if so it writes the CPU address/data pair, which raises the interrupt in the CPU at the allocated vector. That vector is then translated to the Linux interrupt number in the interrupt handling entry code by looking it up in the CPU domain. So from a kernel POV, the address/data pairs are just opaque configuration values, which are written into the remap table and the device. Whether the content of @data is relevant or not, is a hardware implementation detail. That implementation detail is only relevant for the interrupt domain code, which handle a specific part of the hierarchy. The MSI domain does not need to know anything about the content and the meaning of @address and @data. It just cares about converting that into the device specific storage format. The Remap domain does not need to know anything about the content and the meaning of the CPU domain provided @address and @data. It just cares about converting that into the remap table specific format. The hardware entities do not know about the Linux interrupt number at all. That relationship is purely software managed as a mapping from the allocated CPU vector to the Linux interrupt number. Hope that helps. tglx
On Tue, Sep 23, 2025 at 12:12:52PM +0200, Thomas Gleixner wrote: > With a remapping domain intermediary this looks like this: > > [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device > > device driver allocates an MSI interrupt in the MSI domain > > MSI domain allocates an interrupt in the Remap domain > > Remap domain allocates a resource in the remap space, e.g. an entry > in the remap translation table and then allocates an interrupt in the > CPU domain. Thanks! And to be very crystal clear here, the meaning of IRQ_DOMAIN_FLAG_ISOLATED_MSI is that the remap domain has a security feature such that the device can only trigger CPU domain interrupts that have been explicitly allocated in the remap domain for that device. The device can never go through the remap domain and trigger some other device's interrupt. This is usally done by having the remap domain's HW take in the Addr/Data pair, do a per-BDF table lookup and then completely replace the Addr/Data pair with the "remapped" version. By fully replacing the remap domain prevents the device from generating a disallowed addr/data pair toward the CPU domain. It fundamentally must be done by having the HW do a per-RID/BDF table lookup based on the incoming MSI addr/data and fully sanitize the resulting output. There is some legacy history here. When MSI was first invented the goal was to make interrupts scalable by removing any state from the CPU side. The device would be told what Addr/Data to send to the CPU and the CPU would just take some encoded information in that pair as a delivery instruction. No state on the CPU side per interrupt. In the world of virtualization it was realized this is not secure, so the archs undid the core principal of MSI and the CPU HW has some kind of state/table entry for every single device interrupt source. x86/AMD did this by having per-device remapping tables in their IOMMU device context that are selected by incomming RID and effectively completely rewrite the addr/data pair before it reaches the APIC. The remap table alone now basically specifies where the interrupt is delivered. ARM doesn't do remapping, instead the interrupt controller itself has a table that converts (BDF,Data) into a delivery instruction. It is inherently secure. That flag has nothing to do with affinity. Jason
On Tue, Sep 23, 2025 at 12:12:52PM +0200, Thomas Gleixner wrote: > On Mon, Sep 22 2025 at 20:56, Jason Gunthorpe wrote: > > On Mon, Sep 22, 2025 at 04:20:43PM -0500, Andrew Jones wrote: > >> > It has to do with each PCI BDF having a unique set of > >> > validation/mapping tables for MSIs that are granular to the interrupt > >> > number. > >> > >> Interrupt numbers (MSI data) aren't used by the RISC-V IOMMU in any way. > > > > Interrupt number is a Linux concept, HW decodes the addr/data pair and > > delivers it to some Linux interrupt. Linux doesn't care how the HW > > treats the addr/data pair, it can ignore data if it wants. > > Let me explain this a bit deeper. > > As you said, the interrupt number is a pure kernel software construct, > which is mapped to a hardware interrupt source. > > The interrupt domain, which is associated to a hardware interrupt > source, creates the mapping and supplies the resulting configuration to > the hardware, so that the hardware is able to raise an interrupt in the > CPU. > > In case of MSI, this configuration is the MSI message (address, > data). That's composed by the domain according to the requirements of > the underlying CPU hardware resource. This underlying hardware resource > can be the CPUs interrupt controller itself or some intermediary > hardware entity. > > The kernel reflects this in the interrupt domain hierarchy. The simplest > case for MSI is: > > [ CPU domain ] --- [ MSI domain ] -- device > > The flow is as follows: > > device driver allocates an MSI interrupt in the MSI domain > > MSI domain allocates an interrupt in the CPU domain > > CPU domain allocates an interrupt vector and composes the > address/data pair. If @data is written to @address, the interrupt is > raised in the CPU > > MSI domain converts the address/data pair into device format and > writes it into the device. > > When the device fires an interrupt it writes @data to @address, which > raises the interrupt in the CPU at the allocated CPU vector. That > vector is then translated to the Linux interrupt number in the > interrupt handling entry code by looking it up in the CPU domain. > > With a remapping domain intermediary this looks like this: > > [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device > > device driver allocates an MSI interrupt in the MSI domain > > MSI domain allocates an interrupt in the Remap domain > > Remap domain allocates a resource in the remap space, e.g. an entry > in the remap translation table and then allocates an interrupt in the > CPU domain. > > CPU domain allocates an interrupt vector and composes the > address/data pair. If @data is written to @address, the interrupt is > raised in the CPU > > Remap domain converts the CPU address/data pair to remap table format > and writes it to the alloacted entry in that table. It then composes > a new address/data pair, which points at the remap table entry. > > MSI domain converts the remap address/data pair into device format > and writes it into the device. > > So when the device fires an interrupt it writes @data to @address, > which triggers the remap unit. The remap unit validates that the > address/data pair is valid for the device and if so it writes the CPU > address/data pair, which raises the interrupt in the CPU at the > allocated vector. That vector is then translated to the Linux > interrupt number in the interrupt handling entry code by looking it > up in the CPU domain. > > So from a kernel POV, the address/data pairs are just opaque > configuration values, which are written into the remap table and the > device. Whether the content of @data is relevant or not, is a hardware > implementation detail. That implementation detail is only relevant for > the interrupt domain code, which handle a specific part of the > hierarchy. > > The MSI domain does not need to know anything about the content and the > meaning of @address and @data. It just cares about converting that into > the device specific storage format. > > The Remap domain does not need to know anything about the content and > the meaning of the CPU domain provided @address and @data. It just cares > about converting that into the remap table specific format. > > The hardware entities do not know about the Linux interrupt number at > all. That relationship is purely software managed as a mapping from the > allocated CPU vector to the Linux interrupt number. > > Hope that helps. > Thanks, Thomas! I always appreciate these types of detailed design descriptions which certainly help pull all the pieces together. So, I think I got this right, as Patch4 adds the Remap domain, creating this hierarchy name: IR-PCI-MSIX-0000:00:01.0-12 size: 0 mapped: 3 flags: 0x00000213 IRQ_DOMAIN_FLAG_HIERARCHY IRQ_DOMAIN_NAME_ALLOCATED IRQ_DOMAIN_FLAG_MSI IRQ_DOMAIN_FLAG_MSI_DEVICE parent: IOMMU-IR-0000:00:01.0-17 name: IOMMU-IR-0000:00:01.0-17 size: 0 mapped: 3 flags: 0x00000123 IRQ_DOMAIN_FLAG_HIERARCHY IRQ_DOMAIN_NAME_ALLOCATED IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ_DOMAIN_FLAG_MSI_PARENT parent: :soc:interrupt-controller@28000000-5 name: :soc:interrupt-controller@28000000-5 size: 0 mapped: 16 flags: 0x00000103 IRQ_DOMAIN_FLAG_HIERARCHY IRQ_DOMAIN_NAME_ALLOCATED IRQ_DOMAIN_FLAG_MSI_PARENT But, Patch4 only introduces the irqdomain, the functionality is added with Patch8. Patch8 introduces riscv_iommu_ir_get_msipte_idx_from_target() which "converts the CPU address/data pair to remap table format". For the RISC-V IOMMU, the data part of the pair is not used and the address undergoes a specified translation into an index of the MSI table. For the non-virt use case we skip the "composes a new address/data pair, which points at the remap table entry" step since we just forward the original with an identity mapping. For the virt case we do write a new addr,data pair (Patch15) since we need to map guest addresses to host addresses (but data is still just forwarded since the RISC-V IOMMU doesn't support data remapping). The lack of data remapping is unfortunate, since the part of the design where "The remap unit validates that the address/data pair is valid for the device and if so it writes the CPU address/data pair" is only half true for riscv (since the remap unit always forwards data so we can't change it in order to implement validation of it). If we can't set IRQ_DOMAIN_FLAG_ISOLATED_MSI without data validation, then we'll need to try to fast-track an IOMMU extension for it before we can use VFIO without having to set allow_unsafe_interrupts. Thanks, drew
On Tue, Sep 23, 2025 at 09:37:31AM -0500, Andrew Jones wrote: > undergoes a specified translation into an index of the MSI table. For the > non-virt use case we skip the "composes a new address/data pair, which > points at the remap table entry" step since we just forward the original > with an identity mapping. For the virt case we do write a new addr,data > pair (Patch15) since we need to map guest addresses to host addresses (but > data is still just forwarded since the RISC-V IOMMU doesn't support data > remapping). You should banish thinking of non-virt/virt from your lexicon. Linux doesn't work that way, and trying to force it too is a loosing battle. If you have a remap domain then it should always be remapping. There is no such idea in Linux as a conditional IRQ domain dependent on external factors (like how the IOMMU is configured, if the device is "virt" or not, etc). Be specific what you mean. Jason
On Tue, Sep 23, 2025 at 11:06:46AM -0300, Jason Gunthorpe wrote: > On Tue, Sep 23, 2025 at 12:12:52PM +0200, Thomas Gleixner wrote: > > With a remapping domain intermediary this looks like this: > > > > [ CPU domain ] --- [ Remap domain] --- [ MSI domain ] -- device > > > > device driver allocates an MSI interrupt in the MSI domain > > > > MSI domain allocates an interrupt in the Remap domain > > > > Remap domain allocates a resource in the remap space, e.g. an entry > > in the remap translation table and then allocates an interrupt in the > > CPU domain. > > Thanks! > > And to be very crystal clear here, the meaning of > IRQ_DOMAIN_FLAG_ISOLATED_MSI is that the remap domain has a security > feature such that the device can only trigger CPU domain interrupts > that have been explicitly allocated in the remap domain for that > device. The device can never go through the remap domain and trigger > some other device's interrupt. > > This is usally done by having the remap domain's HW take in the > Addr/Data pair, do a per-BDF table lookup and then completely replace > the Addr/Data pair with the "remapped" version. By fully replacing the > remap domain prevents the device from generating a disallowed > addr/data pair toward the CPU domain. > > It fundamentally must be done by having the HW do a per-RID/BDF table > lookup based on the incoming MSI addr/data and fully sanitize the > resulting output. > > There is some legacy history here. When MSI was first invented the > goal was to make interrupts scalable by removing any state from the > CPU side. The device would be told what Addr/Data to send to the CPU > and the CPU would just take some encoded information in that pair as a > delivery instruction. No state on the CPU side per interrupt. > > In the world of virtualization it was realized this is not secure, so > the archs undid the core principal of MSI and the CPU HW has some kind > of state/table entry for every single device interrupt source. > > x86/AMD did this by having per-device remapping tables in their IOMMU > device context that are selected by incomming RID and effectively > completely rewrite the addr/data pair before it reaches the APIC. The > remap table alone now basically specifies where the interrupt is > delivered. > > ARM doesn't do remapping, instead the interrupt controller itself has > a table that converts (BDF,Data) into a delivery instruction. It is > inherently secure. Thanks, Jason. All the above information is very much appreciated, particularly the history. > > That flag has nothing to do with affinity. > So the reason I keep bringing affinity into the context of isolation is that, for MSI-capable RISC-V, each CPU has its own MSI controller (IMSIC). As riscv is missing data validation its closer to the legacy, insecure description above, but the "The device would be told what Addr/Data to send to the CPU and the CPU would just take some encoded information in that pair as a delivery instruction" part becomes "Addr is used to select a CPU and then the CPU would take some encoded information in Data as the delivery instruction". Since setting irq affinity is a way to set Addr to one of a particular set of CPUs, then a device cannot raise interrupts on CPUs outside that set. And, only interrupts that the allowed set of CPUs are aware of may be raised. As a device's irqs move around from irqbalance or a user's selection we can ensure only the CPU an irq should be able to reach be reachable by managing the IOMMU MSI table. This gives us some level of isolation, but there is still the possibility a device may raise an interrupt it should not be able to when its irqs are affined to the same CPU as another device's and the malicious/broken device uses the wrong MSI data. For the non-virt case it's fair to say that's no where near isolated enough. However, for the virt case, Addr is set to guest interrupt files (something like virtual IMSICs) which means there will be no other host device or other guest device irqs sharing those Addrs. Interrupts for devices assigned to guests are truly isolated (not within the guest, but we need nested support to fully isolate within the guest anyway). In v1, I tried to only turn IRQ_DOMAIN_FLAG_ISOLATED_MSI on for the virt case, but, as you pointed out, that wasn't a good idea. For v2, I was hoping the comment above the flag was enough, but thinking about it some more, I agree it's not. I'm not sure what we can do for this other than an IOMMU spec change at this point. Thanks, drew
On Tue, Sep 23, 2025 at 10:12:42AM -0500, Andrew Jones wrote: > be able to reach be reachable by managing the IOMMU MSI table. This gives > us some level of isolation, but there is still the possibility a device > may raise an interrupt it should not be able to when its irqs are affined > to the same CPU as another device's Yes, exactly, this is the problem with basic VFIO support as there is no general idea of a virtualization context.. > and the malicious/broken device uses the wrong MSI data. And to be clear it is not a malicious/broken device at issue here. In PCI MSI is simple a DMA to a magic address. *ANY* device can be commanded by system software to generate *ANY* address/data on PCIe. So any VFIO user can effectively generate any MSI it wants. It isn't a matter of device brokeness. > near isolated enough. However, for the virt case, Addr is set to guest > interrupt files (something like virtual IMSICs) which means there will be > no other host device or other guest device irqs sharing those Addrs. > Interrupts for devices assigned to guests are truly isolated (not within > the guest, but we need nested support to fully isolate within the guest > anyway). At least this is something, and I do think this is enough security to be a useful solution. However, Linux has no existing support for the idea of a VFIO device that only has access to "guest" interrupt HW. Presumably this is direct injection only now? I'm not even sure I could give you a sketch what that would look like, it involves co-operation between so many orthogonal layers it is hard to imagine :\ kvm provides the virt context, iommufd controls the MSI aperture, irq remapping controls the remap table, vfio sets interrupts.. VFIO needs to say 'irq layer only establish an interrupt on this KVM' as some enforced mode ? Jason
On Tue, Sep 23, 2025 at 11:52:51AM -0300, Jason Gunthorpe wrote: > On Tue, Sep 23, 2025 at 09:37:31AM -0500, Andrew Jones wrote: > > undergoes a specified translation into an index of the MSI table. For the > > non-virt use case we skip the "composes a new address/data pair, which > > points at the remap table entry" step since we just forward the original > > with an identity mapping. For the virt case we do write a new addr,data > > pair (Patch15) since we need to map guest addresses to host addresses (but > > data is still just forwarded since the RISC-V IOMMU doesn't support data > > remapping). > > You should banish thinking of non-virt/virt from your lexicon. Linux > doesn't work that way, and trying to force it too is a loosing battle. Well, we need to consider virt when the hardware has virt-specific features that we want to control. We also need to consider virt when additional address translations to go from guest space to host space are needed, as in this case. > > If you have a remap domain then it should always be remapping. There > is no such idea in Linux as a conditional IRQ domain dependent on > external factors (like how the IOMMU is configured, if the device is > "virt" or not, etc). The remap domain is created when the platform supports MSIs and always does remapping when the IOMMU supports the MSI table. It could even do remapping when the IOMMU doesn't support the MSI table since it could use the DMA table instead, but that's left for a later patch series if hardware actually shows up like that. The difference between virt and non-virt is what addresses get remapped for the remapping. For virt, guest addresses get remapped, for non-virt, we don't remap guest addresses. And, since we don't remap guest addresses for non-virt, then, rather than invent some new, arbitrary address, we just use the host address. Remapping is still in use, but, as I said above, it's an identity mapping. (Note, for the current riscv iommu, "remapping" for the non-virt case just means keeping the set of IMSICs that a device may reach limited to just what it should be allowed to reach.) > > Be specific what you mean. I'm always happy to clarify when asked. I'm not sure what I said that would lead to thinking remapping was disabled for the non-virt case, but hopefully what I wrote now clarifies that it is not. Thanks, drew
On Tue, Sep 23, 2025 at 12:27:02PM -0300, Jason Gunthorpe wrote: > On Tue, Sep 23, 2025 at 10:12:42AM -0500, Andrew Jones wrote: > > be able to reach be reachable by managing the IOMMU MSI table. This gives > > us some level of isolation, but there is still the possibility a device > > may raise an interrupt it should not be able to when its irqs are affined > > to the same CPU as another device's > > Yes, exactly, this is the problem with basic VFIO support as there is > no general idea of a virtualization context.. > > > and the malicious/broken device uses the wrong MSI data. > > And to be clear it is not a malicious/broken device at issue here. In > PCI MSI is simple a DMA to a magic address. *ANY* device can be > commanded by system software to generate *ANY* address/data on PCIe. > > So any VFIO user can effectively generate any MSI it wants. It isn't a > matter of device brokeness. > > > near isolated enough. However, for the virt case, Addr is set to guest > > interrupt files (something like virtual IMSICs) which means there will be > > no other host device or other guest device irqs sharing those Addrs. > > Interrupts for devices assigned to guests are truly isolated (not within > > the guest, but we need nested support to fully isolate within the guest > > anyway). > > At least this is something, and I do think this is enough security to > be a useful solution. However, Linux has no existing support for the > idea of a VFIO device that only has access to "guest" interrupt HW. > > Presumably this is direct injection only now? Yup > > I'm not even sure I could give you a sketch what that would look like, > it involves co-operation between so many orthogonal layers it is hard > to imagine :\ > > kvm provides the virt context, iommufd controls the MSI aperture, irq > remapping controls the remap table, vfio sets interrupts.. > > VFIO needs to say 'irq layer only establish an interrupt on this KVM' > as some enforced mode ? > Yes, this is the part that I'd like to lean on you for, since I understand we want to avoid too much KVM/virt special casing for VFIO/IOMMUFD. I was thinking that if I bit the bullet and implemented nested support than when nesting was selected it would be apparent we're in virt context. However, I was hoping to pull together a solution that works with current QEMU and VFIO too. Thanks, drew
On Tue, Sep 23, 2025 at 10:50:56AM -0500, Andrew Jones wrote: > Yes, this is the part that I'd like to lean on you for, since I understand > we want to avoid too much KVM/virt special casing for VFIO/IOMMUFD. I was > thinking that if I bit the bullet and implemented nested support than when > nesting was selected it would be apparent we're in virt context. However, > I was hoping to pull together a solution that works with current QEMU and > VFIO too. You probably do have to make nested part of this, but I don't have a clear picture how you'd tie all the parts together through the nested API.. Somehow you have to load a msiptp that is effectively linked to the KVM reliably into the DC for the iommufd controlled devices that are linked to that KVM. Then synchronize with VFIO that this is done and it can setup KVM only interrupts somehow. This kvm entanglement is the "virt" you have mentioned many times. The direct injection interrupt path is already quite a confusing thing.. Jason
On Tue, Sep 23, 2025 at 01:23:02PM -0300, Jason Gunthorpe wrote: > On Tue, Sep 23, 2025 at 10:50:56AM -0500, Andrew Jones wrote: > > Yes, this is the part that I'd like to lean on you for, since I understand > > we want to avoid too much KVM/virt special casing for VFIO/IOMMUFD. I was > > thinking that if I bit the bullet and implemented nested support than when > > nesting was selected it would be apparent we're in virt context. However, > > I was hoping to pull together a solution that works with current QEMU and > > VFIO too. > > You probably do have to make nested part of this, but I don't have a > clear picture how you'd tie all the parts together through the nested > API.. > > Somehow you have to load a msiptp that is effectively linked to the > KVM reliably into the DC for the iommufd controlled devices that are > linked to that KVM. Then synchronize with VFIO that this is done and > it can setup KVM only interrupts somehow. This kvm entanglement is the > "virt" you have mentioned many times. Yes, the VFIO+KVM irqbypass framework currently provides much of this support (which patches 10-17 of this series leverage). But, if using nested is necessary in order to "signal" to the relevant subsystems that the device's irqs will truly be isolated, then I'll need to start figuring out how iommufd can fit into the mix (or if we'll need an iommufd-specific implementation for a new mix). > > The direct injection interrupt path is already quite a confusing > thing.. No argument there :-) Thanks, drew
On Mon, 22 Sep 2025 20:56:51 -0300, Jason Gunthorpe wrote: > We no longer need to support 32 bit builds and we missed this while > cleaning up. > Right now the only way to deal with this would be to only use one of > the S1 or S2 and make that decision when the iommu driver starts. You > can return the right SW_MSI/HW_MSI based on which PAGING domain style > the driver is going to use. > I recommend if the S2 is available you make the driver always use the > S2 for everything and ignore the S1 except for explicit two stage > translation setup by a hypervisor. Thus always return HW_MSI. > If the iommu does not support S2 then always use S1 and always return > SW_MSI. I strongly agree with this suggestion, because on one hand, the confusing design of RISC-V — particularly the translation rules of the msix_table — leads to different translation behaviors in S1 and S2 modes; on the other hand, designing a proper caching mechanism for the msix_table in both S1 and S2 modes is quite challenging. > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Thanks for the patch! Best regards, jinvas
diff --git a/drivers/iommu/riscv/iommu-ir.c b/drivers/iommu/riscv/iommu-ir.c index 290d91a6c6cd..b97768cac4be 100644 --- a/drivers/iommu/riscv/iommu-ir.c +++ b/drivers/iommu/riscv/iommu-ir.c @@ -4,6 +4,7 @@ * * Copyright © 2025 Ventana Micro Systems Inc. */ +#include <linux/cleanup.h> #include <linux/irqchip/riscv-imsic.h> #include <linux/irqdomain.h> #include <linux/msi.h> @@ -106,6 +107,20 @@ static size_t riscv_iommu_ir_nr_msiptes(struct riscv_iommu_domain *domain) return max_idx + 1; } +static void riscv_iommu_ir_set_pte(struct riscv_iommu_msipte *pte, u64 addr) +{ + pte->pte = FIELD_PREP(RISCV_IOMMU_MSIPTE_M, 3) | + riscv_iommu_phys_to_ppn(addr) | + FIELD_PREP(RISCV_IOMMU_MSIPTE_V, 1); + pte->mrif_info = 0; +} + +static void riscv_iommu_ir_clear_pte(struct riscv_iommu_msipte *pte) +{ + pte->pte = 0; + pte->mrif_info = 0; +} + static void riscv_iommu_ir_msitbl_inval(struct riscv_iommu_domain *domain, struct riscv_iommu_msipte *pte) { @@ -149,19 +164,99 @@ static void riscv_iommu_ir_msitbl_inval(struct riscv_iommu_domain *domain, rcu_read_unlock(); } +static void riscv_iommu_ir_msitbl_map(struct riscv_iommu_domain *domain, size_t idx, + phys_addr_t addr) +{ + struct riscv_iommu_msipte *pte; + + if (!domain->msi_root) + return; + + if (!refcount_inc_not_zero(&domain->msi_pte_counts[idx])) { + scoped_guard(raw_spinlock_irqsave, &domain->msi_lock) { + if (refcount_read(&domain->msi_pte_counts[idx]) == 0) { + pte = &domain->msi_root[idx]; + riscv_iommu_ir_set_pte(pte, addr); + riscv_iommu_ir_msitbl_inval(domain, pte); + refcount_set(&domain->msi_pte_counts[idx], 1); + } else { + refcount_inc(&domain->msi_pte_counts[idx]); + } + } + } +} + +static void riscv_iommu_ir_msitbl_unmap(struct riscv_iommu_domain *domain, size_t idx) +{ + struct riscv_iommu_msipte *pte; + + if (!domain->msi_root) + return; + + scoped_guard(raw_spinlock_irqsave, &domain->msi_lock) { + if (refcount_dec_and_test(&domain->msi_pte_counts[idx])) { + pte = &domain->msi_root[idx]; + riscv_iommu_ir_clear_pte(pte); + riscv_iommu_ir_msitbl_inval(domain, pte); + } + } +} + +static size_t riscv_iommu_ir_get_msipte_idx_from_target(struct riscv_iommu_domain *domain, + struct irq_data *data, phys_addr_t *addr) +{ + struct msi_msg msg; + + BUG_ON(irq_chip_compose_msi_msg(data, &msg)); + + *addr = ((phys_addr_t)msg.address_hi << 32) | msg.address_lo; + + return riscv_iommu_ir_compute_msipte_idx(domain, *addr); +} + +static int riscv_iommu_ir_irq_set_affinity(struct irq_data *data, + const struct cpumask *dest, bool force) +{ + struct riscv_iommu_info *info = data->domain->host_data; + struct riscv_iommu_domain *domain = info->domain; + phys_addr_t old_addr, new_addr; + size_t old_idx, new_idx; + int ret; + + old_idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &old_addr); + + ret = irq_chip_set_affinity_parent(data, dest, force); + if (ret < 0) + return ret; + + new_idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &new_addr); + + if (new_idx == old_idx) + return ret; + + riscv_iommu_ir_msitbl_unmap(domain, old_idx); + riscv_iommu_ir_msitbl_map(domain, new_idx, new_addr); + + return ret; +} + static struct irq_chip riscv_iommu_ir_irq_chip = { .name = "IOMMU-IR", .irq_ack = irq_chip_ack_parent, .irq_mask = irq_chip_mask_parent, .irq_unmask = irq_chip_unmask_parent, - .irq_set_affinity = irq_chip_set_affinity_parent, + .irq_set_affinity = riscv_iommu_ir_irq_set_affinity, }; static int riscv_iommu_ir_irq_domain_alloc_irqs(struct irq_domain *irqdomain, unsigned int irq_base, unsigned int nr_irqs, void *arg) { + struct riscv_iommu_info *info = irqdomain->host_data; + struct riscv_iommu_domain *domain = info->domain; struct irq_data *data; + phys_addr_t addr; + size_t idx; int i, ret; ret = irq_domain_alloc_irqs_parent(irqdomain, irq_base, nr_irqs, arg); @@ -171,14 +266,36 @@ static int riscv_iommu_ir_irq_domain_alloc_irqs(struct irq_domain *irqdomain, for (i = 0; i < nr_irqs; i++) { data = irq_domain_get_irq_data(irqdomain, irq_base + i); data->chip = &riscv_iommu_ir_irq_chip; + idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &addr); + riscv_iommu_ir_msitbl_map(domain, idx, addr); } return 0; } +static void riscv_iommu_ir_irq_domain_free_irqs(struct irq_domain *irqdomain, + unsigned int irq_base, + unsigned int nr_irqs) +{ + struct riscv_iommu_info *info = irqdomain->host_data; + struct riscv_iommu_domain *domain = info->domain; + struct irq_data *data; + phys_addr_t addr; + size_t idx; + int i; + + for (i = 0; i < nr_irqs; i++) { + data = irq_domain_get_irq_data(irqdomain, irq_base + i); + idx = riscv_iommu_ir_get_msipte_idx_from_target(domain, data, &addr); + riscv_iommu_ir_msitbl_unmap(domain, idx); + } + + irq_domain_free_irqs_parent(irqdomain, irq_base, nr_irqs); +} + static const struct irq_domain_ops riscv_iommu_ir_irq_domain_ops = { .alloc = riscv_iommu_ir_irq_domain_alloc_irqs, - .free = irq_domain_free_irqs_parent, + .free = riscv_iommu_ir_irq_domain_free_irqs, }; static const struct msi_parent_ops riscv_iommu_ir_msi_parent_ops = { @@ -221,6 +338,19 @@ struct irq_domain *riscv_iommu_ir_irq_domain_create(struct riscv_iommu_device *i return NULL; } + if (iommu->caps & RISCV_IOMMU_CAPABILITIES_MSI_FLAT) { + /* + * NOTE: The RISC-V IOMMU doesn't actually support isolated MSI because + * there is no MSI message validation (see the comment above + * msi_device_has_isolated_msi()). However, we claim isolated MSI here + * because applying the IOMMU ensures MSI messages may only be delivered + * to the mapped MSI addresses. This allows MSIs to be isolated to + * particular harts/vcpus where the unvalidated MSI messages can be + * tolerated. + */ + irqdomain->flags |= IRQ_DOMAIN_FLAG_ISOLATED_MSI; + } + irqdomain->flags |= IRQ_DOMAIN_FLAG_MSI_PARENT; irqdomain->msi_parent_ops = &riscv_iommu_ir_msi_parent_ops; irq_domain_update_bus_token(irqdomain, DOMAIN_BUS_MSI_REMAP); @@ -233,6 +363,7 @@ struct irq_domain *riscv_iommu_ir_irq_domain_create(struct riscv_iommu_device *i static void riscv_iommu_ir_free_msi_table(struct riscv_iommu_domain *domain) { iommu_free_pages(domain->msi_root); + kfree(domain->msi_pte_counts); } void riscv_iommu_ir_irq_domain_remove(struct riscv_iommu_info *info) @@ -274,6 +405,14 @@ static int riscv_ir_set_imsic_global_config(struct riscv_iommu_device *iommu, nr_ptes * sizeof(*domain->msi_root)); if (!domain->msi_root) return -ENOMEM; + + domain->msi_pte_counts = kcalloc(nr_ptes, sizeof(refcount_t), GFP_KERNEL_ACCOUNT); + if (!domain->msi_pte_counts) { + iommu_free_pages(domain->msi_root); + return -ENOMEM; + } + + raw_spin_lock_init(&domain->msi_lock); } return 0; diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h index 1fe35f1210fb..aeb5642f003c 100644 --- a/drivers/iommu/riscv/iommu.h +++ b/drivers/iommu/riscv/iommu.h @@ -34,6 +34,8 @@ struct riscv_iommu_domain { unsigned int pgd_mode; unsigned long *pgd_root; struct riscv_iommu_msipte *msi_root; + refcount_t *msi_pte_counts; + raw_spinlock_t msi_lock; u64 msi_addr_mask; u64 msi_addr_pattern; u32 group_index_bits;
When setting irq affinity extract the IMSIC address the device needs to access and add it to the MSI table. If the device no longer needs access to an IMSIC then remove it from the table to prohibit access. This allows isolating device MSIs to a set of harts so we can now add the IRQ_DOMAIN_FLAG_ISOLATED_MSI IRQ domain flag. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> --- drivers/iommu/riscv/iommu-ir.c | 143 ++++++++++++++++++++++++++++++++- drivers/iommu/riscv/iommu.h | 2 + 2 files changed, 143 insertions(+), 2 deletions(-)