Message ID | cover.1714494653.git.tjeznach@rivosinc.com |
---|---|
Headers | show |
Series | Linux RISC-V IOMMU Support | expand |
On 2024/5/1 4:01, Tomasz Jeznach wrote: > Advertise IOMMU device and its core API. > Only minimal implementation for single identity domain type, without > per-group domain protection. > > Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> with some nits below. > --- > drivers/iommu/riscv/iommu.c | 64 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 64 insertions(+) [..] > static int riscv_iommu_init_check(struct riscv_iommu_device *iommu) > { > u64 ddtp; > @@ -71,6 +126,7 @@ static int riscv_iommu_init_check(struct riscv_iommu_device *iommu) > > void riscv_iommu_remove(struct riscv_iommu_device *iommu) > { > + iommu_device_unregister(&iommu->iommu); > iommu_device_sysfs_remove(&iommu->iommu); > } > > @@ -96,8 +152,16 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu) > goto err_sysfs; > } > > + rc = iommu_device_register(&iommu->iommu, &riscv_iommu_ops, iommu->dev); > + if (rc) { > + dev_err_probe(iommu->dev, rc, "cannot register iommu interface\n"); > + goto err_iommu; > + } > + > return 0; > > +err_iommu: > + iommu_device_sysfs_remove(&iommu->iommu); > err_sysfs: > return rc; > } It's better to make the goto label indicate what is going to be handled. So it's more readable to make it like this: err_remove_sysfs: iommu_device_sysfs_remove(&iommu->iommu); Best regards, baolu
On 2024/5/1 4:01, Tomasz Jeznach wrote: > Introduce device driver for PCIe implementation > of RISC-V IOMMU architected hardware. > > IOMMU hardware and system support for MSI or MSI-X is > required by this implementation. > > Vendor and device identifiers used in this patch > matches QEMU implementation of the RISC-V IOMMU PCIe > device, from Rivos VID (0x1efd) range allocated by the PCI-SIG. > > MAINTAINERS | added iommu-pci.c already covered by matching pattern. > > Link:https://lore.kernel.org/qemu-devel/20240307160319.675044-1-dbarboza@ventanamicro.com/ > Co-developed-by: Nick Kossifidis<mick@ics.forth.gr> > Signed-off-by: Nick Kossifidis<mick@ics.forth.gr> > Signed-off-by: Tomasz Jeznach<tjeznach@rivosinc.com> > --- > drivers/iommu/riscv/Kconfig | 5 ++ > drivers/iommu/riscv/Makefile | 1 + > drivers/iommu/riscv/iommu-pci.c | 119 ++++++++++++++++++++++++++++++++ > 3 files changed, 125 insertions(+) > create mode 100644 drivers/iommu/riscv/iommu-pci.c Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Best regards, baolu
On 2024/5/1 4:01, Tomasz Jeznach wrote: > +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu) > +{ > + u64 ddtp; > + > + /* > + * Make sure the IOMMU is switched off or in pass-through mode during regular > + * boot flow and disable translation when we boot into a kexec kernel and the > + * previous kernel left them enabled. > + */ > + ddtp = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_DDTP); > + if (ddtp & RISCV_IOMMU_DDTP_BUSY) > + return -EBUSY; > + > + if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE) { > + if (!is_kdump_kernel()) Is kdump supported for RISC-V architectures? If so, the documentation in Documentation/admin-guide/kdump/kdump.rst might need an update. There is a possibility of ongoing DMAs during the boot process of the kdump capture kernel because there's a small chance of legacy DMA setups targeting any memory location. Kdump typically allows these ongoing DMA transfers to complete, assuming they were intended for valid memory regions. The IOMMU subsystem implements a default domain deferred attachment mechanism for this. In the kdump capture kernel, the whole device context tables are copied from the original kernel and will be overridden once the device driver calls the kernel DMA interface for the first time. This assumes that all old DMA transfers are completed after the driver's takeover. Will you consider this for RISC-V architecture as well? > + return -EBUSY; > + riscv_iommu_disable(iommu); > + } > + > + /* Configure accesses to in-memory data structures for CPU-native byte order. */ > + if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE)) { > + if (!(iommu->caps & RISCV_IOMMU_CAP_END)) > + return -EINVAL; > + riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, > + iommu->fctl ^ RISCV_IOMMU_FCTL_BE); > + iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL); > + if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE)) > + return -EINVAL; > + } > + > + return 0; > +} Best regards, baolu
On Wed, May 01, 2024 at 06:26:20PM +0800, Baolu Lu wrote: > On 2024/5/1 4:01, Tomasz Jeznach wrote: > > +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu) > > +{ > > + u64 ddtp; > > + > > + /* > > + * Make sure the IOMMU is switched off or in pass-through mode during regular > > + * boot flow and disable translation when we boot into a kexec kernel and the > > + * previous kernel left them enabled. > > + */ > > + ddtp = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_DDTP); > > + if (ddtp & RISCV_IOMMU_DDTP_BUSY) > > + return -EBUSY; > > + > > + if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE) { > > + if (!is_kdump_kernel()) > > Is kdump supported for RISC-V architectures? If so, the documentation > in Documentation/admin-guide/kdump/kdump.rst might need an update. > > There is a possibility of ongoing DMAs during the boot process of the > kdump capture kernel because there's a small chance of legacy DMA setups > targeting any memory location. Kdump typically allows these ongoing DMA > transfers to complete, assuming they were intended for valid memory > regions. > > The IOMMU subsystem implements a default domain deferred attachment > mechanism for this. In the kdump capture kernel, the whole device > context tables are copied from the original kernel and will be > overridden once the device driver calls the kernel DMA interface for the > first time. This assumes that all old DMA transfers are completed after > the driver's takeover. > > Will you consider this for RISC-V architecture as well? It seems we decided not to do that mess in ARM.. New architectures doing kdump should put the iommu in a full blocking state before handing over the next kernel, and this implies that devices drivers need to cleanly suspend their DMAs before going into the next kernel. Jason
On Tue, Apr 30, 2024 at 01:01:57PM -0700, Tomasz Jeznach wrote: > +#define iommu_domain_to_riscv(iommu_domain) \ > + container_of(iommu_domain, struct riscv_iommu_domain, domain) > + > +#define dev_to_domain(dev) \ > + iommu_domain_to_riscv(dev_iommu_priv_get(dev)) Please use the priv properly and put a struct around it, you'll certainly need this eventually to do the rest of the advanced features. > +static void riscv_iommu_bond_unlink(struct riscv_iommu_domain *domain, struct device *dev) > +{ > + struct riscv_iommu_bond *bond, *found = NULL; > + unsigned long flags; > + > + if (!domain) > + return; > + > + spin_lock_irqsave(&domain->lock, flags); This is never locked from an irq, you don't need to use the irqsave variations. > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > + if (bond->dev == dev) { > + list_del_rcu(&bond->list); > + found = bond; > + } > + } > + spin_unlock_irqrestore(&domain->lock, flags); > + > + /* Release and wait for all read-rcu critical sections have completed. */ > + kfree_rcu(found, rcu); > + synchronize_rcu(); Please no, synchronize_rcu() on a path like this is not so reasonable.. Also you don't need kfree_rcu() if you write it like this. This still looks better to do what I said before, put the iommu not the dev in the bond struct. > +static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev) > +{ > + struct riscv_iommu_domain *domain; > + struct riscv_iommu_device *iommu; > + > + iommu = dev ? dev_to_iommu(dev) : NULL; > + domain = kzalloc(sizeof(*domain), GFP_KERNEL); > + if (!domain) > + return ERR_PTR(-ENOMEM); > + > + INIT_LIST_HEAD_RCU(&domain->bonds); > + spin_lock_init(&domain->lock); > + domain->numa_node = NUMA_NO_NODE; > + > + /* > + * Follow system address translation mode. > + * RISC-V IOMMU ATP mode values match RISC-V CPU SATP mode values. > + */ > + domain->pgd_mode = satp_mode >> SATP_MODE_SHIFT; This seems really strange, the iommu paging domains should be unrelated to what the CPU is doing. There is no connection between these two concepts. Just pick a size that the iommu supports. The number of radix levels is a tunable alot of iommus have that we haven't really exposed to anything else yet. > + /* > + * Note: RISC-V Privilege spec mandates that virtual addresses > + * need to be sign-extended, so if (VA_BITS - 1) is set, all > + * bits >= VA_BITS need to also be set or else we'll get a > + * page fault. However the code that creates the mappings > + * above us (e.g. iommu_dma_alloc_iova()) won't do that for us > + * for now, so we'll end up with invalid virtual addresses > + * to map. As a workaround until we get this sorted out > + * limit the available virtual addresses to VA_BITS - 1. > + */ > + domain->domain.geometry.aperture_start = 0; > + domain->domain.geometry.aperture_end = DMA_BIT_MASK(VA_BITS - 1); > + domain->domain.geometry.force_aperture = true; Yikes.. This is probably the best solution long term anyhow, unless you need to use the last page in VFIO for some reason. > static int riscv_iommu_device_domain_type(struct device *dev) > { > - return IOMMU_DOMAIN_IDENTITY; > + struct riscv_iommu_device *iommu = dev_to_iommu(dev); > + > + if (iommu->ddt_mode == RISCV_IOMMU_DDTP_MODE_BARE) > + return IOMMU_DOMAIN_IDENTITY; > + Is there even a point of binding an iommu driver if the HW can't support a DDT table? Just return -ENODEV from probe_device? Logically a iommu block that can't decode the RID has no association at all with a Linux struct device :) Jason
On Tue, Apr 30, 2024 at 01:01:55PM -0700, Tomasz Jeznach wrote: > Introduce device context allocation and device directory tree > management including capabilities discovery sequence, as described > in Chapter 2.1 of the RISC-V IOMMU Architecture Specification. > > Device directory mode will be auto detected using DDTP WARL property, > using highest mode supported by the driver and hardware. If none > supported can be configured, driver will fall back to global pass-through. > > First level DDTP page can be located in I/O (detected using DDTP WARL) > and system memory. > > Only simple identity and release (blocking) protection domains are > supported by this implementation. Why rename the concept? We call it a BLOCKING domain, just use that name please. > +static int riscv_iommu_attach_release_domain(struct iommu_domain *iommu_domain, > + struct device *dev) > +{ > + struct riscv_iommu_device *iommu = dev_to_iommu(dev); > + > + if (iommu->ddt_mode > RISCV_IOMMU_DDTP_MODE_BARE) > + riscv_iommu_iodir_update(iommu, dev, RISCV_IOMMU_FSC_BARE, 0); > + > + return 0; > +} > + > +static struct iommu_domain riscv_iommu_release_domain = { > + .type = IOMMU_DOMAIN_BLOCKED, > + .ops = &(const struct iommu_domain_ops) { > + .attach_dev = riscv_iommu_attach_release_domain, > + } > +}; 'riscv_iommu_release_domain' doesn't make sense.. Jason
On Tue, Apr 30, 2024 at 01:01:50PM -0700, Tomasz Jeznach wrote: > This patch series introduces support for RISC-V IOMMU architected > hardware into the Linux kernel. It seems in reasonable shape now, at least in terms of implementing the domain logic. It would be nice if you'd run it through clang-format and correct some of the minor misformatting it will point out. We still like to have a 80 col line limit in most cases. There are many overages here that aren't well justified. And you could consider the nitpicky style advice to use 'reverse christmas tree' for the variable declarations like most of the subsystem is trending toward. Jason
On 5/1/24 4:01 AM, Tomasz Jeznach wrote: > @@ -128,6 +489,7 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu) > { > iommu_device_unregister(&iommu->iommu); > iommu_device_sysfs_remove(&iommu->iommu); > + riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF); > } > > int riscv_iommu_init(struct riscv_iommu_device *iommu) > @@ -138,12 +500,13 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu) > if (rc) > return dev_err_probe(iommu->dev, rc, "unexpected device state\n"); > > - /* > - * Placeholder for a complete IOMMU device initialization. > - * For now, only bare minimum: enable global identity mapping mode and register sysfs. > - */ > - riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, > - FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE)); > + rc = riscv_iommu_iodir_alloc(iommu); > + if (rc) > + goto err_init; > + > + rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX); > + if (rc) > + goto err_init; > > rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s", > dev_name(iommu->dev)); The device directory root page might be allocated in riscv_iommu_iodir_alloc(), + if (!iommu->ddt_root) { + iommu->ddt_root = riscv_iommu_get_pages(iommu, 0); + iommu->ddt_phys = __pa(iommu->ddt_root); + } But I didn't find any place to free it in the error paths. Did I overlook anything? Best regards, baolu
On 5/2/24 9:38 AM, Baolu Lu wrote: > On 5/1/24 4:01 AM, Tomasz Jeznach wrote: >> @@ -128,6 +489,7 @@ void riscv_iommu_remove(struct riscv_iommu_device >> *iommu) >> { >> iommu_device_unregister(&iommu->iommu); >> iommu_device_sysfs_remove(&iommu->iommu); >> + riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF); >> } >> int riscv_iommu_init(struct riscv_iommu_device *iommu) >> @@ -138,12 +500,13 @@ int riscv_iommu_init(struct riscv_iommu_device >> *iommu) >> if (rc) >> return dev_err_probe(iommu->dev, rc, "unexpected device >> state\n"); >> - /* >> - * Placeholder for a complete IOMMU device initialization. >> - * For now, only bare minimum: enable global identity mapping >> mode and register sysfs. >> - */ >> - riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, >> - FIELD_PREP(RISCV_IOMMU_DDTP_MODE, >> RISCV_IOMMU_DDTP_MODE_BARE)); >> + rc = riscv_iommu_iodir_alloc(iommu); >> + if (rc) >> + goto err_init; >> + >> + rc = riscv_iommu_iodir_set_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX); >> + if (rc) >> + goto err_init; >> rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, >> "riscv-iommu@%s", >> dev_name(iommu->dev)); > > The device directory root page might be allocated in > riscv_iommu_iodir_alloc(), > > + if (!iommu->ddt_root) { > + iommu->ddt_root = riscv_iommu_get_pages(iommu, 0); > + iommu->ddt_phys = __pa(iommu->ddt_root); > + } > > But I didn't find any place to free it in the error paths. Did I > overlook anything? I found the answer by myself. devres_alloc() is used so the page memory resources are managed by the driver core automatically. Please ignore the above comment. Not sure about the pages for paging domain. If all pages release is deferred, it may cause big memory consumption. We have a trade-off here. On one hand, this might speed up the domain mapping since all pages might be preallocated; on the other hand, it will cost more memory. Best regards, baolu
On 5/1/24 4:01 AM, Tomasz Jeznach wrote: > Introduce device context allocation and device directory tree > management including capabilities discovery sequence, as described > in Chapter 2.1 of the RISC-V IOMMU Architecture Specification. > > Device directory mode will be auto detected using DDTP WARL property, > using highest mode supported by the driver and hardware. If none > supported can be configured, driver will fall back to global pass-through. > > First level DDTP page can be located in I/O (detected using DDTP WARL) > and system memory. > > Only simple identity and release (blocking) protection domains are > supported by this implementation. > > Co-developed-by: Nick Kossifidis<mick@ics.forth.gr> > Signed-off-by: Nick Kossifidis<mick@ics.forth.gr> > Signed-off-by: Tomasz Jeznach<tjeznach@rivosinc.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Best regards, baolu
On 5/1/24 10:20 PM, Jason Gunthorpe wrote: > On Wed, May 01, 2024 at 06:26:20PM +0800, Baolu Lu wrote: >> On 2024/5/1 4:01, Tomasz Jeznach wrote: >>> +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu) >>> +{ >>> + u64 ddtp; >>> + >>> + /* >>> + * Make sure the IOMMU is switched off or in pass-through mode during regular >>> + * boot flow and disable translation when we boot into a kexec kernel and the >>> + * previous kernel left them enabled. >>> + */ >>> + ddtp = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_DDTP); >>> + if (ddtp & RISCV_IOMMU_DDTP_BUSY) >>> + return -EBUSY; >>> + >>> + if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE) { >>> + if (!is_kdump_kernel()) >> Is kdump supported for RISC-V architectures? If so, the documentation >> in Documentation/admin-guide/kdump/kdump.rst might need an update. >> >> There is a possibility of ongoing DMAs during the boot process of the >> kdump capture kernel because there's a small chance of legacy DMA setups >> targeting any memory location. Kdump typically allows these ongoing DMA >> transfers to complete, assuming they were intended for valid memory >> regions. >> >> The IOMMU subsystem implements a default domain deferred attachment >> mechanism for this. In the kdump capture kernel, the whole device >> context tables are copied from the original kernel and will be >> overridden once the device driver calls the kernel DMA interface for the >> first time. This assumes that all old DMA transfers are completed after >> the driver's takeover. >> >> Will you consider this for RISC-V architecture as well? > It seems we decided not to do that mess in ARM.. > > New architectures doing kdump should put the iommu in a full blocking > state before handing over the next kernel, and this implies that > devices drivers need to cleanly suspend their DMAs before going into > the next kernel. Glad to hear that. :-) With the above consideration, the driver should consider it an error case where the iommu is not in the blocking state, and it's in the kdump kernel, right? If so, probably the iommu driver should always return failure when the iommu is not in the blocking state. However, the RISC-V's logic is: - if this is a kdump kernel, just disable iommu; - otherwise, failure case. This logic seems problematic. Best regards, baolu
On Wed, May 1, 2024 at 7:25 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > On 5/1/24 10:20 PM, Jason Gunthorpe wrote: > > On Wed, May 01, 2024 at 06:26:20PM +0800, Baolu Lu wrote: > >> On 2024/5/1 4:01, Tomasz Jeznach wrote: > >>> +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu) > >>> +{ > >>> + u64 ddtp; > >>> + > >>> + /* > >>> + * Make sure the IOMMU is switched off or in pass-through mode during regular > >>> + * boot flow and disable translation when we boot into a kexec kernel and the > >>> + * previous kernel left them enabled. > >>> + */ > >>> + ddtp = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_DDTP); > >>> + if (ddtp & RISCV_IOMMU_DDTP_BUSY) > >>> + return -EBUSY; > >>> + > >>> + if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE) { > >>> + if (!is_kdump_kernel()) > >> Is kdump supported for RISC-V architectures? If so, the documentation > >> in Documentation/admin-guide/kdump/kdump.rst might need an update. > >> > >> There is a possibility of ongoing DMAs during the boot process of the > >> kdump capture kernel because there's a small chance of legacy DMA setups > >> targeting any memory location. Kdump typically allows these ongoing DMA > >> transfers to complete, assuming they were intended for valid memory > >> regions. > >> > >> The IOMMU subsystem implements a default domain deferred attachment > >> mechanism for this. In the kdump capture kernel, the whole device > >> context tables are copied from the original kernel and will be > >> overridden once the device driver calls the kernel DMA interface for the > >> first time. This assumes that all old DMA transfers are completed after > >> the driver's takeover. > >> > >> Will you consider this for RISC-V architecture as well? > > It seems we decided not to do that mess in ARM.. > > > > New architectures doing kdump should put the iommu in a full blocking > > state before handing over the next kernel, and this implies that > > devices drivers need to cleanly suspend their DMAs before going into > > the next kernel. > > Glad to hear that. :-) > > With the above consideration, the driver should consider it an error > case where the iommu is not in the blocking state, and it's in the kdump > kernel, right? > > If so, probably the iommu driver should always return failure when the > iommu is not in the blocking state. However, the RISC-V's logic is: > > - if this is a kdump kernel, just disable iommu; > - otherwise, failure case. > > This logic seems problematic. > Initial implementation recognize this as an error and failed to initialize IOMMU was in non-default state (disabled or pass-through). Ideally we should have proper shutdown sequence for kdump case, and quiesce IOMMU before kdump kernel runs. However, this path is problematic enough that I'd prefer not to add any other complications to this patch set. Definitely setting IOMMU in a full blocking state before handling over to kdump kernel. Will get back to that at some point. Dropping all IOMMU config in kdump path is IMHO the compromise we can accept, leaving kdump kernel an option to reconfigure IOMMU in later stages of the IOMMU initialization. > Best regards, > baolu Best, - Tomasz
On 5/1/24 4:01 AM, Tomasz Jeznach wrote: > +/* > + * Send IOTLB.INVAL for whole address space for ranges larger than 2MB. > + * This limit will be replaced with range invalidations, if supported by > + * the hardware, when RISC-V IOMMU architecture specification update for > + * range invalidations update will be available. > + */ > +#define RISCV_IOMMU_IOTLB_INVAL_LIMIT (2 << 20) > + > +static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain, > + unsigned long start, unsigned long end) > +{ > + struct riscv_iommu_bond *bond; > + struct riscv_iommu_device *iommu, *prev; > + struct riscv_iommu_command cmd; > + unsigned long len = end - start + 1; > + unsigned long iova; > + > + rcu_read_lock(); > + > + prev = NULL; > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > + iommu = dev_to_iommu(bond->dev); > + > + riscv_iommu_cmd_inval_vma(&cmd); > + riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid); > + if (len && len >= RISCV_IOMMU_IOTLB_INVAL_LIMIT) { > + for (iova = start; iova < end; iova += PAGE_SIZE) { > + riscv_iommu_cmd_inval_set_addr(&cmd, iova); > + riscv_iommu_cmd_send(iommu, &cmd, 0); > + } > + } else { > + riscv_iommu_cmd_send(iommu, &cmd, 0); > + } > + > + /* > + * IOTLB invalidation request can be safely omitted if already sent > + * to the IOMMU for the same PSCID, and with domain->bonds list > + * arranged based on the device's IOMMU, it's sufficient to check > + * last device the invalidation was sent to. > + */ > + if (iommu == prev) > + continue; > + > + prev = iommu; > + riscv_iommu_cmd_send(iommu, &cmd, 0); > + } I don't quite follow why not moving "if (iommu == prev)" check to the top and removing the last riscv_iommu_cmd_send(). My understanding is that we could make it simply like below: prev = NULL; list_for_each_entry_rcu(bond, &domain->bonds, list) { iommu = dev_to_iommu(bond->dev); if (iommu == prev) continue; /* * Send an invalidation request to the request queue * without wait. */ ... ... prev = iommu; } > + > + prev = NULL; > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > + iommu = dev_to_iommu(bond->dev); > + if (iommu == prev) > + continue; > + > + prev = iommu; > + riscv_iommu_cmd_iofence(&cmd); > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT); > + } > + rcu_read_unlock(); > +} Best regards, baolu
On 5/1/24 4:01 AM, Tomasz Jeznach wrote: > Introduce device command submission and fault reporting queues, > as described in Chapter 3.1 and 3.2 of the RISC-V IOMMU Architecture > Specification. > > Command and fault queues are instantiated in contiguous system memory > local to IOMMU device domain, or mapped from fixed I/O space provided > by the hardware implementation. Detection of the location and maximum > allowed size of the queue utilize WARL properties of queue base control > register. Driver implementation will try to allocate up to 128KB of > system memory, while respecting hardware supported maximum queue size. > > Interrupts allocation is based on interrupt vectors availability and > distributed to all queues in simple round-robin fashion. For hardware > Implementation with fixed event type to interrupt vector assignment > IVEC WARL property is used to discover such mappings. > > Address translation, command and queue fault handling in this change > is limited to simple fault reporting without taking any action. > > Signed-off-by: Tomasz Jeznach<tjeznach@rivosinc.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Best regards, baolu
On Wed, May 1, 2024 at 8:52 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > On 5/1/24 4:01 AM, Tomasz Jeznach wrote: > > +/* > > + * Send IOTLB.INVAL for whole address space for ranges larger than 2MB. > > + * This limit will be replaced with range invalidations, if supported by > > + * the hardware, when RISC-V IOMMU architecture specification update for > > + * range invalidations update will be available. > > + */ > > +#define RISCV_IOMMU_IOTLB_INVAL_LIMIT (2 << 20) > > + > > +static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain, > > + unsigned long start, unsigned long end) > > +{ > > + struct riscv_iommu_bond *bond; > > + struct riscv_iommu_device *iommu, *prev; > > + struct riscv_iommu_command cmd; > > + unsigned long len = end - start + 1; > > + unsigned long iova; > > + > > + rcu_read_lock(); > > + > > + prev = NULL; > > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > > + iommu = dev_to_iommu(bond->dev); > > + > > + riscv_iommu_cmd_inval_vma(&cmd); > > + riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid); > > + if (len && len >= RISCV_IOMMU_IOTLB_INVAL_LIMIT) { > > + for (iova = start; iova < end; iova += PAGE_SIZE) { > > + riscv_iommu_cmd_inval_set_addr(&cmd, iova); > > + riscv_iommu_cmd_send(iommu, &cmd, 0); > > + } > > + } else { > > + riscv_iommu_cmd_send(iommu, &cmd, 0); > > + } > > + > > + /* > > + * IOTLB invalidation request can be safely omitted if already sent > > + * to the IOMMU for the same PSCID, and with domain->bonds list > > + * arranged based on the device's IOMMU, it's sufficient to check > > + * last device the invalidation was sent to. > > + */ > > + if (iommu == prev) > > + continue; > > + > > + prev = iommu; > > + riscv_iommu_cmd_send(iommu, &cmd, 0); > > + } > > I don't quite follow why not moving "if (iommu == prev)" check to the > top and removing the last riscv_iommu_cmd_send(). My understanding is > that we could make it simply like below: > > prev = NULL; > list_for_each_entry_rcu(bond, &domain->bonds, list) { > iommu = dev_to_iommu(bond->dev); > if (iommu == prev) > continue; > > /* > * Send an invalidation request to the request queue > * without wait. > */ > ... ... > > prev = iommu; > } > Oh. Thanks for spotting that. Code section reordered very likely during rebasing patches... > > + > > + prev = NULL; > > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > > + iommu = dev_to_iommu(bond->dev); > > + if (iommu == prev) > > + continue; > > + > > + prev = iommu; > > + riscv_iommu_cmd_iofence(&cmd); > > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT); > > + } > > + rcu_read_unlock(); > > +} > > Best regards, > baolu Best regards, - Tomasz
On Wed, May 1, 2024 at 7:56 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Apr 30, 2024 at 01:01:57PM -0700, Tomasz Jeznach wrote: > > > +#define iommu_domain_to_riscv(iommu_domain) \ > > + container_of(iommu_domain, struct riscv_iommu_domain, domain) > > + > > +#define dev_to_domain(dev) \ > > + iommu_domain_to_riscv(dev_iommu_priv_get(dev)) > > Please use the priv properly and put a struct around it, you'll > certainly need this eventually to do the rest of the advanced > features. > Done. Yes, indeed, I was going to introduce proper struct in follow up patches anyway. Pulled this change sooner. > > +static void riscv_iommu_bond_unlink(struct riscv_iommu_domain *domain, struct device *dev) > > +{ > > + struct riscv_iommu_bond *bond, *found = NULL; > > + unsigned long flags; > > + > > + if (!domain) > > + return; > > + > > + spin_lock_irqsave(&domain->lock, flags); > > This is never locked from an irq, you don't need to use the irqsave > variations. > Good point. done in v4. > > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > > + if (bond->dev == dev) { > > + list_del_rcu(&bond->list); > > + found = bond; > > + } > > + } > > + spin_unlock_irqrestore(&domain->lock, flags); > > + > > + /* Release and wait for all read-rcu critical sections have completed. */ > > + kfree_rcu(found, rcu); > > + synchronize_rcu(); > > Please no, synchronize_rcu() on a path like this is not so > reasonable.. Also you don't need kfree_rcu() if you write it like this. > > This still looks better to do what I said before, put the iommu not > the dev in the bond struct. > > I was trying not to duplicate data in bond struct and use whatever is available to be referenced from dev pointer (eg iommu / ids / private iommu dev data). If I'm reading core iommu code correctly, device pointer and iommu pointers should be valid between _probe_device and _release_device calls. I've moved synchronize_rcu out of the domain attach path to _release_device, LMK if that would be ok for now. I'll have a second another to rework other patches to avoid storing dev pointers at all. > > +static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev) > > +{ > > + struct riscv_iommu_domain *domain; > > + struct riscv_iommu_device *iommu; > > + > > + iommu = dev ? dev_to_iommu(dev) : NULL; > > + domain = kzalloc(sizeof(*domain), GFP_KERNEL); > > + if (!domain) > > + return ERR_PTR(-ENOMEM); > > + > > + INIT_LIST_HEAD_RCU(&domain->bonds); > > + spin_lock_init(&domain->lock); > > + domain->numa_node = NUMA_NO_NODE; > > + > > + /* > > + * Follow system address translation mode. > > + * RISC-V IOMMU ATP mode values match RISC-V CPU SATP mode values. > > + */ > > + domain->pgd_mode = satp_mode >> SATP_MODE_SHIFT; > > This seems really strange, the iommu paging domains should be > unrelated to what the CPU is doing. There is no connection between > these two concepts. > > Just pick a size that the iommu supports. > > The number of radix levels is a tunable alot of iommus have that we > haven't really exposed to anything else yet. > Makes sense. I've left an option to pick mode from MMU for cases where dev/iommu is not known at allocation time (with iommu_domain_alloc()). I'd guess it's reasonable to assume IOMMU supported page modes will match MMU. > > + /* > > + * Note: RISC-V Privilege spec mandates that virtual addresses > > + * need to be sign-extended, so if (VA_BITS - 1) is set, all > > + * bits >= VA_BITS need to also be set or else we'll get a > > + * page fault. However the code that creates the mappings > > + * above us (e.g. iommu_dma_alloc_iova()) won't do that for us > > + * for now, so we'll end up with invalid virtual addresses > > + * to map. As a workaround until we get this sorted out > > + * limit the available virtual addresses to VA_BITS - 1. > > + */ > > + domain->domain.geometry.aperture_start = 0; > > + domain->domain.geometry.aperture_end = DMA_BIT_MASK(VA_BITS - 1); > > + domain->domain.geometry.force_aperture = true; > > Yikes.. This is probably the best solution long term anyhow, unless > you need to use the last page in VFIO for some reason. > > > static int riscv_iommu_device_domain_type(struct device *dev) > > { > > - return IOMMU_DOMAIN_IDENTITY; > > + struct riscv_iommu_device *iommu = dev_to_iommu(dev); > > + > > + if (iommu->ddt_mode == RISCV_IOMMU_DDTP_MODE_BARE) > > + return IOMMU_DOMAIN_IDENTITY; > > + > > Is there even a point of binding an iommu driver if the HW can't > support a DDT table? Just return -ENODEV from probe_device? > > Logically a iommu block that can't decode the RID has no association > at all with a Linux struct device :) > Done. Good point ;) Thanks for review, - Tomasz > Jason
On Fri, May 03, 2024 at 10:44:14AM -0700, Tomasz Jeznach wrote: > > > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > > > + if (bond->dev == dev) { > > > + list_del_rcu(&bond->list); > > > + found = bond; > > > + } > > > + } > > > + spin_unlock_irqrestore(&domain->lock, flags); > > > + > > > + /* Release and wait for all read-rcu critical sections have completed. */ > > > + kfree_rcu(found, rcu); > > > + synchronize_rcu(); > > > > Please no, synchronize_rcu() on a path like this is not so > > reasonable.. Also you don't need kfree_rcu() if you write it like this. > > > > This still looks better to do what I said before, put the iommu not > > the dev in the bond struct. > > > > > > I was trying not to duplicate data in bond struct and use whatever is > available to be referenced from dev pointer (eg iommu / ids / private > iommu dev data). I'm not sure that is a valuable goal considering the RCU complexity.. But I suppose it would be a bit of a hassle to replicate the ids list into bond structurs. Maybe something to do when you get to ATS since you'll probably want to replicate the ATS RIDs. (see what Intel did, which I think is pretty good) > If I'm reading core iommu code correctly, device pointer and iommu > pointers should be valid between _probe_device and _release_device > calls. I've moved synchronize_rcu out of the domain attach path to > _release_device, LMK if that would be ok for now. I'll have a > second another to rework other patches to avoid storing dev pointers > at all. Yes, that seems better.. I'm happier to see device hot-unplug be slow than un attach There is another issue with the RCU that I haven't wrapped my head around.. Technically we can have concurrent map/unmap/invalidation along side device attach/detach. Does the invalidation under RCU work correctly? For detach I think yes: Inv CPU Detach CPU write io_pte Update device descriptor rcu_read_lock list_for_each <make invalidation command> <make description inv cmd> dma_wmb() dma_wmb() <doorbell> <cmd doorbell> rcu_read_unlock list_del_rcu() <wipe ASID> In this case I think we never miss an invalidation, the list_del is always after the HW has been fully fenced, so I don't think we can have any issue. Maybe a suprious invalidation if the ASID gets re-used, but who cares. Attach is different.. Inv CPU Attach CPU write io_pte rcu_read_lock list_for_each // empty list_add_rcu() Update device descriptor <make description inv cmd> dma_wmb() <cmd doorbell> rcu_read_unlock As above shows we can "miss" an invalidation. The issue is narrow, the io_pte could still be sitting in write buffers in "Inv CPU" and not yet globally visiable. "Attach CPU" could get the device descriptor installed in the IOMMU and the IOMMU could walk an io_pte that is in the old state. Effectively this is because there is no release/acquire barrier passing the io_pte store from the Inv CPU to the Attach CPU to the IOMMU. It seems like it should be solvable somehow: 1) Inv CPU releases all the io ptes 2) Attach CPU acquires the io ptes before updating the DDT 3) Inv CPU acquires the RCU list in such a way that either attach CPU will acquire the io_pte or inv CPU will acquire the RCU list. 4) Either invalidation works or we release the new iopte to the SMMU and don't need it. But #3 is a really weird statement. smb_mb() on both sides may do the job?? > > The number of radix levels is a tunable alot of iommus have that we > > haven't really exposed to anything else yet. > > Makes sense. I've left an option to pick mode from MMU for cases where > dev/iommu is not known at allocation time (with iommu_domain_alloc()). > I'd guess it's reasonable to assume IOMMU supported page modes will > match MMU. Reasonable, but for this case you'd be best to have a global static that unifies the capability of all the iommu instances. Then you can pick the right thing from the installed iommus, and arguably, that is the right thing to do in all cases as we'd slightly prefer domains that work everywhere in that edge case. Jason
On Fri, May 3, 2024 at 11:11 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, May 03, 2024 at 10:44:14AM -0700, Tomasz Jeznach wrote: > > > > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > > > > + if (bond->dev == dev) { > > > > + list_del_rcu(&bond->list); > > > > + found = bond; > > > > + } > > > > + } > > > > + spin_unlock_irqrestore(&domain->lock, flags); > > > > + > > > > + /* Release and wait for all read-rcu critical sections have completed. */ > > > > + kfree_rcu(found, rcu); > > > > + synchronize_rcu(); > > > > > > Please no, synchronize_rcu() on a path like this is not so > > > reasonable.. Also you don't need kfree_rcu() if you write it like this. > > > > > > This still looks better to do what I said before, put the iommu not > > > the dev in the bond struct. > > > > > > > > > > I was trying not to duplicate data in bond struct and use whatever is > > available to be referenced from dev pointer (eg iommu / ids / private > > iommu dev data). > > I'm not sure that is a valuable goal considering the RCU > complexity.. But I suppose it would be a bit of a hassle to replicate > the ids list into bond structurs. Maybe something to do when you get > to ATS since you'll probably want to replicate the ATS RIDs. (see what > Intel did, which I think is pretty good) > > > If I'm reading core iommu code correctly, device pointer and iommu > > pointers should be valid between _probe_device and _release_device > > calls. I've moved synchronize_rcu out of the domain attach path to > > _release_device, LMK if that would be ok for now. I'll have a > > second another to rework other patches to avoid storing dev pointers > > at all. > > Yes, that seems better.. I'm happier to see device hot-unplug be slow > than un attach > > There is another issue with the RCU that I haven't wrapped my head > around.. > > Technically we can have concurrent map/unmap/invalidation along side > device attach/detach. Does the invalidation under RCU work correctly? > > For detach I think yes: > > Inv CPU Detach CPU > > write io_pte Update device descriptor > rcu_read_lock > list_for_each > <make invalidation command> <make description inv cmd> > dma_wmb() dma_wmb() > <doorbell> <cmd doorbell> > rcu_read_unlock > list_del_rcu() > <wipe ASID> > > In this case I think we never miss an invalidation, the list_del is > always after the HW has been fully fenced, so I don't think we can > have any issue. Maybe a suprious invalidation if the ASID gets > re-used, but who cares. > > Attach is different.. > > Inv CPU Attach CPU > > write io_pte > rcu_read_lock > list_for_each // empty > list_add_rcu() > Update device descriptor > <make description inv cmd> > dma_wmb() > <cmd doorbell> > rcu_read_unlock > > As above shows we can "miss" an invalidation. The issue is narrow, the > io_pte could still be sitting in write buffers in "Inv CPU" and not > yet globally visiable. "Attach CPU" could get the device descriptor > installed in the IOMMU and the IOMMU could walk an io_pte that is in > the old state. Effectively this is because there is no release/acquire > barrier passing the io_pte store from the Inv CPU to the Attach CPU to the > IOMMU. > > It seems like it should be solvable somehow: > 1) Inv CPU releases all the io ptes > 2) Attach CPU acquires the io ptes before updating the DDT > 3) Inv CPU acquires the RCU list in such a way that either attach > CPU will acquire the io_pte or inv CPU will acquire the RCU list. > 4) Either invalidation works or we release the new iopte to the SMMU > and don't need it. > > But #3 is a really weird statement. smb_mb() on both sides may do the > job?? > Actual attach sequence is slightly different. Inv CPU Attach CPU write io_pte rcu_read_lock list_for_each // empty list_add_rcu() IOTLB.INVAL(PSCID) <make description inv cmd> dma_wmb() <cmd doorbell> rcu_read_unlock I've tried to cover this case with riscv_iommu_iotlb_inval() called before the attached domain is visible to the device. This is something to optimize to avoid invalidating the whole PSCID if the domain was already attached to the same IOMMU, but I'd leave it for a separate patch series. > > > The number of radix levels is a tunable alot of iommus have that we > > > haven't really exposed to anything else yet. > > > > Makes sense. I've left an option to pick mode from MMU for cases where > > dev/iommu is not known at allocation time (with iommu_domain_alloc()). > > I'd guess it's reasonable to assume IOMMU supported page modes will > > match MMU. > > Reasonable, but for this case you'd be best to have a global static > that unifies the capability of all the iommu instances. Then you can > pick the right thing from the installed iommus, and arguably, that is > the right thing to do in all cases as we'd slightly prefer domains > that work everywhere in that edge case. > That is a viable option. If you're ok I'll address this as a separate patch. I've been trying to avoid adding more global variables, but it might not be avoidable. > Jason Best, - Tomasz
On Fri, May 03, 2024 at 12:44:09PM -0700, Tomasz Jeznach wrote: > > For detach I think yes: > > > > Inv CPU Detach CPU > > > > write io_pte Update device descriptor > > rcu_read_lock > > list_for_each > > <make invalidation command> <make description inv cmd> > > dma_wmb() dma_wmb() > > <doorbell> <cmd doorbell> > > rcu_read_unlock > > list_del_rcu() > > <wipe ASID> > > > > In this case I think we never miss an invalidation, the list_del is > > always after the HW has been fully fenced, so I don't think we can > > have any issue. Maybe a suprious invalidation if the ASID gets > > re-used, but who cares. > > > > Attach is different.. > > > > Inv CPU Attach CPU > > > > write io_pte > > rcu_read_lock > > list_for_each // empty > > list_add_rcu() > > Update device descriptor > > <make description inv cmd> > > dma_wmb() > > <cmd doorbell> > > rcu_read_unlock > > > > As above shows we can "miss" an invalidation. The issue is narrow, the > > io_pte could still be sitting in write buffers in "Inv CPU" and not > > yet globally visiable. "Attach CPU" could get the device descriptor > > installed in the IOMMU and the IOMMU could walk an io_pte that is in > > the old state. Effectively this is because there is no release/acquire > > barrier passing the io_pte store from the Inv CPU to the Attach CPU to the > > IOMMU. > > > > It seems like it should be solvable somehow: > > 1) Inv CPU releases all the io ptes > > 2) Attach CPU acquires the io ptes before updating the DDT > > 3) Inv CPU acquires the RCU list in such a way that either attach > > CPU will acquire the io_pte or inv CPU will acquire the RCU list. > > 4) Either invalidation works or we release the new iopte to the SMMU > > and don't need it. > > > > But #3 is a really weird statement. smb_mb() on both sides may do the > > job?? > > > > Actual attach sequence is slightly different. > > Inv CPU Attach CPU > > write io_pte > rcu_read_lock > list_for_each // empty > list_add_rcu() > IOTLB.INVAL(PSCID) > <make description inv cmd> > dma_wmb() > <cmd doorbell> > rcu_read_unlock > > I've tried to cover this case with riscv_iommu_iotlb_inval() called > before the attached domain is visible to the device. That invalidation shouldn't do anything. If this is the first attach of a PSCID then the PSCID had better already be empty, it won't become non-empty until the DDT entry is installed. And if it is the second attach then the Inv CPU is already taking care of things, no need to invalidate at all. Regardless, there is still a theortical race that the IOPTEs haven't been made visible yet because there is still no synchronization with the CPU writing them. So, I don't think this solves any problem. I belive you need the appropriate kind of CPU barrier here instead of an invalidation. Jason
On Sun, May 5, 2024 at 8:46 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, May 03, 2024 at 12:44:09PM -0700, Tomasz Jeznach wrote: > > > For detach I think yes: > > > > > > Inv CPU Detach CPU > > > > > > write io_pte Update device descriptor > > > rcu_read_lock > > > list_for_each > > > <make invalidation command> <make description inv cmd> > > > dma_wmb() dma_wmb() > > > <doorbell> <cmd doorbell> > > > rcu_read_unlock > > > list_del_rcu() > > > <wipe ASID> > > > > > > In this case I think we never miss an invalidation, the list_del is > > > always after the HW has been fully fenced, so I don't think we can > > > have any issue. Maybe a suprious invalidation if the ASID gets > > > re-used, but who cares. > > > > > > Attach is different.. > > > > > > Inv CPU Attach CPU > > > > > > write io_pte > > > rcu_read_lock > > > list_for_each // empty > > > list_add_rcu() > > > Update device descriptor > > > <make description inv cmd> > > > dma_wmb() > > > <cmd doorbell> > > > rcu_read_unlock > > > > > > As above shows we can "miss" an invalidation. The issue is narrow, the > > > io_pte could still be sitting in write buffers in "Inv CPU" and not > > > yet globally visiable. "Attach CPU" could get the device descriptor > > > installed in the IOMMU and the IOMMU could walk an io_pte that is in > > > the old state. Effectively this is because there is no release/acquire > > > barrier passing the io_pte store from the Inv CPU to the Attach CPU to the > > > IOMMU. > > > > > > It seems like it should be solvable somehow: > > > 1) Inv CPU releases all the io ptes > > > 2) Attach CPU acquires the io ptes before updating the DDT > > > 3) Inv CPU acquires the RCU list in such a way that either attach > > > CPU will acquire the io_pte or inv CPU will acquire the RCU list. > > > 4) Either invalidation works or we release the new iopte to the SMMU > > > and don't need it. > > > > > > But #3 is a really weird statement. smb_mb() on both sides may do the > > > job?? > > > > > > > Actual attach sequence is slightly different. > > > > Inv CPU Attach CPU > > > > write io_pte > > rcu_read_lock > > list_for_each // empty > > list_add_rcu() > > IOTLB.INVAL(PSCID) > > <make description inv cmd> > > dma_wmb() > > <cmd doorbell> > > rcu_read_unlock > > > > I've tried to cover this case with riscv_iommu_iotlb_inval() called > > before the attached domain is visible to the device. > > That invalidation shouldn't do anything. If this is the first attach > of a PSCID then the PSCID had better already be empty, it won't become > non-empty until the DDT entry is installed. > > And if it is the second attach then the Inv CPU is already taking care > of things, no need to invalidate at all. > > Regardless, there is still a theortical race that the IOPTEs haven't > been made visible yet because there is still no synchronization with > the CPU writing them. > > So, I don't think this solves any problem. I belive you need the > appropriate kind of CPU barrier here instead of an invalidation. > Yes. There was a small, but still plausible race w/ IOPTEs visibility to the IOMMU. For v5 I'm adding two barriers to the inval/detach flow, I believe should cover it. 1) In riscv_iommu_iotlb_inval() unconditional dma_wmb() to make any pending writes to PTEs visible to the IOMMU device. This should cover the case when list_add_rcu() update is not yet visible in the _iotlb_inval() sequence, for the first time the domain is attached to the IOMMU. Inv CPU Attach CPU write io_pte dma_wmb (1) rcu_read_lock list_for_each // empty list_add_rcu() smp_wmb (2) Update device descriptor <make description inv cmd> // PTEs are visible to the HW (*1) dma_wmb() <cmd doorbell> rcu_read_unlock 2) In riscv_iommu_bond_link() write memory barrier to ensure list update is visible before IOMMU descriptor update. If stale data has been fetched by the HW, inval CPU will run iotlb-invalidation sequence. There is a possibility that IOMMU will fetch correct PTEs and will receive unnecessary IOTLB inval, but I don't think anyone would care. Inv CPU Attach CPU write io_pte list_add_rcu() smp_wmb (2) Update device descriptor <make description inv cmd> // HW might fetch stale PTEs dma_wmb() <cmd doorbell> dma_wmb (1) rcu_read_lock list_for_each // non-empty (*2) <make iotlb inval cmd> dma_wmb() <cmd doorbell> rcu_read_unlock 3) I've also updated riscv_iommu_bond_unlink() to wipe the PSCID cache on the last domain unlink from the IOMMU. Thank you for pointing this out. Let me know if that makes sense. Best, - Tomasz > Jason
On Mon, May 06, 2024 at 07:22:07PM -0700, Tomasz Jeznach wrote: > On Sun, May 5, 2024 at 8:46 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, May 03, 2024 at 12:44:09PM -0700, Tomasz Jeznach wrote: > > > > For detach I think yes: > > > > > > > > Inv CPU Detach CPU > > > > > > > > write io_pte Update device descriptor > > > > rcu_read_lock > > > > list_for_each > > > > <make invalidation command> <make description inv cmd> > > > > dma_wmb() dma_wmb() > > > > <doorbell> <cmd doorbell> > > > > rcu_read_unlock > > > > list_del_rcu() > > > > <wipe ASID> > > > > > > > > In this case I think we never miss an invalidation, the list_del is > > > > always after the HW has been fully fenced, so I don't think we can > > > > have any issue. Maybe a suprious invalidation if the ASID gets > > > > re-used, but who cares. > > > > > > > > Attach is different.. > > > > > > > > Inv CPU Attach CPU > > > > > > > > write io_pte > > > > rcu_read_lock > > > > list_for_each // empty > > > > list_add_rcu() > > > > Update device descriptor > > > > <make description inv cmd> > > > > dma_wmb() > > > > <cmd doorbell> > > > > rcu_read_unlock > > > > > > > > As above shows we can "miss" an invalidation. The issue is narrow, the > > > > io_pte could still be sitting in write buffers in "Inv CPU" and not > > > > yet globally visiable. "Attach CPU" could get the device descriptor > > > > installed in the IOMMU and the IOMMU could walk an io_pte that is in > > > > the old state. Effectively this is because there is no release/acquire > > > > barrier passing the io_pte store from the Inv CPU to the Attach CPU to the > > > > IOMMU. > > > > > > > > It seems like it should be solvable somehow: > > > > 1) Inv CPU releases all the io ptes > > > > 2) Attach CPU acquires the io ptes before updating the DDT > > > > 3) Inv CPU acquires the RCU list in such a way that either attach > > > > CPU will acquire the io_pte or inv CPU will acquire the RCU list. > > > > 4) Either invalidation works or we release the new iopte to the SMMU > > > > and don't need it. > > > > > > > > But #3 is a really weird statement. smb_mb() on both sides may do the > > > > job?? > > > > > > > > > > Actual attach sequence is slightly different. > > > > > > Inv CPU Attach CPU > > > > > > write io_pte > > > rcu_read_lock > > > list_for_each // empty > > > list_add_rcu() > > > IOTLB.INVAL(PSCID) > > > <make description inv cmd> > > > dma_wmb() > > > <cmd doorbell> > > > rcu_read_unlock > > > > > > I've tried to cover this case with riscv_iommu_iotlb_inval() called > > > before the attached domain is visible to the device. > > > > That invalidation shouldn't do anything. If this is the first attach > > of a PSCID then the PSCID had better already be empty, it won't become > > non-empty until the DDT entry is installed. > > > > And if it is the second attach then the Inv CPU is already taking care > > of things, no need to invalidate at all. > > > > Regardless, there is still a theortical race that the IOPTEs haven't > > been made visible yet because there is still no synchronization with > > the CPU writing them. > > > > So, I don't think this solves any problem. I belive you need the > > appropriate kind of CPU barrier here instead of an invalidation. > > > > Yes. There was a small, but still plausible race w/ IOPTEs visibility > to the IOMMU. > For v5 I'm adding two barriers to the inval/detach flow, I believe > should cover it. > > 1) In riscv_iommu_iotlb_inval() unconditional dma_wmb() to make any > pending writes to PTEs visible to the IOMMU device. This should cover > the case when list_add_rcu() update is not yet visible in the > _iotlb_inval() sequence, for the first time the domain is attached to > the IOMMU. > > Inv CPU Attach CPU > write io_pte > dma_wmb (1) > rcu_read_lock > list_for_each // empty list_add_rcu() > smp_wmb (2) > Update device descriptor > <make description inv cmd> > // PTEs are visible to the HW (*1) > dma_wmb() > <cmd doorbell> > rcu_read_unlock > > 2) In riscv_iommu_bond_link() write memory barrier to ensure list > update is visible before IOMMU descriptor update. If stale data has > been fetched by the HW, inval CPU will run iotlb-invalidation > sequence. There is a possibility that IOMMU will fetch correct PTEs > and will receive unnecessary IOTLB inval, but I don't think anyone > would care. > > Inv CPU Attach CPU > write io_pte list_add_rcu() > smp_wmb (2) > Update device descriptor > <make description inv cmd> > // HW might fetch stale PTEs > dma_wmb() > <cmd doorbell> > dma_wmb (1) > rcu_read_lock > list_for_each // non-empty (*2) > <make iotlb inval cmd> > dma_wmb() > <cmd doorbell> > rcu_read_unlock > > 3) I've also updated riscv_iommu_bond_unlink() to wipe the PSCID cache > on the last domain unlink from the IOMMU. > > Thank you for pointing this out. Let me know if that makes sense. I'm not an expert in barriers, but I think you need the more expensive "mb" in both cases. The inv side is both releasing the write and acquiring the list read. IIRC READ_ONCE is not a full acquire? The Attach side is both releasing the list_add_rcu() and acquiring the iopte. rcu is still a benefit, there is no cache line sharing and there is only one full barrier, not two, like a spinlock. And a big fat comment in both sides explaining this :) Jason
On Tue, May 7, 2024 at 9:51 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, May 06, 2024 at 07:22:07PM -0700, Tomasz Jeznach wrote: > > On Sun, May 5, 2024 at 8:46 AM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, May 03, 2024 at 12:44:09PM -0700, Tomasz Jeznach wrote: > > > > > For detach I think yes: > > > > > > > > > > Inv CPU Detach CPU > > > > > > > > > > write io_pte Update device descriptor > > > > > rcu_read_lock > > > > > list_for_each > > > > > <make invalidation command> <make description inv cmd> > > > > > dma_wmb() dma_wmb() > > > > > <doorbell> <cmd doorbell> > > > > > rcu_read_unlock > > > > > list_del_rcu() > > > > > <wipe ASID> > > > > > > > > > > In this case I think we never miss an invalidation, the list_del is > > > > > always after the HW has been fully fenced, so I don't think we can > > > > > have any issue. Maybe a suprious invalidation if the ASID gets > > > > > re-used, but who cares. > > > > > > > > > > Attach is different.. > > > > > > > > > > Inv CPU Attach CPU > > > > > > > > > > write io_pte > > > > > rcu_read_lock > > > > > list_for_each // empty > > > > > list_add_rcu() > > > > > Update device descriptor > > > > > <make description inv cmd> > > > > > dma_wmb() > > > > > <cmd doorbell> > > > > > rcu_read_unlock > > > > > > > > > > As above shows we can "miss" an invalidation. The issue is narrow, the > > > > > io_pte could still be sitting in write buffers in "Inv CPU" and not > > > > > yet globally visiable. "Attach CPU" could get the device descriptor > > > > > installed in the IOMMU and the IOMMU could walk an io_pte that is in > > > > > the old state. Effectively this is because there is no release/acquire > > > > > barrier passing the io_pte store from the Inv CPU to the Attach CPU to the > > > > > IOMMU. > > > > > > > > > > It seems like it should be solvable somehow: > > > > > 1) Inv CPU releases all the io ptes > > > > > 2) Attach CPU acquires the io ptes before updating the DDT > > > > > 3) Inv CPU acquires the RCU list in such a way that either attach > > > > > CPU will acquire the io_pte or inv CPU will acquire the RCU list. > > > > > 4) Either invalidation works or we release the new iopte to the SMMU > > > > > and don't need it. > > > > > > > > > > But #3 is a really weird statement. smb_mb() on both sides may do the > > > > > job?? > > > > > > > > > > > > > Actual attach sequence is slightly different. > > > > > > > > Inv CPU Attach CPU > > > > > > > > write io_pte > > > > rcu_read_lock > > > > list_for_each // empty > > > > list_add_rcu() > > > > IOTLB.INVAL(PSCID) > > > > <make description inv cmd> > > > > dma_wmb() > > > > <cmd doorbell> > > > > rcu_read_unlock > > > > > > > > I've tried to cover this case with riscv_iommu_iotlb_inval() called > > > > before the attached domain is visible to the device. > > > > > > That invalidation shouldn't do anything. If this is the first attach > > > of a PSCID then the PSCID had better already be empty, it won't become > > > non-empty until the DDT entry is installed. > > > > > > And if it is the second attach then the Inv CPU is already taking care > > > of things, no need to invalidate at all. > > > > > > Regardless, there is still a theortical race that the IOPTEs haven't > > > been made visible yet because there is still no synchronization with > > > the CPU writing them. > > > > > > So, I don't think this solves any problem. I belive you need the > > > appropriate kind of CPU barrier here instead of an invalidation. > > > > > > > Yes. There was a small, but still plausible race w/ IOPTEs visibility > > to the IOMMU. > > For v5 I'm adding two barriers to the inval/detach flow, I believe > > should cover it. > > > > 1) In riscv_iommu_iotlb_inval() unconditional dma_wmb() to make any > > pending writes to PTEs visible to the IOMMU device. This should cover > > the case when list_add_rcu() update is not yet visible in the > > _iotlb_inval() sequence, for the first time the domain is attached to > > the IOMMU. > > > > Inv CPU Attach CPU > > write io_pte > > dma_wmb (1) > > rcu_read_lock > > list_for_each // empty list_add_rcu() > > smp_wmb (2) > > Update device descriptor > > <make description inv cmd> > > // PTEs are visible to the HW (*1) > > dma_wmb() > > <cmd doorbell> > > rcu_read_unlock > > > > 2) In riscv_iommu_bond_link() write memory barrier to ensure list > > update is visible before IOMMU descriptor update. If stale data has > > been fetched by the HW, inval CPU will run iotlb-invalidation > > sequence. There is a possibility that IOMMU will fetch correct PTEs > > and will receive unnecessary IOTLB inval, but I don't think anyone > > would care. > > > > Inv CPU Attach CPU > > write io_pte list_add_rcu() > > smp_wmb (2) > > Update device descriptor > > <make description inv cmd> > > // HW might fetch stale PTEs > > dma_wmb() > > <cmd doorbell> > > dma_wmb (1) > > rcu_read_lock > > list_for_each // non-empty (*2) > > <make iotlb inval cmd> > > dma_wmb() > > <cmd doorbell> > > rcu_read_unlock > > > > 3) I've also updated riscv_iommu_bond_unlink() to wipe the PSCID cache > > on the last domain unlink from the IOMMU. > > > > Thank you for pointing this out. Let me know if that makes sense. > > I'm not an expert in barriers, but I think you need the more expensive > "mb" in both cases. > > The inv side is both releasing the write and acquiring the list > read. IIRC READ_ONCE is not a full acquire? > > The Attach side is both releasing the list_add_rcu() and acquiring the > iopte. > > rcu is still a benefit, there is no cache line sharing and there is > only one full barrier, not two, like a spinlock. > > And a big fat comment in both sides explaining this :) > I'm not an expert in barriers as well, but I've checked offline with one ;) And the conclusion is that we need FENCE W,W (or stronger) on the attach side, and FENCE RW,RW in the invalidation sequence. Hardware access to PTE/DC is sequential, with implied FENCE R,R barriers. As 'attach' sequence is a rare event anyway, I'll go on "mb" on both sides, as suggested. Unless there are other opinions on that I'll update barriers to match this conclusion and try to capture in long comment for bond / inval / attach synchronization assumptions. Jason, thanks again for pointing this out. > Jason Best, - Tomasz