| Message ID | 20260507180646.40356-1-gbatra@linux.ibm.com (mailing list archive) |
|---|---|
| State | Changes Requested |
| Headers | show |
| Series | [v2] powerpc/pseries/iommu: export DMA window data to user space | expand |
| Context | Check | Description |
|---|---|---|
| snowpatch_ozlabs/github-powerpc_selftests | success | Successfully ran 10 jobs. |
| snowpatch_ozlabs/github-powerpc_ppctests | success | Successfully ran 10 jobs. |
| snowpatch_ozlabs/github-powerpc_sparse | success | Successfully ran 4 jobs. |
| snowpatch_ozlabs/github-powerpc_clang | success | Successfully ran 5 jobs. |
| snowpatch_ozlabs/github-powerpc_kernel_qemu | success | Successfully ran 22 jobs. |
Hi Gaurav, On 07/05/26 11:36 pm, Gaurav Batra wrote: > Export PowerPC DMA window information (both default 2GB and Dynamic > larger window) to user space via sysfs. Each of these DMA windows has > attributes like size of the window, page size backing the window, mode, > etc. Each of these atributes is exported for user space consumption as a > file. > > PowerPC Host Bridge (PHB) can have multiple devices/functions sharing > the same DMA window. For each PHB, iommu registration creates an iommu > device under "/sys/devices/virtual/iommu". > > These devices will have 2 groups created to export Default and DDW > attributes. > > Reviewed-by: Brian King <brking@linux.ibm.com> > Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> > Reviewed-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> I do not see R-b tags provided on the list after review comments. Not sure if I am missing the email or were these provided privately ? Sharing some review comments inline below .. > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> > --- > V1 -> V2 change log: > > 1. Shiva: "weight" the it_map for the bitmap. This avoids using an extra > counter in the table. Please look into how iommu_debugfs_weight_get() > does this > > Response: Incorporated changes > > 2. Vaibhav: If the DMA window is not available, show function should just > return ENOENT so that userspace know the error instantly instead of > having to parse the sysfs contents. > > Response: Incorporated changes, returning ENODATA > > 3. Vaibhav: All the show functions have similar template. Please convert > them to macros expansion to reduce code volume. > > Response: Incorporated changes > > 4. Vaibhav: These new attributes are PSeries specific but they are being > setup in ppc generic iommu code at arch/powerpc/kernel/iommu.c. Can > you move these attributes to arch/powerpc/platforms/pseries/iommu.c > > Response: I have split the attributes and moved them to pseries specific > files. The original group "spapr-tce-iommu", is moved to PowerNV code > base to retain the legacy functionality. > > I tested the changes both on Pseries and PowerNV. > > 5. Vaibhav: It would be better to use function iommu_table_inuse_tces() as > a callback in iommu_table_ops which can be implemented by pseries and > powernv code differently. > > Response: the function is no longer needed after changes in #1 > > 6. Vaibhav: Since sysfs is ABI can you propose appropriate entries under > Documentation/ABI/testing > > Response: Added documentation > > ...sfs-devices-virtual-iommu-dma_window_attrs | 21 ++ > .../arch/powerpc/dma_window_attributes.rst | 65 +++++ > arch/powerpc/include/asm/pci-bridge.h | 4 + > arch/powerpc/kernel/iommu.c | 16 +- > arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++ > arch/powerpc/platforms/pseries/iommu.c | 261 ++++++++++++++++++ > arch/powerpc/platforms/pseries/pci_dlpar.c | 2 + > arch/powerpc/platforms/pseries/pseries.h | 1 + > arch/powerpc/platforms/pseries/setup.c | 2 + > 9 files changed, 373 insertions(+), 15 deletions(-) > create mode 100644 Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs > create mode 100644 Documentation/arch/powerpc/dma_window_attributes.rst > > diff --git a/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs > new file mode 100644 > index 000000000000..18ba63874276 > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs > @@ -0,0 +1,21 @@ > +What: /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-ddw/* > +Date: Oct 2025 > +Contact: linuxppc-dev@lists.ozlabs.org > +Description: read only > + For each IOMMU isolation unit spapr-tce-ddw sub-directory provides > + attributes to query information related to the bigger Dynamic DMA > + window (DDW) in the PowerPC virtualized platforms. > + > + See Documentation/arch/powerpc/dma_window_attributes.rst for more > + information. > + > +What: /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-dma/* > +Date: Oct 2025 > +Contact: linuxppc-dev@lists.ozlabs.org > +Description: read only > + For each IOMMU isolation unit spapr-tce-dma sub-directory provides > + attributes to query information related to the default 2GB DMA > + window in the PowerPC virtualized platforms. > + > + See Documentation/arch/powerpc/dma_window_attributes.rst for more > + information. > diff --git a/Documentation/arch/powerpc/dma_window_attributes.rst b/Documentation/arch/powerpc/dma_window_attributes.rst > new file mode 100644 > index 000000000000..8bd9aec8539d > --- /dev/null > +++ b/Documentation/arch/powerpc/dma_window_attributes.rst > @@ -0,0 +1,65 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +DMA Window Attributes > +===================== > + > +In PowerPC architecture there are 2 types of DMA windows - > + > +1. Default 2GB DMA window which is backed by 4K page size > +2. A bigger Dynamic DMA Window (DDW) which is backed by larger page size > + (64K or 2MB) > + > +A dedicated device will have both the DMA windows instantiated but an SR-IOV > +device will only have the bigger Dynamic DMA Window. > + > +The attributes of these 2 DMA windows are exported to user space via sysfs. > +Each IOMMU isolation unit will have its directory created under > +/sys/devices/virtual/iommu. > + > +As an exapmple, iommu-phb0001 s/exapmple/example ? > + > +Under each IOMMU isolation unit, there will be a group of attributes for > +"Default 2GB DMA Window" and "Dynamic DMA Window" - spapr-tce-dma and > +spapr-tce-ddw respectively. > + > +Attributes under each group > + > +spapr-tce-ddw: > +direct_address dynamic_address dynamic_size window_type > +direct_size dynamic_pages_mapped page_size > + > +spapr-tce-dma: > +dynamic_address dynamic_pages_mapped dynamic_size page_size > + > + > +The bigger Dynamic DMA Window is configured into pre-mapped and/or dynamically > +allocated TCEs. If the DDW is in "Hybrid" mode, then both the Direct > +(pre-mapped) and Dynamic part of the DMA window will have valid values. Hybrid > +mode is valid only for SR-IOV devices. > + > +DMA Window properties: > + > +direct_address Starting address of the pre-mapped DMA window > +direct_size Size of the pre-mapped DMA Window > +dynamic_address Starting address of the dynamic allocations > +dynamic_size Size of the dynamic allocation window > +dynamic_pages_mapped Pages mapped for DMA by dynamic allocations > +page_size Page size backing the DMA window > +window_type Type of the DMA Window (Direct/Dynamic/Hybrid) > + > + > +An example of DDW attributes for an SR-IOV device:: > + > + $ cd /sys/devices/virtual/iommu/iommu-phb0001/spapr-tce-ddw > + > + $ grep . * > + > + direct_address:0x800000000000000 <-- Starting addr of pre-mapped Window > + direct_size:137438953472 <-- Size of pre-mapped Window (128GB) > + dynamic_address:0x800002000000000 <-- Starting addr of Dynamic allocations > + dynamic_size:412316860416 <-- Size of dynamic allocation window (384GB) > + dynamic_pages_mapped:270 <-- Pages mapped by dynamic allocations > + page_size:2097152 <-- DMA window page size (2MB) > + window_type:Hybrid <-- window has both pre-mapped and > + dynamic sections > diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h > index 1dae53130782..9b09178aca5e 100644 > --- a/arch/powerpc/include/asm/pci-bridge.h > +++ b/arch/powerpc/include/asm/pci-bridge.h > @@ -124,6 +124,10 @@ struct pci_controller { > resource_size_t dma_window_base_cur; > resource_size_t dma_window_size; > > +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) > + const struct attribute_group **iommu_groups; > +#endif > + > #ifdef CONFIG_PPC64 > unsigned long buid; > struct pci_dn *pci_data; > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c > index 0ce71310b7d9..d6242e3f77da 100644 > --- a/arch/powerpc/kernel/iommu.c > +++ b/arch/powerpc/kernel/iommu.c > @@ -1269,24 +1269,10 @@ static const struct iommu_ops spapr_tce_iommu_ops = { > .device_group = spapr_tce_iommu_device_group, > }; > > -static struct attribute *spapr_tce_iommu_attrs[] = { > - NULL, > -}; > - > -static struct attribute_group spapr_tce_iommu_group = { > - .name = "spapr-tce-iommu", > - .attrs = spapr_tce_iommu_attrs, > -}; > - > -static const struct attribute_group *spapr_tce_iommu_groups[] = { > - &spapr_tce_iommu_group, > - NULL, > -}; > - > void ppc_iommu_register_device(struct pci_controller *phb) > { > iommu_device_sysfs_add(&phb->iommu, phb->parent, > - spapr_tce_iommu_groups, "iommu-phb%04x", > + phb->iommu_groups, "iommu-phb%04x", > phb->global_number); > iommu_device_register(&phb->iommu, &spapr_tce_iommu_ops, > phb->parent); > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > index 1c78fdfb7b03..0887f154955e 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -2493,6 +2493,20 @@ static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = { > .shutdown = pnv_pci_ioda_shutdown, > }; > > +static struct attribute *pnv_tce_iommu_attrs[] = { > + NULL, > +}; > + > +static struct attribute_group pnv_tce_iommu_group = { > + .name = "spapr-tce-iommu", > + .attrs = pnv_tce_iommu_attrs, > +}; > + > +static const struct attribute_group *pnv_tce_iommu_groups[] = { > + &pnv_tce_iommu_group, > + NULL, > +}; > + > static void __init pnv_pci_init_ioda_phb(struct device_node *np, > u64 hub_id, int ioda_type) > { > @@ -2697,6 +2711,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, > hose->controller_ops = pnv_pci_ioda_controller_ops; > } > > + hose->iommu_groups = pnv_tce_iommu_groups; > + > ppc_md.pcibios_default_alignment = pnv_pci_default_alignment; > > #ifdef CONFIG_PCI_IOV > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > index 5497b130e026..28be7a45761d 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -56,6 +56,20 @@ enum { > DDW_EXT_LIMITED_ADDR_MODE = 3 > }; > > +/* used by sysfs when querying Dynamic/Default DMA Window data */ > +struct dma_win_data { > + u32 page_size; > + u64 direct_address; > + u64 direct_size; > + u64 dynamic_address; > + u64 dynamic_size; > + u32 dynamic_pages_mapped; > + char window_type[15]; > +}; > + > +#define SPAPR_SUCCESS 0 > +#define SPAPR_ERROR -1 > + > static struct iommu_table *iommu_pseries_alloc_table(int node) > { > struct iommu_table *tbl; > @@ -837,6 +851,253 @@ static struct device_node *pci_dma_find(struct device_node *dn, > return rdn; > } > > +/* Get DDW information for the device */ > +static int gather_ddw_info(struct device *dev, struct dma_win_data *data) > +{ > + struct iommu_device *iommu; > + struct pci_controller *phb; > + struct device_node *dn; > + struct pci_dn *pci; > + const __be32 *prop = NULL; > + bool ddw_direct = false; > + bool found = false; > + struct iommu_table *tbl; > + u32 pgshift; > + struct dynamic_dma_window_prop *p; > + > + memset(data, 0, sizeof(*data)); > + > + iommu = dev_get_drvdata(dev); > + phb = container_of(iommu, struct pci_controller, iommu); > + dn = phb->dn; > + > + if (!dn) > + return SPAPR_ERROR; > + > + pci = PCI_DN(dn); > + if (!pci || !pci->table_group) > + return SPAPR_ERROR; > + Should we also hold a dn ref with of_node_get(dn) before proceeding with of_get_property calls ? > + /* Find DDW */ > + prop = of_get_property(dn, DIRECT64_PROPNAME, NULL); > + if (prop) { > + ddw_direct = true; > + found = true; > + } else { > + prop = of_get_property(dn, DMA64_PROPNAME, NULL); > + if (prop) > + found = true; > + } > + > + /* NO DDW */ > + if (!found) .. then release dn ref here if not found .. > + return SPAPR_ERROR; > + > + p = (struct dynamic_dma_window_prop *)prop; > + > + pgshift = be32_to_cpu(p->tce_shift); > + if (pgshift != 0xc && pgshift != 0x10 && pgshift != 0x15) Can we have macros for 0xc, 0x10 and 0x15 respectively ? > + data->page_size = 0; > + else > + data->page_size = 1 << pgshift; > + > + /* Check if DDW has table associated with it. Having a table associated with > + * DDW is indicative that is has some dynamic TCE allocations. In this case the > + * DDW can be fully Dynamic or in Hybrid mode. For SR-IOV DDW is on index 0, > + * for dedicated adapter on index 1. > + */ > + found = false; > + for (int i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { > + tbl = pci->table_group->tables[i]; Can another thread do a kfree(table_group) via iommu_pseries_free_group() during hotplug remove before we reach here? > + > + if (tbl && tbl->it_index == be32_to_cpu(p->liobn)) { > + found = true; > + break; > + } > + } Is it possible that another thread changes bitmap before we reach bitmap_weight below ? If table is found, we may want to safely access its bitamp (consider using tbl->largepool.lock?). > + > + /* set the parameters depnding on the DDW type */ s/depnding/depending ? > + if (ddw_direct && found) { /* Hybrid */ > + data->direct_address = be64_to_cpu(p->dma_base); > + data->dynamic_size = (u64)(tbl->it_size << tbl->it_page_shift); > + > + data->dynamic_address = data->direct_address > + + (u64)(1UL << be32_to_cpu(p->window_shift)) > + - data->dynamic_size; > + > + data->direct_size = data->dynamic_address - data->direct_address; > + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); > + > + sprintf(data->window_type, "%s", "Hybrid"); Preferably use snprintf for safety. I see two more instances below. > + } else if (ddw_direct && !found) { /* Direct */ > + data->direct_address = be64_to_cpu(p->dma_base); > + data->direct_size = (u64)(1UL << be32_to_cpu(p->window_shift)); > + > + sprintf(data->window_type, "%s", "Direct"); > + } else { /* Dynamic */ > + data->dynamic_address = be64_to_cpu(p->dma_base); > + data->dynamic_size = (u64)(1UL << be32_to_cpu(p->window_shift)); > + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); > + > + sprintf(data->window_type, "%s", "Dynamic"); > + } > + .. release dn ref with of_node_put() before returning. Similarly applicable for gather_dma_info() also. > + return SPAPR_SUCCESS; > +} > + > +/* Get DDW information for the device */ > +static int gather_dma_info(struct device *dev, struct dma_win_data *data) > +{ > + struct iommu_device *iommu; > + struct pci_controller *phb; > + struct device_node *dn; > + struct pci_dn *pci; > + const __be32 *prop = NULL; > + struct iommu_table *tbl; > + unsigned long offset, size, liobn; > + > + memset(data, 0, sizeof(*data)); > + > + iommu = dev_get_drvdata(dev); > + phb = container_of(iommu, struct pci_controller, iommu); > + dn = phb->dn; > + > + if (!dn) > + return SPAPR_ERROR; > + > + pci = PCI_DN(dn); > + if (!pci || !pci->table_group) > + return SPAPR_ERROR; > + > + /* search for default DMA window */ > + prop = of_get_property(dn, "ibm,dma-window", NULL); > + > + if (!prop) > + return SPAPR_ERROR; > + > + /* default DMA Window is always at index 0 */ > + tbl = pci->table_group->tables[0]; > + if (!tbl) > + return SPAPR_ERROR; > + > + of_parse_dma_window(dn, prop, &liobn, &offset, &size); > + > + data->dynamic_address = offset; > + data->dynamic_size = size; > + data->page_size = 1ULL << IOMMU_PAGE_SHIFT_4K; > + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); > + > + return SPAPR_SUCCESS; > +} > + > +#define DEVICE_SHOW_DDW(_name, _fmt) \ > +ssize_t ddw_##_name##_show(struct device *dev, \ > + struct device_attribute *attr,\ > + char *buf) \ > +{ \ > + int rc = 0; \ > + struct dma_win_data data; \ > + \ > + rc = gather_ddw_info(dev, &data); \ > + \ > + if (rc == SPAPR_SUCCESS) \ > + return sysfs_emit(buf, _fmt, data._name); \ > + else \ > + return -ENODATA; \ > +} \ > + > +#define DEVICE_SHOW_DMA(_name, _fmt) \ > +ssize_t dma_##_name##_show(struct device *dev, \ > + struct device_attribute *attr,\ > + char *buf) \ > +{ \ > + int rc = 0; \ > + struct dma_win_data data; \ > + \ > + rc = gather_dma_info(dev, &data); \ > + \ > + if (rc == SPAPR_SUCCESS) \ > + return sysfs_emit(buf, _fmt, data._name); \ > + else \ > + return -ENODATA; \ > +} \ > + > +static DEVICE_SHOW_DDW(direct_address, "%#llx\n"); > +static DEVICE_SHOW_DDW(direct_size, "%lld\n"); > +static DEVICE_SHOW_DDW(page_size, "%d\n"); > +static DEVICE_SHOW_DDW(window_type, "%s\n"); > +static DEVICE_SHOW_DDW(dynamic_address, "%#llx\n"); > +static DEVICE_SHOW_DDW(dynamic_size, "%lld\n"); > +static DEVICE_SHOW_DDW(dynamic_pages_mapped, "%d\n"); > +static DEVICE_SHOW_DMA(dynamic_address, "%#llx\n"); > +static DEVICE_SHOW_DMA(dynamic_size, "%lld\n"); > +static DEVICE_SHOW_DMA(page_size, "%d\n"); > +static DEVICE_SHOW_DMA(dynamic_pages_mapped, "%d\n"); > + > +#define DEVICE_ATTR_DDW(_name) \ > + struct device_attribute dev_attr_ddw_##_name = \ > + __ATTR(_name, 0444, ddw_##_name##_show, NULL) > +#define DEVICE_ATTR_DMA(_name) \ > + struct device_attribute dev_attr_dma_##_name = \ > + __ATTR(_name, 0444, dma_##_name##_show, NULL) > + > +static DEVICE_ATTR_DDW(direct_address); > +static DEVICE_ATTR_DDW(direct_size); > +static DEVICE_ATTR_DDW(page_size); > +static DEVICE_ATTR_DDW(window_type); > +static DEVICE_ATTR_DDW(dynamic_address); > +static DEVICE_ATTR_DDW(dynamic_size); > +static DEVICE_ATTR_DDW(dynamic_pages_mapped); > +static DEVICE_ATTR_DMA(dynamic_address); > +static DEVICE_ATTR_DMA(dynamic_size); > +static DEVICE_ATTR_DMA(page_size); > +static DEVICE_ATTR_DMA(dynamic_pages_mapped); > + > +static struct attribute *spapr_tce_ddw_attrs[] = { > + &dev_attr_ddw_direct_address.attr, > + &dev_attr_ddw_direct_size.attr, > + &dev_attr_ddw_page_size.attr, > + &dev_attr_ddw_window_type.attr, > + &dev_attr_ddw_dynamic_address.attr, > + &dev_attr_ddw_dynamic_size.attr, > + &dev_attr_ddw_dynamic_pages_mapped.attr, > + NULL, > +}; > + > +static struct attribute *spapr_tce_dma_attrs[] = { > + &dev_attr_dma_dynamic_address.attr, > + &dev_attr_dma_dynamic_size.attr, > + &dev_attr_dma_page_size.attr, > + &dev_attr_dma_dynamic_pages_mapped.attr, > + NULL, > +}; > + > +static struct attribute_group spapr_tce_ddw_group = { > + .name = "spapr-tce-ddw", > + .attrs = spapr_tce_ddw_attrs, > +}; > + > +static struct attribute_group spapr_tce_dma_group = { > + .name = "spapr-tce-dma", > + .attrs = spapr_tce_dma_attrs, > +}; > + > +static struct attribute *spapr_tce_iommu_attrs[] = { > + NULL, > +}; > + > +static struct attribute_group spapr_tce_iommu_group = { > + .name = "spapr-tce-iommu", > + .attrs = spapr_tce_iommu_attrs, > +}; > + > +const struct attribute_group *spapr_tce_iommu_groups[] = { > + &spapr_tce_iommu_group, > + &spapr_tce_ddw_group, > + &spapr_tce_dma_group, > + NULL, > +}; > + > static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) > { > struct iommu_table *tbl; > diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c > index 8c77ec7980de..b457451a2814 100644 > --- a/arch/powerpc/platforms/pseries/pci_dlpar.c > +++ b/arch/powerpc/platforms/pseries/pci_dlpar.c > @@ -45,6 +45,8 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn) > pci_process_bridge_OF_ranges(phb, dn, 0); > phb->controller_ops = pseries_pci_controller_ops; > > + phb->iommu_groups = spapr_tce_iommu_groups; > + > pci_devs_phb_init_dynamic(phb); > > pseries_msi_allocate_domains(phb); > diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h > index 3968a6970fa8..4cf0b7a4e96a 100644 > --- a/arch/powerpc/platforms/pseries/pseries.h > +++ b/arch/powerpc/platforms/pseries/pseries.h > @@ -128,4 +128,5 @@ struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose, > struct pci_dev *pdev); > #endif > > +extern const struct attribute_group *spapr_tce_iommu_groups[]; > #endif /* _PSERIES_PSERIES_H */ > diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c > index 50b26ed8432d..4d877aae0560 100644 > --- a/arch/powerpc/platforms/pseries/setup.c > +++ b/arch/powerpc/platforms/pseries/setup.c > @@ -512,6 +512,8 @@ static void __init pSeries_discover_phbs(void) > isa_bridge_find_early(phb); > phb->controller_ops = pseries_pci_controller_ops; > > + phb->iommu_groups = spapr_tce_iommu_groups; > + > /* create pci_dn's for DT nodes under this PHB */ > pci_devs_phb_init_dynamic(phb); > > base-commit: 192c0159402e6bfbe13de6f8379546943297783d
Hi Gaurav,
kernel test robot noticed the following build warnings:
[auto build test WARNING on 192c0159402e6bfbe13de6f8379546943297783d]
url: https://github.com/intel-lab-lkp/linux/commits/Gaurav-Batra/powerpc-pseries-iommu-export-DMA-window-data-to-user-space/20260510-175116
base: 192c0159402e6bfbe13de6f8379546943297783d
patch link: https://lore.kernel.org/r/20260507180646.40356-1-gbatra%40linux.ibm.com
patch subject: [PATCH v2] powerpc/pseries/iommu: export DMA window data to user space
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260510/202605101820.ZpQl79bh-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202605101820.ZpQl79bh-lkp@intel.com/
All warnings (new ones prefixed by >>):
Documentation/userspace-api/landlock:453: ./include/uapi/linux/landlock.h:45: ERROR: Unknown target name: "network flags". [docutils]
Documentation/userspace-api/landlock:453: ./include/uapi/linux/landlock.h:50: ERROR: Unknown target name: "scope flags". [docutils]
Documentation/userspace-api/landlock:453: ./include/uapi/linux/landlock.h:24: ERROR: Unknown target name: "filesystem flags". [docutils]
Documentation/userspace-api/landlock:462: ./include/uapi/linux/landlock.h:153: ERROR: Unknown target name: "filesystem flags". [docutils]
Documentation/userspace-api/landlock:462: ./include/uapi/linux/landlock.h:176: ERROR: Unknown target name: "network flags". [docutils]
>> Documentation/arch/powerpc/dma_window_attributes.rst: WARNING: document isn't included in any toctree [toc.not_included]
Documentation/networking/skbuff:36: ./include/linux/skbuff.h:181: WARNING: Failed to create a cross reference. A title or caption not found: 'crc' [ref.ref]
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Gaurav Batra <gbatra@linux.ibm.com> writes: Thanks for the v2 patch. My review comments below: General comment. I see some issues in the patch that checkpatch would have flagged. Can you please also ensure that there are no checkpatch related warning before you send the patch Optional comment: Please split the patch into 2 , moving the DOC changes into separate patch. > Export PowerPC DMA window information (both default 2GB and Dynamic > larger window) to user space via sysfs. Each of these DMA windows has > attributes like size of the window, page size backing the window, mode, > etc. Each of these atributes is exported for user space consumption as a > file. > > PowerPC Host Bridge (PHB) can have multiple devices/functions sharing > the same DMA window. For each PHB, iommu registration creates an iommu > device under "/sys/devices/virtual/iommu". > > These devices will have 2 groups created to export Default and DDW > attributes. > > Reviewed-by: Brian King <brking@linux.ibm.com> > Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Thanks for incorporating my review comments from the previous iteration. However I dont remember reviewing the v2 of this patch before. Can you please avoid presumptively adding my R-b until I have a chance to review the patch. > Reviewed-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> > Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> > --- > V1 -> V2 change log: > > 1. Shiva: "weight" the it_map for the bitmap. This avoids using an extra > counter in the table. Please look into how iommu_debugfs_weight_get() > does this > > Response: Incorporated changes > > 2. Vaibhav: If the DMA window is not available, show function should just > return ENOENT so that userspace know the error instantly instead of > having to parse the sysfs contents. > > Response: Incorporated changes, returning ENODATA > > 3. Vaibhav: All the show functions have similar template. Please convert > them to macros expansion to reduce code volume. > > Response: Incorporated changes > > 4. Vaibhav: These new attributes are PSeries specific but they are being > setup in ppc generic iommu code at arch/powerpc/kernel/iommu.c. Can > you move these attributes to arch/powerpc/platforms/pseries/iommu.c > > Response: I have split the attributes and moved them to pseries specific > files. The original group "spapr-tce-iommu", is moved to PowerNV code > base to retain the legacy functionality. > > I tested the changes both on Pseries and PowerNV. > > 5. Vaibhav: It would be better to use function iommu_table_inuse_tces() as > a callback in iommu_table_ops which can be implemented by pseries and > powernv code differently. > > Response: the function is no longer needed after changes in #1 > > 6. Vaibhav: Since sysfs is ABI can you propose appropriate entries under > Documentation/ABI/testing > > Response: Added documentation > > ...sfs-devices-virtual-iommu-dma_window_attrs | 21 ++ > .../arch/powerpc/dma_window_attributes.rst | 65 +++++ > arch/powerpc/include/asm/pci-bridge.h | 4 + > arch/powerpc/kernel/iommu.c | 16 +- > arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++ > arch/powerpc/platforms/pseries/iommu.c | 261 ++++++++++++++++++ > arch/powerpc/platforms/pseries/pci_dlpar.c | 2 + > arch/powerpc/platforms/pseries/pseries.h | 1 + > arch/powerpc/platforms/pseries/setup.c | 2 + > 9 files changed, 373 insertions(+), 15 deletions(-) > create mode 100644 Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs > create mode 100644 Documentation/arch/powerpc/dma_window_attributes.rst > > diff --git a/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs > new file mode 100644 > index 000000000000..18ba63874276 > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs > @@ -0,0 +1,21 @@ > +What: > /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-ddw/* Suggested: s/iommu-isolation/iommu-group/ > +Date: Oct 2025 > +Contact: linuxppc-dev@lists.ozlabs.org > +Description: read only > + For each IOMMU isolation unit spapr-tce-ddw sub-directory provides > + attributes to query information related to the bigger Dynamic DMA > + window (DDW) in the PowerPC virtualized platforms. > + > + See Documentation/arch/powerpc/dma_window_attributes.rst for more > + information. > + > +What: /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-dma/* > +Date: Oct 2025 > +Contact: linuxppc-dev@lists.ozlabs.org > +Description: read only > + For each IOMMU isolation unit spapr-tce-dma sub-directory provides > + attributes to query information related to the default 2GB DMA > + window in the PowerPC virtualized platforms. > + > + See Documentation/arch/powerpc/dma_window_attributes.rst for more > + information. sysfs ABI documentation typically describes all the attribute files rather then directory. Please add details of the individual attributes that you are adding here. > diff --git a/Documentation/arch/powerpc/dma_window_attributes.rst b/Documentation/arch/powerpc/dma_window_attributes.rst > new file mode 100644 > index 000000000000..8bd9aec8539d > --- /dev/null > +++ b/Documentation/arch/powerpc/dma_window_attributes.rst > @@ -0,0 +1,65 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +DMA Window Attributes > +===================== > + > +In PowerPC architecture there are 2 types of DMA windows - > + This is only true for PPC64-PSeries not for PPC64-PowerNV > +1. Default 2GB DMA window which is backed by 4K page size > +2. A bigger Dynamic DMA Window (DDW) which is backed by larger page size > + (64K or 2MB) > + > +A dedicated device will have both the DMA windows instantiated but an SR-IOV > +device will only have the bigger Dynamic DMA Window. In context of PSeries please give some context abt 'dedicated device' > + > +The attributes of these 2 DMA windows are exported to user space via sysfs. > +Each IOMMU isolation unit will have its directory created under > +/sys/devices/virtual/iommu. > + > +As an exapmple, iommu-phb0001 > + > +Under each IOMMU isolation unit, there will be a group of attributes for > +"Default 2GB DMA Window" and "Dynamic DMA Window" - spapr-tce-dma and > +spapr-tce-ddw respectively. > + > +Attributes under each group > + > +spapr-tce-ddw: > +direct_address dynamic_address dynamic_size window_type > +direct_size dynamic_pages_mapped page_size > + > +spapr-tce-dma: > +dynamic_address dynamic_pages_mapped dynamic_size page_size > + > + > +The bigger Dynamic DMA Window is configured into pre-mapped and/or dynamically > +allocated TCEs. If the DDW is in "Hybrid" mode, then both the Direct > +(pre-mapped) and Dynamic part of the DMA window will have valid values. Hybrid > +mode is valid only for SR-IOV devices. > + > +DMA Window properties: > + > +direct_address Starting address of the pre-mapped DMA window > +direct_size Size of the pre-mapped DMA Window > +dynamic_address Starting address of the dynamic allocations > +dynamic_size Size of the dynamic allocation window > +dynamic_pages_mapped Pages mapped for DMA by dynamic allocations > +page_size Page size backing the DMA window > +window_type Type of the DMA Window (Direct/Dynamic/Hybrid) > + these attributes should also be documented in the sysfs/ABI > + > +An example of DDW attributes for an SR-IOV device:: > + > + $ cd /sys/devices/virtual/iommu/iommu-phb0001/spapr-tce-ddw > + > + $ grep . * > + > + direct_address:0x800000000000000 <-- Starting addr of pre-mapped Window > + direct_size:137438953472 <-- Size of pre-mapped Window (128GB) > + dynamic_address:0x800002000000000 <-- Starting addr of Dynamic allocations > + dynamic_size:412316860416 <-- Size of dynamic allocation window (384GB) > + dynamic_pages_mapped:270 <-- Pages mapped by dynamic allocations > + page_size:2097152 <-- DMA window page size (2MB) > + window_type:Hybrid <-- window has both pre-mapped and > + dynamic sections Suggested: This documentation can be improved by moving details on sysfs attrs and adding details on how 2 different types of DMA windows are allocated and managed. > diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h > index 1dae53130782..9b09178aca5e 100644 > --- a/arch/powerpc/include/asm/pci-bridge.h > +++ b/arch/powerpc/include/asm/pci-bridge.h > @@ -124,6 +124,10 @@ struct pci_controller { > resource_size_t dma_window_base_cur; > resource_size_t dma_window_size; > > +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) > + const struct attribute_group **iommu_groups; > +#endif Ideally addition of new members to a struct should be done at the end to preserve KABI. Naming issue: s/iommu_groups/iommu_group_attrs/ > + > #ifdef CONFIG_PPC64 > unsigned long buid; > struct pci_dn *pci_data; > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c > index 0ce71310b7d9..d6242e3f77da 100644 > --- a/arch/powerpc/kernel/iommu.c > +++ b/arch/powerpc/kernel/iommu.c > @@ -1269,24 +1269,10 @@ static const struct iommu_ops spapr_tce_iommu_ops = { > .device_group = spapr_tce_iommu_device_group, > }; > > -static struct attribute *spapr_tce_iommu_attrs[] = { > - NULL, > -}; > - > -static struct attribute_group spapr_tce_iommu_group = { > - .name = "spapr-tce-iommu", > - .attrs = spapr_tce_iommu_attrs, > -}; > - > -static const struct attribute_group *spapr_tce_iommu_groups[] = { > - &spapr_tce_iommu_group, > - NULL, > -}; > - > void ppc_iommu_register_device(struct pci_controller *phb) > { > iommu_device_sysfs_add(&phb->iommu, phb->parent, > - spapr_tce_iommu_groups, "iommu-phb%04x", > + phb->iommu_groups, "iommu-phb%04x", > phb->global_number); > iommu_device_register(&phb->iommu, &spapr_tce_iommu_ops, > phb->parent); Since you are changing this code, can you check for NULL phb->iommu_groups and also check for returned errors from these two functions(). In case phb->iommu_groups == NULL you can ignore registering sysfs. That will take care of POWERNV case. > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > index 1c78fdfb7b03..0887f154955e 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -2493,6 +2493,20 @@ static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = { > .shutdown = pnv_pci_ioda_shutdown, > }; > > +static struct attribute *pnv_tce_iommu_attrs[] = { > + NULL, > +}; > + > +static struct attribute_group pnv_tce_iommu_group = { > + .name = "spapr-tce-iommu", > + .attrs = pnv_tce_iommu_attrs, > +}; > + > +static const struct attribute_group *pnv_tce_iommu_groups[] = { > + &pnv_tce_iommu_group, > + NULL, > +}; > + > static void __init pnv_pci_init_ioda_phb(struct device_node *np, > u64 hub_id, int ioda_type) > { > @@ -2697,6 +2711,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, > hose->controller_ops = pnv_pci_ioda_controller_ops; > } > > + hose->iommu_groups = pnv_tce_iommu_groups; > + See the previous comment for optimization. This proposed hunk can be removed. > ppc_md.pcibios_default_alignment = pnv_pci_default_alignment; > > #ifdef CONFIG_PCI_IOV > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c > index 5497b130e026..28be7a45761d 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -56,6 +56,20 @@ enum { > DDW_EXT_LIMITED_ADDR_MODE = 3 > }; > > +/* used by sysfs when querying Dynamic/Default DMA Window data */ > +struct dma_win_data { > + u32 page_size; > + u64 direct_address; > + u64 direct_size; > + u64 dynamic_address; > + u64 dynamic_size; > + u32 dynamic_pages_mapped; > + char window_type[15]; Why do you need to hold a string representation of the window_type. Can this be replaced by an enum that holds much smaller space. > +}; > + > +#define SPAPR_SUCCESS 0 > +#define SPAPR_ERROR -1 Returning 0 or -1 are common and well known return values from kernel functions and as such you need not create seperate macros for them. Also Indentation looks strange. > + > static struct iommu_table *iommu_pseries_alloc_table(int node) > { > struct iommu_table *tbl; > @@ -837,6 +851,253 @@ static struct device_node *pci_dma_find(struct device_node *dn, > return rdn; > } > > +/* Get DDW information for the device */ > +static int gather_ddw_info(struct device *dev, struct dma_win_data *data) > +{ > + struct iommu_device *iommu; > + struct pci_controller *phb; > + struct device_node *dn; > + struct pci_dn *pci; > + const __be32 *prop = NULL; > + bool ddw_direct = false; > + bool found = false; > + struct iommu_table *tbl; > + u32 pgshift; > + struct dynamic_dma_window_prop *p; > + > + memset(data, 0, sizeof(*data)); > + > + iommu = dev_get_drvdata(dev); > + phb = container_of(iommu, struct pci_controller, iommu); > + dn = phb->dn; > + > + if (!dn) > + return SPAPR_ERROR; > + > + pci = PCI_DN(dn); > + if (!pci || !pci->table_group) > + return SPAPR_ERROR; > + > + /* Find DDW */ > + prop = of_get_property(dn, DIRECT64_PROPNAME, NULL); > + if (prop) { > + ddw_direct = true; > + found = true; > + } else { > + prop = of_get_property(dn, DMA64_PROPNAME, NULL); > + if (prop) > + found = true; > + } > + > + /* NO DDW */ > + if (!found) > + return SPAPR_ERROR; > + > + p = (struct dynamic_dma_window_prop *)prop; > + > + pgshift = be32_to_cpu(p->tce_shift); > + if (pgshift != 0xc && pgshift != 0x10 && pgshift != 0x15) > + data->page_size = 0; > + else > + data->page_size = 1 << pgshift; > + > + /* Check if DDW has table associated with it. Having a table associated with > + * DDW is indicative that is has some dynamic TCE allocations. In this case the > + * DDW can be fully Dynamic or in Hybrid mode. For SR-IOV DDW is on index 0, > + * for dedicated adapter on index 1. > + */ > + found = false; > + for (int i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { Variable Naming: avoid 'i' . Also please hoist the loop variable > + tbl = pci->table_group->tables[i]; > + > + if (tbl && tbl->it_index == be32_to_cpu(p->liobn)) { > + found = true; > + break; > + } > + } > + > + /* set the parameters depnding on the DDW type */ > + if (ddw_direct && found) { /* Hybrid */ > + data->direct_address = be64_to_cpu(p->dma_base); > + data->dynamic_size = (u64)(tbl->it_size << > tbl->it_page_shift); May want to check for possible overflow > + > + data->dynamic_address = data->direct_address > + + (u64)(1UL << be32_to_cpu(p->window_shift)) > + - > data->dynamic_size; May want to check for possible overflow > + > + data->direct_size = data->dynamic_address - data->direct_address; > + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); > + > + sprintf(data->window_type, "%s", "Hybrid"); > + } else if (ddw_direct && !found) { /* Direct */ > + data->direct_address = be64_to_cpu(p->dma_base); > + data->direct_size = (u64)(1UL << be32_to_cpu(p->window_shift)); > + > + sprintf(data->window_type, "%s", "Direct"); > + } else { /* Dynamic */ > + data->dynamic_address = be64_to_cpu(p->dma_base); > + data->dynamic_size = (u64)(1UL << be32_to_cpu(p->window_shift)); > + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); > + > + sprintf(data->window_type, "%s", "Dynamic"); > + } > + > + return SPAPR_SUCCESS; > +} > + > +/* Get DDW information for the device */ > +static int gather_dma_info(struct device *dev, struct dma_win_data *data) > +{ > + struct iommu_device *iommu; > + struct pci_controller *phb; > + struct device_node *dn; > + struct pci_dn *pci; > + const __be32 *prop = NULL; > + struct iommu_table *tbl; > + unsigned long offset, size, liobn; > + > + memset(data, 0, sizeof(*data)); > + > + iommu = dev_get_drvdata(dev); > + phb = container_of(iommu, struct pci_controller, iommu); > + dn = phb->dn; > + > + if (!dn) > + return SPAPR_ERROR; > + > + pci = PCI_DN(dn); > + if (!pci || !pci->table_group) > + return SPAPR_ERROR; > + > + /* search for default DMA window */ > + prop = of_get_property(dn, "ibm,dma-window", NULL); > + > + if (!prop) > + return SPAPR_ERROR; > + > + /* default DMA Window is always at index 0 */ > + tbl = pci->table_group->tables[0]; > + if (!tbl) > + return SPAPR_ERROR; > + > + of_parse_dma_window(dn, prop, &liobn, &offset, &size); > + > + data->dynamic_address = offset; > + data->dynamic_size = size; > + data->page_size = 1ULL << IOMMU_PAGE_SHIFT_4K; > + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); > + > + return SPAPR_SUCCESS; > +} > + > +#define DEVICE_SHOW_DDW(_name, _fmt) \ > +ssize_t ddw_##_name##_show(struct device *dev, \ > + struct device_attribute *attr,\ > + char *buf) \ > +{ \ > + int rc = 0; \ > + struct dma_win_data data; \ > + \ > + rc = gather_ddw_info(dev, &data); \ > + \ > + if (rc == SPAPR_SUCCESS) \ > + return sysfs_emit(buf, _fmt, data._name); \ > + else \ > + return -ENODATA; \ > +} \ All the device tree data that gather_{ddw dma}_info() collects except bitmap_weight is static in nature and need not be refreshed at each call to xx_show(). This can be optimized. > + > +#define DEVICE_SHOW_DMA(_name, _fmt) \ > +ssize_t dma_##_name##_show(struct device *dev, \ > + struct device_attribute *attr,\ > + char *buf) \ > +{ \ > + int rc = 0; \ > + struct dma_win_data data; \ > + \ > + rc = gather_dma_info(dev, &data); \ > + \ > + if (rc == SPAPR_SUCCESS) \ > + return sysfs_emit(buf, _fmt, data._name); \ > + else \ > + return -ENODATA; \ > +} \ > + Indentation looks strange. Also can you just return the 'rc' from gather_{ddw dma}_info back from xx_show rather then ENODATA > +static DEVICE_SHOW_DDW(direct_address, "%#llx\n"); > +static DEVICE_SHOW_DDW(direct_size, "%lld\n"); > +static DEVICE_SHOW_DDW(page_size, "%d\n"); > +static DEVICE_SHOW_DDW(window_type, "%s\n"); > +static DEVICE_SHOW_DDW(dynamic_address, "%#llx\n"); > +static DEVICE_SHOW_DDW(dynamic_size, "%lld\n"); > +static DEVICE_SHOW_DDW(dynamic_pages_mapped, "%d\n"); > +static DEVICE_SHOW_DMA(dynamic_address, "%#llx\n"); > +static DEVICE_SHOW_DMA(dynamic_size, "%lld\n"); > +static DEVICE_SHOW_DMA(page_size, "%d\n"); > +static DEVICE_SHOW_DMA(dynamic_pages_mapped, "%d\n"); Avoid putting '\n's at the end of strings. Makes parsing contents tricky. > + > +#define DEVICE_ATTR_DDW(_name) \ > + struct device_attribute dev_attr_ddw_##_name = \ > + __ATTR(_name, 0444, ddw_##_name##_show, NULL) > +#define DEVICE_ATTR_DMA(_name) \ > + struct device_attribute dev_attr_dma_##_name = \ > + __ATTR(_name, 0444, dma_##_name##_show, NULL) > + > +static DEVICE_ATTR_DDW(direct_address); > +static DEVICE_ATTR_DDW(direct_size); > +static DEVICE_ATTR_DDW(page_size); > +static DEVICE_ATTR_DDW(window_type); > +static DEVICE_ATTR_DDW(dynamic_address); > +static DEVICE_ATTR_DDW(dynamic_size); > +static DEVICE_ATTR_DDW(dynamic_pages_mapped); > +static DEVICE_ATTR_DMA(dynamic_address); > +static DEVICE_ATTR_DMA(dynamic_size); > +static DEVICE_ATTR_DMA(page_size); > +static DEVICE_ATTR_DMA(dynamic_pages_mapped); > + > +static struct attribute *spapr_tce_ddw_attrs[] = { > + &dev_attr_ddw_direct_address.attr, > + &dev_attr_ddw_direct_size.attr, > + &dev_attr_ddw_page_size.attr, > + &dev_attr_ddw_window_type.attr, > + &dev_attr_ddw_dynamic_address.attr, > + &dev_attr_ddw_dynamic_size.attr, > + &dev_attr_ddw_dynamic_pages_mapped.attr, > + NULL, > +}; > + > +static struct attribute *spapr_tce_dma_attrs[] = { > + &dev_attr_dma_dynamic_address.attr, > + &dev_attr_dma_dynamic_size.attr, > + &dev_attr_dma_page_size.attr, > + &dev_attr_dma_dynamic_pages_mapped.attr, > + NULL, > +}; > + > +static struct attribute_group spapr_tce_ddw_group = { > + .name = "spapr-tce-ddw", > + .attrs = spapr_tce_ddw_attrs, > +}; > + > +static struct attribute_group spapr_tce_dma_group = { > + .name = "spapr-tce-dma", > + .attrs = spapr_tce_dma_attrs, > +}; > + > +static struct attribute *spapr_tce_iommu_attrs[] = { > + NULL, > +}; > + > +static struct attribute_group spapr_tce_iommu_group = { > + .name = "spapr-tce-iommu", > + .attrs = spapr_tce_iommu_attrs, > +}; > + > +const struct attribute_group *spapr_tce_iommu_groups[] = { > + &spapr_tce_iommu_group, > + &spapr_tce_ddw_group, > + &spapr_tce_dma_group, > + NULL, > +}; > + > static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) > { > struct iommu_table *tbl; > diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c > index 8c77ec7980de..b457451a2814 100644 > --- a/arch/powerpc/platforms/pseries/pci_dlpar.c > +++ b/arch/powerpc/platforms/pseries/pci_dlpar.c > @@ -45,6 +45,8 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn) > pci_process_bridge_OF_ranges(phb, dn, 0); > phb->controller_ops = pseries_pci_controller_ops; > > + phb->iommu_groups = spapr_tce_iommu_groups; > + > pci_devs_phb_init_dynamic(phb); > > pseries_msi_allocate_domains(phb); > diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h > index 3968a6970fa8..4cf0b7a4e96a 100644 > --- a/arch/powerpc/platforms/pseries/pseries.h > +++ b/arch/powerpc/platforms/pseries/pseries.h > @@ -128,4 +128,5 @@ struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose, > struct pci_dev *pdev); > #endif > > +extern const struct attribute_group *spapr_tce_iommu_groups[]; > #endif /* _PSERIES_PSERIES_H */ > diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c > index 50b26ed8432d..4d877aae0560 100644 > --- a/arch/powerpc/platforms/pseries/setup.c > +++ b/arch/powerpc/platforms/pseries/setup.c > @@ -512,6 +512,8 @@ static void __init pSeries_discover_phbs(void) > isa_bridge_find_early(phb); > phb->controller_ops = pseries_pci_controller_ops; > > + phb->iommu_groups = spapr_tce_iommu_groups; > + > /* create pci_dn's for DT nodes under this PHB */ > pci_devs_phb_init_dynamic(phb); > > base-commit: 192c0159402e6bfbe13de6f8379546943297783d > -- > 2.39.3 >
Hello Harsh, My response to your locking device_node suggestion is below inline. Please let me know if you don't agree with my reasoning. Thanks Gaurav On 5/8/26 12:04 PM, Harsh Prateek Bora wrote: > Hi Gaurav, > > On 07/05/26 11:36 pm, Gaurav Batra wrote: >> Export PowerPC DMA window information (both default 2GB and Dynamic >> larger window) to user space via sysfs. Each of these DMA windows has >> attributes like size of the window, page size backing the window, mode, >> etc. Each of these atributes is exported for user space consumption as a >> file. >> >> PowerPC Host Bridge (PHB) can have multiple devices/functions sharing >> the same DMA window. For each PHB, iommu registration creates an iommu >> device under "/sys/devices/virtual/iommu". >> >> These devices will have 2 groups created to export Default and DDW >> attributes. >> >> Reviewed-by: Brian King <brking@linux.ibm.com> >> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> >> Reviewed-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> > > I do not see R-b tags provided on the list after review comments. > Not sure if I am missing the email or were these provided privately ? > Sharing some review comments inline below .. > >> Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> >> --- >> V1 -> V2 change log: >> >> 1. Shiva: "weight" the it_map for the bitmap. This avoids using an extra >> counter in the table. Please look into how >> iommu_debugfs_weight_get() >> does this >> >> Response: Incorporated changes >> >> 2. Vaibhav: If the DMA window is not available, show function should >> just >> return ENOENT so that userspace know the error instantly instead of >> having to parse the sysfs contents. >> >> Response: Incorporated changes, returning ENODATA >> >> 3. Vaibhav: All the show functions have similar template. Please convert >> them to macros expansion to reduce code volume. >> >> Response: Incorporated changes >> >> 4. Vaibhav: These new attributes are PSeries specific but they are being >> setup in ppc generic iommu code at arch/powerpc/kernel/iommu.c. Can >> you move these attributes to arch/powerpc/platforms/pseries/iommu.c >> >> Response: I have split the attributes and moved them to pseries >> specific >> files. The original group "spapr-tce-iommu", is moved to PowerNV >> code >> base to retain the legacy functionality. >> >> I tested the changes both on Pseries and PowerNV. >> >> 5. Vaibhav: It would be better to use function >> iommu_table_inuse_tces() as >> a callback in iommu_table_ops which can be implemented by pseries >> and >> powernv code differently. >> >> Response: the function is no longer needed after changes in #1 >> >> 6. Vaibhav: Since sysfs is ABI can you propose appropriate entries under >> Documentation/ABI/testing >> >> Response: Added documentation >> >> ...sfs-devices-virtual-iommu-dma_window_attrs | 21 ++ >> .../arch/powerpc/dma_window_attributes.rst | 65 +++++ >> arch/powerpc/include/asm/pci-bridge.h | 4 + >> arch/powerpc/kernel/iommu.c | 16 +- >> arch/powerpc/platforms/powernv/pci-ioda.c | 16 ++ >> arch/powerpc/platforms/pseries/iommu.c | 261 ++++++++++++++++++ >> arch/powerpc/platforms/pseries/pci_dlpar.c | 2 + >> arch/powerpc/platforms/pseries/pseries.h | 1 + >> arch/powerpc/platforms/pseries/setup.c | 2 + >> 9 files changed, 373 insertions(+), 15 deletions(-) >> create mode 100644 >> Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs >> create mode 100644 >> Documentation/arch/powerpc/dma_window_attributes.rst >> >> diff --git >> a/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs >> b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs >> new file mode 100644 >> index 000000000000..18ba63874276 >> --- /dev/null >> +++ >> b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs >> @@ -0,0 +1,21 @@ >> +What: /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-ddw/* >> +Date: Oct 2025 >> +Contact: linuxppc-dev@lists.ozlabs.org >> +Description: read only >> + For each IOMMU isolation unit spapr-tce-ddw sub-directory provides >> + attributes to query information related to the bigger Dynamic DMA >> + window (DDW) in the PowerPC virtualized platforms. >> + >> + See Documentation/arch/powerpc/dma_window_attributes.rst for more >> + information. >> + >> +What: /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-dma/* >> +Date: Oct 2025 >> +Contact: linuxppc-dev@lists.ozlabs.org >> +Description: read only >> + For each IOMMU isolation unit spapr-tce-dma sub-directory provides >> + attributes to query information related to the default 2GB DMA >> + window in the PowerPC virtualized platforms. >> + >> + See Documentation/arch/powerpc/dma_window_attributes.rst for more >> + information. >> diff --git a/Documentation/arch/powerpc/dma_window_attributes.rst >> b/Documentation/arch/powerpc/dma_window_attributes.rst >> new file mode 100644 >> index 000000000000..8bd9aec8539d >> --- /dev/null >> +++ b/Documentation/arch/powerpc/dma_window_attributes.rst >> @@ -0,0 +1,65 @@ >> +.. SPDX-License-Identifier: GPL-2.0 >> + >> +===================== >> +DMA Window Attributes >> +===================== >> + >> +In PowerPC architecture there are 2 types of DMA windows - >> + >> +1. Default 2GB DMA window which is backed by 4K page size >> +2. A bigger Dynamic DMA Window (DDW) which is backed by larger page >> size >> + (64K or 2MB) >> + >> +A dedicated device will have both the DMA windows instantiated but >> an SR-IOV >> +device will only have the bigger Dynamic DMA Window. >> + >> +The attributes of these 2 DMA windows are exported to user space via >> sysfs. >> +Each IOMMU isolation unit will have its directory created under >> +/sys/devices/virtual/iommu. >> + >> +As an exapmple, iommu-phb0001 > > s/exapmple/example ? > >> + >> +Under each IOMMU isolation unit, there will be a group of attributes >> for >> +"Default 2GB DMA Window" and "Dynamic DMA Window" - spapr-tce-dma and >> +spapr-tce-ddw respectively. >> + >> +Attributes under each group >> + >> +spapr-tce-ddw: >> +direct_address dynamic_address dynamic_size window_type >> +direct_size dynamic_pages_mapped page_size >> + >> +spapr-tce-dma: >> +dynamic_address dynamic_pages_mapped dynamic_size page_size >> + >> + >> +The bigger Dynamic DMA Window is configured into pre-mapped and/or >> dynamically >> +allocated TCEs. If the DDW is in "Hybrid" mode, then both the Direct >> +(pre-mapped) and Dynamic part of the DMA window will have valid >> values. Hybrid >> +mode is valid only for SR-IOV devices. >> + >> +DMA Window properties: >> + >> +direct_address Starting address of the pre-mapped DMA >> window >> +direct_size Size of the pre-mapped DMA Window >> +dynamic_address Starting address of the dynamic allocations >> +dynamic_size Size of the dynamic allocation window >> +dynamic_pages_mapped Pages mapped for DMA by dynamic allocations >> +page_size Page size backing the DMA window >> +window_type Type of the DMA Window >> (Direct/Dynamic/Hybrid) >> + >> + >> +An example of DDW attributes for an SR-IOV device:: >> + >> + $ cd /sys/devices/virtual/iommu/iommu-phb0001/spapr-tce-ddw >> + >> + $ grep . * >> + >> + direct_address:0x800000000000000 <-- Starting addr of >> pre-mapped Window >> + direct_size:137438953472 <-- Size of pre-mapped Window >> (128GB) >> + dynamic_address:0x800002000000000 <-- Starting addr of Dynamic >> allocations >> + dynamic_size:412316860416 <-- Size of dynamic >> allocation window (384GB) >> + dynamic_pages_mapped:270 <-- Pages mapped by dynamic >> allocations >> + page_size:2097152 <-- DMA window page size (2MB) >> + window_type:Hybrid <-- window has both >> pre-mapped and >> + dynamic sections >> diff --git a/arch/powerpc/include/asm/pci-bridge.h >> b/arch/powerpc/include/asm/pci-bridge.h >> index 1dae53130782..9b09178aca5e 100644 >> --- a/arch/powerpc/include/asm/pci-bridge.h >> +++ b/arch/powerpc/include/asm/pci-bridge.h >> @@ -124,6 +124,10 @@ struct pci_controller { >> resource_size_t dma_window_base_cur; >> resource_size_t dma_window_size; >> +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) >> + const struct attribute_group **iommu_groups; >> +#endif >> + >> #ifdef CONFIG_PPC64 >> unsigned long buid; >> struct pci_dn *pci_data; >> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c >> index 0ce71310b7d9..d6242e3f77da 100644 >> --- a/arch/powerpc/kernel/iommu.c >> +++ b/arch/powerpc/kernel/iommu.c >> @@ -1269,24 +1269,10 @@ static const struct iommu_ops >> spapr_tce_iommu_ops = { >> .device_group = spapr_tce_iommu_device_group, >> }; >> -static struct attribute *spapr_tce_iommu_attrs[] = { >> - NULL, >> -}; >> - >> -static struct attribute_group spapr_tce_iommu_group = { >> - .name = "spapr-tce-iommu", >> - .attrs = spapr_tce_iommu_attrs, >> -}; >> - >> -static const struct attribute_group *spapr_tce_iommu_groups[] = { >> - &spapr_tce_iommu_group, >> - NULL, >> -}; >> - >> void ppc_iommu_register_device(struct pci_controller *phb) >> { >> iommu_device_sysfs_add(&phb->iommu, phb->parent, >> - spapr_tce_iommu_groups, "iommu-phb%04x", >> + phb->iommu_groups, "iommu-phb%04x", >> phb->global_number); >> iommu_device_register(&phb->iommu, &spapr_tce_iommu_ops, >> phb->parent); >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c >> b/arch/powerpc/platforms/powernv/pci-ioda.c >> index 1c78fdfb7b03..0887f154955e 100644 >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c >> @@ -2493,6 +2493,20 @@ static const struct pci_controller_ops >> pnv_npu_ocapi_ioda_controller_ops = { >> .shutdown = pnv_pci_ioda_shutdown, >> }; >> +static struct attribute *pnv_tce_iommu_attrs[] = { >> + NULL, >> +}; >> + >> +static struct attribute_group pnv_tce_iommu_group = { >> + .name = "spapr-tce-iommu", >> + .attrs = pnv_tce_iommu_attrs, >> +}; >> + >> +static const struct attribute_group *pnv_tce_iommu_groups[] = { >> + &pnv_tce_iommu_group, >> + NULL, >> +}; >> + >> static void __init pnv_pci_init_ioda_phb(struct device_node *np, >> u64 hub_id, int ioda_type) >> { >> @@ -2697,6 +2711,8 @@ static void __init pnv_pci_init_ioda_phb(struct >> device_node *np, >> hose->controller_ops = pnv_pci_ioda_controller_ops; >> } >> + hose->iommu_groups = pnv_tce_iommu_groups; >> + >> ppc_md.pcibios_default_alignment = pnv_pci_default_alignment; >> #ifdef CONFIG_PCI_IOV >> diff --git a/arch/powerpc/platforms/pseries/iommu.c >> b/arch/powerpc/platforms/pseries/iommu.c >> index 5497b130e026..28be7a45761d 100644 >> --- a/arch/powerpc/platforms/pseries/iommu.c >> +++ b/arch/powerpc/platforms/pseries/iommu.c >> @@ -56,6 +56,20 @@ enum { >> DDW_EXT_LIMITED_ADDR_MODE = 3 >> }; >> +/* used by sysfs when querying Dynamic/Default DMA Window data */ >> +struct dma_win_data { >> + u32 page_size; >> + u64 direct_address; >> + u64 direct_size; >> + u64 dynamic_address; >> + u64 dynamic_size; >> + u32 dynamic_pages_mapped; >> + char window_type[15]; >> +}; >> + >> +#define SPAPR_SUCCESS 0 >> +#define SPAPR_ERROR -1 >> + >> static struct iommu_table *iommu_pseries_alloc_table(int node) >> { >> struct iommu_table *tbl; >> @@ -837,6 +851,253 @@ static struct device_node *pci_dma_find(struct >> device_node *dn, >> return rdn; >> } >> +/* Get DDW information for the device */ >> +static int gather_ddw_info(struct device *dev, struct dma_win_data >> *data) >> +{ >> + struct iommu_device *iommu; >> + struct pci_controller *phb; >> + struct device_node *dn; >> + struct pci_dn *pci; >> + const __be32 *prop = NULL; >> + bool ddw_direct = false; >> + bool found = false; >> + struct iommu_table *tbl; >> + u32 pgshift; >> + struct dynamic_dma_window_prop *p; >> + >> + memset(data, 0, sizeof(*data)); >> + >> + iommu = dev_get_drvdata(dev); >> + phb = container_of(iommu, struct pci_controller, iommu); >> + dn = phb->dn; >> + >> + if (!dn) >> + return SPAPR_ERROR; >> + >> + pci = PCI_DN(dn); >> + if (!pci || !pci->table_group) >> + return SPAPR_ERROR; >> + > Here are the sequence of events when a PHB is registered and IOMMU device created 1. first PHB device_node is created 2. IOMMU device created with default DMA window. All the DMA tables are hanging out from PHB device_node 3. IOMMU device is registered and sysfs files/attributes created. This is where the patch is creating attributes as well. Now, when we DLPAR remove a PHB, the sequence of events are 1. delete the sysfs entries for the IOMMU device of the PHB. 2. delete the device_node of PHB. So, while *_show() is executing, it is holding the kobject of the sysfs attribute. In the event of DLPAR remove of the PHB, from another thread, the DLPAR thread gets blocked while removing the sysfs attribute. device_del() --> device_remove_attrs() As such, we are guaranteed that while the _show() interface has not completed, the whole infrastructure is intact - namely, PHB device_node and the DMA table_group. I have tested this while putting the _show() interface in a long sleep and executing DLPAR of PHB from another terminal. > Should we also hold a dn ref with of_node_get(dn) before proceeding > with of_get_property calls ? Not needed as explained above. > >> + /* Find DDW */ >> + prop = of_get_property(dn, DIRECT64_PROPNAME, NULL); >> + if (prop) { >> + ddw_direct = true; >> + found = true; >> + } else { >> + prop = of_get_property(dn, DMA64_PROPNAME, NULL); >> + if (prop) >> + found = true; >> + } >> + >> + /* NO DDW */ >> + if (!found) > > .. then release dn ref here if not found .. not needed > >> + return SPAPR_ERROR; >> + >> + p = (struct dynamic_dma_window_prop *)prop; >> + >> + pgshift = be32_to_cpu(p->tce_shift); >> + if (pgshift != 0xc && pgshift != 0x10 && pgshift != 0x15) > > Can we have macros for 0xc, 0x10 and 0x15 respectively ? > >> + data->page_size = 0; >> + else >> + data->page_size = 1 << pgshift; >> + >> + /* Check if DDW has table associated with it. Having a table >> associated with >> + * DDW is indicative that is has some dynamic TCE allocations. >> In this case the >> + * DDW can be fully Dynamic or in Hybrid mode. For SR-IOV DDW is >> on index 0, >> + * for dedicated adapter on index 1. >> + */ >> + found = false; >> + for (int i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { >> + tbl = pci->table_group->tables[i]; > > Can another thread do a kfree(table_group) via > iommu_pseries_free_group() during hotplug remove before we reach here? not possible, as explained above. This will get called only when the PHB device_node is deleted. > >> + >> + if (tbl && tbl->it_index == be32_to_cpu(p->liobn)) { >> + found = true; >> + break; >> + } >> + } > > Is it possible that another thread changes bitmap before we reach > bitmap_weight below ? If table is found, we may want to safely access > its bitamp (consider using tbl->largepool.lock?). yes, other thread can change the bitmap before we reach here. But, the DMA attributes are exported via sysfs as a way to get a peek at the DMA window properties at that moment. The bitmap doesn't have to be 100% accurate. This just indicates, at that moment, how many TCEs are mapped. > >> + >> + /* set the parameters depnding on the DDW type */ > > s/depnding/depending ? > >> + if (ddw_direct && found) { /* Hybrid */ >> + data->direct_address = be64_to_cpu(p->dma_base); >> + data->dynamic_size = (u64)(tbl->it_size << tbl->it_page_shift); >> + >> + data->dynamic_address = data->direct_address >> + + (u64)(1UL << >> be32_to_cpu(p->window_shift)) >> + - data->dynamic_size; >> + >> + data->direct_size = data->dynamic_address - >> data->direct_address; >> + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, >> tbl->it_size); >> + >> + sprintf(data->window_type, "%s", "Hybrid"); > > Preferably use snprintf for safety. I see two more instances below. > >> + } else if (ddw_direct && !found) { /* Direct */ >> + data->direct_address = be64_to_cpu(p->dma_base); >> + data->direct_size = (u64)(1UL << be32_to_cpu(p->window_shift)); >> + >> + sprintf(data->window_type, "%s", "Direct"); >> + } else { /* Dynamic */ >> + data->dynamic_address = be64_to_cpu(p->dma_base); >> + data->dynamic_size = (u64)(1UL << >> be32_to_cpu(p->window_shift)); >> + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, >> tbl->it_size); >> + >> + sprintf(data->window_type, "%s", "Dynamic"); >> + } >> + > > .. release dn ref with of_node_put() before returning. not needed as explained above. > > Similarly applicable for gather_dma_info() also. > >> + return SPAPR_SUCCESS; >> +} >> + >> +/* Get DDW information for the device */ >> +static int gather_dma_info(struct device *dev, struct dma_win_data >> *data) >> +{ >> + struct iommu_device *iommu; >> + struct pci_controller *phb; >> + struct device_node *dn; >> + struct pci_dn *pci; >> + const __be32 *prop = NULL; >> + struct iommu_table *tbl; >> + unsigned long offset, size, liobn; >> + >> + memset(data, 0, sizeof(*data)); >> + >> + iommu = dev_get_drvdata(dev); >> + phb = container_of(iommu, struct pci_controller, iommu); >> + dn = phb->dn; >> + >> + if (!dn) >> + return SPAPR_ERROR; >> + >> + pci = PCI_DN(dn); >> + if (!pci || !pci->table_group) >> + return SPAPR_ERROR; >> + >> + /* search for default DMA window */ >> + prop = of_get_property(dn, "ibm,dma-window", NULL); >> + >> + if (!prop) >> + return SPAPR_ERROR; >> + >> + /* default DMA Window is always at index 0 */ >> + tbl = pci->table_group->tables[0]; >> + if (!tbl) >> + return SPAPR_ERROR; >> + >> + of_parse_dma_window(dn, prop, &liobn, &offset, &size); >> + >> + data->dynamic_address = offset; >> + data->dynamic_size = size; >> + data->page_size = 1ULL << IOMMU_PAGE_SHIFT_4K; >> + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, >> tbl->it_size); >> + >> + return SPAPR_SUCCESS; >> +} >> + >> +#define DEVICE_SHOW_DDW(_name, _fmt) \ >> +ssize_t ddw_##_name##_show(struct device *dev, \ >> + struct device_attribute *attr,\ >> + char *buf) \ >> +{ \ >> + int rc = 0; \ >> + struct dma_win_data data; \ >> + \ >> + rc = gather_ddw_info(dev, &data); \ >> + \ >> + if (rc == SPAPR_SUCCESS) \ >> + return sysfs_emit(buf, _fmt, data._name); \ >> + else \ >> + return -ENODATA; \ >> +} \ >> + >> +#define DEVICE_SHOW_DMA(_name, _fmt) \ >> +ssize_t dma_##_name##_show(struct device *dev, \ >> + struct device_attribute *attr,\ >> + char *buf) \ >> +{ \ >> + int rc = 0; \ >> + struct dma_win_data data; \ >> + \ >> + rc = gather_dma_info(dev, &data); \ >> + \ >> + if (rc == SPAPR_SUCCESS) \ >> + return sysfs_emit(buf, _fmt, data._name); \ >> + else \ >> + return -ENODATA; \ >> +} \ >> + >> +static DEVICE_SHOW_DDW(direct_address, "%#llx\n"); >> +static DEVICE_SHOW_DDW(direct_size, "%lld\n"); >> +static DEVICE_SHOW_DDW(page_size, "%d\n"); >> +static DEVICE_SHOW_DDW(window_type, "%s\n"); >> +static DEVICE_SHOW_DDW(dynamic_address, "%#llx\n"); >> +static DEVICE_SHOW_DDW(dynamic_size, "%lld\n"); >> +static DEVICE_SHOW_DDW(dynamic_pages_mapped, "%d\n"); >> +static DEVICE_SHOW_DMA(dynamic_address, "%#llx\n"); >> +static DEVICE_SHOW_DMA(dynamic_size, "%lld\n"); >> +static DEVICE_SHOW_DMA(page_size, "%d\n"); >> +static DEVICE_SHOW_DMA(dynamic_pages_mapped, "%d\n"); >> + >> +#define DEVICE_ATTR_DDW(_name) \ >> + struct device_attribute dev_attr_ddw_##_name = \ >> + __ATTR(_name, 0444, ddw_##_name##_show, NULL) >> +#define DEVICE_ATTR_DMA(_name) \ >> + struct device_attribute dev_attr_dma_##_name = \ >> + __ATTR(_name, 0444, dma_##_name##_show, NULL) >> + >> +static DEVICE_ATTR_DDW(direct_address); >> +static DEVICE_ATTR_DDW(direct_size); >> +static DEVICE_ATTR_DDW(page_size); >> +static DEVICE_ATTR_DDW(window_type); >> +static DEVICE_ATTR_DDW(dynamic_address); >> +static DEVICE_ATTR_DDW(dynamic_size); >> +static DEVICE_ATTR_DDW(dynamic_pages_mapped); >> +static DEVICE_ATTR_DMA(dynamic_address); >> +static DEVICE_ATTR_DMA(dynamic_size); >> +static DEVICE_ATTR_DMA(page_size); >> +static DEVICE_ATTR_DMA(dynamic_pages_mapped); >> + >> +static struct attribute *spapr_tce_ddw_attrs[] = { >> + &dev_attr_ddw_direct_address.attr, >> + &dev_attr_ddw_direct_size.attr, >> + &dev_attr_ddw_page_size.attr, >> + &dev_attr_ddw_window_type.attr, >> + &dev_attr_ddw_dynamic_address.attr, >> + &dev_attr_ddw_dynamic_size.attr, >> + &dev_attr_ddw_dynamic_pages_mapped.attr, >> + NULL, >> +}; >> + >> +static struct attribute *spapr_tce_dma_attrs[] = { >> + &dev_attr_dma_dynamic_address.attr, >> + &dev_attr_dma_dynamic_size.attr, >> + &dev_attr_dma_page_size.attr, >> + &dev_attr_dma_dynamic_pages_mapped.attr, >> + NULL, >> +}; >> + >> +static struct attribute_group spapr_tce_ddw_group = { >> + .name = "spapr-tce-ddw", >> + .attrs = spapr_tce_ddw_attrs, >> +}; >> + >> +static struct attribute_group spapr_tce_dma_group = { >> + .name = "spapr-tce-dma", >> + .attrs = spapr_tce_dma_attrs, >> +}; >> + >> +static struct attribute *spapr_tce_iommu_attrs[] = { >> + NULL, >> +}; >> + >> +static struct attribute_group spapr_tce_iommu_group = { >> + .name = "spapr-tce-iommu", >> + .attrs = spapr_tce_iommu_attrs, >> +}; >> + >> +const struct attribute_group *spapr_tce_iommu_groups[] = { >> + &spapr_tce_iommu_group, >> + &spapr_tce_ddw_group, >> + &spapr_tce_dma_group, >> + NULL, >> +}; >> + >> static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) >> { >> struct iommu_table *tbl; >> diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c >> b/arch/powerpc/platforms/pseries/pci_dlpar.c >> index 8c77ec7980de..b457451a2814 100644 >> --- a/arch/powerpc/platforms/pseries/pci_dlpar.c >> +++ b/arch/powerpc/platforms/pseries/pci_dlpar.c >> @@ -45,6 +45,8 @@ struct pci_controller *init_phb_dynamic(struct >> device_node *dn) >> pci_process_bridge_OF_ranges(phb, dn, 0); >> phb->controller_ops = pseries_pci_controller_ops; >> + phb->iommu_groups = spapr_tce_iommu_groups; >> + >> pci_devs_phb_init_dynamic(phb); >> pseries_msi_allocate_domains(phb); >> diff --git a/arch/powerpc/platforms/pseries/pseries.h >> b/arch/powerpc/platforms/pseries/pseries.h >> index 3968a6970fa8..4cf0b7a4e96a 100644 >> --- a/arch/powerpc/platforms/pseries/pseries.h >> +++ b/arch/powerpc/platforms/pseries/pseries.h >> @@ -128,4 +128,5 @@ struct iommu_group >> *pSeries_pci_device_group(struct pci_controller *hose, >> struct pci_dev *pdev); >> #endif >> +extern const struct attribute_group *spapr_tce_iommu_groups[]; >> #endif /* _PSERIES_PSERIES_H */ >> diff --git a/arch/powerpc/platforms/pseries/setup.c >> b/arch/powerpc/platforms/pseries/setup.c >> index 50b26ed8432d..4d877aae0560 100644 >> --- a/arch/powerpc/platforms/pseries/setup.c >> +++ b/arch/powerpc/platforms/pseries/setup.c >> @@ -512,6 +512,8 @@ static void __init pSeries_discover_phbs(void) >> isa_bridge_find_early(phb); >> phb->controller_ops = pseries_pci_controller_ops; >> + phb->iommu_groups = spapr_tce_iommu_groups; >> + >> /* create pci_dn's for DT nodes under this PHB */ >> pci_devs_phb_init_dynamic(phb); >> base-commit: 192c0159402e6bfbe13de6f8379546943297783d >
diff --git a/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs new file mode 100644 index 000000000000..18ba63874276 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-devices-virtual-iommu-dma_window_attrs @@ -0,0 +1,21 @@ +What: /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-ddw/* +Date: Oct 2025 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + For each IOMMU isolation unit spapr-tce-ddw sub-directory provides + attributes to query information related to the bigger Dynamic DMA + window (DDW) in the PowerPC virtualized platforms. + + See Documentation/arch/powerpc/dma_window_attributes.rst for more + information. + +What: /sys/devices/virtual/iommu/<iommu-isolation>/spapr-tce-dma/* +Date: Oct 2025 +Contact: linuxppc-dev@lists.ozlabs.org +Description: read only + For each IOMMU isolation unit spapr-tce-dma sub-directory provides + attributes to query information related to the default 2GB DMA + window in the PowerPC virtualized platforms. + + See Documentation/arch/powerpc/dma_window_attributes.rst for more + information. diff --git a/Documentation/arch/powerpc/dma_window_attributes.rst b/Documentation/arch/powerpc/dma_window_attributes.rst new file mode 100644 index 000000000000..8bd9aec8539d --- /dev/null +++ b/Documentation/arch/powerpc/dma_window_attributes.rst @@ -0,0 +1,65 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +DMA Window Attributes +===================== + +In PowerPC architecture there are 2 types of DMA windows - + +1. Default 2GB DMA window which is backed by 4K page size +2. A bigger Dynamic DMA Window (DDW) which is backed by larger page size + (64K or 2MB) + +A dedicated device will have both the DMA windows instantiated but an SR-IOV +device will only have the bigger Dynamic DMA Window. + +The attributes of these 2 DMA windows are exported to user space via sysfs. +Each IOMMU isolation unit will have its directory created under +/sys/devices/virtual/iommu. + +As an exapmple, iommu-phb0001 + +Under each IOMMU isolation unit, there will be a group of attributes for +"Default 2GB DMA Window" and "Dynamic DMA Window" - spapr-tce-dma and +spapr-tce-ddw respectively. + +Attributes under each group + +spapr-tce-ddw: +direct_address dynamic_address dynamic_size window_type +direct_size dynamic_pages_mapped page_size + +spapr-tce-dma: +dynamic_address dynamic_pages_mapped dynamic_size page_size + + +The bigger Dynamic DMA Window is configured into pre-mapped and/or dynamically +allocated TCEs. If the DDW is in "Hybrid" mode, then both the Direct +(pre-mapped) and Dynamic part of the DMA window will have valid values. Hybrid +mode is valid only for SR-IOV devices. + +DMA Window properties: + +direct_address Starting address of the pre-mapped DMA window +direct_size Size of the pre-mapped DMA Window +dynamic_address Starting address of the dynamic allocations +dynamic_size Size of the dynamic allocation window +dynamic_pages_mapped Pages mapped for DMA by dynamic allocations +page_size Page size backing the DMA window +window_type Type of the DMA Window (Direct/Dynamic/Hybrid) + + +An example of DDW attributes for an SR-IOV device:: + + $ cd /sys/devices/virtual/iommu/iommu-phb0001/spapr-tce-ddw + + $ grep . * + + direct_address:0x800000000000000 <-- Starting addr of pre-mapped Window + direct_size:137438953472 <-- Size of pre-mapped Window (128GB) + dynamic_address:0x800002000000000 <-- Starting addr of Dynamic allocations + dynamic_size:412316860416 <-- Size of dynamic allocation window (384GB) + dynamic_pages_mapped:270 <-- Pages mapped by dynamic allocations + page_size:2097152 <-- DMA window page size (2MB) + window_type:Hybrid <-- window has both pre-mapped and + dynamic sections diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 1dae53130782..9b09178aca5e 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -124,6 +124,10 @@ struct pci_controller { resource_size_t dma_window_base_cur; resource_size_t dma_window_size; +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV) + const struct attribute_group **iommu_groups; +#endif + #ifdef CONFIG_PPC64 unsigned long buid; struct pci_dn *pci_data; diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 0ce71310b7d9..d6242e3f77da 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1269,24 +1269,10 @@ static const struct iommu_ops spapr_tce_iommu_ops = { .device_group = spapr_tce_iommu_device_group, }; -static struct attribute *spapr_tce_iommu_attrs[] = { - NULL, -}; - -static struct attribute_group spapr_tce_iommu_group = { - .name = "spapr-tce-iommu", - .attrs = spapr_tce_iommu_attrs, -}; - -static const struct attribute_group *spapr_tce_iommu_groups[] = { - &spapr_tce_iommu_group, - NULL, -}; - void ppc_iommu_register_device(struct pci_controller *phb) { iommu_device_sysfs_add(&phb->iommu, phb->parent, - spapr_tce_iommu_groups, "iommu-phb%04x", + phb->iommu_groups, "iommu-phb%04x", phb->global_number); iommu_device_register(&phb->iommu, &spapr_tce_iommu_ops, phb->parent); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 1c78fdfb7b03..0887f154955e 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2493,6 +2493,20 @@ static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = { .shutdown = pnv_pci_ioda_shutdown, }; +static struct attribute *pnv_tce_iommu_attrs[] = { + NULL, +}; + +static struct attribute_group pnv_tce_iommu_group = { + .name = "spapr-tce-iommu", + .attrs = pnv_tce_iommu_attrs, +}; + +static const struct attribute_group *pnv_tce_iommu_groups[] = { + &pnv_tce_iommu_group, + NULL, +}; + static void __init pnv_pci_init_ioda_phb(struct device_node *np, u64 hub_id, int ioda_type) { @@ -2697,6 +2711,8 @@ static void __init pnv_pci_init_ioda_phb(struct device_node *np, hose->controller_ops = pnv_pci_ioda_controller_ops; } + hose->iommu_groups = pnv_tce_iommu_groups; + ppc_md.pcibios_default_alignment = pnv_pci_default_alignment; #ifdef CONFIG_PCI_IOV diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 5497b130e026..28be7a45761d 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -56,6 +56,20 @@ enum { DDW_EXT_LIMITED_ADDR_MODE = 3 }; +/* used by sysfs when querying Dynamic/Default DMA Window data */ +struct dma_win_data { + u32 page_size; + u64 direct_address; + u64 direct_size; + u64 dynamic_address; + u64 dynamic_size; + u32 dynamic_pages_mapped; + char window_type[15]; +}; + +#define SPAPR_SUCCESS 0 +#define SPAPR_ERROR -1 + static struct iommu_table *iommu_pseries_alloc_table(int node) { struct iommu_table *tbl; @@ -837,6 +851,253 @@ static struct device_node *pci_dma_find(struct device_node *dn, return rdn; } +/* Get DDW information for the device */ +static int gather_ddw_info(struct device *dev, struct dma_win_data *data) +{ + struct iommu_device *iommu; + struct pci_controller *phb; + struct device_node *dn; + struct pci_dn *pci; + const __be32 *prop = NULL; + bool ddw_direct = false; + bool found = false; + struct iommu_table *tbl; + u32 pgshift; + struct dynamic_dma_window_prop *p; + + memset(data, 0, sizeof(*data)); + + iommu = dev_get_drvdata(dev); + phb = container_of(iommu, struct pci_controller, iommu); + dn = phb->dn; + + if (!dn) + return SPAPR_ERROR; + + pci = PCI_DN(dn); + if (!pci || !pci->table_group) + return SPAPR_ERROR; + + /* Find DDW */ + prop = of_get_property(dn, DIRECT64_PROPNAME, NULL); + if (prop) { + ddw_direct = true; + found = true; + } else { + prop = of_get_property(dn, DMA64_PROPNAME, NULL); + if (prop) + found = true; + } + + /* NO DDW */ + if (!found) + return SPAPR_ERROR; + + p = (struct dynamic_dma_window_prop *)prop; + + pgshift = be32_to_cpu(p->tce_shift); + if (pgshift != 0xc && pgshift != 0x10 && pgshift != 0x15) + data->page_size = 0; + else + data->page_size = 1 << pgshift; + + /* Check if DDW has table associated with it. Having a table associated with + * DDW is indicative that is has some dynamic TCE allocations. In this case the + * DDW can be fully Dynamic or in Hybrid mode. For SR-IOV DDW is on index 0, + * for dedicated adapter on index 1. + */ + found = false; + for (int i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { + tbl = pci->table_group->tables[i]; + + if (tbl && tbl->it_index == be32_to_cpu(p->liobn)) { + found = true; + break; + } + } + + /* set the parameters depnding on the DDW type */ + if (ddw_direct && found) { /* Hybrid */ + data->direct_address = be64_to_cpu(p->dma_base); + data->dynamic_size = (u64)(tbl->it_size << tbl->it_page_shift); + + data->dynamic_address = data->direct_address + + (u64)(1UL << be32_to_cpu(p->window_shift)) + - data->dynamic_size; + + data->direct_size = data->dynamic_address - data->direct_address; + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); + + sprintf(data->window_type, "%s", "Hybrid"); + } else if (ddw_direct && !found) { /* Direct */ + data->direct_address = be64_to_cpu(p->dma_base); + data->direct_size = (u64)(1UL << be32_to_cpu(p->window_shift)); + + sprintf(data->window_type, "%s", "Direct"); + } else { /* Dynamic */ + data->dynamic_address = be64_to_cpu(p->dma_base); + data->dynamic_size = (u64)(1UL << be32_to_cpu(p->window_shift)); + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); + + sprintf(data->window_type, "%s", "Dynamic"); + } + + return SPAPR_SUCCESS; +} + +/* Get DDW information for the device */ +static int gather_dma_info(struct device *dev, struct dma_win_data *data) +{ + struct iommu_device *iommu; + struct pci_controller *phb; + struct device_node *dn; + struct pci_dn *pci; + const __be32 *prop = NULL; + struct iommu_table *tbl; + unsigned long offset, size, liobn; + + memset(data, 0, sizeof(*data)); + + iommu = dev_get_drvdata(dev); + phb = container_of(iommu, struct pci_controller, iommu); + dn = phb->dn; + + if (!dn) + return SPAPR_ERROR; + + pci = PCI_DN(dn); + if (!pci || !pci->table_group) + return SPAPR_ERROR; + + /* search for default DMA window */ + prop = of_get_property(dn, "ibm,dma-window", NULL); + + if (!prop) + return SPAPR_ERROR; + + /* default DMA Window is always at index 0 */ + tbl = pci->table_group->tables[0]; + if (!tbl) + return SPAPR_ERROR; + + of_parse_dma_window(dn, prop, &liobn, &offset, &size); + + data->dynamic_address = offset; + data->dynamic_size = size; + data->page_size = 1ULL << IOMMU_PAGE_SHIFT_4K; + data->dynamic_pages_mapped = bitmap_weight(tbl->it_map, tbl->it_size); + + return SPAPR_SUCCESS; +} + +#define DEVICE_SHOW_DDW(_name, _fmt) \ +ssize_t ddw_##_name##_show(struct device *dev, \ + struct device_attribute *attr,\ + char *buf) \ +{ \ + int rc = 0; \ + struct dma_win_data data; \ + \ + rc = gather_ddw_info(dev, &data); \ + \ + if (rc == SPAPR_SUCCESS) \ + return sysfs_emit(buf, _fmt, data._name); \ + else \ + return -ENODATA; \ +} \ + +#define DEVICE_SHOW_DMA(_name, _fmt) \ +ssize_t dma_##_name##_show(struct device *dev, \ + struct device_attribute *attr,\ + char *buf) \ +{ \ + int rc = 0; \ + struct dma_win_data data; \ + \ + rc = gather_dma_info(dev, &data); \ + \ + if (rc == SPAPR_SUCCESS) \ + return sysfs_emit(buf, _fmt, data._name); \ + else \ + return -ENODATA; \ +} \ + +static DEVICE_SHOW_DDW(direct_address, "%#llx\n"); +static DEVICE_SHOW_DDW(direct_size, "%lld\n"); +static DEVICE_SHOW_DDW(page_size, "%d\n"); +static DEVICE_SHOW_DDW(window_type, "%s\n"); +static DEVICE_SHOW_DDW(dynamic_address, "%#llx\n"); +static DEVICE_SHOW_DDW(dynamic_size, "%lld\n"); +static DEVICE_SHOW_DDW(dynamic_pages_mapped, "%d\n"); +static DEVICE_SHOW_DMA(dynamic_address, "%#llx\n"); +static DEVICE_SHOW_DMA(dynamic_size, "%lld\n"); +static DEVICE_SHOW_DMA(page_size, "%d\n"); +static DEVICE_SHOW_DMA(dynamic_pages_mapped, "%d\n"); + +#define DEVICE_ATTR_DDW(_name) \ + struct device_attribute dev_attr_ddw_##_name = \ + __ATTR(_name, 0444, ddw_##_name##_show, NULL) +#define DEVICE_ATTR_DMA(_name) \ + struct device_attribute dev_attr_dma_##_name = \ + __ATTR(_name, 0444, dma_##_name##_show, NULL) + +static DEVICE_ATTR_DDW(direct_address); +static DEVICE_ATTR_DDW(direct_size); +static DEVICE_ATTR_DDW(page_size); +static DEVICE_ATTR_DDW(window_type); +static DEVICE_ATTR_DDW(dynamic_address); +static DEVICE_ATTR_DDW(dynamic_size); +static DEVICE_ATTR_DDW(dynamic_pages_mapped); +static DEVICE_ATTR_DMA(dynamic_address); +static DEVICE_ATTR_DMA(dynamic_size); +static DEVICE_ATTR_DMA(page_size); +static DEVICE_ATTR_DMA(dynamic_pages_mapped); + +static struct attribute *spapr_tce_ddw_attrs[] = { + &dev_attr_ddw_direct_address.attr, + &dev_attr_ddw_direct_size.attr, + &dev_attr_ddw_page_size.attr, + &dev_attr_ddw_window_type.attr, + &dev_attr_ddw_dynamic_address.attr, + &dev_attr_ddw_dynamic_size.attr, + &dev_attr_ddw_dynamic_pages_mapped.attr, + NULL, +}; + +static struct attribute *spapr_tce_dma_attrs[] = { + &dev_attr_dma_dynamic_address.attr, + &dev_attr_dma_dynamic_size.attr, + &dev_attr_dma_page_size.attr, + &dev_attr_dma_dynamic_pages_mapped.attr, + NULL, +}; + +static struct attribute_group spapr_tce_ddw_group = { + .name = "spapr-tce-ddw", + .attrs = spapr_tce_ddw_attrs, +}; + +static struct attribute_group spapr_tce_dma_group = { + .name = "spapr-tce-dma", + .attrs = spapr_tce_dma_attrs, +}; + +static struct attribute *spapr_tce_iommu_attrs[] = { + NULL, +}; + +static struct attribute_group spapr_tce_iommu_group = { + .name = "spapr-tce-iommu", + .attrs = spapr_tce_iommu_attrs, +}; + +const struct attribute_group *spapr_tce_iommu_groups[] = { + &spapr_tce_iommu_group, + &spapr_tce_ddw_group, + &spapr_tce_dma_group, + NULL, +}; + static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) { struct iommu_table *tbl; diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c index 8c77ec7980de..b457451a2814 100644 --- a/arch/powerpc/platforms/pseries/pci_dlpar.c +++ b/arch/powerpc/platforms/pseries/pci_dlpar.c @@ -45,6 +45,8 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn) pci_process_bridge_OF_ranges(phb, dn, 0); phb->controller_ops = pseries_pci_controller_ops; + phb->iommu_groups = spapr_tce_iommu_groups; + pci_devs_phb_init_dynamic(phb); pseries_msi_allocate_domains(phb); diff --git a/arch/powerpc/platforms/pseries/pseries.h b/arch/powerpc/platforms/pseries/pseries.h index 3968a6970fa8..4cf0b7a4e96a 100644 --- a/arch/powerpc/platforms/pseries/pseries.h +++ b/arch/powerpc/platforms/pseries/pseries.h @@ -128,4 +128,5 @@ struct iommu_group *pSeries_pci_device_group(struct pci_controller *hose, struct pci_dev *pdev); #endif +extern const struct attribute_group *spapr_tce_iommu_groups[]; #endif /* _PSERIES_PSERIES_H */ diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 50b26ed8432d..4d877aae0560 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -512,6 +512,8 @@ static void __init pSeries_discover_phbs(void) isa_bridge_find_early(phb); phb->controller_ops = pseries_pci_controller_ops; + phb->iommu_groups = spapr_tce_iommu_groups; + /* create pci_dn's for DT nodes under this PHB */ pci_devs_phb_init_dynamic(phb);