| Message ID | 20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.com |
|---|---|
| State | New |
| Headers | show |
| Series | vfio/pci: Allow MMIO regions to be exported through dma-buf | expand |
On 11/20/25 10:28, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@nvidia.com> > > Add dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt() helpers to convert > an array of MMIO physical address ranges into scatter-gather tables with > proper DMA mapping. > > These common functions are a starting point and support any PCI > drivers creating mappings from their BAR's MMIO addresses. VFIO is one > case, as shortly will be RDMA. We can review existing DRM drivers to > refactor them separately. We hope this will evolve into routines to > help common DRM that include mixed CPU and MMIO mappings. > > Compared to the dma_map_resource() abuse this implementation handles > the complicated PCI P2P scenarios properly, especially when an IOMMU > is enabled: > > - Direct bus address mapping without IOVA allocation for > PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This > happens if the IOMMU is enabled but the PCIe switch ACS flags allow > transactions to avoid the host bridge. > > Further, this handles the slightly obscure, case of MMIO with a > phys_addr_t that is different from the physical BAR programming > (bus offset). The phys_addr_t is converted to a dma_addr_t and > accommodates this effect. This enables certain real systems to > work, especially on ARM platforms. > > - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO > attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE). > This happens when the IOMMU is enabled and the ACS flags are forcing > all traffic to the IOMMU - ie for virtualization systems. > > - Cases where P2P is not supported through the host bridge/CPU. The > P2P subsystem is the proper place to detect this and block it. > > Helper functions fill_sg_entry() and calc_sg_nents() handle the > scatter-gather table construction, splitting large regions into > UINT_MAX-sized chunks to fit within sg->length field limits. > > Since the physical address based DMA API forbids use of the CPU list > of the scatterlist this will produce a mangled scatterlist that has > a fully zero-length and NULL'd CPU list. The list is 0 length, > all the struct page pointers are NULL and zero sized. This is stronger > and more robust than the existing mangle_sg_table() technique. It is > a future project to migrate DMABUF as a subsystem away from using > scatterlist for this data structure. > > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > Tested-by: Alex Mastro <amastro@fb.com> > Tested-by: Nicolin Chen <nicolinc@nvidia.com> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Could be that this will backfire at some point, but I think we will never know without trying. Acked-by: Christian König <christian.koenig@amd.com> Regards, Christian. > --- > drivers/dma-buf/Makefile | 2 +- > drivers/dma-buf/dma-buf-mapping.c | 248 ++++++++++++++++++++++++++++++++++++++ > include/linux/dma-buf-mapping.h | 17 +++ > include/linux/dma-buf.h | 11 ++ > 4 files changed, 277 insertions(+), 1 deletion(-) > > diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile > index 70ec901edf2c..2008fb7481b3 100644 > --- a/drivers/dma-buf/Makefile > +++ b/drivers/dma-buf/Makefile > @@ -1,6 +1,6 @@ > # SPDX-License-Identifier: GPL-2.0-only > obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \ > - dma-fence-unwrap.o dma-resv.o > + dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o > obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o > obj-$(CONFIG_DMABUF_HEAPS) += heaps/ > obj-$(CONFIG_SYNC_FILE) += sync_file.o > diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c > new file mode 100644 > index 000000000000..de494bcac5e9 > --- /dev/null > +++ b/drivers/dma-buf/dma-buf-mapping.c > @@ -0,0 +1,248 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * DMA BUF Mapping Helpers > + * > + */ > +#include <linux/dma-buf-mapping.h> > +#include <linux/dma-resv.h> > + > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, > + dma_addr_t addr) > +{ > + unsigned int len, nents; > + int i; > + > + nents = DIV_ROUND_UP(length, UINT_MAX); > + for (i = 0; i < nents; i++) { > + len = min_t(size_t, length, UINT_MAX); > + length -= len; > + /* > + * DMABUF abuses scatterlist to create a scatterlist > + * that does not have any CPU list, only the DMA list. > + * Always set the page related values to NULL to ensure > + * importers can't use it. The phys_addr based DMA API > + * does not require the CPU list for mapping or unmapping. > + */ > + sg_set_page(sgl, NULL, 0, 0); > + sg_dma_address(sgl) = addr + i * UINT_MAX; > + sg_dma_len(sgl) = len; > + sgl = sg_next(sgl); > + } > + > + return sgl; > +} > + > +static unsigned int calc_sg_nents(struct dma_iova_state *state, > + struct dma_buf_phys_vec *phys_vec, > + size_t nr_ranges, size_t size) > +{ > + unsigned int nents = 0; > + size_t i; > + > + if (!state || !dma_use_iova(state)) { > + for (i = 0; i < nr_ranges; i++) > + nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); > + } else { > + /* > + * In IOVA case, there is only one SG entry which spans > + * for whole IOVA address space, but we need to make sure > + * that it fits sg->length, maybe we need more. > + */ > + nents = DIV_ROUND_UP(size, UINT_MAX); > + } > + > + return nents; > +} > + > +/** > + * struct dma_buf_dma - holds DMA mapping information > + * @sgt: Scatter-gather table > + * @state: DMA IOVA state relevant in IOMMU-based DMA > + * @size: Total size of DMA transfer > + */ > +struct dma_buf_dma { > + struct sg_table sgt; > + struct dma_iova_state *state; > + size_t size; > +}; > + > +/** > + * dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment > + * from arrays of physical vectors. This funciton is intended for MMIO memory > + * only. > + * @attach: [in] attachment whose scatterlist is to be returned > + * @provider: [in] p2pdma provider > + * @phys_vec: [in] array of physical vectors > + * @nr_ranges: [in] number of entries in phys_vec array > + * @size: [in] total size of phys_vec > + * @dir: [in] direction of DMA transfer > + * > + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR > + * on error. May return -EINTR if it is interrupted by a signal. > + * > + * On success, the DMA addresses and lengths in the returned scatterlist are > + * PAGE_SIZE aligned. > + * > + * A mapping must be unmapped by using dma_buf_free_sgt(). > + * > + * NOTE: This function is intended for exporters. If direct traffic routing is > + * mandatory exporter should call routing pci_p2pdma_map_type() before calling > + * this function. > + */ > +struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, > + struct p2pdma_provider *provider, > + struct dma_buf_phys_vec *phys_vec, > + size_t nr_ranges, size_t size, > + enum dma_data_direction dir) > +{ > + unsigned int nents, mapped_len = 0; > + struct dma_buf_dma *dma; > + struct scatterlist *sgl; > + dma_addr_t addr; > + size_t i; > + int ret; > + > + dma_resv_assert_held(attach->dmabuf->resv); > + > + if (WARN_ON(!attach || !attach->dmabuf || !provider)) > + /* This function is supposed to work on MMIO memory only */ > + return ERR_PTR(-EINVAL); > + > + dma = kzalloc(sizeof(*dma), GFP_KERNEL); > + if (!dma) > + return ERR_PTR(-ENOMEM); > + > + switch (pci_p2pdma_map_type(provider, attach->dev)) { > + case PCI_P2PDMA_MAP_BUS_ADDR: > + /* > + * There is no need in IOVA at all for this flow. > + */ > + break; > + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: > + dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL); > + if (!dma->state) { > + ret = -ENOMEM; > + goto err_free_dma; > + } > + > + dma_iova_try_alloc(attach->dev, dma->state, 0, size); > + break; > + default: > + ret = -EINVAL; > + goto err_free_dma; > + } > + > + nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size); > + ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO); > + if (ret) > + goto err_free_state; > + > + sgl = dma->sgt.sgl; > + > + for (i = 0; i < nr_ranges; i++) { > + if (!dma->state) { > + addr = pci_p2pdma_bus_addr_map(provider, > + phys_vec[i].paddr); > + } else if (dma_use_iova(dma->state)) { > + ret = dma_iova_link(attach->dev, dma->state, > + phys_vec[i].paddr, 0, > + phys_vec[i].len, dir, > + DMA_ATTR_MMIO); > + if (ret) > + goto err_unmap_dma; > + > + mapped_len += phys_vec[i].len; > + } else { > + addr = dma_map_phys(attach->dev, phys_vec[i].paddr, > + phys_vec[i].len, dir, > + DMA_ATTR_MMIO); > + ret = dma_mapping_error(attach->dev, addr); > + if (ret) > + goto err_unmap_dma; > + } > + > + if (!dma->state || !dma_use_iova(dma->state)) > + sgl = fill_sg_entry(sgl, phys_vec[i].len, addr); > + } > + > + if (dma->state && dma_use_iova(dma->state)) { > + WARN_ON_ONCE(mapped_len != size); > + ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len); > + if (ret) > + goto err_unmap_dma; > + > + sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr); > + } > + > + dma->size = size; > + > + /* > + * No CPU list included — set orig_nents = 0 so others can detect > + * this via SG table (use nents only). > + */ > + dma->sgt.orig_nents = 0; > + > + > + /* > + * SGL must be NULL to indicate that SGL is the last one > + * and we allocated correct number of entries in sg_alloc_table() > + */ > + WARN_ON_ONCE(sgl); > + return &dma->sgt; > + > +err_unmap_dma: > + if (!i || !dma->state) { > + ; /* Do nothing */ > + } else if (dma_use_iova(dma->state)) { > + dma_iova_destroy(attach->dev, dma->state, mapped_len, dir, > + DMA_ATTR_MMIO); > + } else { > + for_each_sgtable_dma_sg(&dma->sgt, sgl, i) > + dma_unmap_phys(attach->dev, sg_dma_address(sgl), > + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); > + } > + sg_free_table(&dma->sgt); > +err_free_state: > + kfree(dma->state); > +err_free_dma: > + kfree(dma); > + return ERR_PTR(ret); > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF"); > + > +/** > + * dma_buf_free_sgt- unmaps the buffer > + * @attach: [in] attachment to unmap buffer from > + * @sgt: [in] scatterlist info of the buffer to unmap > + * @direction: [in] direction of DMA transfer > + * > + * This unmaps a DMA mapping for @attached obtained > + * by dma_buf_phys_vec_to_sgt(). > + */ > +void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, > + enum dma_data_direction dir) > +{ > + struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt); > + int i; > + > + dma_resv_assert_held(attach->dmabuf->resv); > + > + if (!dma->state) { > + ; /* Do nothing */ > + } else if (dma_use_iova(dma->state)) { > + dma_iova_destroy(attach->dev, dma->state, dma->size, dir, > + DMA_ATTR_MMIO); > + } else { > + struct scatterlist *sgl; > + > + for_each_sgtable_dma_sg(sgt, sgl, i) > + dma_unmap_phys(attach->dev, sg_dma_address(sgl), > + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); > + } > + > + sg_free_table(sgt); > + kfree(dma->state); > + kfree(dma); > + > +} > +EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF"); > diff --git a/include/linux/dma-buf-mapping.h b/include/linux/dma-buf-mapping.h > new file mode 100644 > index 000000000000..a3c0ce2d3a42 > --- /dev/null > +++ b/include/linux/dma-buf-mapping.h > @@ -0,0 +1,17 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * DMA BUF Mapping Helpers > + * > + */ > +#ifndef __DMA_BUF_MAPPING_H__ > +#define __DMA_BUF_MAPPING_H__ > +#include <linux/dma-buf.h> > + > +struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, > + struct p2pdma_provider *provider, > + struct dma_buf_phys_vec *phys_vec, > + size_t nr_ranges, size_t size, > + enum dma_data_direction dir); > +void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, > + enum dma_data_direction dir); > +#endif > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index d58e329ac0e7..0bc492090237 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -22,6 +22,7 @@ > #include <linux/fs.h> > #include <linux/dma-fence.h> > #include <linux/wait.h> > +#include <linux/pci-p2pdma.h> > > struct device; > struct dma_buf; > @@ -530,6 +531,16 @@ struct dma_buf_export_info { > void *priv; > }; > > +/** > + * struct dma_buf_phys_vec - describe continuous chunk of memory > + * @paddr: physical address of that chunk > + * @len: Length of this chunk > + */ > +struct dma_buf_phys_vec { > + phys_addr_t paddr; > + size_t len; > +}; > + > /** > * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters > * @name: export-info name >
On Thu, Nov 20, 2025 at 10:33:36AM +0100, Christian König wrote: > On 11/20/25 10:28, Leon Romanovsky wrote: > > From: Leon Romanovsky <leonro@nvidia.com> > > > > Add dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt() helpers to convert > > an array of MMIO physical address ranges into scatter-gather tables with > > proper DMA mapping. <...> > > Reviewed-by: Kevin Tian <kevin.tian@intel.com> > > Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> > > Tested-by: Alex Mastro <amastro@fb.com> > > Tested-by: Nicolin Chen <nicolinc@nvidia.com> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com> > > Could be that this will backfire at some point, but I think we will never know without trying. > > Acked-by: Christian König <christian.koenig@amd.com> Thanks a lot.
On Thu, Nov 20, 2025 at 11:28:25AM +0200, Leon Romanovsky wrote: > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, > + dma_addr_t addr) > +{ > + unsigned int len, nents; > + int i; > + > + nents = DIV_ROUND_UP(length, UINT_MAX); > + for (i = 0; i < nents; i++) { > + len = min_t(size_t, length, UINT_MAX); > + length -= len; > + /* > + * DMABUF abuses scatterlist to create a scatterlist > + * that does not have any CPU list, only the DMA list. > + * Always set the page related values to NULL to ensure > + * importers can't use it. The phys_addr based DMA API > + * does not require the CPU list for mapping or unmapping. > + */ > + sg_set_page(sgl, NULL, 0, 0); > + sg_dma_address(sgl) = addr + i * UINT_MAX; (i * UINT_MAX) happens in 32-bit before being promoted to dma_addr_t for addition with addr. Overflows for i >=2 when length >= 8 GiB. Needs a cast: sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX; Discovered this while debugging why dma-buf import was failing for an 8 GiB dma-buf using my earlier toy program [1]. It was surfaced by ib_umem_find_best_pgsz() returning 0 due to malformed scatterlist, which bubbles up as an EINVAL. $ ./test_dmabuf 0000:05:00.0 3 4 0 0x200000000 opening 0000:05:00.0 via /dev/vfio/56 allocating dma_buf bar_idx=4, bar_offset=0x0, size=0x200000000 allocated dma_buf fd=6 discovered 4 ibv devices: mlx5_0 mlx5_1 mlx5_2 mlx5_3 opened ibv device 3: mlx5_3 test_dmabuf.c:154 Condition failed: 'mr' (errno=22: Invalid argument) $ sudo retsnoop -e mlx5_ib_reg_user_mr_dmabuf -a 'mlx5*' -a 'ib_umem*' -a '*umr*' -a 'vfio_pci*' -a 'dma_buf_*' -x EINVAL -T Receiving data... 13:56:22.257907 -> 13:56:22.258275 TID/PID 948895/948895 (test_dmabuf/test_dmabuf): FUNCTION CALLS RESULT DURATION -------------------------------------------- -------------------- --------- → mlx5_ib_reg_user_mr_dmabuf ↔ mlx5r_umr_resource_init [0] 2.224us → ib_umem_dmabuf_get → ib_umem_dmabuf_get_with_dma_device ↔ dma_buf_get [0xff11012a6a098c00] 0.972us → dma_buf_dynamic_attach ↔ vfio_pci_dma_buf_attach [0] 2.003us ← dma_buf_dynamic_attach [0xff1100012793e400] 10.566us ← ib_umem_dmabuf_get_with_dma_device [0xff110127a6c74480] 15.794us ← ib_umem_dmabuf_get [0xff110127a6c74480] 25.258us → mlx5_ib_init_dmabuf_mr → ib_umem_dmabuf_map_pages → dma_buf_map_attachment → vfio_pci_dma_buf_map ↔ dma_buf_map [0xff1100012977f700] 4.918us ← vfio_pci_dma_buf_map [0xff1100012977f700] 8.362us ← dma_buf_map_attachment [0xff1100012977f700] 10.956us ← ib_umem_dmabuf_map_pages [0] 17.336us ↔ ib_umem_find_best_pgsz [0] 6.280us → ib_umem_dmabuf_unmap_pages → dma_buf_unmap_attachment → vfio_pci_dma_buf_unmap ↔ dma_buf_unmap [void] 2.023us ← vfio_pci_dma_buf_unmap [void] 6.700us ← dma_buf_unmap_attachment [void] 8.142us ← ib_umem_dmabuf_unmap_pages [void] 14.953us ← mlx5_ib_init_dmabuf_mr [-EINVAL] 67.272us → mlx5r_umr_revoke_mr → mlx5r_umr_post_send_wait → mlx5r_umr_post_send ↔ mlx5r_begin_wqe [0] 1.703us ↔ mlx5r_finish_wqe [void] 1.633us ↔ mlx5r_ring_db [void] 1.312us ← mlx5r_umr_post_send [0] 27.451us ← mlx5r_umr_post_send_wait [0] 126.541us ← mlx5r_umr_revoke_mr [0] 141.925us → ib_umem_release → ib_umem_dmabuf_release ↔ ib_umem_dmabuf_revoke [void] 1.582us ↔ dma_buf_detach [void] 3.765us ↔ dma_buf_put [void] 0.531us ← ib_umem_dmabuf_release [void] 23.315us ← ib_umem_release [void] 40.301us ← mlx5_ib_reg_user_mr_dmabuf [-EINVAL] 363.280us [1] https://lore.kernel.org/all/aQkLcAxEn4qmF3c4@devgpu015.cco6.facebook.com/ Alex
On Tue, Nov 25, 2025 at 04:18:03PM -0800, Alex Mastro wrote: > On Thu, Nov 20, 2025 at 11:28:25AM +0200, Leon Romanovsky wrote: > > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, > > + dma_addr_t addr) > > +{ > > + unsigned int len, nents; > > + int i; > > + > > + nents = DIV_ROUND_UP(length, UINT_MAX); > > + for (i = 0; i < nents; i++) { > > + len = min_t(size_t, length, UINT_MAX); > > + length -= len; > > + /* > > + * DMABUF abuses scatterlist to create a scatterlist > > + * that does not have any CPU list, only the DMA list. > > + * Always set the page related values to NULL to ensure > > + * importers can't use it. The phys_addr based DMA API > > + * does not require the CPU list for mapping or unmapping. > > + */ > > + sg_set_page(sgl, NULL, 0, 0); > > + sg_dma_address(sgl) = addr + i * UINT_MAX; > > (i * UINT_MAX) happens in 32-bit before being promoted to dma_addr_t for > addition with addr. Overflows for i >=2 when length >= 8 GiB. Needs a cast: > > sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX; > > Discovered this while debugging why dma-buf import was failing for > an 8 GiB dma-buf using my earlier toy program [1]. It was surfaced by > ib_umem_find_best_pgsz() returning 0 due to malformed scatterlist, which bubbles > up as an EINVAL. > Thanks a lot for testing & reporting this! However, I believe the casting approach is a little fragile (and potentially prone to issues depending on how dma_addr_t is sized on different platforms). Thus, approaching this with accumulation seems better as it avoids the multiplication logic entirely, maybe something like the following (untested) diff ? --- a/drivers/dma-buf/dma-buf-mapping.c +++ b/drivers/dma-buf/dma-buf-mapping.c @@ -252,14 +252,14 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, nents = DIV_ROUND_UP(length, UINT_MAX); for (i = 0; i < nents; i++) { len = min_t(size_t, length, UINT_MAX); - length -= len; /* * DMABUF abuses scatterlist to create a scatterlist * that does not have any CPU list, only the DMA list. * Always set the page related values to NULL to ensure * importers can't use it. The phys_addr based DMA API * does not require the CPU list for mapping or unmapping. */ sg_set_page(sgl, NULL, 0, 0); - sg_dma_address(sgl) = addr + i * UINT_MAX; + sg_dma_address(sgl) = addr; sg_dma_len(sgl) = len; + + addr += len; + length -= len; sgl = sg_next(sgl); } Thanks, Praan
On Wed, Nov 26, 2025 at 01:12:40PM +0000, Pranjal Shrivastava wrote: > On Tue, Nov 25, 2025 at 04:18:03PM -0800, Alex Mastro wrote: > > On Thu, Nov 20, 2025 at 11:28:25AM +0200, Leon Romanovsky wrote: > > > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, > > > + dma_addr_t addr) > > > +{ > > > + unsigned int len, nents; > > > + int i; > > > + > > > + nents = DIV_ROUND_UP(length, UINT_MAX); > > > + for (i = 0; i < nents; i++) { > > > + len = min_t(size_t, length, UINT_MAX); > > > + length -= len; > > > + /* > > > + * DMABUF abuses scatterlist to create a scatterlist > > > + * that does not have any CPU list, only the DMA list. > > > + * Always set the page related values to NULL to ensure > > > + * importers can't use it. The phys_addr based DMA API > > > + * does not require the CPU list for mapping or unmapping. > > > + */ > > > + sg_set_page(sgl, NULL, 0, 0); > > > + sg_dma_address(sgl) = addr + i * UINT_MAX; > > > > (i * UINT_MAX) happens in 32-bit before being promoted to dma_addr_t for > > addition with addr. Overflows for i >=2 when length >= 8 GiB. Needs a cast: > > > > sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX; > > > > Discovered this while debugging why dma-buf import was failing for > > an 8 GiB dma-buf using my earlier toy program [1]. It was surfaced by > > ib_umem_find_best_pgsz() returning 0 due to malformed scatterlist, which bubbles > > up as an EINVAL. > > > > Thanks a lot for testing & reporting this! > > However, I believe the casting approach is a little fragile (and > potentially prone to issues depending on how dma_addr_t is sized on > different platforms). Thus, approaching this with accumulation seems > better as it avoids the multiplication logic entirely, maybe something > like the following (untested) diff ? If the function input range is well-formed, then all values in [addr..addr+length) must be expressible by dma_addr_t, so I don't think overflow after casting is possible as long as nents is valid. That said, `nents = DIV_ROUND_UP(length, UINT_MAX)` is simply broken on any system where size_t is 32b. I don't know if that's a practical consideration for these code paths though. > > --- a/drivers/dma-buf/dma-buf-mapping.c > +++ b/drivers/dma-buf/dma-buf-mapping.c > @@ -252,14 +252,14 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, > nents = DIV_ROUND_UP(length, UINT_MAX); > for (i = 0; i < nents; i++) { > len = min_t(size_t, length, UINT_MAX); > - length -= len; > /* > * DMABUF abuses scatterlist to create a scatterlist > * that does not have any CPU list, only the DMA list. > * Always set the page related values to NULL to ensure > * importers can't use it. The phys_addr based DMA API > * does not require the CPU list for mapping or unmapping. > */ > sg_set_page(sgl, NULL, 0, 0); > - sg_dma_address(sgl) = addr + i * UINT_MAX; > + sg_dma_address(sgl) = addr; > sg_dma_len(sgl) = len; > + > + addr += len; > + length -= len; > sgl = sg_next(sgl); > } > > Thanks, > Praan
On Wed, Nov 26, 2025 at 08:08:24AM -0800, Alex Mastro wrote: > On Wed, Nov 26, 2025 at 01:12:40PM +0000, Pranjal Shrivastava wrote: > > On Tue, Nov 25, 2025 at 04:18:03PM -0800, Alex Mastro wrote: > > > On Thu, Nov 20, 2025 at 11:28:25AM +0200, Leon Romanovsky wrote: > > > > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, > > > > + dma_addr_t addr) > > > > +{ > > > > + unsigned int len, nents; > > > > + int i; > > > > + > > > > + nents = DIV_ROUND_UP(length, UINT_MAX); > > > > + for (i = 0; i < nents; i++) { > > > > + len = min_t(size_t, length, UINT_MAX); > > > > + length -= len; > > > > + /* > > > > + * DMABUF abuses scatterlist to create a scatterlist > > > > + * that does not have any CPU list, only the DMA list. > > > > + * Always set the page related values to NULL to ensure > > > > + * importers can't use it. The phys_addr based DMA API > > > > + * does not require the CPU list for mapping or unmapping. > > > > + */ > > > > + sg_set_page(sgl, NULL, 0, 0); > > > > + sg_dma_address(sgl) = addr + i * UINT_MAX; > > > > > > (i * UINT_MAX) happens in 32-bit before being promoted to dma_addr_t for > > > addition with addr. Overflows for i >=2 when length >= 8 GiB. Needs a cast: > > > > > > sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX; Yeah, and i should not be signed. > > > Discovered this while debugging why dma-buf import was failing for > > > an 8 GiB dma-buf using my earlier toy program [1]. It was surfaced by > > > ib_umem_find_best_pgsz() returning 0 due to malformed scatterlist, which bubbles > > > up as an EINVAL. > > > > > > > Thanks a lot for testing & reporting this! > > > > However, I believe the casting approach is a little fragile (and > > potentially prone to issues depending on how dma_addr_t is sized on > > different platforms). Thus, approaching this with accumulation seems > > better as it avoids the multiplication logic entirely, maybe something > > like the following (untested) diff ? > > If the function input range is well-formed, then all values in > [addr..addr+length) must be expressible by dma_addr_t, so I don't think overflow > after casting is possible as long as nents is valid. It is probably not perfect, but validate_dmabuf_input() limits length to a valid size_t The signature is: bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state, phys_addr_t phys, size_t size) And that function should fail if size is too large. I think it mostly does, but it looks like there are a few little misses: iova_align(iovad, size + iova_off), return ALIGN(size, iovad->granule); etc are all unchecked math that could overflow. > That said, `nents = DIV_ROUND_UP(length, UINT_MAX)` is simply broken on any > system where size_t is 32b. I don't know if that's a practical consideration for > these code paths though. Yeah, that's a good point. Casting to u64 will trigger 64 bit device errors on 32 bit too. // DIV_ROUND_UP that is safe at the type limits nents = size / UINT_MAX; if (size % UINT_MAX) nents++; Compiler should turn the % into bit math. Jason
diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile index 70ec901edf2c..2008fb7481b3 100644 --- a/drivers/dma-buf/Makefile +++ b/drivers/dma-buf/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0-only obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \ - dma-fence-unwrap.o dma-resv.o + dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o obj-$(CONFIG_DMABUF_HEAPS) += heaps/ obj-$(CONFIG_SYNC_FILE) += sync_file.o diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c new file mode 100644 index 000000000000..de494bcac5e9 --- /dev/null +++ b/drivers/dma-buf/dma-buf-mapping.c @@ -0,0 +1,248 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * DMA BUF Mapping Helpers + * + */ +#include <linux/dma-buf-mapping.h> +#include <linux/dma-resv.h> + +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length, + dma_addr_t addr) +{ + unsigned int len, nents; + int i; + + nents = DIV_ROUND_UP(length, UINT_MAX); + for (i = 0; i < nents; i++) { + len = min_t(size_t, length, UINT_MAX); + length -= len; + /* + * DMABUF abuses scatterlist to create a scatterlist + * that does not have any CPU list, only the DMA list. + * Always set the page related values to NULL to ensure + * importers can't use it. The phys_addr based DMA API + * does not require the CPU list for mapping or unmapping. + */ + sg_set_page(sgl, NULL, 0, 0); + sg_dma_address(sgl) = addr + i * UINT_MAX; + sg_dma_len(sgl) = len; + sgl = sg_next(sgl); + } + + return sgl; +} + +static unsigned int calc_sg_nents(struct dma_iova_state *state, + struct dma_buf_phys_vec *phys_vec, + size_t nr_ranges, size_t size) +{ + unsigned int nents = 0; + size_t i; + + if (!state || !dma_use_iova(state)) { + for (i = 0; i < nr_ranges; i++) + nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); + } else { + /* + * In IOVA case, there is only one SG entry which spans + * for whole IOVA address space, but we need to make sure + * that it fits sg->length, maybe we need more. + */ + nents = DIV_ROUND_UP(size, UINT_MAX); + } + + return nents; +} + +/** + * struct dma_buf_dma - holds DMA mapping information + * @sgt: Scatter-gather table + * @state: DMA IOVA state relevant in IOMMU-based DMA + * @size: Total size of DMA transfer + */ +struct dma_buf_dma { + struct sg_table sgt; + struct dma_iova_state *state; + size_t size; +}; + +/** + * dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment + * from arrays of physical vectors. This funciton is intended for MMIO memory + * only. + * @attach: [in] attachment whose scatterlist is to be returned + * @provider: [in] p2pdma provider + * @phys_vec: [in] array of physical vectors + * @nr_ranges: [in] number of entries in phys_vec array + * @size: [in] total size of phys_vec + * @dir: [in] direction of DMA transfer + * + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR + * on error. May return -EINTR if it is interrupted by a signal. + * + * On success, the DMA addresses and lengths in the returned scatterlist are + * PAGE_SIZE aligned. + * + * A mapping must be unmapped by using dma_buf_free_sgt(). + * + * NOTE: This function is intended for exporters. If direct traffic routing is + * mandatory exporter should call routing pci_p2pdma_map_type() before calling + * this function. + */ +struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, + struct p2pdma_provider *provider, + struct dma_buf_phys_vec *phys_vec, + size_t nr_ranges, size_t size, + enum dma_data_direction dir) +{ + unsigned int nents, mapped_len = 0; + struct dma_buf_dma *dma; + struct scatterlist *sgl; + dma_addr_t addr; + size_t i; + int ret; + + dma_resv_assert_held(attach->dmabuf->resv); + + if (WARN_ON(!attach || !attach->dmabuf || !provider)) + /* This function is supposed to work on MMIO memory only */ + return ERR_PTR(-EINVAL); + + dma = kzalloc(sizeof(*dma), GFP_KERNEL); + if (!dma) + return ERR_PTR(-ENOMEM); + + switch (pci_p2pdma_map_type(provider, attach->dev)) { + case PCI_P2PDMA_MAP_BUS_ADDR: + /* + * There is no need in IOVA at all for this flow. + */ + break; + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL); + if (!dma->state) { + ret = -ENOMEM; + goto err_free_dma; + } + + dma_iova_try_alloc(attach->dev, dma->state, 0, size); + break; + default: + ret = -EINVAL; + goto err_free_dma; + } + + nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size); + ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO); + if (ret) + goto err_free_state; + + sgl = dma->sgt.sgl; + + for (i = 0; i < nr_ranges; i++) { + if (!dma->state) { + addr = pci_p2pdma_bus_addr_map(provider, + phys_vec[i].paddr); + } else if (dma_use_iova(dma->state)) { + ret = dma_iova_link(attach->dev, dma->state, + phys_vec[i].paddr, 0, + phys_vec[i].len, dir, + DMA_ATTR_MMIO); + if (ret) + goto err_unmap_dma; + + mapped_len += phys_vec[i].len; + } else { + addr = dma_map_phys(attach->dev, phys_vec[i].paddr, + phys_vec[i].len, dir, + DMA_ATTR_MMIO); + ret = dma_mapping_error(attach->dev, addr); + if (ret) + goto err_unmap_dma; + } + + if (!dma->state || !dma_use_iova(dma->state)) + sgl = fill_sg_entry(sgl, phys_vec[i].len, addr); + } + + if (dma->state && dma_use_iova(dma->state)) { + WARN_ON_ONCE(mapped_len != size); + ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len); + if (ret) + goto err_unmap_dma; + + sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr); + } + + dma->size = size; + + /* + * No CPU list included — set orig_nents = 0 so others can detect + * this via SG table (use nents only). + */ + dma->sgt.orig_nents = 0; + + + /* + * SGL must be NULL to indicate that SGL is the last one + * and we allocated correct number of entries in sg_alloc_table() + */ + WARN_ON_ONCE(sgl); + return &dma->sgt; + +err_unmap_dma: + if (!i || !dma->state) { + ; /* Do nothing */ + } else if (dma_use_iova(dma->state)) { + dma_iova_destroy(attach->dev, dma->state, mapped_len, dir, + DMA_ATTR_MMIO); + } else { + for_each_sgtable_dma_sg(&dma->sgt, sgl, i) + dma_unmap_phys(attach->dev, sg_dma_address(sgl), + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); + } + sg_free_table(&dma->sgt); +err_free_state: + kfree(dma->state); +err_free_dma: + kfree(dma); + return ERR_PTR(ret); +} +EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF"); + +/** + * dma_buf_free_sgt- unmaps the buffer + * @attach: [in] attachment to unmap buffer from + * @sgt: [in] scatterlist info of the buffer to unmap + * @direction: [in] direction of DMA transfer + * + * This unmaps a DMA mapping for @attached obtained + * by dma_buf_phys_vec_to_sgt(). + */ +void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt); + int i; + + dma_resv_assert_held(attach->dmabuf->resv); + + if (!dma->state) { + ; /* Do nothing */ + } else if (dma_use_iova(dma->state)) { + dma_iova_destroy(attach->dev, dma->state, dma->size, dir, + DMA_ATTR_MMIO); + } else { + struct scatterlist *sgl; + + for_each_sgtable_dma_sg(sgt, sgl, i) + dma_unmap_phys(attach->dev, sg_dma_address(sgl), + sg_dma_len(sgl), dir, DMA_ATTR_MMIO); + } + + sg_free_table(sgt); + kfree(dma->state); + kfree(dma); + +} +EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF"); diff --git a/include/linux/dma-buf-mapping.h b/include/linux/dma-buf-mapping.h new file mode 100644 index 000000000000..a3c0ce2d3a42 --- /dev/null +++ b/include/linux/dma-buf-mapping.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * DMA BUF Mapping Helpers + * + */ +#ifndef __DMA_BUF_MAPPING_H__ +#define __DMA_BUF_MAPPING_H__ +#include <linux/dma-buf.h> + +struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach, + struct p2pdma_provider *provider, + struct dma_buf_phys_vec *phys_vec, + size_t nr_ranges, size_t size, + enum dma_data_direction dir); +void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt, + enum dma_data_direction dir); +#endif diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d58e329ac0e7..0bc492090237 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -22,6 +22,7 @@ #include <linux/fs.h> #include <linux/dma-fence.h> #include <linux/wait.h> +#include <linux/pci-p2pdma.h> struct device; struct dma_buf; @@ -530,6 +531,16 @@ struct dma_buf_export_info { void *priv; }; +/** + * struct dma_buf_phys_vec - describe continuous chunk of memory + * @paddr: physical address of that chunk + * @len: Length of this chunk + */ +struct dma_buf_phys_vec { + phys_addr_t paddr; + size_t len; +}; + /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name