| Message ID | 1044f7aa09836d63de964d4eb6e646b3071c1fdb.1760368250.git.leon@kernel.org |
|---|---|
| State | New |
| Headers | show |
| Series | vfio/pci: Allow MMIO regions to be exported through dma-buf | expand |
On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote: > The DMA API now has a new flow, and has gained phys_addr_t support, so > it no longer needs struct pages to perform P2P mapping. That's news to me. All the pci_p2pdma_map_state machinery is still based on pgmaps and thus pages. > Lifecycle management can be delegated to the user, DMABUF for instance > has a suitable invalidation protocol that does not require struct page. How?
On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote: > On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote: > > The DMA API now has a new flow, and has gained phys_addr_t support, so > > it no longer needs struct pages to perform P2P mapping. > > That's news to me. All the pci_p2pdma_map_state machinery is still > based on pgmaps and thus pages. We had this discussion already three months ago: https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/ These couple patches make the core pci_p2pdma_map_state machinery work on struct p2pdma_provider, and pgmap is just one way to get a p2pdma_provider * The struct page paths through pgmap go page->pgmap->mem to get p2pdma_provider. The non-struct page paths just have a p2pdma_provider * without a pgmap. In this series VFIO uses + *provider = pcim_p2pdma_provider(pdev, bar); To get the provider for a specific BAR. > > Lifecycle management can be delegated to the user, DMABUF for instance > > has a suitable invalidation protocol that does not require struct page. > > How? I think I've answered this three times now - for DMABUF the DMABUF invalidation scheme is used to control the lifetime and no DMA mapping outlives the provider, and the provider doesn't outlive the driver. Hotplug works fine. VFIO gets the driver removal callback, it invalidates all the DMABUFs, refuses to re-validate them, destroys the P2P provider, and ends its driver. There is no lifetime issue. Obviously you cannot use the new p2provider mechanism without some kind of protection against use after hot unplug, but it doesn't have to be struct page based. For VFIO the invalidation scheme is linked to dma_buf_move_notify(), for instance the hotunplug case goes: static const struct vfio_device_ops vfio_pci_ops = { .close_device = vfio_pci_core_close_device, vfio_pci_dma_buf_cleanup(vdev); dma_buf_move_notify(priv->dmabuf); And then if we follow that into an importer like RDMA: static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = { .move_notify = mlx5_ib_dmabuf_invalidate_cb, mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ZAP); ib_umem_dmabuf_unmap_pages(umem_dmabuf); dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt, DMA_BIDIRECTIONAL); vfio_pci_dma_buf_unmap() XLT_ZAP tells the HW to stop doing DMA and the unmap_pages -> unmap_attachment -> vfio_pci_dma_buf_unmap() flow will tear down the DMA API mapping and remove it from the IOMMU. All of this happens before device_driver remove completes. There is no lifecycle issue here and we don't need pgmap to solve a livecycle problem or to help find the p2pdma_provider. Jason
On Fri, Oct 17, 2025 at 08:53:20AM -0300, Jason Gunthorpe wrote: > On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote: > > On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote: > > > The DMA API now has a new flow, and has gained phys_addr_t support, so > > > it no longer needs struct pages to perform P2P mapping. > > > > That's news to me. All the pci_p2pdma_map_state machinery is still > > based on pgmaps and thus pages. > > We had this discussion already three months ago: > > https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/ > > These couple patches make the core pci_p2pdma_map_state machinery work > on struct p2pdma_provider, and pgmap is just one way to get a > p2pdma_provider * > > The struct page paths through pgmap go page->pgmap->mem to get > p2pdma_provider. > > The non-struct page paths just have a p2pdma_provider * without a > pgmap. In this series VFIO uses > > + *provider = pcim_p2pdma_provider(pdev, bar); > > To get the provider for a specific BAR. And what protects that life time? I've not seen anyone actually building the proper lifetime management. And if someone did the patches need to clearly point to that. > I think I've answered this three times now - for DMABUF the DMABUF > invalidation scheme is used to control the lifetime and no DMA mapping > outlives the provider, and the provider doesn't outlive the driver. How? > Obviously you cannot use the new p2provider mechanism without some > kind of protection against use after hot unplug, but it doesn't have > to be struct page based. And how does this interact with everyone else expecting pgmap based lifetime management.
On Mon, Oct 20, 2025 at 05:27:02AM -0700, Christoph Hellwig wrote: > On Fri, Oct 17, 2025 at 08:53:20AM -0300, Jason Gunthorpe wrote: > > On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote: > > > On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote: > > > > The DMA API now has a new flow, and has gained phys_addr_t support, so > > > > it no longer needs struct pages to perform P2P mapping. > > > > > > That's news to me. All the pci_p2pdma_map_state machinery is still > > > based on pgmaps and thus pages. > > > > We had this discussion already three months ago: > > > > https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/ > > > > These couple patches make the core pci_p2pdma_map_state machinery work > > on struct p2pdma_provider, and pgmap is just one way to get a > > p2pdma_provider * > > > > The struct page paths through pgmap go page->pgmap->mem to get > > p2pdma_provider. > > > > The non-struct page paths just have a p2pdma_provider * without a > > pgmap. In this series VFIO uses > > > > + *provider = pcim_p2pdma_provider(pdev, bar); > > > > To get the provider for a specific BAR. > > And what protects that life time? I've not seen anyone actually > building the proper lifetime management. And if someone did the patches > need to clearly point to that. It is this series! The above API gives a lifetime that is driver bound. The calling driver must ensure it stops using provider and stops doing DMA with it before remove() completes. This VFIO series does that through the move_notify callchain I showed in the previous email. This callchain is always triggered before remove() of the VFIO PCI driver is completed. > > I think I've answered this three times now - for DMABUF the DMABUF > > invalidation scheme is used to control the lifetime and no DMA mapping > > outlives the provider, and the provider doesn't outlive the driver. > > How? I explained it in detail in the message you are repling to. If something is not clear can you please be more specific?? Is it the mmap in VFIO perhaps that is causing these questions? VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses unmap_mapping_range() during its remove() path to get rid of the VMA PTEs. The DMA activity doesn't use the mmap *at all*. It isn't like NVMe which relies on the ZONE_DEVICE pages and VMAs to link drivers togther. Instead the DMABUF FD is used to pass the MMIO pages between VFIO and another driver. DMABUF has a built in invalidation mechanism that VFIO triggers before remove(). The invalidation removes access from the other driver. This is different than NVMe which has no invalidation. NVMe does unmap_mapping_range() on the VMA and waits for all the short lived pgmap references to clear. We don't need anything like that because DMABUF invalidation is synchronous. The full picture for VFIO is something like: [startup] MMIO is acquired from the pci_resource p2p_providers are setup [runtime] MMIO is mapped into PFNMAP VMAs MMIO is linked to a DMABUF FD DMABUF FD gets DMA mapped using the p2p_provider [unplug] unmap_mapping_range() is called so all VMAs are emptied out and the fault handler prevents new PTEs ** No access to the MMIO through VMAs is possible** vfio_pci_dma_buf_cleanup() is called which prevents new DMABUF mappings from starting, and does dma_buf_move_notify() on all the open DMABUF FDs to invalidate other drivers. Other drivers stop doing DMA and we need to free the IOVA from the IOMMU/etc. ** No DMA access from other drivers is possible now** Any still open DMABUF FD will fail inside VFIO immediately due to the priv->revoked checks. **No code touches the p2p_provider anymore** The p2p_provider is destroyed by devm. > > Obviously you cannot use the new p2provider mechanism without some > > kind of protection against use after hot unplug, but it doesn't have > > to be struct page based. > > And how does this interact with everyone else expecting pgmap based > lifetime management. They continue to use pgmap and nothing changes for them. The pgmap path always waited until nothing was using the pgmap and thus provider before allowing device driver remove() to complete. The refactoring doesn't change the lifecycle model, it just provides entry points to access the driver bound lifetime model directly instead of being forced to use pgmap. Leon, can you add some remarks to the comments about what the rules are to call pcim_p2pdma_provider() ? Jason
On Mon, Oct 20, 2025 at 09:58:54AM -0300, Jason Gunthorpe wrote: > On Mon, Oct 20, 2025 at 05:27:02AM -0700, Christoph Hellwig wrote: > > On Fri, Oct 17, 2025 at 08:53:20AM -0300, Jason Gunthorpe wrote: > > > On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote: > > > > On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote: > > > > > The DMA API now has a new flow, and has gained phys_addr_t support, so > > > > > it no longer needs struct pages to perform P2P mapping. > > > > > > > > That's news to me. All the pci_p2pdma_map_state machinery is still > > > > based on pgmaps and thus pages. > > > > > > We had this discussion already three months ago: > > > > > > https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/ > > > > > > These couple patches make the core pci_p2pdma_map_state machinery work > > > on struct p2pdma_provider, and pgmap is just one way to get a > > > p2pdma_provider * > > > > > > The struct page paths through pgmap go page->pgmap->mem to get > > > p2pdma_provider. > > > > > > The non-struct page paths just have a p2pdma_provider * without a > > > pgmap. In this series VFIO uses > > > > > > + *provider = pcim_p2pdma_provider(pdev, bar); > > > > > > To get the provider for a specific BAR. > > > > And what protects that life time? I've not seen anyone actually > > building the proper lifetime management. And if someone did the patches > > need to clearly point to that. > > It is this series! > > The above API gives a lifetime that is driver bound. The calling > driver must ensure it stops using provider and stops doing DMA with it > before remove() completes. > > This VFIO series does that through the move_notify callchain I showed > in the previous email. This callchain is always triggered before > remove() of the VFIO PCI driver is completed. > > > > I think I've answered this three times now - for DMABUF the DMABUF > > > invalidation scheme is used to control the lifetime and no DMA mapping > > > outlives the provider, and the provider doesn't outlive the driver. > > > > How? > > I explained it in detail in the message you are repling to. If > something is not clear can you please be more specific?? > > Is it the mmap in VFIO perhaps that is causing these questions? > > VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses > unmap_mapping_range() during its remove() path to get rid of the VMA > PTEs. > > The DMA activity doesn't use the mmap *at all*. It isn't like NVMe > which relies on the ZONE_DEVICE pages and VMAs to link drivers > togther. > > Instead the DMABUF FD is used to pass the MMIO pages between VFIO and > another driver. DMABUF has a built in invalidation mechanism that VFIO > triggers before remove(). The invalidation removes access from the > other driver. > > This is different than NVMe which has no invalidation. NVMe does > unmap_mapping_range() on the VMA and waits for all the short lived > pgmap references to clear. We don't need anything like that because > DMABUF invalidation is synchronous. > > The full picture for VFIO is something like: > > [startup] > MMIO is acquired from the pci_resource > p2p_providers are setup > > [runtime] > MMIO is mapped into PFNMAP VMAs > MMIO is linked to a DMABUF FD > DMABUF FD gets DMA mapped using the p2p_provider > > [unplug] > unmap_mapping_range() is called so all VMAs are emptied out and the > fault handler prevents new PTEs > ** No access to the MMIO through VMAs is possible** > > vfio_pci_dma_buf_cleanup() is called which prevents new DMABUF > mappings from starting, and does dma_buf_move_notify() on all the > open DMABUF FDs to invalidate other drivers. Other drivers stop > doing DMA and we need to free the IOVA from the IOMMU/etc. > ** No DMA access from other drivers is possible now** > > Any still open DMABUF FD will fail inside VFIO immediately due to > the priv->revoked checks. > **No code touches the p2p_provider anymore** > > The p2p_provider is destroyed by devm. > > > > Obviously you cannot use the new p2provider mechanism without some > > > kind of protection against use after hot unplug, but it doesn't have > > > to be struct page based. > > > > And how does this interact with everyone else expecting pgmap based > > lifetime management. > > They continue to use pgmap and nothing changes for them. > > The pgmap path always waited until nothing was using the pgmap and > thus provider before allowing device driver remove() to complete. > > The refactoring doesn't change the lifecycle model, it just provides > entry points to access the driver bound lifetime model directly > instead of being forced to use pgmap. > > Leon, can you add some remarks to the comments about what the rules > are to call pcim_p2pdma_provider() ? Yes, sure. Thanks > > Jason
On Mon, Oct 20, 2025 at 09:58:54AM -0300, Jason Gunthorpe wrote: > I explained it in detail in the message you are repling to. If > something is not clear can you please be more specific?? > > Is it the mmap in VFIO perhaps that is causing these questions? > > VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses > unmap_mapping_range() during its remove() path to get rid of the VMA > PTEs. This all needs to g• into the explanation. > Instead the DMABUF FD is used to pass the MMIO pages between VFIO and > another driver. DMABUF has a built in invalidation mechanism that VFIO > triggers before remove(). The invalidation removes access from the > other driver. > > This is different than NVMe which has no invalidation. NVMe does > unmap_mapping_range() on the VMA and waits for all the short lived > pgmap references to clear. We don't need anything like that because > DMABUF invalidation is synchronous. Please add documentation for this model to the source tree.
On Wed, Oct 22, 2025 at 12:10:35AM -0700, Christoph Hellwig wrote: > On Mon, Oct 20, 2025 at 09:58:54AM -0300, Jason Gunthorpe wrote: > > I explained it in detail in the message you are repling to. If > > something is not clear can you please be more specific?? > > > > Is it the mmap in VFIO perhaps that is causing these questions? > > > > VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses > > unmap_mapping_range() during its remove() path to get rid of the VMA > > PTEs. > > This all needs to g• into the explanation. > > > Instead the DMABUF FD is used to pass the MMIO pages between VFIO and > > another driver. DMABUF has a built in invalidation mechanism that VFIO > > triggers before remove(). The invalidation removes access from the > > other driver. > > > > This is different than NVMe which has no invalidation. NVMe does > > unmap_mapping_range() on the VMA and waits for all the short lived > > pgmap references to clear. We don't need anything like that because > > DMABUF invalidation is synchronous. > > Please add documentation for this model to the source tree. Okay, Lets see what we can come up with. I think explaining the dmabuf model with respect to the p2p provider in the new common dmabuf mapping API code would make sense. Jason
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 78e108e47254..59cd6fb40e83 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -28,9 +28,8 @@ struct pci_p2pdma { }; struct pci_p2pdma_pagemap { - struct pci_dev *provider; - u64 bus_offset; struct dev_pagemap pgmap; + struct p2pdma_provider mem; }; static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) @@ -204,8 +203,8 @@ static void p2pdma_page_free(struct page *page) { struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ - struct pci_p2pdma *p2pdma = - rcu_dereference_protected(pgmap->provider->p2pdma, 1); + struct pci_p2pdma *p2pdma = rcu_dereference_protected( + to_pci_dev(pgmap->mem.owner)->p2pdma, 1); struct percpu_ref *ref; gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), @@ -270,14 +269,15 @@ static int pci_p2pdma_setup(struct pci_dev *pdev) static void pci_p2pdma_unmap_mappings(void *data) { - struct pci_dev *pdev = data; + struct pci_p2pdma_pagemap *p2p_pgmap = data; /* * Removing the alloc attribute from sysfs will call * unmap_mapping_range() on the inode, teardown any existing userspace * mappings and prevent new ones from being created. */ - sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr, + sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj, + &p2pmem_alloc_attr.attr, p2pmem_group.name); } @@ -328,10 +328,9 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap->nr_range = 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; pgmap->ops = &p2pdma_pgmap_ops; - - p2p_pgmap->provider = pdev; - p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) - - pci_resource_start(pdev, bar); + p2p_pgmap->mem.owner = &pdev->dev; + p2p_pgmap->mem.bus_offset = + pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar); addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -340,7 +339,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, } error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings, - pdev); + p2p_pgmap); if (error) goto pages_free; @@ -972,16 +971,16 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish); -static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, - struct device *dev) +static enum pci_p2pdma_map_type +pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) { enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; - struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; + struct pci_dev *pdev = to_pci_dev(provider->owner); struct pci_dev *client; struct pci_p2pdma *p2pdma; int dist; - if (!provider->p2pdma) + if (!pdev->p2pdma) return PCI_P2PDMA_MAP_NOT_SUPPORTED; if (!dev_is_pci(dev)) @@ -990,7 +989,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, client = to_pci_dev(dev); rcu_read_lock(); - p2pdma = rcu_dereference(provider->p2pdma); + p2pdma = rcu_dereference(pdev->p2pdma); if (p2pdma) type = xa_to_value(xa_load(&p2pdma->map_types, @@ -998,7 +997,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, rcu_read_unlock(); if (type == PCI_P2PDMA_MAP_UNKNOWN) - return calc_map_type_and_dist(provider, client, &dist, true); + return calc_map_type_and_dist(pdev, client, &dist, true); return type; } @@ -1006,9 +1005,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { - state->pgmap = page_pgmap(page); - state->map = pci_p2pdma_map_type(state->pgmap, dev); - state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page)); + + if (state->mem == &p2p_pgmap->mem) + return; + + state->mem = &p2p_pgmap->mem; + state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev); } /** diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 951f81a38f3a..1400f3ad4299 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -16,6 +16,16 @@ struct block_device; struct scatterlist; +/** + * struct p2pdma_provider + * + * A p2pdma provider is a range of MMIO address space available to the CPU. + */ +struct p2pdma_provider { + struct device *owner; + u64 bus_offset; +}; + #ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); @@ -139,11 +149,11 @@ enum pci_p2pdma_map_type { }; struct pci_p2pdma_map_state { - struct dev_pagemap *pgmap; + struct p2pdma_provider *mem; enum pci_p2pdma_map_type map; - u64 bus_off; }; + /* helper for pci_p2pdma_state(), do not use directly */ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page); @@ -162,8 +172,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { - if (state->pgmap != page_pgmap(page)) - __pci_p2pdma_update_state(state, dev, page); + __pci_p2pdma_update_state(state, dev, page); return state->map; } return PCI_P2PDMA_MAP_NONE; @@ -181,7 +190,7 @@ static inline dma_addr_t pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) { WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); - return paddr + state->bus_off; + return paddr + state->mem->bus_offset; } #endif /* _LINUX_PCI_P2P_H */