From patchwork Mon Mar 12 19:35:15 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884819 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400SsD0Cz5z9sQn for ; Tue, 13 Mar 2018 06:36:16 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932458AbeCLTf6 (ORCPT ); Mon, 12 Mar 2018 15:35:58 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54764 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932313AbeCLTf4 (ORCPT ); Mon, 12 Mar 2018 15:35:56 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-000663-QD; Mon, 12 Mar 2018 13:35:34 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEG-0000kx-Ac; Mon, 12 Mar 2018 13:35:28 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:15 -0600 Message-Id: <20180312193525.2855-2-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> MIME-Version: 1.0 X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.7 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE, T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org Some PCI devices may have memory mapped in a BAR space that's intended for use in peer-to-peer transactions. In order to enable such transactions the memory must be registered with ZONE_DEVICE pages so it can be used by DMA interfaces in existing drivers. Add an interface for other subsystems to find and allocate chunks of P2P memory as necessary to facilitate transfers between two PCI peers: int pci_p2pdma_add_client(); struct pci_dev *pci_p2pmem_find(); void *pci_alloc_p2pmem(); The new interface requires a driver to collect a list of client devices involved in the transaction with the pci_p2pmem_add_client*() functions then call pci_p2pmem_find() to obtain any suitable P2P memory. Once this is done the list is bound to the memory and the calling driver is free to add and remove clients as necessary (adding incompatible clients will fail). With a suitable p2pmem device, memory can then be allocated with pci_alloc_p2pmem() for use in DMA transactions. Depending on hardware, using peer-to-peer memory may reduce the bandwidth of the transfer but would significantly reduce pressure on system memory. This may be desirable in many cases: for example a system could be designed with a small CPU connected to a PCI switch by a small number of lanes which would maximize the number of lanes available to connect to NVME devices. The code is designed to only utilize the p2pmem device if all the devices involved in a transfer are behind the same PCI switch. This is because we have no way of knowing whether peer-to-peer routing between PCIe Root Ports is supported (PCIe r4.0, sec 1.3.1). Additionally, the benefits of P2P transfers that go through the RC is limited to only reducing DRAM usage and, in some cases, coding convienence. This commit includes significant rework and feedback from Christoph Hellwig. Signed-off-by: Christoph Hellwig Signed-off-by: Logan Gunthorpe --- drivers/pci/Kconfig | 16 ++ drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 679 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/memremap.h | 18 ++ include/linux/pci-p2pdma.h | 101 +++++++ include/linux/pci.h | 4 + 6 files changed, 819 insertions(+) create mode 100644 drivers/pci/p2pdma.c create mode 100644 include/linux/pci-p2pdma.h diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 34b56a8f8480..d59f6f5ddfcd 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -124,6 +124,22 @@ config PCI_PASID If unsure, say N. +config PCI_P2PDMA + bool "PCI peer-to-peer transfer support" + depends on ZONE_DEVICE + select GENERIC_ALLOCATOR + help + Enableѕ drivers to do PCI peer-to-peer transactions to and from + BARs that are exposed in other devices that are the part of + the hierarchy where peer-to-peer DMA is guaranteed by the PCI + specification to work (ie. anything below a single PCI bridge). + + Many PCIe root complexes do not support P2P transactions and + it's hard to tell which support it at all, so at this time you + will need a PCIe switch. + + If unsure, say N. + config PCI_LABEL def_bool y if (DMI || ACPI) depends on PCI diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 941970936840..45e0ff6f3213 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -26,6 +26,7 @@ obj-$(CONFIG_PCI_MSI) += msi.o obj-$(CONFIG_PCI_ATS) += ats.o obj-$(CONFIG_PCI_IOV) += iov.o +obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o # # ACPI Related PCI FW Functions diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c new file mode 100644 index 000000000000..0ee917381dce --- /dev/null +++ b/drivers/pci/p2pdma.c @@ -0,0 +1,679 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCI Peer 2 Peer DMA support. + * + * Copyright (c) 2016-2018, Logan Gunthorpe + * Copyright (c) 2016-2017, Microsemi Corporation + * Copyright (c) 2017, Christoph Hellwig + * Copyright (c) 2018, Eideticom Inc. + * + */ + +#include +#include +#include +#include +#include +#include +#include + +struct pci_p2pdma { + struct percpu_ref devmap_ref; + struct completion devmap_ref_done; + struct gen_pool *pool; + bool p2pmem_published; +}; + +static void pci_p2pdma_percpu_release(struct percpu_ref *ref) +{ + struct pci_p2pdma *p2p = + container_of(ref, struct pci_p2pdma, devmap_ref); + + complete_all(&p2p->devmap_ref_done); +} + +static void pci_p2pdma_percpu_kill(void *data) +{ + struct percpu_ref *ref = data; + + if (percpu_ref_is_dying(ref)) + return; + + percpu_ref_kill(ref); +} + +static void pci_p2pdma_release(void *data) +{ + struct pci_dev *pdev = data; + + if (!pdev->p2pdma) + return; + + wait_for_completion(&pdev->p2pdma->devmap_ref_done); + percpu_ref_exit(&pdev->p2pdma->devmap_ref); + + gen_pool_destroy(pdev->p2pdma->pool); + pdev->p2pdma = NULL; +} + +static int pci_p2pdma_setup(struct pci_dev *pdev) +{ + int error = -ENOMEM; + struct pci_p2pdma *p2p; + + p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); + if (!p2p) + return -ENOMEM; + + p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); + if (!p2p->pool) + goto out; + + init_completion(&p2p->devmap_ref_done); + error = percpu_ref_init(&p2p->devmap_ref, + pci_p2pdma_percpu_release, 0, GFP_KERNEL); + if (error) + goto out_pool_destroy; + + percpu_ref_switch_to_atomic_sync(&p2p->devmap_ref); + + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); + if (error) + goto out_pool_destroy; + + pdev->p2pdma = p2p; + + return 0; + +out_pool_destroy: + gen_pool_destroy(p2p->pool); +out: + devm_kfree(&pdev->dev, p2p); + return error; +} + +/** + * pci_p2pdma_add_resource - add memory for use as p2p memory + * @pdev: the device to add the memory to + * @bar: PCI BAR to add + * @size: size of the memory to add, may be zero to use the whole BAR + * @offset: offset into the PCI BAR + * + * The memory will be given ZONE_DEVICE struct pages so that it may + * be used with any DMA request. + */ +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, + u64 offset) +{ + struct dev_pagemap *pgmap; + void *addr; + int error; + + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) + return -EINVAL; + + if (offset >= pci_resource_len(pdev, bar)) + return -EINVAL; + + if (!size) + size = pci_resource_len(pdev, bar) - offset; + + if (size + offset > pci_resource_len(pdev, bar)) + return -EINVAL; + + if (!pdev->p2pdma) { + error = pci_p2pdma_setup(pdev); + if (error) + return error; + } + + pgmap = devm_kzalloc(&pdev->dev, sizeof(*pgmap), GFP_KERNEL); + if (!pgmap) + return -ENOMEM; + + pgmap->res.start = pci_resource_start(pdev, bar) + offset; + pgmap->res.end = pgmap->res.start + size - 1; + pgmap->res.flags = pci_resource_flags(pdev, bar); + pgmap->ref = &pdev->p2pdma->devmap_ref; + pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; + + addr = devm_memremap_pages(&pdev->dev, pgmap); + if (IS_ERR(addr)) { + error = PTR_ERR(addr); + goto pgmap_free; + } + + error = gen_pool_add_virt(pdev->p2pdma->pool, (unsigned long)addr, + pci_bus_address(pdev, bar) + offset, + resource_size(&pgmap->res), dev_to_node(&pdev->dev)); + if (error) + goto pgmap_free; + + error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_percpu_kill, + &pdev->p2pdma->devmap_ref); + if (error) + goto pgmap_free; + + pci_info(pdev, "added peer-to-peer DMA memory %pR\n", + &pgmap->res); + + return 0; + +pgmap_free: + devres_free(pgmap); + return error; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource); + +static struct pci_dev *find_parent_pci_dev(struct device *dev) +{ + struct device *parent; + + dev = get_device(dev); + + while (dev) { + if (dev_is_pci(dev)) + return to_pci_dev(dev); + + parent = get_device(dev->parent); + put_device(dev); + dev = parent; + } + + return NULL; +} + +/* + * If a device is behind a switch, we try to find the upstream bridge + * port of the switch. This requires two calls to pci_upstream_bridge(): + * one for the upstream port on the switch, one on the upstream port + * for the next level in the hierarchy. Because of this, devices connected + * to the root port will be rejected. + */ +static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev) +{ + struct pci_dev *up1, *up2; + + if (!pdev) + return NULL; + + up1 = pci_dev_get(pci_upstream_bridge(pdev)); + if (!up1) + return NULL; + + up2 = pci_dev_get(pci_upstream_bridge(up1)); + pci_dev_put(up1); + + return up2; +} + +/* + * This function checks if two PCI devices are behind the same switch. + * (ie. they share the same second upstream port as returned by + * get_upstream_bridge_port().) + * + * Future work could expand this to handle hierarchies of switches + * so any devices whose traffic can be routed without going through + * the root complex could be used. For now, we limit it to just one + * level of switch. + * + * This function returns the "distance" between the devices. 0 meaning + * they are the same device, 1 meaning they are behind the same switch. + * If they are not behind the same switch, -1 is returned. + */ +static int __upstream_bridges_match(struct pci_dev *upstream, + struct pci_dev *client) +{ + struct pci_dev *dma_up; + int ret = 1; + + dma_up = get_upstream_bridge_port(client); + + if (!dma_up) { + dev_dbg(&client->dev, "not a PCI device behind a bridge\n"); + ret = -1; + goto out; + } + + if (upstream != dma_up) { + dev_dbg(&client->dev, + "does not reside on the same upstream bridge\n"); + ret = -1; + goto out; + } + +out: + pci_dev_put(dma_up); + return ret; +} + +static int upstream_bridges_match(struct pci_dev *provider, + struct pci_dev *client) +{ + struct pci_dev *upstream; + int ret; + + if (provider == client) + return 0; + + upstream = get_upstream_bridge_port(provider); + if (!upstream) { + pci_warn(provider, "not behind a PCI bridge\n"); + return -1; + } + + ret = __upstream_bridges_match(upstream, client); + + pci_dev_put(upstream); + + return ret; +} + +struct pci_p2pdma_client { + struct list_head list; + struct pci_dev *client; + struct pci_dev *provider; +}; + +/** + * pci_p2pdma_add_client - allocate a new element in a client device list + * @head: list head of p2pdma clients + * @dev: device to add to the list + * + * This adds @dev to a list of clients used by a p2pdma device. + * This list should be passed to pci_p2pmem_find(). Once pci_p2pmem_find() has + * been called successfully, the list will be bound to a specific p2pdma + * device and new clients can only be added to the list if they are + * supported by that p2pdma device. + * + * The caller is expected to have a lock which protects @head as necessary + * so that none of the pci_p2p functions can be called concurrently + * on that list. + * + * Returns 0 if the client was successfully added. + */ +int pci_p2pdma_add_client(struct list_head *head, struct device *dev) +{ + struct pci_p2pdma_client *item, *new_item; + struct pci_dev *provider = NULL; + struct pci_dev *client; + int ret; + + if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) { + dev_warn(dev, + "cannot be used for peer-to-peer DMA because the driver makes use of dma_virt_ops\n"); + return -ENODEV; + } + + + client = find_parent_pci_dev(dev); + if (!client) { + dev_warn(dev, + "cannot be used for peer-to-peer DMA as it is not a PCI device\n"); + return -ENODEV; + } + + item = list_first_entry_or_null(head, struct pci_p2pdma_client, list); + if (item && item->provider) { + provider = item->provider; + + if (upstream_bridges_match(provider, client) < 0) { + ret = -EXDEV; + goto put_client; + } + } + + new_item = kzalloc(sizeof(*new_item), GFP_KERNEL); + if (!new_item) { + ret = -ENOMEM; + goto put_client; + } + + new_item->client = client; + new_item->provider = pci_dev_get(provider); + + list_add_tail(&new_item->list, head); + + return 0; + +put_client: + pci_dev_put(client); + return ret; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_add_client); + +static void pci_p2pdma_client_free(struct pci_p2pdma_client *item) +{ + list_del(&item->list); + pci_dev_put(item->client); + pci_dev_put(item->provider); + kfree(item); +} + +/** + * pci_p2pdma_remove_client - remove and free a new p2pdma client + * @head: list head of p2pdma clients + * @dev: device to remove from the list + * + * This removes @dev from a list of clients used by a p2pdma device. + * The caller is expected to have a lock which protects @head as necessary + * so that none of the pci_p2p functions can be called concurrently + * on that list. + */ +void pci_p2pdma_remove_client(struct list_head *head, struct device *dev) +{ + struct pci_p2pdma_client *pos, *tmp; + struct pci_dev *pdev; + + pdev = find_parent_pci_dev(dev); + if (!pdev) + return; + + list_for_each_entry_safe(pos, tmp, head, list) { + if (pos->client != pdev) + continue; + + pci_p2pdma_client_free(pos); + } + + pci_dev_put(pdev); +} +EXPORT_SYMBOL_GPL(pci_p2pdma_remove_client); + +/** + * pci_p2pdma_client_list_free - free an entire list of p2pdma clients + * @head: list head of p2pdma clients + * + * This removes all devices in a list of clients used by a p2pdma device. + * The caller is expected to have a lock which protects @head as necessary + * so that none of the pci_p2pdma functions can be called concurrently + * on that list. + */ +void pci_p2pdma_client_list_free(struct list_head *head) +{ + struct pci_p2pdma_client *pos, *tmp; + + list_for_each_entry_safe(pos, tmp, head, list) + pci_p2pdma_client_free(pos); +} +EXPORT_SYMBOL_GPL(pci_p2pdma_client_list_free); + +/** + * pci_p2pdma_distance - Determive the cumulative distance between + * a p2pdma provider and the clients in use. + * @provider: p2pdma provider to check against the client list + * @clients: list of devices to check (NULL-terminated) + * + * Returns -1 if any of the clients are not compatible (behind the same + * switch as the provider), otherwise returns a positive number where + * the lower number is the preferrable choice. (If there's one client + * that's the same as the provider it will return 0, which is best choice). + * + * For now, "compatible" means the provider and the clients are all behind + * the same switch. This cuts out cases that may work but is safest for the + * user. Future work can expand this to cases with nested switches. + */ +int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients) +{ + struct pci_p2pdma_client *pos; + struct pci_dev *upstream; + int ret; + int distance = 0; + + upstream = get_upstream_bridge_port(provider); + if (!upstream) { + pci_warn(provider, "not behind a PCI bridge\n"); + return false; + } + + list_for_each_entry(pos, clients, list) { + if (pos->client == provider) + continue; + + ret = __upstream_bridges_match(upstream, pos->client); + if (ret < 0) + goto no_match; + + distance += ret; + } + + ret = distance; + +no_match: + pci_dev_put(upstream); + return ret; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_distance); + +/** + * pci_p2pdma_assign_provider - Check compatibily (as per pci_p2pdma_distance) + * and assign a provider to a list of clients + * @provider: p2pdma provider to assign to the client list + * @clients: list of devices to check (NULL-terminated) + * + * Returns false if any of the clients are not compatible, true if the + * provider was successfully assigned to the clients. + */ +bool pci_p2pdma_assign_provider(struct pci_dev *provider, + struct list_head *clients) +{ + struct pci_p2pdma_client *pos; + + if (pci_p2pdma_distance(provider, clients) < 0) + return false; + + list_for_each_entry(pos, clients, list) + pos->provider = provider; + + return true; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_assign_provider); + +/** + * pci_has_p2pmem - check if a given PCI device has published any p2pmem + * @pdev: PCI device to check + */ +bool pci_has_p2pmem(struct pci_dev *pdev) +{ + return pdev->p2pdma && pdev->p2pdma->p2pmem_published; +} +EXPORT_SYMBOL_GPL(pci_has_p2pmem); + +/** + * pci_p2pmem_find - find a peer-to-peer DMA memory device compatible with + * the specified list of clients and shortest distance (as determined + * by pci_p2pmem_dma()) + * @clients: list of devices to check (NULL-terminated) + * + * If multiple devices are behind the same switch, the one "closest" to the + * client devices in use will be chosen first. (So if one of the providers are + * the same as one of the clients, that provider will be used ahead of any + * other providers that are unrelated). If multiple providers are an equal + * distance away, one will be chosen at random. + * + * Returns a pointer to the PCI device with a reference taken (use pci_dev_put + * to return the reference) or NULL if no compatible device is found. The + * found provider will also be assigned to the client list. + */ +struct pci_dev *pci_p2pmem_find(struct list_head *clients) +{ + struct pci_dev *pdev = NULL; + struct pci_p2pdma_client *pos; + int distance; + int closest_distance = INT_MAX; + struct pci_dev **closest_pdevs; + int ties = 0; + const int max_ties = PAGE_SIZE / sizeof(*closest_pdevs); + int i; + + closest_pdevs = kmalloc(PAGE_SIZE, GFP_KERNEL); + + while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev))) { + if (!pci_has_p2pmem(pdev)) + continue; + + distance = pci_p2pdma_distance(pdev, clients); + if (distance < 0 || distance > closest_distance) + continue; + + if (distance == closest_distance && ties >= max_ties) + continue; + + if (distance < closest_distance) { + for (i = 0; i < ties; i++) + pci_dev_put(closest_pdevs[i]); + + ties = 0; + closest_distance = distance; + } + + closest_pdevs[ties++] = pci_dev_get(pdev); + } + + if (ties) + pdev = pci_dev_get(closest_pdevs[prandom_u32_max(ties)]); + + for (i = 0; i < ties; i++) + pci_dev_put(closest_pdevs[i]); + + if (pdev) + list_for_each_entry(pos, clients, list) + pos->provider = pdev; + + kfree(closest_pdevs); + return pdev; +} +EXPORT_SYMBOL_GPL(pci_p2pmem_find); + +/** + * pci_alloc_p2p_mem - allocate peer-to-peer DMA memory + * @pdev: the device to allocate memory from + * @size: number of bytes to allocate + * + * Returns the allocated memory or NULL on error. + */ +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size) +{ + void *ret; + + if (unlikely(!pdev->p2pdma)) + return NULL; + + if (unlikely(!percpu_ref_tryget_live(&pdev->p2pdma->devmap_ref))) + return NULL; + + ret = (void *)gen_pool_alloc(pdev->p2pdma->pool, size); + + if (unlikely(!ret)) + percpu_ref_put(&pdev->p2pdma->devmap_ref); + + return ret; +} +EXPORT_SYMBOL_GPL(pci_alloc_p2pmem); + +/** + * pci_free_p2pmem - allocate peer-to-peer DMA memory + * @pdev: the device the memory was allocated from + * @addr: address of the memory that was allocated + * @size: number of bytes that was allocated + */ +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size) +{ + gen_pool_free(pdev->p2pdma->pool, (uintptr_t)addr, size); + percpu_ref_put(&pdev->p2pdma->devmap_ref); +} +EXPORT_SYMBOL_GPL(pci_free_p2pmem); + +/** + * pci_virt_to_bus - return the PCI bus address for a given virtual + * address obtained with pci_alloc_p2pmem() + * @pdev: the device the memory was allocated from + * @addr: address of the memory that was allocated + */ +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr) +{ + if (!addr) + return 0; + if (!pdev->p2pdma) + return 0; + + /* + * Note: when we added the memory to the pool we used the PCI + * bus address as the physical address. So gen_pool_virt_to_phys() + * actually returns the bus address despite the misleading name. + */ + return gen_pool_virt_to_phys(pdev->p2pdma->pool, (unsigned long)addr); +} +EXPORT_SYMBOL_GPL(pci_p2pmem_virt_to_bus); + +/** + * pci_p2pmem_alloc_sgl - allocate peer-to-peer DMA memory in a scatterlist + * @pdev: the device to allocate memory from + * @sgl: the allocated scatterlist + * @nents: the number of SG entries in the list + * @length: number of bytes to allocate + * + * Returns 0 on success + */ +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl, + unsigned int *nents, u32 length) +{ + struct scatterlist *sg; + void *addr; + + sg = kzalloc(sizeof(*sg), GFP_KERNEL); + if (!sg) + return -ENOMEM; + + sg_init_table(sg, 1); + + addr = pci_alloc_p2pmem(pdev, length); + if (!addr) + goto out_free_sg; + + sg_set_buf(sg, addr, length); + *sgl = sg; + *nents = 1; + return 0; + +out_free_sg: + kfree(sg); + return -ENOMEM; +} +EXPORT_SYMBOL_GPL(pci_p2pmem_alloc_sgl); + +/** + * pci_p2pmem_free_sgl - free a scatterlist allocated by pci_p2pmem_alloc_sgl() + * @pdev: the device to allocate memory from + * @sgl: the allocated scatterlist + * @nents: the number of SG entries in the list + */ +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl, + unsigned int nents) +{ + struct scatterlist *sg; + int count; + + if (!sgl || !nents) + return; + + for_each_sg(sgl, sg, nents, count) + pci_free_p2pmem(pdev, sg_virt(sg), sg->length); + kfree(sgl); +} +EXPORT_SYMBOL_GPL(pci_p2pmem_free_sgl); + +/** + * pci_p2pmem_publish - publish the peer-to-peer DMA memory for use by + * other devices with pci_p2pmem_find() + * @pdev: the device with peer-to-peer DMA memory to publish + * @publish: set to true to publish the memory, false to unpublish it + */ +void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) +{ + if (publish && !pdev->p2pdma) + return; + + pdev->p2pdma->p2pmem_published = publish; +} +EXPORT_SYMBOL_GPL(pci_p2pmem_publish); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 7b4899c06f49..9e907c338a44 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -53,11 +53,16 @@ struct vmem_altmap { * driver can hotplug the device memory using ZONE_DEVICE and with that memory * type. Any page of a process can be migrated to such memory. However no one * should be allow to pin such memory so that it can always be evicted. + * + * MEMORY_DEVICE_PCI_P2PDMA: + * Device memory residing in a PCI BAR intended for use with Peer-to-Peer + * transactions. */ enum memory_type { MEMORY_DEVICE_HOST = 0, MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_PUBLIC, + MEMORY_DEVICE_PCI_P2PDMA, }; /* @@ -161,6 +166,19 @@ static inline void vmem_altmap_free(struct vmem_altmap *altmap, } #endif /* CONFIG_ZONE_DEVICE */ +#ifdef CONFIG_PCI_P2PDMA +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return is_zone_device_page(page) && + page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; +} +#else /* CONFIG_PCI_P2PDMA */ +static inline bool is_pci_p2pdma_page(const struct page *page) +{ + return false; +} +#endif /* CONFIG_PCI_P2PDMA */ + #if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC) static inline bool is_device_private_page(const struct page *page) { diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h new file mode 100644 index 000000000000..1f7856ff098b --- /dev/null +++ b/include/linux/pci-p2pdma.h @@ -0,0 +1,101 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * PCI Peer 2 Peer DMA support. + * + * Copyright (c) 2016-2018, Logan Gunthorpe + * Copyright (c) 2016-2017, Microsemi Corporation + * Copyright (c) 2017, Christoph Hellwig + * Copyright (c) 2018, Eideticom Inc. + * + */ + +#ifndef _LINUX_PCI_P2PDMA_H +#define _LINUX_PCI_P2PDMA_H + +#include + +struct block_device; +struct scatterlist; + +#ifdef CONFIG_PCI_P2PDMA +int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, + u64 offset); +int pci_p2pdma_add_client(struct list_head *head, struct device *dev); +void pci_p2pdma_remove_client(struct list_head *head, struct device *dev); +void pci_p2pdma_client_list_free(struct list_head *head); +int pci_p2pdma_distance(struct pci_dev *provider, struct list_head *clients); +bool pci_p2pdma_assign_provider(struct pci_dev *provider, + struct list_head *clients); +bool pci_has_p2pmem(struct pci_dev *pdev); +struct pci_dev *pci_p2pmem_find(struct list_head *clients); +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size); +void pci_free_p2pmem(struct pci_dev *pdev, void *addr, size_t size); +pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, void *addr); +int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl, + unsigned int *nents, u32 length); +void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl, + unsigned int nents); +void pci_p2pmem_publish(struct pci_dev *pdev, bool publish); +#else /* CONFIG_PCI_P2PDMA */ +static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, + size_t size, u64 offset) +{ + return 0; +} +static inline int pci_p2pdma_add_client(struct list_head *head, + struct device *dev) +{ + return 0; +} +static inline void pci_p2pdma_remove_client(struct list_head *head, + struct device *dev) +{ +} +static inline void pci_p2pdma_client_list_free(struct list_head *head) +{ +} +static inline int pci_p2pdma_distance(struct pci_dev *provider, + struct list_head *clients) +{ + return -1; +} +static inline bool pci_p2pdma_assign_provider(struct pci_dev *provider, + struct list_head *clients) +{ + return false; +} +static inline bool pci_has_p2pmem(struct pci_dev *pdev) +{ + return false; +} +static inline struct pci_dev *pci_p2pmem_find(struct list_head *clients) +{ + return NULL; +} +static inline void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size) +{ + return NULL; +} +static inline void pci_free_p2pmem(struct pci_dev *pdev, void *addr, + size_t size) +{ +} +static inline pci_bus_addr_t pci_p2pmem_virt_to_bus(struct pci_dev *pdev, + void *addr) +{ + return 0; +} +static inline int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, + struct scatterlist **sgl, unsigned int *nents, u32 length) +{ + return -ENODEV; +} +static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev, + struct scatterlist *sgl, unsigned int nents) +{ +} +static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) +{ +} +#endif /* CONFIG_PCI_P2PDMA */ +#endif /* _LINUX_PCI_P2P_H */ diff --git a/include/linux/pci.h b/include/linux/pci.h index 024a1beda008..437e42615896 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -276,6 +276,7 @@ struct pcie_link_state; struct pci_vpd; struct pci_sriov; struct pci_ats; +struct pci_p2pdma; /* The pci_dev structure describes PCI devices */ struct pci_dev { @@ -429,6 +430,9 @@ struct pci_dev { #ifdef CONFIG_PCI_PASID u16 pasid_features; #endif +#ifdef CONFIG_PCI_P2PDMA + struct pci_p2pdma *p2pdma; +#endif phys_addr_t rom; /* Physical address if not from BAR */ size_t romlen; /* Length if not from BAR */ char *driver_override; /* Driver name to force a match */ From patchwork Mon Mar 12 19:35:16 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884818 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400Srh3msXz9sRc for ; Tue, 13 Mar 2018 06:35:48 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932282AbeCLTfq (ORCPT ); Mon, 12 Mar 2018 15:35:46 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54640 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932313AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-000664-QE; Mon, 12 Mar 2018 13:35:33 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEG-0000l0-GN; Mon, 12 Mar 2018 13:35:28 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:16 -0600 Message-Id: <20180312193525.2855-3-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.5 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE,MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 02/11] PCI/P2PDMA: Add sysfs group to display p2pmem stats X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org Add a sysfs group to display statistics about P2P memory that is registered in each PCI device. Attributes in the group display the total amount of P2P memory, the amount available and whether it is published or not. Signed-off-by: Logan Gunthorpe --- Documentation/ABI/testing/sysfs-bus-pci | 25 +++++++++++++++ drivers/pci/p2pdma.c | 54 +++++++++++++++++++++++++++++++++ 2 files changed, 79 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index 44d4b2be92fd..044812c816d0 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -323,3 +323,28 @@ Description: This is similar to /sys/bus/pci/drivers_autoprobe, but affects only the VFs associated with a specific PF. + +What: /sys/bus/pci/devices/.../p2pmem/available +Date: November 2017 +Contact: Logan Gunthorpe +Description: + If the device has any Peer-to-Peer memory registered, this + file contains the amount of memory that has not been + allocated (in decimal). + +What: /sys/bus/pci/devices/.../p2pmem/size +Date: November 2017 +Contact: Logan Gunthorpe +Description: + If the device has any Peer-to-Peer memory registered, this + file contains the total amount of memory that the device + provides (in decimal). + +What: /sys/bus/pci/devices/.../p2pmem/published +Date: November 2017 +Contact: Logan Gunthorpe +Description: + If the device has any Peer-to-Peer memory registered, this + file contains a '1' if the memory has been published for + use inside the kernel or a '0' if it is only intended + for use within the driver that published it. diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 0ee917381dce..fd4789566a56 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -24,6 +24,54 @@ struct pci_p2pdma { bool p2pmem_published; }; +static ssize_t size_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + size_t size = 0; + + if (pdev->p2pdma->pool) + size = gen_pool_size(pdev->p2pdma->pool); + + return snprintf(buf, PAGE_SIZE, "%zd\n", size); +} +static DEVICE_ATTR_RO(size); + +static ssize_t available_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + size_t avail = 0; + + if (pdev->p2pdma->pool) + avail = gen_pool_avail(pdev->p2pdma->pool); + + return snprintf(buf, PAGE_SIZE, "%zd\n", avail); +} +static DEVICE_ATTR_RO(available); + +static ssize_t published_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return snprintf(buf, PAGE_SIZE, "%d\n", + pdev->p2pdma->p2pmem_published); +} +static DEVICE_ATTR_RO(published); + +static struct attribute *p2pmem_attrs[] = { + &dev_attr_size.attr, + &dev_attr_available.attr, + &dev_attr_published.attr, + NULL, +}; + +static const struct attribute_group p2pmem_group = { + .attrs = p2pmem_attrs, + .name = "p2pmem", +}; + static void pci_p2pdma_percpu_release(struct percpu_ref *ref) { struct pci_p2pdma *p2p = @@ -53,6 +101,7 @@ static void pci_p2pdma_release(void *data) percpu_ref_exit(&pdev->p2pdma->devmap_ref); gen_pool_destroy(pdev->p2pdma->pool); + sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); pdev->p2pdma = NULL; } @@ -83,9 +132,14 @@ static int pci_p2pdma_setup(struct pci_dev *pdev) pdev->p2pdma = p2p; + error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); + if (error) + goto out_pool_destroy; + return 0; out_pool_destroy: + pdev->p2pdma = NULL; gen_pool_destroy(p2p->pool); out: devm_kfree(&pdev->dev, p2p); From patchwork Mon Mar 12 19:35:17 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884823 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400Sv56wFmz9sRX for ; Tue, 13 Mar 2018 06:37:53 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932497AbeCLThM (ORCPT ); Mon, 12 Mar 2018 15:37:12 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54642 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932303AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-000665-QD; Mon, 12 Mar 2018 13:35:32 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEG-0000l3-K9; Mon, 12 Mar 2018 13:35:28 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:17 -0600 Message-Id: <20180312193525.2855-4-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.5 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE,MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 03/11] PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org The DMA address used when mapping PCI P2P memory must be the PCI bus address. Thus, introduce pci_p2pmem_[un]map_sg() to map the correct addresses when using P2P memory. For this, we assume that an SGL passed to these functions contain all P2P memory or no P2P memory. Signed-off-by: Logan Gunthorpe --- drivers/pci/p2pdma.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++ include/linux/memremap.h | 1 + include/linux/pci-p2pdma.h | 13 ++++++++++++ 3 files changed, 65 insertions(+) diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index fd4789566a56..ab810c3a93eb 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -190,6 +190,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap->res.flags = pci_resource_flags(pdev, bar); pgmap->ref = &pdev->p2pdma->devmap_ref; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; + pgmap->pci_p2pdma_bus_offset = pci_bus_address(pdev, bar) - + pci_resource_start(pdev, bar); addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -731,3 +733,52 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) pdev->p2pdma->p2pmem_published = publish; } EXPORT_SYMBOL_GPL(pci_p2pmem_publish); + +/** + * pci_p2pdma_map_sg - map a PCI peer-to-peer sg for DMA + * @dev: device doing the DMA request + * @sg: scatter list to map + * @nents: elements in the scatterlist + * @dir: DMA direction + * + * Returns the number of SG entries mapped + */ +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir) +{ + struct dev_pagemap *pgmap; + struct scatterlist *s; + phys_addr_t paddr; + int i; + + /* + * p2pdma mappings are not compatible with devices that use + * dma_virt_ops. + */ + if (IS_ENABLED(CONFIG_DMA_VIRT_OPS) && dev->dma_ops == &dma_virt_ops) + return 0; + + for_each_sg(sg, s, nents, i) { + pgmap = sg_page(s)->pgmap; + paddr = sg_phys(s); + + s->dma_address = paddr - pgmap->pci_p2pdma_bus_offset; + sg_dma_len(s) = s->length; + } + + return nents; +} +EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg); + +/** + * pci_p2pdma_unmap_sg - unmap a PCI peer-to-peer sg for DMA + * @dev: device doing the DMA request + * @sg: scatter list to map + * @nents: elements in the scatterlist + * @dir: DMA direction + */ +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir) +{ +} +EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 9e907c338a44..1660f64ce96f 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -125,6 +125,7 @@ struct dev_pagemap { struct device *dev; void *data; enum memory_type type; + u64 pci_p2pdma_bus_offset; }; #ifdef CONFIG_ZONE_DEVICE diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 1f7856ff098b..59eb218bdb25 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -36,6 +36,10 @@ int pci_p2pmem_alloc_sgl(struct pci_dev *pdev, struct scatterlist **sgl, void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl, unsigned int nents); void pci_p2pmem_publish(struct pci_dev *pdev, bool publish); +int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir); +void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction dir); #else /* CONFIG_PCI_P2PDMA */ static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) @@ -97,5 +101,14 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev, static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) { } +static inline int pci_p2pdma_map_sg(struct device *dev, + struct scatterlist *sg, int nents, enum dma_data_direction dir) +{ + return 0; +} +static inline void pci_p2pdma_unmap_sg(struct device *dev, + struct scatterlist *sg, int nents, enum dma_data_direction dir) +{ +} #endif /* CONFIG_PCI_P2PDMA */ #endif /* _LINUX_PCI_P2P_H */ From patchwork Mon Mar 12 19:35:18 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884826 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400Svl5k61z9sRX for ; Tue, 13 Mar 2018 06:38:27 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751620AbeCLThI (ORCPT ); Mon, 12 Mar 2018 15:37:08 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54654 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932294AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-000666-QD; Mon, 12 Mar 2018 13:35:33 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEG-0000l6-Nf; Mon, 12 Mar 2018 13:35:28 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:18 -0600 Message-Id: <20180312193525.2855-5-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.7 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 04/11] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org For peer-to-peer transactions to work the downstream ports in each switch must not have the ACS flags set. At this time there is no way to dynamically change the flags and update the corresponding IOMMU groups so this is done at enumeration time before the groups are assigned. This effectively means that if CONFIG_PCI_P2PDMA is selected then all devices behind any PCIe switch will be in the same IOMMU group. Which implies that individual devices behind any switch will not be able to be assigned to separate VMs because there is no isolation between them. Additionally, any malicious PCIe devices will be able to DMA to memory exposed by other EPs in the same domain as TLPs will not be checked by the IOMMU. Given that the intended use case of P2P Memory is for users with custom hardware designed for purpose, we do not expect distributors to ever need to enable this option. Users that want to use P2P must have compiled a custom kernel with this configuration option and understand the implications regarding ACS. They will either not require ACS or will have design the system in such a way that devices that require isolation will be separate from those using P2P transactions. Signed-off-by: Logan Gunthorpe --- drivers/pci/Kconfig | 9 +++++++++ drivers/pci/p2pdma.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ drivers/pci/pci.c | 6 ++++++ include/linux/pci-p2pdma.h | 5 +++++ 4 files changed, 64 insertions(+) diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index d59f6f5ddfcd..c7a9d155baca 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -138,6 +138,15 @@ config PCI_P2PDMA it's hard to tell which support it at all, so at this time you will need a PCIe switch. + Enabling this option will also disable ACS on all ports behind + any PCIe switch. This effectively puts all devices behind any + switch into the same IOMMU group. Which implies that individual + devices behind any switch will not be able to be assigned to + separate VMs because there is no isolation between them. + Additionally, any malicious PCIe devices will be able to DMA + to memory exposed by other EPs in the same domain as TLPs will + not be checked by the IOMMU. + If unsure, say N. config PCI_LABEL diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index ab810c3a93eb..3e70b0662def 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -264,6 +264,50 @@ static struct pci_dev *get_upstream_bridge_port(struct pci_dev *pdev) } /* + * pci_p2pdma_disable_acs - disable ACS flags for ports in PCI + * bridges/switches + * @pdev: device to disable ACS flags for + * + * The ACS flags for P2P Request Redirect and P2P Completion Redirect need + * to be disabled on any downstream port in any switch in order for + * the TLPs to not be forwarded up to the RC which is not what we want + * for P2P. + * + * This function is called when the devices are first enumerated and + * will result in all devices behind any switch to be in the same IOMMU + * group. At this time there is no way to "hotplug" IOMMU groups so we rely + * on this largish hammer. If you need the devices to be in separate groups + * don't enable CONFIG_PCI_P2PDMA. + * + * Returns 1 if the ACS bits for this device were cleared, otherwise 0. + */ +int pci_p2pdma_disable_acs(struct pci_dev *pdev) +{ + struct pci_dev *up; + int pos; + u16 ctrl; + + up = get_upstream_bridge_port(pdev); + if (!up) + return 0; + pci_dev_put(up); + + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS); + if (!pos) + return 0; + + pci_info(pdev, "disabling ACS flags for peer-to-peer DMA\n"); + + pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl); + + ctrl &= ~(PCI_ACS_RR | PCI_ACS_CR); + + pci_write_config_word(pdev, pos + PCI_ACS_CTRL, ctrl); + + return 1; +} + +/* * This function checks if two PCI devices are behind the same switch. * (ie. they share the same second upstream port as returned by * get_upstream_bridge_port().) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index f6a4dd10d9b0..e5da8f482e94 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -2826,6 +2827,11 @@ static void pci_std_enable_acs(struct pci_dev *dev) */ void pci_enable_acs(struct pci_dev *dev) { +#ifdef CONFIG_PCI_P2PDMA + if (pci_p2pdma_disable_acs(dev)) + return; +#endif + if (!pci_acs_enable) return; diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 59eb218bdb25..2a2bf2ca018e 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -18,6 +18,7 @@ struct block_device; struct scatterlist; #ifdef CONFIG_PCI_P2PDMA +int pci_p2pdma_disable_acs(struct pci_dev *pdev); int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); int pci_p2pdma_add_client(struct list_head *head, struct device *dev); @@ -41,6 +42,10 @@ int pci_p2pdma_map_sg(struct device *dev, struct scatterlist *sg, int nents, void pci_p2pdma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir); #else /* CONFIG_PCI_P2PDMA */ +static inline int pci_p2pdma_disable_acs(struct pci_dev *pdev) +{ + return 0; +} static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { From patchwork Mon Mar 12 19:35:19 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884827 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400Svx2HnJz9sRX for ; Tue, 13 Mar 2018 06:38:37 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751787AbeCLThH (ORCPT ); Mon, 12 Mar 2018 15:37:07 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54652 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932302AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEK-000667-12; Mon, 12 Mar 2018 13:35:34 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEG-0000l9-RC; Mon, 12 Mar 2018 13:35:28 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe , Jonathan Corbet Date: Mon, 12 Mar 2018 13:35:19 -0600 Message-Id: <20180312193525.2855-6-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com, corbet@lwn.net X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.5 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE,MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 05/11] PCI/P2PDMA: Add P2P DMA driver writer's documentation X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org Add a restructured text file describing how to write drivers with support for P2P DMA transactions. The document describes how to use the APIs that were added in the previous few commits. Also adds an index for the PCI documentation tree even though this is the only PCI document that has ben converted to restructured text at this time. Signed-off-by: Logan Gunthorpe Cc: Jonathan Corbet --- Documentation/PCI/index.rst | 14 ++++ Documentation/PCI/p2pdma.rst | 164 +++++++++++++++++++++++++++++++++++++++++++ Documentation/index.rst | 3 +- 3 files changed, 180 insertions(+), 1 deletion(-) create mode 100644 Documentation/PCI/index.rst create mode 100644 Documentation/PCI/p2pdma.rst diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst new file mode 100644 index 000000000000..2fdc4b3c291d --- /dev/null +++ b/Documentation/PCI/index.rst @@ -0,0 +1,14 @@ +================================== +Linux PCI Driver Developer's Guide +================================== + +.. toctree:: + + p2pdma + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/PCI/p2pdma.rst b/Documentation/PCI/p2pdma.rst new file mode 100644 index 000000000000..d7edd48a3941 --- /dev/null +++ b/Documentation/PCI/p2pdma.rst @@ -0,0 +1,164 @@ +============================ +PCI Peer-to-Peer DMA Support +============================ + +The PCI bus has pretty decent support for performing DMA transfers +between two endpoints on the bus. This type of transaction is +henceforth called Peer-to-Peer (or P2P). However, there are a number of +issues that make P2P transactions tricky to do in a perfectly safe way. + +One of the biggest issues is that PCI Root Complexes are not required +to support forwarding packets between Root Ports. To make things worse, +there is no simple way to determine if a given Root Complex supports +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, +the kernel only supports doing P2P when the endpoints involved are all +behind a PCIe Switch as this guarantees the packets will always be routable. + +The second issue is that to make use of existing interfaces in Linux, +memory that is used for P2P transactions needs to be backed by struct +pages. However, PCI BARs are not typically cache coherent so there are +a few corner case gotchas with these pages so developers need to +be careful about what they do with them. + + +Driver Writer's Guide +==================== + +In a given P2P implementation there may be three or more different +types of kernel drivers in play: + +* Providers - A driver which provides or publishes P2P resources like + memory or doorbell registers to other drivers. +* Clients - A driver which makes use of a resource by setting up a + DMA transaction to it. +* Orchestrators - A driver which orchestrates the flow of data between + clients and providers + +In many cases there could be overlap between these three types (ie. +it may be typical for a driver to be both a provider and a client). + +For example, in the NVMe Target Copy Offload implementation: + +* The NVMe PCI driver is both a client, provider and orchestrator + in that it exposes any CMB (Controller Memory Buffer) as a P2P memory + resource (provider), it accepts P2P memory pages as buffers in requests + to be used directly (client) and it can also make use the CMB as + submission queue entries. +* The RDMA driver is a client in this arrangement so that an RNIC + can DMA directly to the memory exposed by the NVME device. +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC + to the P2P memory (CMB) and then to the NVMe device (and vice versa). + +This is currently the only arrangement supported by the kernel but +one could imagine slight tweaks to this that would allow for the same +functionality. For example, if a specific RNIC added a BAR with some +memory behind it, its driver could add support as a P2P provider and +then the NVMe Target could use the RNIC's memory instead of the CMB +in cases where the NVMe cards in use do not have CMB support. + + +Provider Drivers +---------------- + +A provider simply needs to register a BAR (or a portion of a BAR) +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`. +This will register struct pages for all the specified memory. + +After that it may optionally publish all of its resources as +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow +any orchestrator drivers to find and use the memory. When marked in +this way, the resource must be regular memory with no side effects. + +For the time being this is fairly rudimentary in that all resources +are typically going to be P2P memory. Future work will likely expand +this to include other types of resources like doorbells. + + +Client Drivers +-------------- + +A client driver typically only has to conditionally change its DMA map +routine to use the mapping functions :c:func:`pci_p2pdma_map_sg()` and +:c:func:`pci_p2pdma_unmap_sg()` instead of the usual :c:func:`dma_map_sg()` +functions. + +The client may also, optionally, make use of +:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping +functions and when to use the regular mapping functions. In some +situations, it may be more appropriate to use a flag to indicate a +given request is P2P memory and map appropriately (for example the +block layer uses a flag to keep P2P memory out of queues that do not +have P2P client support). It is important to ensure that struct pages that +back P2P memory stay out of code that does not have support for them. + + +Orchestrator Drivers +-------------------- + +The first task an orchestrator driver must do is compile a list of +all client drivers that will be involved in a given transaction. For +example, the NVMe Target driver creates a list including all NVMe drives +and the RNIC in use. The list is stored as an anonymous struct +list_head which must be initialized with the usual INIT_LIST_HEAD. +The following functions may then be used to add to, remove from and free +the list of clients with the functions :c:func:`pci_p2pdma_add_client()`, +:c:func:`pci_p2pdma_remove_client()` and +:c:func:`pci_p2pdma_client_list_free()`. + +With the client list in hand, the orchestrator may then call +:c:func:`pci_p2pmem_find()` to obtain a published P2P memory provider +that is supported (behind the same switch) as all the clients. If more +than one provider is supported, the one nearest to all the clients will +be chosen first. If there are more than one provider is an equal distance +away, the one returned will be chosen at random. This function returns the PCI +device to use for the provider with a reference taken and therefore +when it's no longer needed it should be returned with pci_dev_put(). + +Alternatively, if the orchestrator knows (via some other means) +which provider it wants to use it may use :c:func:`pci_has_p2pmem()` +to determine if it has P2P memory and :c:func:`pci_p2pdma_distance()` +to determine the cumulative distance between it and a potential +list of clients. + +With a supported provider in hand, the driver can then call +:c:func:`pci_p2pdma_assign_provider()` to assign the provider +to the client list. This function returns false if any of the +clients are unsupported by the provider. + +Once a provider is assigned to a client list via either +:c:func:`pci_p2pmem_find()` or :c:func:`pci_p2pdma_assign_provider()`, +the list is permanently bound to the provider such that any new clients +added to the list must be supported by the already selected provider. +If they are not supported, :c:func:`pci_p2pdma_add_client()` will return +an error. In this way, orchestrators are free to add and remove devices +without having to recheck support or tear down existing transfers to +change P2P providers. + +Once a provider is selected, the orchestrator can then use +:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to +allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()` +and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for +allocating scatter-gather lists with P2P memory. + +Struct Page Caveats +------------------- + +Driver writers should be very careful about not passing these special +struct pages to code that isn't prepared for it. At this time, the kernel +interfaces do not have any checks for ensuring this. This obviously +precludes passing these pages to userspace. + +P2P memory is also technically IO memory but should never have any side +effects behind it. Thus, the order of loads and stores should not be important +and ioreadX(), iowriteX() and friends should not be necessary. +However, as the memory is not cache coherent, if access ever needs to +be protected by a spinlock then :c:func:`mmiowb()` must be used before +unlocking the lock. (See ACQUIRES VS I/O ACCESSES in +Documentation/memory-barriers.txt) + + +P2P DMA API Functions +===================== + +.. kernel-doc:: drivers/pci/p2pdma.c + :export: diff --git a/Documentation/index.rst b/Documentation/index.rst index ef5080cbf009..c31bf0918413 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -45,7 +45,7 @@ the kernel interface as seen by application developers. .. toctree:: :maxdepth: 2 - userspace-api/index + userspace-api/index Introduction to kernel development @@ -88,6 +88,7 @@ needed). sound/index crypto/index filesystems/index + PCI/index Architecture-specific documentation ----------------------------------- From patchwork Mon Mar 12 19:35:20 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884824 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400SvN55QRz9sRX for ; Tue, 13 Mar 2018 06:38:08 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932407AbeCLThL (ORCPT ); Mon, 12 Mar 2018 15:37:11 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54638 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932311AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-000668-QD; Mon, 12 Mar 2018 13:35:32 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEG-0000lC-Ui; Mon, 12 Mar 2018 13:35:28 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:20 -0600 Message-Id: <20180312193525.2855-7-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.5 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE,MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 06/11] block: Introduce PCI P2P flags for request and request queue X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue supports targeting P2P memory. REQ_PCI_P2P is introduced to indicate a particular bio request is directed to/from PCI P2P memory. A request with this flag is not accepted unless the corresponding queues have the QUEUE_FLAG_PCI_P2P flag set. Signed-off-by: Logan Gunthorpe Reviewed-by: Sagi Grimberg Reviewed-by: Christoph Hellwig --- block/blk-core.c | 3 +++ include/linux/blk_types.h | 18 +++++++++++++++++- include/linux/blkdev.h | 3 +++ 3 files changed, 23 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index 6d82c4f7fadd..a2f113738b85 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2183,6 +2183,9 @@ generic_make_request_checks(struct bio *bio) if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) goto not_supported; + if ((bio->bi_opf & REQ_PCI_P2PDMA) && !blk_queue_pci_p2pdma(q)) + goto not_supported; + if (should_fail_bio(bio)) goto end_io; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index bf18b95ed92d..490122c85b3f 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -274,6 +274,10 @@ enum req_flag_bits { __REQ_BACKGROUND, /* background IO */ __REQ_NOWAIT, /* Don't wait if request will block */ +#ifdef CONFIG_PCI_P2PDMA + __REQ_PCI_P2PDMA, /* request is to/from P2P memory */ +#endif + /* command specific flags for REQ_OP_WRITE_ZEROES: */ __REQ_NOUNMAP, /* do not free blocks when zeroing */ @@ -298,6 +302,18 @@ enum req_flag_bits { #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) #define REQ_NOWAIT (1ULL << __REQ_NOWAIT) +#ifdef CONFIG_PCI_P2PDMA +/* + * Currently SGLs do not support mixed P2P and regular memory so + * requests with P2P memory must not be merged. + */ +#define REQ_PCI_P2PDMA (1ULL << __REQ_PCI_P2PDMA) +#define REQ_IS_PCI_P2PDMA(req) ((req)->cmd_flags & REQ_PCI_P2PDMA) +#else +#define REQ_PCI_P2PDMA 0 +#define REQ_IS_PCI_P2PDMA(req) 0 +#endif /* CONFIG_PCI_P2PDMA */ + #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP) #define REQ_DRV (1ULL << __REQ_DRV) @@ -306,7 +322,7 @@ enum req_flag_bits { (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER) #define REQ_NOMERGE_FLAGS \ - (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA) + (REQ_NOMERGE | REQ_PREFLUSH | REQ_FUA | REQ_PCI_P2PDMA) #define bio_op(bio) \ ((bio)->bi_opf & REQ_OP_MASK) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ed63f3b69c12..0b4a386c73ea 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -698,6 +698,7 @@ struct request_queue { #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED 28 /* queue has been quiesced */ #define QUEUE_FLAG_PREEMPT_ONLY 29 /* only process REQ_PREEMPT requests */ +#define QUEUE_FLAG_PCI_P2PDMA 30 /* device supports pci p2p requests */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_SAME_COMP) | \ @@ -793,6 +794,8 @@ static inline void queue_flag_clear(unsigned int flag, struct request_queue *q) #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) #define blk_queue_scsi_passthrough(q) \ test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) +#define blk_queue_pci_p2pdma(q) \ + test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags) #define blk_noretry_request(rq) \ ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ From patchwork Mon Mar 12 19:35:21 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884825 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400Svc50jkz9sRX for ; Tue, 13 Mar 2018 06:38:20 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932256AbeCLThK (ORCPT ); Mon, 12 Mar 2018 15:37:10 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54648 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932292AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-000669-QD; Mon, 12 Mar 2018 13:35:33 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEH-0000lF-1Q; Mon, 12 Mar 2018 13:35:29 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:21 -0600 Message-Id: <20180312193525.2855-8-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.7 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 07/11] IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]() X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org In order to use PCI P2P memory pci_p2pmem_[un]map_sg() functions must be called to map the correct PCI bus address. To do this, check the first page in the scatter list to see if it is P2P memory or not. At the moment, scatter lists that contain P2P memory must be homogeneous so if the first page is P2P the entire SGL should be P2P. Signed-off-by: Logan Gunthorpe Reviewed-by: Christoph Hellwig --- drivers/infiniband/core/rw.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c index c8963e91f92a..f495e8a7f8ac 100644 --- a/drivers/infiniband/core/rw.c +++ b/drivers/infiniband/core/rw.c @@ -12,6 +12,7 @@ */ #include #include +#include #include #include @@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, struct ib_device *dev = qp->pd->device; int ret; - ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); + if (is_pci_p2pdma_page(sg_page(sg))) + ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir); + else + ret = ib_dma_map_sg(dev, sg, sg_cnt, dir); + if (!ret) return -ENOMEM; sg_cnt = ret; @@ -602,7 +607,11 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num, break; } - ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir); + if (is_pci_p2pdma_page(sg_page(sg))) + pci_p2pdma_unmap_sg(qp->pd->device->dma_device, sg, + sg_cnt, dir); + else + ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir); } EXPORT_SYMBOL(rdma_rw_ctx_destroy); From patchwork Mon Mar 12 19:35:22 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884828 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400Sw55Wn8z9sRX for ; Tue, 13 Mar 2018 06:38:45 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751748AbeCLThG (ORCPT ); Mon, 12 Mar 2018 15:37:06 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54658 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932315AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-00066A-QD; Mon, 12 Mar 2018 13:35:33 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEH-0000lI-4x; Mon, 12 Mar 2018 13:35:29 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:22 -0600 Message-Id: <20180312193525.2855-9-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.5 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE,MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 08/11] nvme-pci: Use PCI p2pmem subsystem to manage the CMB X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org Register the CMB buffer as p2pmem and use the appropriate allocation functions to create and destroy the IO SQ. If the CMB supports WDS and RDS, publish it for use as P2P memory by other devices. Signed-off-by: Logan Gunthorpe --- drivers/nvme/host/pci.c | 75 +++++++++++++++++++++++++++---------------------- 1 file changed, 41 insertions(+), 34 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b6f43b738f03..1fb57fa42dd0 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "nvme.h" @@ -91,9 +92,8 @@ struct nvme_dev { struct work_struct remove_work; struct mutex shutdown_lock; bool subsystem; - void __iomem *cmb; - pci_bus_addr_t cmb_bus_addr; u64 cmb_size; + bool cmb_use_sqes; u32 cmbsz; u32 cmbloc; struct nvme_ctrl ctrl; @@ -148,7 +148,7 @@ struct nvme_queue { struct nvme_dev *dev; spinlock_t q_lock; struct nvme_command *sq_cmds; - struct nvme_command __iomem *sq_cmds_io; + bool sq_cmds_is_io; volatile struct nvme_completion *cqes; struct blk_mq_tags **tags; dma_addr_t sq_dma_addr; @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue *nvmeq, { u16 tail = nvmeq->sq_tail; - if (nvmeq->sq_cmds_io) - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, sizeof(*cmd)); - else - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); if (++tail == nvmeq->q_depth) tail = 0; @@ -1287,9 +1284,18 @@ static void nvme_free_queue(struct nvme_queue *nvmeq) { dma_free_coherent(nvmeq->q_dmadev, CQ_SIZE(nvmeq->q_depth), (void *)nvmeq->cqes, nvmeq->cq_dma_addr); - if (nvmeq->sq_cmds) - dma_free_coherent(nvmeq->q_dmadev, SQ_SIZE(nvmeq->q_depth), - nvmeq->sq_cmds, nvmeq->sq_dma_addr); + + if (nvmeq->sq_cmds) { + if (nvmeq->sq_cmds_is_io) + pci_free_p2pmem(to_pci_dev(nvmeq->q_dmadev), + nvmeq->sq_cmds, + SQ_SIZE(nvmeq->q_depth)); + else + dma_free_coherent(nvmeq->q_dmadev, + SQ_SIZE(nvmeq->q_depth), + nvmeq->sq_cmds, + nvmeq->sq_dma_addr); + } } static void nvme_free_queues(struct nvme_dev *dev, int lowest) @@ -1369,12 +1375,21 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int nr_io_queues, static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq, int qid, int depth) { - /* CMB SQEs will be mapped before creation */ - if (qid && dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) - return 0; + struct pci_dev *pdev = to_pci_dev(dev->dev); + + if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { + nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth)); + nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev, + nvmeq->sq_cmds); + nvmeq->sq_cmds_is_io = true; + } + + if (!nvmeq->sq_cmds) { + nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth), + &nvmeq->sq_dma_addr, GFP_KERNEL); + nvmeq->sq_cmds_is_io = false; + } - nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth), - &nvmeq->sq_dma_addr, GFP_KERNEL); if (!nvmeq->sq_cmds) return -ENOMEM; return 0; @@ -1450,13 +1465,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) struct nvme_dev *dev = nvmeq->dev; int result; - if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { - unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth), - dev->ctrl.page_size); - nvmeq->sq_dma_addr = dev->cmb_bus_addr + offset; - nvmeq->sq_cmds_io = dev->cmb + offset; - } - nvmeq->cq_vector = qid - 1; result = adapter_alloc_cq(dev, qid, nvmeq); if (result < 0) @@ -1689,9 +1697,6 @@ static void nvme_map_cmb(struct nvme_dev *dev) return; dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC); - if (!use_cmb_sqes) - return; - size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev); offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc); bar = NVME_CMB_BIR(dev->cmbloc); @@ -1708,11 +1713,15 @@ static void nvme_map_cmb(struct nvme_dev *dev) if (size > bar_size - offset) size = bar_size - offset; - dev->cmb = ioremap_wc(pci_resource_start(pdev, bar) + offset, size); - if (!dev->cmb) + if (pci_p2pdma_add_resource(pdev, bar, size, offset)) return; - dev->cmb_bus_addr = pci_bus_address(pdev, bar) + offset; + dev->cmb_size = size; + dev->cmb_use_sqes = use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS); + + if ((dev->cmbsz & (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) == + (NVME_CMBSZ_WDS | NVME_CMBSZ_RDS)) + pci_p2pmem_publish(pdev, true); if (sysfs_add_file_to_group(&dev->ctrl.device->kobj, &dev_attr_cmb.attr, NULL)) @@ -1722,12 +1731,10 @@ static void nvme_map_cmb(struct nvme_dev *dev) static inline void nvme_release_cmb(struct nvme_dev *dev) { - if (dev->cmb) { - iounmap(dev->cmb); - dev->cmb = NULL; + if (dev->cmb_size) { sysfs_remove_file_from_group(&dev->ctrl.device->kobj, &dev_attr_cmb.attr, NULL); - dev->cmbsz = 0; + dev->cmb_size = 0; } } @@ -1922,13 +1929,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) if (nr_io_queues == 0) return 0; - if (dev->cmb && (dev->cmbsz & NVME_CMBSZ_SQS)) { + if (dev->cmb_use_sqes) { result = nvme_cmb_qdepth(dev, nr_io_queues, sizeof(struct nvme_command)); if (result > 0) dev->q_depth = result; else - nvme_release_cmb(dev); + dev->cmb_use_sqes = false; } do { From patchwork Mon Mar 12 19:35:23 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884821 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400Ssz2hXZz9sRX for ; Tue, 13 Mar 2018 06:36:55 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932338AbeCLTfn (ORCPT ); Mon, 12 Mar 2018 15:35:43 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54644 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932282AbeCLTfi (ORCPT ); Mon, 12 Mar 2018 15:35:38 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEJ-00066B-QG; Mon, 12 Mar 2018 13:35:33 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEH-0000lL-8R; Mon, 12 Mar 2018 13:35:29 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:23 -0600 Message-Id: <20180312193525.2855-10-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.5 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE,MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 09/11] nvme-pci: Add support for P2P memory in requests X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org For P2P requests, we must use the pci_p2pmem_[un]map_sg() functions instead of the dma_map_sg functions. With that, we can then indicate PCI_P2P support in the request queue. For this, we create an NVME_F_PCI_P2P flag which tells the core to set QUEUE_FLAG_PCI_P2P in the request queue. Signed-off-by: Logan Gunthorpe Reviewed-by: Sagi Grimberg Reviewed-by: Christoph Hellwig --- drivers/nvme/host/core.c | 4 ++++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 19 +++++++++++++++---- 3 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 7aeca5db7916..c7c5de116720 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2949,7 +2949,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) ns->queue = blk_mq_init_queue(ctrl->tagset); if (IS_ERR(ns->queue)) goto out_free_ns; + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue); + if (ctrl->ops->flags & NVME_F_PCI_P2PDMA) + queue_flag_set_unlocked(QUEUE_FLAG_PCI_P2PDMA, ns->queue); + ns->queue->queuedata = ns; ns->ctrl = ctrl; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index d733b14ede9d..1fb2b6603d49 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -290,6 +290,7 @@ struct nvme_ctrl_ops { unsigned int flags; #define NVME_F_FABRICS (1 << 0) #define NVME_F_METADATA_SUPPORTED (1 << 1) +#define NVME_F_PCI_P2PDMA (1 << 2) int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val); int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val); int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val); diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 1fb57fa42dd0..0ebab7ab4d7e 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -796,8 +796,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, goto out; ret = BLK_STS_RESOURCE; - nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir, - DMA_ATTR_NO_WARN); + + if (REQ_IS_PCI_P2PDMA(req)) + nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents, + dma_dir); + else + nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, + dma_dir, DMA_ATTR_NO_WARN); if (!nr_mapped) goto out; @@ -842,7 +847,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req) DMA_TO_DEVICE : DMA_FROM_DEVICE; if (iod->nents) { - dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); + if (REQ_IS_PCI_P2PDMA(req)) + pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents, + dma_dir); + else + dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); + if (blk_integrity_rq(req)) { if (req_op(req) == REQ_OP_READ) nvme_dif_remap(req, nvme_dif_complete); @@ -2426,7 +2436,8 @@ static int nvme_pci_reg_read64(struct nvme_ctrl *ctrl, u32 off, u64 *val) static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = { .name = "pcie", .module = THIS_MODULE, - .flags = NVME_F_METADATA_SUPPORTED, + .flags = NVME_F_METADATA_SUPPORTED | + NVME_F_PCI_P2PDMA, .reg_read32 = nvme_pci_reg_read32, .reg_write32 = nvme_pci_reg_write32, .reg_read64 = nvme_pci_reg_read64, From patchwork Mon Mar 12 19:35:24 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884822 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400St526NBz9sRX for ; Tue, 13 Mar 2018 06:37:01 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932322AbeCLTfl (ORCPT ); Mon, 12 Mar 2018 15:35:41 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54656 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932289AbeCLTfi (ORCPT ); Mon, 12 Mar 2018 15:35:38 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEM-00066C-Hc; Mon, 12 Mar 2018 13:35:35 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEH-0000lO-BP; Mon, 12 Mar 2018 13:35:29 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe Date: Mon, 12 Mar 2018 13:35:24 -0600 Message-Id: <20180312193525.2855-11-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.7 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_NO_TEXT,T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 10/11] nvme-pci: Add a quirk for a pseudo CMB X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org Introduce a quirk to use CMB-like memory on older devices that have an exposed BAR but do not advertise support for using CMBLOC and CMBSIZE. We'd like to use some of these older cards to test P2P memory. Signed-off-by: Logan Gunthorpe Reviewed-by: Sagi Grimberg --- drivers/nvme/host/nvme.h | 7 +++++++ drivers/nvme/host/pci.c | 24 ++++++++++++++++++++---- 2 files changed, 27 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 1fb2b6603d49..d1381bfc40f1 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -83,6 +83,13 @@ enum nvme_quirks { * Supports the LighNVM command set if indicated in vs[1]. */ NVME_QUIRK_LIGHTNVM = (1 << 6), + + /* + * Pseudo CMB Support on BAR 4. For adapters like the Microsemi + * NVRAM that have CMB-like memory on a BAR but does not set + * CMBLOC or CMBSZ. + */ + NVME_QUIRK_PSEUDO_CMB_BAR4 = (1 << 7), }; /* diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 0ebab7ab4d7e..a798e08a07bc 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1683,6 +1683,13 @@ static ssize_t nvme_cmb_show(struct device *dev, } static DEVICE_ATTR(cmb, S_IRUGO, nvme_cmb_show, NULL); +static u32 nvme_pseudo_cmbsz(struct pci_dev *pdev, int bar) +{ + return NVME_CMBSZ_WDS | NVME_CMBSZ_RDS | + (((ilog2(SZ_16M) - 12) / 4) << NVME_CMBSZ_SZU_SHIFT) | + ((pci_resource_len(pdev, bar) / SZ_16M) << NVME_CMBSZ_SZ_SHIFT); +} + static u64 nvme_cmb_size_unit(struct nvme_dev *dev) { u8 szu = (dev->cmbsz >> NVME_CMBSZ_SZU_SHIFT) & NVME_CMBSZ_SZU_MASK; @@ -1702,10 +1709,15 @@ static void nvme_map_cmb(struct nvme_dev *dev) struct pci_dev *pdev = to_pci_dev(dev->dev); int bar; - dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ); - if (!dev->cmbsz) - return; - dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC); + if (dev->ctrl.quirks & NVME_QUIRK_PSEUDO_CMB_BAR4) { + dev->cmbsz = nvme_pseudo_cmbsz(pdev, 4); + dev->cmbloc = 4; + } else { + dev->cmbsz = readl(dev->bar + NVME_REG_CMBSZ); + if (!dev->cmbsz) + return; + dev->cmbloc = readl(dev->bar + NVME_REG_CMBLOC); + } size = nvme_cmb_size_unit(dev) * nvme_cmb_size(dev); offset = nvme_cmb_size_unit(dev) * NVME_CMB_OFST(dev->cmbloc); @@ -2719,6 +2731,10 @@ static const struct pci_device_id nvme_id_table[] = { .driver_data = NVME_QUIRK_LIGHTNVM, }, { PCI_DEVICE(0x1d1d, 0x2807), /* CNEX WL */ .driver_data = NVME_QUIRK_LIGHTNVM, }, + { PCI_DEVICE(0x11f8, 0xf117), /* Microsemi NVRAM adaptor */ + .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, }, + { PCI_DEVICE(0x1db1, 0x0002), /* Everspin nvNitro adaptor */ + .driver_data = NVME_QUIRK_PSEUDO_CMB_BAR4, }, { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) }, { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) }, From patchwork Mon Mar 12 19:35:25 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Logan Gunthorpe X-Patchwork-Id: 884829 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=linux-pci-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=deltatee.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 400SwP1LlFz9sSY for ; Tue, 13 Mar 2018 06:39:01 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751678AbeCLThE (ORCPT ); Mon, 12 Mar 2018 15:37:04 -0400 Received: from ale.deltatee.com ([207.54.116.67]:54646 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932308AbeCLTfj (ORCPT ); Mon, 12 Mar 2018 15:35:39 -0400 Received: from cgy1-donard.priv.deltatee.com ([172.16.1.31]) by ale.deltatee.com with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1evTEK-00066D-3b; Mon, 12 Mar 2018 13:35:34 -0600 Received: from gunthorp by cgy1-donard.priv.deltatee.com with local (Exim 4.89) (envelope-from ) id 1evTEH-0000lR-ER; Mon, 12 Mar 2018 13:35:29 -0600 From: Logan Gunthorpe To: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?utf-8?b?SsOpcsO0bWUgR2xp?= =?utf-8?q?sse?= , Benjamin Herrenschmidt , Alex Williamson , Logan Gunthorpe , Steve Wise Date: Mon, 12 Mar 2018 13:35:25 -0600 Message-Id: <20180312193525.2855-12-logang@deltatee.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20180312193525.2855-1-logang@deltatee.com> References: <20180312193525.2855-1-logang@deltatee.com> MIME-Version: 1.0 X-SA-Exim-Connect-IP: 172.16.1.31 X-SA-Exim-Rcpt-To: linux-nvme@lists.infradead.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, linux-block@vger.kernel.org, sbates@raithlin.com, hch@lst.de, axboe@kernel.dk, sagi@grimberg.me, bhelgaas@google.com, jgg@mellanox.com, maxg@mellanox.com, keith.busch@intel.com, dan.j.williams@intel.com, benh@kernel.crashing.org, jglisse@redhat.com, alex.williamson@redhat.com, logang@deltatee.com, swise@opengridcomputing.com X-SA-Exim-Mail-From: gunthorp@deltatee.com X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on ale.deltatee.com X-Spam-Level: X-Spam-Status: No, score=-6.7 required=5.0 tests=ALL_TRUSTED,BAYES_00, MYRULES_FREE, T_RP_MATCHES_RCVD autolearn=no autolearn_force=no version=3.4.1 Subject: [PATCH v3 11/11] nvmet: Optionally use PCI P2P memory X-SA-Exim-Version: 4.2.1 (built Tue, 02 Aug 2016 21:08:31 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org We create a configfs attribute in each nvme-fabrics target port to enable p2p memory use. When enabled, the port will only then use the p2p memory if a p2p memory device can be found which is behind the same switch as the RDMA port and all the block devices in use. If the user enabled it an no devices are found, then the system will silently fall back on using regular memory. If appropriate, that port will allocate memory for the RDMA buffers for queues from the p2pmem device falling back to system memory should anything fail. Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available. Signed-off-by: Stephen Bates Signed-off-by: Steve Wise [hch: partial rewrite of the initial code] Signed-off-by: Christoph Hellwig Signed-off-by: Logan Gunthorpe --- drivers/nvme/target/configfs.c | 67 ++++++++++++++++++++++++++ drivers/nvme/target/core.c | 106 ++++++++++++++++++++++++++++++++++++++++- drivers/nvme/target/io-cmd.c | 3 ++ drivers/nvme/target/nvmet.h | 12 +++++ drivers/nvme/target/rdma.c | 32 +++++++++++-- 5 files changed, 214 insertions(+), 6 deletions(-) diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c index e6b2d2af81b6..6ca8c712f0d3 100644 --- a/drivers/nvme/target/configfs.c +++ b/drivers/nvme/target/configfs.c @@ -17,6 +17,8 @@ #include #include #include +#include +#include #include "nvmet.h" @@ -867,12 +869,77 @@ static void nvmet_port_release(struct config_item *item) kfree(port); } +#ifdef CONFIG_PCI_P2PDMA +static ssize_t nvmet_p2pmem_show(struct config_item *item, char *page) +{ + struct nvmet_port *port = to_nvmet_port(item); + + if (!port->use_p2pmem) + return sprintf(page, "none\n"); + + if (!port->p2p_dev) + return sprintf(page, "auto\n"); + + return sprintf(page, "%s\n", pci_name(port->p2p_dev)); +} + +static ssize_t nvmet_p2pmem_store(struct config_item *item, + const char *page, size_t count) +{ + struct nvmet_port *port = to_nvmet_port(item); + struct device *dev; + struct pci_dev *p2p_dev = NULL; + bool use_p2pmem; + + switch (page[0]) { + case 'y': + case 'Y': + case 'a': + case 'A': + use_p2pmem = true; + break; + case 'n': + case 'N': + use_p2pmem = false; + break; + default: + dev = bus_find_device_by_name(&pci_bus_type, NULL, page); + if (!dev) { + pr_err("No such PCI device: %s\n", page); + return -ENODEV; + } + + use_p2pmem = true; + p2p_dev = to_pci_dev(dev); + + if (!pci_has_p2pmem(p2p_dev)) { + pr_err("PCI device has no peer-to-peer memory: %s\n", + page); + pci_dev_put(p2p_dev); + return -ENODEV; + } + } + + down_write(&nvmet_config_sem); + port->use_p2pmem = use_p2pmem; + pci_dev_put(port->p2p_dev); + port->p2p_dev = p2p_dev; + up_write(&nvmet_config_sem); + + return count; +} +CONFIGFS_ATTR(nvmet_, p2pmem); +#endif /* CONFIG_PCI_P2PDMA */ + static struct configfs_attribute *nvmet_port_attrs[] = { &nvmet_attr_addr_adrfam, &nvmet_attr_addr_treq, &nvmet_attr_addr_traddr, &nvmet_attr_addr_trsvcid, &nvmet_attr_addr_trtype, +#ifdef CONFIG_PCI_P2PDMA + &nvmet_attr_p2pmem, +#endif NULL, }; diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index a78029e4e5f4..ab3cc7135ae8 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -15,6 +15,7 @@ #include #include #include +#include #include "nvmet.h" @@ -271,6 +272,25 @@ void nvmet_put_namespace(struct nvmet_ns *ns) percpu_ref_put(&ns->ref); } +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl, + struct nvmet_ns *ns) +{ + int ret; + + if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) { + pr_err("peer-to-peer DMA is not supported by %s\n", + ns->device_path); + return -EINVAL; + } + + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns)); + if (ret) + pr_err("failed to add peer-to-peer DMA client %s: %d\n", + ns->device_path, ret); + + return ret; +} + int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns) if (ret) goto out_blkdev_put; + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + if (ctrl->p2p_dev) { + ret = nvmet_p2pdma_add_client(ctrl, ns); + if (ret) + goto out_remove_clients; + } + } + if (ns->nsid > subsys->max_nsid) subsys->max_nsid = ns->nsid; @@ -328,6 +356,9 @@ int nvmet_ns_enable(struct nvmet_ns *ns) out_unlock: mutex_unlock(&subsys->lock); return ret; +out_remove_clients: + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) + pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns)); out_blkdev_put: blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ); ns->bdev = NULL; @@ -363,8 +394,10 @@ void nvmet_ns_disable(struct nvmet_ns *ns) percpu_ref_exit(&ns->ref); mutex_lock(&subsys->lock); - list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + pci_p2pdma_remove_client(&ctrl->p2p_clients, nvmet_ns_dev(ns)); nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0); + } if (ns->bdev) blkdev_put(ns->bdev, FMODE_WRITE|FMODE_READ); @@ -764,6 +797,74 @@ bool nvmet_host_allowed(struct nvmet_req *req, struct nvmet_subsys *subsys, return __nvmet_host_allowed(subsys, hostnqn); } +/* + * If allow_p2pmem is set, we will try to use P2P memory for the SGL lists for + * Ι/O commands. This requires the PCI p2p device to be compatible with the + * backing device for every namespace on this controller. + */ +static void nvmet_setup_p2pmem(struct nvmet_ctrl *ctrl, struct nvmet_req *req) +{ + struct nvmet_ns *ns; + int ret; + + if (!req->port->use_p2pmem || !req->p2p_client) + return; + + mutex_lock(&ctrl->subsys->lock); + + ret = pci_p2pdma_add_client(&ctrl->p2p_clients, req->p2p_client); + if (ret) { + pr_err("failed adding peer-to-peer DMA client %s: %d\n", + dev_name(req->p2p_client), ret); + goto free_devices; + } + + list_for_each_entry_rcu(ns, &ctrl->subsys->namespaces, dev_link) { + ret = nvmet_p2pdma_add_client(ctrl, ns); + if (ret) + goto free_devices; + } + + if (req->port->p2p_dev) { + if (!pci_p2pdma_assign_provider(req->port->p2p_dev, + &ctrl->p2p_clients)) { + pr_info("peer-to-peer memory on %s is not supported\n", + pci_name(req->port->p2p_dev)); + goto free_devices; + } + ctrl->p2p_dev = pci_dev_get(req->port->p2p_dev); + } else { + ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients); + if (!ctrl->p2p_dev) { + pr_info("no supported peer-to-peer memory devices found\n"); + goto free_devices; + } + } + + mutex_unlock(&ctrl->subsys->lock); + + pr_info("using peer-to-peer memory on %s\n", pci_name(ctrl->p2p_dev)); + return; + +free_devices: + pci_p2pdma_client_list_free(&ctrl->p2p_clients); + mutex_unlock(&ctrl->subsys->lock); +} + +static void nvmet_release_p2pmem(struct nvmet_ctrl *ctrl) +{ + if (!ctrl->p2p_dev) + return; + + mutex_lock(&ctrl->subsys->lock); + + pci_p2pdma_client_list_free(&ctrl->p2p_clients); + pci_dev_put(ctrl->p2p_dev); + ctrl->p2p_dev = NULL; + + mutex_unlock(&ctrl->subsys->lock); +} + u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, struct nvmet_req *req, u32 kato, struct nvmet_ctrl **ctrlp) { @@ -803,6 +904,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work); INIT_LIST_HEAD(&ctrl->async_events); + INIT_LIST_HEAD(&ctrl->p2p_clients); memcpy(ctrl->subsysnqn, subsysnqn, NVMF_NQN_SIZE); memcpy(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE); @@ -858,6 +960,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, ctrl->kato = DIV_ROUND_UP(kato, 1000); } nvmet_start_keep_alive_timer(ctrl); + nvmet_setup_p2pmem(ctrl, req); mutex_lock(&subsys->lock); list_add_tail(&ctrl->subsys_entry, &subsys->ctrls); @@ -894,6 +997,7 @@ static void nvmet_ctrl_free(struct kref *ref) flush_work(&ctrl->async_event_work); cancel_work_sync(&ctrl->fatal_err_work); + nvmet_release_p2pmem(ctrl); ida_simple_remove(&cntlid_ida, ctrl->cntlid); kfree(ctrl->sqs); diff --git a/drivers/nvme/target/io-cmd.c b/drivers/nvme/target/io-cmd.c index 28bbdff4a88b..a213f8fc3bf3 100644 --- a/drivers/nvme/target/io-cmd.c +++ b/drivers/nvme/target/io-cmd.c @@ -56,6 +56,9 @@ static void nvmet_execute_rw(struct nvmet_req *req) op = REQ_OP_READ; } + if (is_pci_p2pdma_page(sg_page(req->sg))) + op_flags |= REQ_PCI_P2PDMA; + sector = le64_to_cpu(req->cmd->rw.slba); sector <<= (req->ns->blksize_shift - 9); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 417f6c0331cc..e05afdbdaa10 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -64,6 +64,11 @@ static inline struct nvmet_ns *to_nvmet_ns(struct config_item *item) return container_of(to_config_group(item), struct nvmet_ns, group); } +static inline struct device *nvmet_ns_dev(struct nvmet_ns *ns) +{ + return disk_to_dev(ns->bdev->bd_disk); +} + struct nvmet_cq { u16 qid; u16 size; @@ -98,6 +103,8 @@ struct nvmet_port { struct list_head referrals; void *priv; bool enabled; + bool use_p2pmem; + struct pci_dev *p2p_dev; }; static inline struct nvmet_port *to_nvmet_port(struct config_item *item) @@ -131,6 +138,8 @@ struct nvmet_ctrl { struct work_struct fatal_err_work; struct nvmet_fabrics_ops *ops; + struct pci_dev *p2p_dev; + struct list_head p2p_clients; char subsysnqn[NVMF_NQN_FIELD_LEN]; char hostnqn[NVMF_NQN_FIELD_LEN]; @@ -232,6 +241,9 @@ struct nvmet_req { void (*execute)(struct nvmet_req *req); struct nvmet_fabrics_ops *ops; + + struct pci_dev *p2p_dev; + struct device *p2p_client; }; static inline void nvmet_set_status(struct nvmet_req *req, u16 status) diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 978e169c11bf..84db1022664f 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -430,8 +431,13 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp *rsp) rsp->req.sg_cnt, nvmet_data_dir(&rsp->req)); } - if (rsp->req.sg != &rsp->cmd->inline_sg) - sgl_free(rsp->req.sg); + if (rsp->req.sg != &rsp->cmd->inline_sg) { + if (rsp->req.p2p_dev) + pci_p2pmem_free_sgl(rsp->req.p2p_dev, rsp->req.sg, + rsp->req.sg_cnt); + else + sgl_free(rsp->req.sg); + } if (unlikely(!list_empty_careful(&queue->rsp_wr_wait_list))) nvmet_rdma_process_wr_wait_list(queue); @@ -567,15 +573,29 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp *rsp, u64 addr = le64_to_cpu(sgl->addr); u32 len = get_unaligned_le24(sgl->length); u32 key = get_unaligned_le32(sgl->key); + struct pci_dev *p2p_dev = NULL; int ret; /* no data command? */ if (!len) return 0; - rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt); - if (!rsp->req.sg) - return NVME_SC_INTERNAL; + if (rsp->queue->nvme_sq.ctrl) + p2p_dev = rsp->queue->nvme_sq.ctrl->p2p_dev; + + rsp->req.p2p_dev = NULL; + if (rsp->queue->nvme_sq.qid && p2p_dev) { + ret = pci_p2pmem_alloc_sgl(p2p_dev, &rsp->req.sg, + &rsp->req.sg_cnt, len); + if (!ret) + rsp->req.p2p_dev = p2p_dev; + } + + if (!rsp->req.p2p_dev) { + rsp->req.sg = sgl_alloc(len, GFP_KERNEL, &rsp->req.sg_cnt); + if (!rsp->req.sg) + return NVME_SC_INTERNAL; + } ret = rdma_rw_ctx_init(&rsp->rw, cm_id->qp, cm_id->port_num, rsp->req.sg, rsp->req.sg_cnt, 0, addr, key, @@ -658,6 +678,8 @@ static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue, cmd->send_sge.addr, cmd->send_sge.length, DMA_TO_DEVICE); + cmd->req.p2p_client = &queue->dev->device->dev; + if (!nvmet_req_init(&cmd->req, &queue->nvme_cq, &queue->nvme_sq, &nvmet_rdma_ops)) return;