From patchwork Thu Dec 20 08:23:31 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016622 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bx5Rksz9sMM for ; Thu, 20 Dec 2018 19:26:37 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730794AbeLTIYC (ORCPT ); Thu, 20 Dec 2018 03:24:02 -0500 Received: from ozlabs.ru ([107.173.13.209]:53954 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727667AbeLTIYC (ORCPT ); Thu, 20 Dec 2018 03:24:02 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id BC72BAE8010F; Thu, 20 Dec 2018 03:23:57 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 01/20] powerpc/ioda/npu: Call skiboot's hot reset hook when disabling NPU2 Date: Thu, 20 Dec 2018 19:23:31 +1100 Message-Id: <20181220082350.58113-2-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org The skiboot firmware has a hot reset handler which fences the NVIDIA V100 GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs: https://github.com/open-power/skiboot/commit/fca2b2b839a67 Now we are going to pass V100 via VFIO which most certainly involves KVM guests which are often terminated without getting a chance to offline GPU RAM so we end up with a running machine with misconfigured memory. Accessing this memory produces hardware management interrupts (HMI) which bring the host down. To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable() via pci_disable_device() which switches NPU2 to a safe mode and prevents HMIs. Signed-off-by: Alexey Kardashevskiy Acked-by: Alistair Popple Reviewed-by: David Gibson --- Changes: v2: * updated the commit log --- arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 9ee7a30..29c6837 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -3676,6 +3676,15 @@ static void pnv_pci_release_device(struct pci_dev *pdev) pnv_ioda_release_pe(pe); } +static void pnv_npu_disable_device(struct pci_dev *pdev) +{ + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev); + struct eeh_pe *eehpe = edev ? edev->pe : NULL; + + if (eehpe && eeh_ops && eeh_ops->reset) + eeh_ops->reset(eehpe, EEH_RESET_HOT); +} + static void pnv_pci_ioda_shutdown(struct pci_controller *hose) { struct pnv_phb *phb = hose->private_data; @@ -3720,6 +3729,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = { .reset_secondary_bus = pnv_pci_reset_secondary_bus, .dma_set_mask = pnv_npu_dma_set_mask, .shutdown = pnv_pci_ioda_shutdown, + .disable_device = pnv_npu_disable_device, }; static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = { From patchwork Thu Dec 20 08:23:32 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016621 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bt4CK4z9sMr for ; Thu, 20 Dec 2018 19:26:34 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730726AbeLTIYG (ORCPT ); Thu, 20 Dec 2018 03:24:06 -0500 Received: from ozlabs.ru ([107.173.13.209]:54003 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730755AbeLTIYG (ORCPT ); Thu, 20 Dec 2018 03:24:06 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 7740BAE80110; Thu, 20 Dec 2018 03:24:01 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 02/20] powerpc/mm/iommu/vfio_spapr_tce: Change mm_iommu_get to reference a region Date: Thu, 20 Dec 2018 19:23:32 +1100 Message-Id: <20181220082350.58113-3-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org Normally mm_iommu_get() should add a reference and mm_iommu_put() should remove it. However historically mm_iommu_find() does the referencing and mm_iommu_get() is doing allocation and referencing. We are going to add another helper to preregister device memory so instead of having mm_iommu_new() (which pre-registers the normal memory and references the region), we need separate helpers for pre-registering and referencing. This renames: - mm_iommu_get to mm_iommu_new; - mm_iommu_find to mm_iommu_get. This changes mm_iommu_get() to reference the region so the name now reflects what it does. This removes the check for exact match from mm_iommu_new() as we want it to fail on existing regions; mm_iommu_get() should be used instead. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v5: * fixed a bug with uninitialized @found in tce_iommu_unregister_pages() * reworded the commit log v4: * squashed "powerpc/mm/iommu: Make mm_iommu_new() fail on existing regions" into this v2: * merged 2 patches into one --- arch/powerpc/include/asm/mmu_context.h | 4 +-- arch/powerpc/mm/mmu_context_iommu.c | 19 +++++++------- drivers/vfio/vfio_iommu_spapr_tce.c | 35 +++++++++++++++++--------- 3 files changed, 34 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index c05efd2..268e112 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -21,7 +21,7 @@ struct mm_iommu_table_group_mem_t; extern int isolate_lru_page(struct page *page); /* from internal.h */ extern bool mm_iommu_preregistered(struct mm_struct *mm); -extern long mm_iommu_get(struct mm_struct *mm, +extern long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem); extern long mm_iommu_put(struct mm_struct *mm, @@ -32,7 +32,7 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct mm_struct *mm, unsigned long ua, unsigned long size); extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm( struct mm_struct *mm, unsigned long ua, unsigned long size); -extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries); extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned int pageshift, unsigned long *hpa); diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 0741d90..25a4b7f7 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -89,7 +89,7 @@ bool mm_iommu_preregistered(struct mm_struct *mm) } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); -long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, +long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; @@ -100,12 +100,6 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) { - if ((mem->ua == ua) && (mem->entries == entries)) { - ++mem->used; - *pmem = mem; - goto unlock_exit; - } - /* Overlap? */ if ((mem->ua < (ua + (entries << PAGE_SHIFT))) && (ua < (mem->ua + @@ -192,7 +186,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries, return ret; } -EXPORT_SYMBOL_GPL(mm_iommu_get); +EXPORT_SYMBOL_GPL(mm_iommu_new); static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem) { @@ -308,21 +302,26 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(struct mm_struct *mm, return ret; } -struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm, +struct mm_iommu_table_group_mem_t *mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries) { struct mm_iommu_table_group_mem_t *mem, *ret = NULL; + mutex_lock(&mem_list_mutex); + list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) { if ((mem->ua == ua) && (mem->entries == entries)) { ret = mem; + ++mem->used; break; } } + mutex_unlock(&mem_list_mutex); + return ret; } -EXPORT_SYMBOL_GPL(mm_iommu_find); +EXPORT_SYMBOL_GPL(mm_iommu_get); long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned int pageshift, unsigned long *hpa) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index ad63725..1d8b889 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -152,11 +152,12 @@ static long tce_iommu_unregister_pages(struct tce_container *container, struct mm_iommu_table_group_mem_t *mem; struct tce_iommu_prereg *tcemem; bool found = false; + long ret; if ((vaddr & ~PAGE_MASK) || (size & ~PAGE_MASK)) return -EINVAL; - mem = mm_iommu_find(container->mm, vaddr, size >> PAGE_SHIFT); + mem = mm_iommu_get(container->mm, vaddr, size >> PAGE_SHIFT); if (!mem) return -ENOENT; @@ -168,9 +169,13 @@ static long tce_iommu_unregister_pages(struct tce_container *container, } if (!found) - return -ENOENT; + ret = -ENOENT; + else + ret = tce_iommu_prereg_free(container, tcemem); - return tce_iommu_prereg_free(container, tcemem); + mm_iommu_put(container->mm, mem); + + return ret; } static long tce_iommu_register_pages(struct tce_container *container, @@ -185,22 +190,24 @@ static long tce_iommu_register_pages(struct tce_container *container, ((vaddr + size) < vaddr)) return -EINVAL; - mem = mm_iommu_find(container->mm, vaddr, entries); + mem = mm_iommu_get(container->mm, vaddr, entries); if (mem) { list_for_each_entry(tcemem, &container->prereg_list, next) { - if (tcemem->mem == mem) - return -EBUSY; + if (tcemem->mem == mem) { + ret = -EBUSY; + goto put_exit; + } } + } else { + ret = mm_iommu_new(container->mm, vaddr, entries, &mem); + if (ret) + return ret; } - ret = mm_iommu_get(container->mm, vaddr, entries, &mem); - if (ret) - return ret; - tcemem = kzalloc(sizeof(*tcemem), GFP_KERNEL); if (!tcemem) { - mm_iommu_put(container->mm, mem); - return -ENOMEM; + ret = -ENOMEM; + goto put_exit; } tcemem->mem = mem; @@ -209,6 +216,10 @@ static long tce_iommu_register_pages(struct tce_container *container, container->enabled = true; return 0; + +put_exit: + mm_iommu_put(container->mm, mem); + return ret; } static bool tce_page_is_contained(struct page *page, unsigned page_shift) From patchwork Thu Dec 20 08:23:33 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016603 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4Y65SjVz9sDr for ; Thu, 20 Dec 2018 19:24:10 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730832AbeLTIYJ (ORCPT ); Thu, 20 Dec 2018 03:24:09 -0500 Received: from ozlabs.ru ([107.173.13.209]:54056 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730814AbeLTIYJ (ORCPT ); Thu, 20 Dec 2018 03:24:09 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 27A7FAE801F6; Thu, 20 Dec 2018 03:24:04 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 03/20] powerpc/vfio/iommu/kvm: Do not pin device memory Date: Thu, 20 Dec 2018 19:23:33 +1100 Message-Id: <20181220082350.58113-4-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org This new memory does not have page structs as it is not plugged to the host so gup() will fail anyway. This adds 2 helpers: - mm_iommu_newdev() to preregister the "memory device" memory so the rest of API can still be used; - mm_iommu_is_devmem() to know if the physical address is one of thise new regions which we must avoid unpinning of. This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test if the memory is device memory to avoid pfn_to_page(). This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which does delayed pages dirtying. Signed-off-by: Alexey Kardashevskiy Reviewed-by: Paul Mackerras --- Changes: v6: * added dummy mm_iommu_is_devmem() for !CONFIG_SPAPR_TCE_IOMMU * removed "extern" from c file v5: * mm_iommu_is_devmem() now returns the actual size which might me smaller than the pageshift so tce_page_is_contained() won't do pfn_to_page() if @hpa..@hpa+64K is preregistered but page_shift is bigger than 16 * removed David's r-by because of the change in mm_iommu_is_devmem v4: * added device memory check in the real mode path --- arch/powerpc/include/asm/iommu.h | 5 +- arch/powerpc/include/asm/mmu_context.h | 11 +++ arch/powerpc/kernel/iommu.c | 11 ++- arch/powerpc/kvm/book3s_64_vio.c | 18 ++--- arch/powerpc/mm/mmu_context_iommu.c | 93 +++++++++++++++++++++++--- drivers/vfio/vfio_iommu_spapr_tce.c | 29 +++++--- 6 files changed, 135 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 35db0cb..a8aeac0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -218,8 +218,9 @@ extern void iommu_register_group(struct iommu_table_group *table_group, extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); -extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, - unsigned long *hpa, enum dma_data_direction *direction); +extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl, + unsigned long entry, unsigned long *hpa, + enum dma_data_direction *direction); #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 268e112..c50bd6a 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -24,6 +24,9 @@ extern bool mm_iommu_preregistered(struct mm_struct *mm); extern long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, struct mm_iommu_table_group_mem_t **pmem); +extern long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua, + unsigned long entries, unsigned long dev_hpa, + struct mm_iommu_table_group_mem_t **pmem); extern long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_init(struct mm_struct *mm); @@ -39,8 +42,16 @@ extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned int pageshift, unsigned long *hpa); extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua); +extern bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa, + unsigned int pageshift, unsigned long *size); extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem); extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem); +#else +static inline bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa, + unsigned int pageshift, unsigned long *size) +{ + return false; +} #endif extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); extern void set_context(unsigned long id, pgd_t *pgd); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index f0dc680..cbcc615 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -47,6 +47,7 @@ #include #include #include +#include #define DBG(...) @@ -993,15 +994,19 @@ int iommu_tce_check_gpa(unsigned long page_shift, unsigned long gpa) } EXPORT_SYMBOL_GPL(iommu_tce_check_gpa); -long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, - unsigned long *hpa, enum dma_data_direction *direction) +long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl, + unsigned long entry, unsigned long *hpa, + enum dma_data_direction *direction) { long ret; + unsigned long size = 0; ret = tbl->it_ops->exchange(tbl, entry, hpa, direction); if (!ret && ((*direction == DMA_FROM_DEVICE) || - (*direction == DMA_BIDIRECTIONAL))) + (*direction == DMA_BIDIRECTIONAL)) && + !mm_iommu_is_devmem(mm, *hpa, tbl->it_page_shift, + &size)) SetPageDirty(pfn_to_page(*hpa >> PAGE_SHIFT)); /* if (unlikely(ret)) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 62a8d03..532ab797 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -397,12 +397,13 @@ static long kvmppc_tce_validate(struct kvmppc_spapr_tce_table *stt, return H_SUCCESS; } -static void kvmppc_clear_tce(struct iommu_table *tbl, unsigned long entry) +static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl, + unsigned long entry) { unsigned long hpa = 0; enum dma_data_direction dir = DMA_NONE; - iommu_tce_xchg(tbl, entry, &hpa, &dir); + iommu_tce_xchg(mm, tbl, entry, &hpa, &dir); } static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm, @@ -433,7 +434,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm, unsigned long hpa = 0; long ret; - if (WARN_ON_ONCE(iommu_tce_xchg(tbl, entry, &hpa, &dir))) + if (WARN_ON_ONCE(iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir))) return H_TOO_HARD; if (dir == DMA_NONE) @@ -441,7 +442,7 @@ static long kvmppc_tce_iommu_do_unmap(struct kvm *kvm, ret = kvmppc_tce_iommu_mapped_dec(kvm, tbl, entry); if (ret != H_SUCCESS) - iommu_tce_xchg(tbl, entry, &hpa, &dir); + iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir); return ret; } @@ -487,7 +488,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl, if (mm_iommu_mapped_inc(mem)) return H_TOO_HARD; - ret = iommu_tce_xchg(tbl, entry, &hpa, &dir); + ret = iommu_tce_xchg(kvm->mm, tbl, entry, &hpa, &dir); if (WARN_ON_ONCE(ret)) { mm_iommu_mapped_dec(mem); return H_TOO_HARD; @@ -566,7 +567,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, entry, ua, dir); if (ret != H_SUCCESS) { - kvmppc_clear_tce(stit->tbl, entry); + kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry); goto unlock_exit; } } @@ -655,7 +656,8 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, iommu_tce_direction(tce)); if (ret != H_SUCCESS) { - kvmppc_clear_tce(stit->tbl, entry); + kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, + entry); goto unlock_exit; } } @@ -704,7 +706,7 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, return ret; WARN_ON_ONCE(1); - kvmppc_clear_tce(stit->tbl, entry); + kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry); } } diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 25a4b7f7..dd9f1ca 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -47,6 +47,8 @@ struct mm_iommu_table_group_mem_t { struct page **hpages; /* vmalloc'ed */ phys_addr_t *hpas; }; +#define MM_IOMMU_TABLE_INVALID_HPA ((uint64_t)-1) + u64 dev_hpa; /* Device memory base address */ }; static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, @@ -89,7 +91,8 @@ bool mm_iommu_preregistered(struct mm_struct *mm) } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); -long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, +static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, + unsigned long entries, unsigned long dev_hpa, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; @@ -110,11 +113,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, } - ret = mm_iommu_adjust_locked_vm(mm, entries, true); - if (ret) - goto unlock_exit; + if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) { + ret = mm_iommu_adjust_locked_vm(mm, entries, true); + if (ret) + goto unlock_exit; - locked_entries = entries; + locked_entries = entries; + } mem = kzalloc(sizeof(*mem), GFP_KERNEL); if (!mem) { @@ -122,6 +127,13 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, goto unlock_exit; } + if (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) { + mem->pageshift = __ffs(dev_hpa | (entries << PAGE_SHIFT)); + mem->dev_hpa = dev_hpa; + goto good_exit; + } + mem->dev_hpa = MM_IOMMU_TABLE_INVALID_HPA; + /* * For a starting point for a maximum page size calculation * we use @ua and @entries natural alignment to allow IOMMU pages @@ -170,6 +182,7 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, } +good_exit: atomic64_set(&mem->mapped, 1); mem->used = 1; mem->ua = ua; @@ -186,13 +199,31 @@ long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, return ret; } + +long mm_iommu_new(struct mm_struct *mm, unsigned long ua, unsigned long entries, + struct mm_iommu_table_group_mem_t **pmem) +{ + return mm_iommu_do_alloc(mm, ua, entries, MM_IOMMU_TABLE_INVALID_HPA, + pmem); +} EXPORT_SYMBOL_GPL(mm_iommu_new); +long mm_iommu_newdev(struct mm_struct *mm, unsigned long ua, + unsigned long entries, unsigned long dev_hpa, + struct mm_iommu_table_group_mem_t **pmem) +{ + return mm_iommu_do_alloc(mm, ua, entries, dev_hpa, pmem); +} +EXPORT_SYMBOL_GPL(mm_iommu_newdev); + static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem) { long i; struct page *page = NULL; + if (!mem->hpas) + return; + for (i = 0; i < mem->entries; ++i) { if (!mem->hpas[i]) continue; @@ -234,6 +265,7 @@ static void mm_iommu_release(struct mm_iommu_table_group_mem_t *mem) long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem) { long ret = 0; + unsigned long entries, dev_hpa; mutex_lock(&mem_list_mutex); @@ -255,9 +287,12 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem) } /* @mapped became 0 so now mappings are disabled, release the region */ + entries = mem->entries; + dev_hpa = mem->dev_hpa; mm_iommu_release(mem); - mm_iommu_adjust_locked_vm(mm, mem->entries, false); + if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) + mm_iommu_adjust_locked_vm(mm, entries, false); unlock_exit: mutex_unlock(&mem_list_mutex); @@ -327,7 +362,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned int pageshift, unsigned long *hpa) { const long entry = (ua - mem->ua) >> PAGE_SHIFT; - u64 *va = &mem->hpas[entry]; + u64 *va; if (entry >= mem->entries) return -EFAULT; @@ -335,6 +370,12 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, if (pageshift > mem->pageshift) return -EFAULT; + if (!mem->hpas) { + *hpa = mem->dev_hpa + (ua - mem->ua); + return 0; + } + + va = &mem->hpas[entry]; *hpa = (*va & MM_IOMMU_TABLE_GROUP_PAGE_MASK) | (ua & ~PAGE_MASK); return 0; @@ -345,7 +386,6 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, unsigned long ua, unsigned int pageshift, unsigned long *hpa) { const long entry = (ua - mem->ua) >> PAGE_SHIFT; - void *va = &mem->hpas[entry]; unsigned long *pa; if (entry >= mem->entries) @@ -354,7 +394,12 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem, if (pageshift > mem->pageshift) return -EFAULT; - pa = (void *) vmalloc_to_phys(va); + if (!mem->hpas) { + *hpa = mem->dev_hpa + (ua - mem->ua); + return 0; + } + + pa = (void *) vmalloc_to_phys(&mem->hpas[entry]); if (!pa) return -EFAULT; @@ -374,6 +419,9 @@ extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua) if (!mem) return; + if (mem->dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) + return; + entry = (ua - mem->ua) >> PAGE_SHIFT; va = &mem->hpas[entry]; @@ -384,6 +432,33 @@ extern void mm_iommu_ua_mark_dirty_rm(struct mm_struct *mm, unsigned long ua) *pa |= MM_IOMMU_TABLE_GROUP_PAGE_DIRTY; } +bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long hpa, + unsigned int pageshift, unsigned long *size) +{ + struct mm_iommu_table_group_mem_t *mem; + unsigned long end; + + list_for_each_entry_rcu(mem, &mm->context.iommu_group_mem_list, next) { + if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) + continue; + + end = mem->dev_hpa + (mem->entries << PAGE_SHIFT); + if ((mem->dev_hpa <= hpa) && (hpa < end)) { + /* + * Since the IOMMU page size might be bigger than + * PAGE_SIZE, the amount of preregistered memory + * starting from @hpa might be smaller than 1<mapped)) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 1d8b889..c424913 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -222,8 +222,16 @@ static long tce_iommu_register_pages(struct tce_container *container, return ret; } -static bool tce_page_is_contained(struct page *page, unsigned page_shift) +static bool tce_page_is_contained(struct mm_struct *mm, unsigned long hpa, + unsigned int page_shift) { + struct page *page; + unsigned long size = 0; + + if (mm_iommu_is_devmem(mm, hpa, page_shift, &size)) + return size == (1UL << page_shift); + + page = pfn_to_page(hpa >> PAGE_SHIFT); /* * Check that the TCE table granularity is not bigger than the size of * a page we just found. Otherwise the hardware can get access to @@ -499,7 +507,8 @@ static int tce_iommu_clear(struct tce_container *container, direction = DMA_NONE; oldhpa = 0; - ret = iommu_tce_xchg(tbl, entry, &oldhpa, &direction); + ret = iommu_tce_xchg(container->mm, tbl, entry, &oldhpa, + &direction); if (ret) continue; @@ -537,7 +546,6 @@ static long tce_iommu_build(struct tce_container *container, enum dma_data_direction direction) { long i, ret = 0; - struct page *page; unsigned long hpa; enum dma_data_direction dirtmp; @@ -548,15 +556,16 @@ static long tce_iommu_build(struct tce_container *container, if (ret) break; - page = pfn_to_page(hpa >> PAGE_SHIFT); - if (!tce_page_is_contained(page, tbl->it_page_shift)) { + if (!tce_page_is_contained(container->mm, hpa, + tbl->it_page_shift)) { ret = -EPERM; break; } hpa |= offset; dirtmp = direction; - ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp); + ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa, + &dirtmp); if (ret) { tce_iommu_unuse_page(container, hpa); pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%ld\n", @@ -583,7 +592,6 @@ static long tce_iommu_build_v2(struct tce_container *container, enum dma_data_direction direction) { long i, ret = 0; - struct page *page; unsigned long hpa; enum dma_data_direction dirtmp; @@ -596,8 +604,8 @@ static long tce_iommu_build_v2(struct tce_container *container, if (ret) break; - page = pfn_to_page(hpa >> PAGE_SHIFT); - if (!tce_page_is_contained(page, tbl->it_page_shift)) { + if (!tce_page_is_contained(container->mm, hpa, + tbl->it_page_shift)) { ret = -EPERM; break; } @@ -610,7 +618,8 @@ static long tce_iommu_build_v2(struct tce_container *container, if (mm_iommu_mapped_inc(mem)) break; - ret = iommu_tce_xchg(tbl, entry + i, &hpa, &dirtmp); + ret = iommu_tce_xchg(container->mm, tbl, entry + i, &hpa, + &dirtmp); if (ret) { /* dirtmp cannot be DMA_NONE here */ tce_iommu_unuse_page_v2(container, tbl, entry + i); From patchwork Thu Dec 20 08:23:34 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016620 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bk3pllz9sN4 for ; Thu, 20 Dec 2018 19:26:26 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730851AbeLTIYN (ORCPT ); Thu, 20 Dec 2018 03:24:13 -0500 Received: from ozlabs.ru ([107.173.13.209]:54119 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730849AbeLTIYM (ORCPT ); Thu, 20 Dec 2018 03:24:12 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id CBD25AE80215; Thu, 20 Dec 2018 03:24:08 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 04/20] powerpc/powernv: Move npu struct from pnv_phb to pci_controller Date: Thu, 20 Dec 2018 19:23:34 +1100 Message-Id: <20181220082350.58113-5-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org The powernv PCI code stores NPU data in the pnv_phb struct. The latter is referenced by pci_controller::private_data. We are going to have NPU2 support in the pseries platform as well but it does not store any private_data in in the pci_controller struct; and even if it did, it would be a different data structure. This makes npu a pointer and stores it one level higher in the pci_controller struct. Signed-off-by: Alexey Kardashevskiy --- Changes: v5: * removed !npu checks as this is out of scope of this patch * added WARN_ON_ONCE in WARN_ON_ONCE(pnv_npu2_init(phb)) v4: * changed subj from "powerpc/powernv: Detach npu struct from pnv_phb" * got rid of global list of npus - store them now in pci_controller * got rid of npdev_to_npu() helper --- arch/powerpc/include/asm/pci-bridge.h | 1 + arch/powerpc/platforms/powernv/pci.h | 16 ----- arch/powerpc/platforms/powernv/npu-dma.c | 74 +++++++++++++++++------ arch/powerpc/platforms/powernv/pci-ioda.c | 2 +- 4 files changed, 58 insertions(+), 35 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 94d4490..aee4fcc 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -129,6 +129,7 @@ struct pci_controller { #endif /* CONFIG_PPC64 */ void *private_data; + struct npu *npu; }; /* These are used for config access before all the PCI probing diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index 2131373..f2d50974 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -8,9 +8,6 @@ struct pci_dn; -/* Maximum possible number of ATSD MMIO registers per NPU */ -#define NV_NMMU_ATSD_REGS 8 - enum pnv_phb_type { PNV_PHB_IODA1 = 0, PNV_PHB_IODA2 = 1, @@ -176,19 +173,6 @@ struct pnv_phb { unsigned int diag_data_size; u8 *diag_data; - /* Nvlink2 data */ - struct npu { - int index; - __be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS]; - unsigned int mmio_atsd_count; - - /* Bitmask for MMIO register usage */ - unsigned long mmio_atsd_usage; - - /* Do we need to explicitly flush the nest mmu? */ - bool nmmu_flush; - } npu; - int p2p_target_count; }; diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 91d488f..5e66439 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -327,6 +327,25 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe) return gpe; } +/* + * NPU2 ATS + */ +/* Maximum possible number of ATSD MMIO registers per NPU */ +#define NV_NMMU_ATSD_REGS 8 + +/* An NPU descriptor, valid for POWER9 only */ +struct npu { + int index; + __be64 *mmio_atsd_regs[NV_NMMU_ATSD_REGS]; + unsigned int mmio_atsd_count; + + /* Bitmask for MMIO register usage */ + unsigned long mmio_atsd_usage; + + /* Do we need to explicitly flush the nest mmu? */ + bool nmmu_flush; +}; + /* Maximum number of nvlinks per npu */ #define NV_MAX_LINKS 6 @@ -478,7 +497,6 @@ static void acquire_atsd_reg(struct npu_context *npu_context, int i, j; struct npu *npu; struct pci_dev *npdev; - struct pnv_phb *nphb; for (i = 0; i <= max_npu2_index; i++) { mmio_atsd_reg[i].reg = -1; @@ -493,8 +511,7 @@ static void acquire_atsd_reg(struct npu_context *npu_context, if (!npdev) continue; - nphb = pci_bus_to_host(npdev->bus)->private_data; - npu = &nphb->npu; + npu = pci_bus_to_host(npdev->bus)->npu; mmio_atsd_reg[i].npu = npu; mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu); while (mmio_atsd_reg[i].reg < 0) { @@ -662,6 +679,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, struct pnv_phb *nphb; struct npu *npu; struct npu_context *npu_context; + struct pci_controller *hose; /* * At present we don't support GPUs connected to multiple NPUs and I'm @@ -689,8 +707,9 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, return ERR_PTR(-EINVAL); } - nphb = pci_bus_to_host(npdev->bus)->private_data; - npu = &nphb->npu; + hose = pci_bus_to_host(npdev->bus); + nphb = hose->private_data; + npu = hose->npu; /* * Setup the NPU context table for a particular GPU. These need to be @@ -764,7 +783,7 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, */ WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], npdev); - if (!nphb->npu.nmmu_flush) { + if (!npu->nmmu_flush) { /* * If we're not explicitly flushing ourselves we need to mark * the thread for global flushes @@ -802,6 +821,7 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context, struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0); struct device_node *nvlink_dn; u32 nvlink_index; + struct pci_controller *hose; if (WARN_ON(!npdev)) return; @@ -809,8 +829,9 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context, if (!firmware_has_feature(FW_FEATURE_OPAL)) return; - nphb = pci_bus_to_host(npdev->bus)->private_data; - npu = &nphb->npu; + hose = pci_bus_to_host(npdev->bus); + nphb = hose->private_data; + npu = hose->npu; nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0); if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index", &nvlink_index))) @@ -888,9 +909,15 @@ int pnv_npu2_init(struct pnv_phb *phb) struct pci_dev *gpdev; static int npu_index; uint64_t rc = 0; + struct pci_controller *hose = phb->hose; + struct npu *npu; + int ret; - phb->npu.nmmu_flush = - of_property_read_bool(phb->hose->dn, "ibm,nmmu-flush"); + npu = kzalloc(sizeof(*npu), GFP_KERNEL); + if (!npu) + return -ENOMEM; + + npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush"); for_each_child_of_node(phb->hose->dn, dn) { gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn)); if (gpdev) { @@ -904,18 +931,29 @@ int pnv_npu2_init(struct pnv_phb *phb) } } - for (i = 0; !of_property_read_u64_index(phb->hose->dn, "ibm,mmio-atsd", + for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", i, &mmio_atsd); i++) - phb->npu.mmio_atsd_regs[i] = ioremap(mmio_atsd, 32); + npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32); - pr_info("NPU%lld: Found %d MMIO ATSD registers", phb->opal_id, i); - phb->npu.mmio_atsd_count = i; - phb->npu.mmio_atsd_usage = 0; + pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i); + npu->mmio_atsd_count = i; + npu->mmio_atsd_usage = 0; npu_index++; - if (WARN_ON(npu_index >= NV_MAX_NPUS)) - return -ENOSPC; + if (WARN_ON(npu_index >= NV_MAX_NPUS)) { + ret = -ENOSPC; + goto fail_exit; + } max_npu2_index = npu_index; - phb->npu.index = npu_index; + npu->index = npu_index; + hose->npu = npu; return 0; + +fail_exit: + for (i = 0; i < npu->mmio_atsd_count; ++i) + iounmap(npu->mmio_atsd_regs[i]); + + kfree(npu); + + return ret; } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 29c6837..8528fb9 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1283,7 +1283,7 @@ static void pnv_pci_ioda_setup_PEs(void) pnv_ioda_reserve_pe(phb, 0); pnv_ioda_setup_npu_PEs(hose->bus); if (phb->model == PNV_PHB_MODEL_NPU2) - pnv_npu2_init(phb); + WARN_ON_ONCE(pnv_npu2_init(phb)); } if (phb->type == PNV_PHB_NPU_OCAPI) { bus = hose->bus; From patchwork Thu Dec 20 08:23:35 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016619 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bh4YRZz9sN9 for ; Thu, 20 Dec 2018 19:26:24 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730874AbeLTIYS (ORCPT ); Thu, 20 Dec 2018 03:24:18 -0500 Received: from ozlabs.ru ([107.173.13.209]:54180 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730870AbeLTIYQ (ORCPT ); Thu, 20 Dec 2018 03:24:16 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 54AF9AE80219; Thu, 20 Dec 2018 03:24:12 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 05/20] powerpc/powernv/npu: Move OPAL calls away from context manipulation Date: Thu, 20 Dec 2018 19:23:35 +1100 Message-Id: <20181220082350.58113-6-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org When introduced, the NPU context init/destroy helpers called OPAL which enabled/disabled PID (a userspace memory context ID) filtering in an NPU per a GPU; this was a requirement for P9 DD1.0. However newer chip revision added a PID wildcard support so there is no more need to call OPAL every time a new context is initialized. Also, since the PID wildcard support was added, skiboot does not clear wildcard entries in the NPU so these remain in the hardware till the system reboot. This moves LPID and wildcard programming to the PE setup code which executes once during the booting process so NPU2 context init/destroy won't need to do additional configuration. This replaces the check for FW_FEATURE_OPAL with a check for npu!=NULL as this is the way to tell if the NPU support is present and configured. This moves pnv_npu2_init() declaration as pseries should be able to use it. This keeps pnv_npu2_map_lpar() in powernv as pseries is not allowed to call that. This exports pnv_npu2_map_lpar_dev() as following patches will use it from the VFIO driver. While at it, replace redundant list_for_each_entry_safe() with a simpler list_for_each_entry(). Signed-off-by: Alexey Kardashevskiy --- Changes: v5: * add few checks for npu!=NULL v4: * add flags check in pnv_npu2_init_context() --- arch/powerpc/include/asm/pci.h | 3 + arch/powerpc/platforms/powernv/pci.h | 2 +- arch/powerpc/platforms/powernv/npu-dma.c | 111 ++++++++++++---------- arch/powerpc/platforms/powernv/pci-ioda.c | 15 ++- 4 files changed, 77 insertions(+), 54 deletions(-) diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h index 2af9ded..baf2886 100644 --- a/arch/powerpc/include/asm/pci.h +++ b/arch/powerpc/include/asm/pci.h @@ -129,5 +129,8 @@ extern void pcibios_scan_phb(struct pci_controller *hose); extern struct pci_dev *pnv_pci_get_gpu_dev(struct pci_dev *npdev); extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index); +extern int pnv_npu2_init(struct pci_controller *hose); +extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid, + unsigned long msr); #endif /* __ASM_POWERPC_PCI_H */ diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index f2d50974..ddb4f02 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -190,6 +190,7 @@ extern void pnv_pci_init_ioda_hub(struct device_node *np); extern void pnv_pci_init_ioda2_phb(struct device_node *np); extern void pnv_pci_init_npu_phb(struct device_node *np); extern void pnv_pci_init_npu2_opencapi_phb(struct device_node *np); +extern void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr); extern void pnv_pci_reset_secondary_bus(struct pci_dev *dev); extern int pnv_eeh_phb_reset(struct pci_controller *hose, int option); @@ -220,7 +221,6 @@ extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num); extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe); extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe); -extern int pnv_npu2_init(struct pnv_phb *phb); /* pci-ioda-tce.c */ #define POWERNV_IOMMU_DEFAULT_LEVELS 1 diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 5e66439..ef1457f 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -512,6 +512,9 @@ static void acquire_atsd_reg(struct npu_context *npu_context, continue; npu = pci_bus_to_host(npdev->bus)->npu; + if (!npu) + continue; + mmio_atsd_reg[i].npu = npu; mmio_atsd_reg[i].reg = get_mmio_atsd_reg(npu); while (mmio_atsd_reg[i].reg < 0) { @@ -676,7 +679,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, u32 nvlink_index; struct device_node *nvlink_dn; struct mm_struct *mm = current->mm; - struct pnv_phb *nphb; struct npu *npu; struct npu_context *npu_context; struct pci_controller *hose; @@ -687,13 +689,14 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, */ struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0); - if (!firmware_has_feature(FW_FEATURE_OPAL)) - return ERR_PTR(-ENODEV); - if (!npdev) /* No nvlink associated with this GPU device */ return ERR_PTR(-ENODEV); + /* We only support DR/PR/HV in pnv_npu2_map_lpar_dev() */ + if (flags & ~(MSR_DR | MSR_PR | MSR_HV)) + return ERR_PTR(-EINVAL); + nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0); if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index", &nvlink_index))) @@ -708,20 +711,9 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, } hose = pci_bus_to_host(npdev->bus); - nphb = hose->private_data; npu = hose->npu; - - /* - * Setup the NPU context table for a particular GPU. These need to be - * per-GPU as we need the tables to filter ATSDs when there are no - * active contexts on a particular GPU. It is safe for these to be - * called concurrently with destroy as the OPAL call takes appropriate - * locks and refcounts on init/destroy. - */ - rc = opal_npu_init_context(nphb->opal_id, mm->context.id, flags, - PCI_DEVID(gpdev->bus->number, gpdev->devfn)); - if (rc < 0) - return ERR_PTR(-ENOSPC); + if (!npu) + return ERR_PTR(-ENODEV); /* * We store the npu pci device so we can more easily get at the @@ -733,9 +725,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, if (npu_context->release_cb != cb || npu_context->priv != priv) { spin_unlock(&npu_context_lock); - opal_npu_destroy_context(nphb->opal_id, mm->context.id, - PCI_DEVID(gpdev->bus->number, - gpdev->devfn)); return ERR_PTR(-EINVAL); } @@ -761,9 +750,6 @@ struct npu_context *pnv_npu2_init_context(struct pci_dev *gpdev, if (rc) { kfree(npu_context); - opal_npu_destroy_context(nphb->opal_id, mm->context.id, - PCI_DEVID(gpdev->bus->number, - gpdev->devfn)); return ERR_PTR(rc); } @@ -816,7 +802,6 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context, struct pci_dev *gpdev) { int removed; - struct pnv_phb *nphb; struct npu *npu; struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0); struct device_node *nvlink_dn; @@ -826,19 +811,15 @@ void pnv_npu2_destroy_context(struct npu_context *npu_context, if (WARN_ON(!npdev)) return; - if (!firmware_has_feature(FW_FEATURE_OPAL)) - return; - hose = pci_bus_to_host(npdev->bus); - nphb = hose->private_data; npu = hose->npu; + if (!npu) + return; nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0); if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index", &nvlink_index))) return; WRITE_ONCE(npu_context->npdev[npu->index][nvlink_index], NULL); - opal_npu_destroy_context(nphb->opal_id, npu_context->mm->context.id, - PCI_DEVID(gpdev->bus->number, gpdev->devfn)); spin_lock(&npu_context_lock); removed = kref_put(&npu_context->kref, pnv_npu2_release_context); spin_unlock(&npu_context_lock); @@ -870,9 +851,6 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea, /* mmap_sem should be held so the struct_mm must be present */ struct mm_struct *mm = context->mm; - if (!firmware_has_feature(FW_FEATURE_OPAL)) - return -ENODEV; - WARN_ON(!rwsem_is_locked(&mm->mmap_sem)); for (i = 0; i < count; i++) { @@ -901,15 +879,11 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea, } EXPORT_SYMBOL(pnv_npu2_handle_fault); -int pnv_npu2_init(struct pnv_phb *phb) +int pnv_npu2_init(struct pci_controller *hose) { unsigned int i; u64 mmio_atsd; - struct device_node *dn; - struct pci_dev *gpdev; static int npu_index; - uint64_t rc = 0; - struct pci_controller *hose = phb->hose; struct npu *npu; int ret; @@ -918,18 +892,6 @@ int pnv_npu2_init(struct pnv_phb *phb) return -ENOMEM; npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush"); - for_each_child_of_node(phb->hose->dn, dn) { - gpdev = pnv_pci_get_gpu_dev(get_pci_dev(dn)); - if (gpdev) { - rc = opal_npu_map_lpar(phb->opal_id, - PCI_DEVID(gpdev->bus->number, gpdev->devfn), - 0, 0); - if (rc) - dev_err(&gpdev->dev, - "Error %lld mapping device to LPAR\n", - rc); - } - } for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", i, &mmio_atsd); i++) @@ -957,3 +919,52 @@ int pnv_npu2_init(struct pnv_phb *phb) return ret; } + +int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid, + unsigned long msr) +{ + int ret; + struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0); + struct pci_controller *hose; + struct pnv_phb *nphb; + + if (!npdev) + return -ENODEV; + + hose = pci_bus_to_host(npdev->bus); + nphb = hose->private_data; + + dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=%u\n", + nphb->opal_id, lparid); + /* + * Currently we only support radix and non-zero LPCR only makes sense + * for hash tables so skiboot expects the LPCR parameter to be a zero. + */ + ret = opal_npu_map_lpar(nphb->opal_id, + PCI_DEVID(gpdev->bus->number, gpdev->devfn), lparid, + 0 /* LPCR bits */); + if (ret) { + dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret); + return ret; + } + + dev_dbg(&gpdev->dev, "init context opalid=%llu msr=%lx\n", + nphb->opal_id, msr); + ret = opal_npu_init_context(nphb->opal_id, 0/*__unused*/, msr, + PCI_DEVID(gpdev->bus->number, gpdev->devfn)); + if (ret < 0) + dev_err(&gpdev->dev, "Failed to init context: %d\n", ret); + else + ret = 0; + + return 0; +} +EXPORT_SYMBOL_GPL(pnv_npu2_map_lpar_dev); + +void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr) +{ + struct pci_dev *gpdev; + + list_for_each_entry(gpdev, &gpe->pbus->devices, bus_list) + pnv_npu2_map_lpar_dev(gpdev, 0, msr); +} diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 8528fb9..ec4ce35 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1271,19 +1271,20 @@ static void pnv_ioda_setup_npu_PEs(struct pci_bus *bus) static void pnv_pci_ioda_setup_PEs(void) { - struct pci_controller *hose, *tmp; + struct pci_controller *hose; struct pnv_phb *phb; struct pci_bus *bus; struct pci_dev *pdev; + struct pnv_ioda_pe *pe; - list_for_each_entry_safe(hose, tmp, &hose_list, list_node) { + list_for_each_entry(hose, &hose_list, list_node) { phb = hose->private_data; if (phb->type == PNV_PHB_NPU_NVLINK) { /* PE#0 is needed for error reporting */ pnv_ioda_reserve_pe(phb, 0); pnv_ioda_setup_npu_PEs(hose->bus); if (phb->model == PNV_PHB_MODEL_NPU2) - WARN_ON_ONCE(pnv_npu2_init(phb)); + WARN_ON_ONCE(pnv_npu2_init(hose)); } if (phb->type == PNV_PHB_NPU_OCAPI) { bus = hose->bus; @@ -1291,6 +1292,14 @@ static void pnv_pci_ioda_setup_PEs(void) pnv_ioda_setup_dev_PE(pdev); } } + list_for_each_entry(hose, &hose_list, list_node) { + phb = hose->private_data; + if (phb->type != PNV_PHB_IODA2) + continue; + + list_for_each_entry(pe, &phb->ioda.pe_list, list) + pnv_npu2_map_lpar(pe, MSR_DR | MSR_PR | MSR_HV); + } } #ifdef CONFIG_PCI_IOV From patchwork Thu Dec 20 08:23:36 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016618 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bX1m2dz9sNq for ; Thu, 20 Dec 2018 19:26:16 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725766AbeLTI0I (ORCPT ); Thu, 20 Dec 2018 03:26:08 -0500 Received: from ozlabs.ru ([107.173.13.209]:54238 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730882AbeLTIYT (ORCPT ); Thu, 20 Dec 2018 03:24:19 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id D3A1CAE802F9; Thu, 20 Dec 2018 03:24:15 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 06/20] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation Date: Thu, 20 Dec 2018 19:23:36 +1100 Message-Id: <20181220082350.58113-7-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org We might have memory@ nodes with "linux,usable-memory" set to zero (for example, to replicate powernv's behaviour for GPU coherent memory) which means that the memory needs an extra initialization but since it can be used afterwards, the pseries platform will try mapping it for DMA so the DMA window needs to cover those memory regions too; if the window cannot cover new memory regions, the memory onlining fails. This walks through the memory nodes to find the highest RAM address to let a huge DMA window cover that too in case this memory gets onlined later. Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * uses of_read_number directly instead of cut-n-pasted read_n_cells --- arch/powerpc/platforms/pseries/iommu.c | 33 +++++++++++++++++++++++++- 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 06f0296..cbcc8ce 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -964,6 +964,37 @@ struct failed_ddw_pdn { static LIST_HEAD(failed_ddw_pdn_list); +static phys_addr_t ddw_memory_hotplug_max(void) +{ + phys_addr_t max_addr = memory_hotplug_max(); + struct device_node *memory; + + for_each_node_by_type(memory, "memory") { + unsigned long start, size; + int ranges, n_mem_addr_cells, n_mem_size_cells, len; + const __be32 *memcell_buf; + + memcell_buf = of_get_property(memory, "reg", &len); + if (!memcell_buf || len <= 0) + continue; + + n_mem_addr_cells = of_n_addr_cells(memory); + n_mem_size_cells = of_n_size_cells(memory); + + /* ranges in cell */ + ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells); + + start = of_read_number(memcell_buf, n_mem_addr_cells); + memcell_buf += n_mem_addr_cells; + size = of_read_number(memcell_buf, n_mem_size_cells); + memcell_buf += n_mem_size_cells; + + max_addr = max_t(phys_addr_t, max_addr, start + size); + } + + return max_addr; +} + /* * If the PE supports dynamic dma windows, and there is space for a table * that can map all pages in a linear offset, then setup such a table, @@ -1053,7 +1084,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) } /* verify the window * number of ptes will map the partition */ /* check largest block * page size > max memory hotplug addr */ - max_addr = memory_hotplug_max(); + max_addr = ddw_memory_hotplug_max(); if (query.largest_available_block < (max_addr >> page_shift)) { dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u " "%llu-sized pages\n", max_addr, query.largest_available_block, From patchwork Thu Dec 20 08:23:37 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016617 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bM5Ck0z9sNj for ; Thu, 20 Dec 2018 19:26:07 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730861AbeLTIYX (ORCPT ); Thu, 20 Dec 2018 03:24:23 -0500 Received: from ozlabs.ru ([107.173.13.209]:54307 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730870AbeLTIYX (ORCPT ); Thu, 20 Dec 2018 03:24:23 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 5CDBBAE802FA; Thu, 20 Dec 2018 03:24:19 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 07/20] powerpc/pseries/npu: Enable platform support Date: Thu, 20 Dec 2018 19:23:37 +1100 Message-Id: <20181220082350.58113-8-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org We already changed NPU API for GPUs to not to call OPAL and the remaining bit is initializing NPU structures. This searches for POWER9 NVLinks attached to any device on a PHB and initializes an NPU structure if any found. Signed-off-by: Alexey Kardashevskiy --- Changes: v5: * added WARN_ON_ONCE v4: * dropped "IBM,npu-vphb" compatible type on PHB and use the type of NVLink --- arch/powerpc/platforms/pseries/pci.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/arch/powerpc/platforms/pseries/pci.c b/arch/powerpc/platforms/pseries/pci.c index 41d8a4d..7725825 100644 --- a/arch/powerpc/platforms/pseries/pci.c +++ b/arch/powerpc/platforms/pseries/pci.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "pseries.h" #if 0 @@ -237,6 +238,8 @@ static void __init pSeries_request_regions(void) void __init pSeries_final_fixup(void) { + struct pci_controller *hose; + pSeries_request_regions(); eeh_probe_devices(); @@ -246,6 +249,25 @@ void __init pSeries_final_fixup(void) ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable; ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable; #endif + list_for_each_entry(hose, &hose_list, list_node) { + struct device_node *dn = hose->dn, *nvdn; + + while (1) { + dn = of_find_all_nodes(dn); + if (!dn) + break; + nvdn = of_parse_phandle(dn, "ibm,nvlink", 0); + if (!nvdn) + continue; + if (!of_device_is_compatible(nvdn, "ibm,npu-link")) + continue; + if (!of_device_is_compatible(nvdn->parent, + "ibm,power9-npu")) + continue; + WARN_ON_ONCE(pnv_npu2_init(hose)); + break; + } + } } /* From patchwork Thu Dec 20 08:23:38 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016616 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bK1j2hz9sNj for ; Thu, 20 Dec 2018 19:26:05 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730920AbeLTIY1 (ORCPT ); Thu, 20 Dec 2018 03:24:27 -0500 Received: from ozlabs.ru ([107.173.13.209]:54358 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730870AbeLTIY1 (ORCPT ); Thu, 20 Dec 2018 03:24:27 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id DAA7EAE802FB; Thu, 20 Dec 2018 03:24:22 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 08/20] powerpc/pseries: Remove IOMMU API support for non-LPAR systems Date: Thu, 20 Dec 2018 19:23:38 +1100 Message-Id: <20181220082350.58113-9-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are registered for the pseries platform which does not have FW_FEATURE_LPAR; these would be pre-powernv platforms which we never supported PCI pass through for anyway so remove it. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- arch/powerpc/platforms/pseries/iommu.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index cbcc8ce..2783cb7 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -645,7 +645,6 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus) iommu_table_setparms(pci->phb, dn, tbl); tbl->it_ops = &iommu_table_pseries_ops; iommu_init_table(tbl, pci->phb->node); - iommu_register_group(pci->table_group, pci_domain_nr(bus), 0); /* Divide the rest (1.75GB) among the children */ pci->phb->dma_window_size = 0x80000000ul; @@ -756,10 +755,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev) iommu_table_setparms(phb, dn, tbl); tbl->it_ops = &iommu_table_pseries_ops; iommu_init_table(tbl, phb->node); - iommu_register_group(PCI_DN(dn)->table_group, - pci_domain_nr(phb->bus), 0); set_iommu_table_base(&dev->dev, tbl); - iommu_add_device(&dev->dev); return; } @@ -770,11 +766,10 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev) while (dn && PCI_DN(dn) && PCI_DN(dn)->table_group == NULL) dn = dn->parent; - if (dn && PCI_DN(dn)) { + if (dn && PCI_DN(dn)) set_iommu_table_base(&dev->dev, PCI_DN(dn)->table_group->tables[0]); - iommu_add_device(&dev->dev); - } else + else printk(KERN_WARNING "iommu: Device %s has no iommu table\n", pci_name(dev)); } From patchwork Thu Dec 20 08:23:39 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016604 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4YZ1Lrdz9sNk for ; Thu, 20 Dec 2018 19:24:34 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730925AbeLTIYd (ORCPT ); Thu, 20 Dec 2018 03:24:33 -0500 Received: from ozlabs.ru ([107.173.13.209]:54431 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730870AbeLTIYb (ORCPT ); Thu, 20 Dec 2018 03:24:31 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 64CA8AE8030E; Thu, 20 Dec 2018 03:24:26 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 09/20] powerpc/powernv/pseries: Rework device adding to IOMMU groups Date: Thu, 20 Dec 2018 19:23:39 +1100 Message-Id: <20181220082350.58113-10-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org The powernv platform registers IOMMU groups and adds devices to them from the pci_controller_ops::setup_bridge() hook except one case when virtual functions (SRIOV VFs) are added from a bus notifier. The pseries platform registers IOMMU groups from the pci_controller_ops::dma_bus_setup() hook and adds devices from the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier used for powernv does not add devices for pseries though as __of_scan_bus() adds devices first, then it does the bus/dev DMA setup. Both platforms use iommu_add_device() which takes a device and expects it to have a valid IOMMU table struct with an iommu_table_group pointer which in turn points the iommu_group struct (which represents an IOMMU group). Although the helper seems easy to use, it relies on some pre-existing device configuration and associated data structures which it does not really need. This simplifies iommu_add_device() to take the table_group pointer directly. Pseries already has a table_group pointer handy and the bus notified is not used anyway. For powernv, this copies the existing bus notifier, makes it work for powernv only which means an easy way of getting to the table_group pointer. This was tested on VFs but should also support physical PCI hotplug. Since iommu_add_device() receives the table_group pointer directly, pseries does not do TCE cache invalidation (the hypervisor does) nor allow multiple groups per a VFIO container (in other words sharing an IOMMU table between partitionable endpoints), this removes iommu_table_group_link from pseries. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- arch/powerpc/include/asm/iommu.h | 12 ++--- arch/powerpc/kernel/iommu.c | 58 ++--------------------- arch/powerpc/platforms/powernv/pci-ioda.c | 10 +--- arch/powerpc/platforms/powernv/pci.c | 43 ++++++++++++++++- arch/powerpc/platforms/pseries/iommu.c | 46 +++++++++--------- 5 files changed, 74 insertions(+), 95 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index a8aeac0..e847ff6 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -215,9 +215,9 @@ struct iommu_table_group { extern void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num); -extern int iommu_add_device(struct device *dev); +extern int iommu_add_device(struct iommu_table_group *table_group, + struct device *dev); extern void iommu_del_device(struct device *dev); -extern int __init tce_iommu_bus_notifier_init(void); extern long iommu_tce_xchg(struct mm_struct *mm, struct iommu_table *tbl, unsigned long entry, unsigned long *hpa, enum dma_data_direction *direction); @@ -228,7 +228,8 @@ static inline void iommu_register_group(struct iommu_table_group *table_group, { } -static inline int iommu_add_device(struct device *dev) +static inline int iommu_add_device(struct iommu_table_group *table_group, + struct device *dev) { return 0; } @@ -236,11 +237,6 @@ static inline int iommu_add_device(struct device *dev) static inline void iommu_del_device(struct device *dev) { } - -static inline int __init tce_iommu_bus_notifier_init(void) -{ - return 0; -} #endif /* !CONFIG_IOMMU_API */ int dma_iommu_mapping_error(struct device *dev, dma_addr_t dma_addr); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index cbcc615..9d5d109 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1078,11 +1078,8 @@ void iommu_release_ownership(struct iommu_table *tbl) } EXPORT_SYMBOL_GPL(iommu_release_ownership); -int iommu_add_device(struct device *dev) +int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) { - struct iommu_table *tbl; - struct iommu_table_group_link *tgl; - /* * The sysfs entries should be populated before * binding IOMMU group. If sysfs entries isn't @@ -1098,32 +1095,10 @@ int iommu_add_device(struct device *dev) return -EBUSY; } - tbl = get_iommu_table_base(dev); - if (!tbl) { - pr_debug("%s: Skipping device %s with no tbl\n", - __func__, dev_name(dev)); - return 0; - } - - tgl = list_first_entry_or_null(&tbl->it_group_list, - struct iommu_table_group_link, next); - if (!tgl) { - pr_debug("%s: Skipping device %s with no group\n", - __func__, dev_name(dev)); - return 0; - } pr_debug("%s: Adding %s to iommu group %d\n", - __func__, dev_name(dev), - iommu_group_id(tgl->table_group->group)); + __func__, dev_name(dev), iommu_group_id(table_group->group)); - if (PAGE_SIZE < IOMMU_PAGE_SIZE(tbl)) { - pr_err("%s: Invalid IOMMU page size %lx (%lx) on %s\n", - __func__, IOMMU_PAGE_SIZE(tbl), - PAGE_SIZE, dev_name(dev)); - return -EINVAL; - } - - return iommu_group_add_device(tgl->table_group->group, dev); + return iommu_group_add_device(table_group->group, dev); } EXPORT_SYMBOL_GPL(iommu_add_device); @@ -1143,31 +1118,4 @@ void iommu_del_device(struct device *dev) iommu_group_remove_device(dev); } EXPORT_SYMBOL_GPL(iommu_del_device); - -static int tce_iommu_bus_notifier(struct notifier_block *nb, - unsigned long action, void *data) -{ - struct device *dev = data; - - switch (action) { - case BUS_NOTIFY_ADD_DEVICE: - return iommu_add_device(dev); - case BUS_NOTIFY_DEL_DEVICE: - if (dev->iommu_group) - iommu_del_device(dev); - return 0; - default: - return 0; - } -} - -static struct notifier_block tce_iommu_bus_nb = { - .notifier_call = tce_iommu_bus_notifier, -}; - -int __init tce_iommu_bus_notifier_init(void) -{ - bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb); - return 0; -} #endif /* CONFIG_IOMMU_API */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index ec4ce35..b86a6e0 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1940,7 +1940,7 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, set_iommu_table_base(&dev->dev, pe->table_group.tables[0]); set_dma_offset(&dev->dev, pe->tce_bypass_base); if (add_to_group) - iommu_add_device(&dev->dev); + iommu_add_device(&pe->table_group, &dev->dev); if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) pnv_ioda_setup_bus_dma(pe, dev->subordinate, @@ -2526,14 +2526,6 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe) if (!pnv_iommu_bypass_disabled) pnv_pci_ioda2_set_bypass(pe, true); - /* - * Setting table base here only for carrying iommu_group - * further down to let iommu_add_device() do the job. - * pnv_pci_ioda_dma_dev_setup will override it later anyway. - */ - if (pe->flags & PNV_IODA_PE_DEV) - set_iommu_table_base(&pe->pdev->dev, tbl); - return 0; } diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index db230a35..5121fb8 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -1127,4 +1127,45 @@ void __init pnv_pci_init(void) set_pci_dma_ops(&dma_iommu_ops); } -machine_subsys_initcall_sync(powernv, tce_iommu_bus_notifier_init); +static int pnv_tce_iommu_bus_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct device *dev = data; + struct pci_dev *pdev; + struct pci_dn *pdn; + struct pnv_ioda_pe *pe; + struct pci_controller *hose; + struct pnv_phb *phb; + + switch (action) { + case BUS_NOTIFY_ADD_DEVICE: + pdev = to_pci_dev(dev); + pdn = pci_get_pdn(pdev); + hose = pci_bus_to_host(pdev->bus); + phb = hose->private_data; + + WARN_ON_ONCE(!phb); + if (!pdn || pdn->pe_number == IODA_INVALID_PE || !phb) + return 0; + + pe = &phb->ioda.pe_array[pdn->pe_number]; + iommu_add_device(&pe->table_group, dev); + return 0; + case BUS_NOTIFY_DEL_DEVICE: + iommu_del_device(dev); + return 0; + default: + return 0; + } +} + +static struct notifier_block pnv_tce_iommu_bus_nb = { + .notifier_call = pnv_tce_iommu_bus_notifier, +}; + +static int __init pnv_tce_iommu_bus_notifier_init(void) +{ + bus_register_notifier(&pci_bus_type, &pnv_tce_iommu_bus_nb); + return 0; +} +machine_subsys_initcall_sync(powernv, pnv_tce_iommu_bus_notifier_init); diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 2783cb7..8fc8fe0 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -57,7 +57,6 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node) { struct iommu_table_group *table_group; struct iommu_table *tbl; - struct iommu_table_group_link *tgl; table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL, node); @@ -68,22 +67,13 @@ static struct iommu_table_group *iommu_pseries_alloc_group(int node) if (!tbl) goto free_group; - tgl = kzalloc_node(sizeof(struct iommu_table_group_link), GFP_KERNEL, - node); - if (!tgl) - goto free_table; - INIT_LIST_HEAD_RCU(&tbl->it_group_list); kref_init(&tbl->it_kref); - tgl->table_group = table_group; - list_add_rcu(&tgl->next, &tbl->it_group_list); table_group->tables[0] = tbl; return table_group; -free_table: - kfree(tbl); free_group: kfree(table_group); return NULL; @@ -93,23 +83,12 @@ static void iommu_pseries_free_group(struct iommu_table_group *table_group, const char *node_name) { struct iommu_table *tbl; -#ifdef CONFIG_IOMMU_API - struct iommu_table_group_link *tgl; -#endif if (!table_group) return; tbl = table_group->tables[0]; #ifdef CONFIG_IOMMU_API - tgl = list_first_entry_or_null(&tbl->it_group_list, - struct iommu_table_group_link, next); - - WARN_ON_ONCE(!tgl); - if (tgl) { - list_del_rcu(&tgl->next); - kfree(tgl); - } if (table_group->group) { iommu_group_put(table_group->group); BUG_ON(table_group->group); @@ -1216,7 +1195,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev) } set_iommu_table_base(&dev->dev, pci->table_group->tables[0]); - iommu_add_device(&dev->dev); + iommu_add_device(pci->table_group, &dev->dev); } static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask) @@ -1421,4 +1400,27 @@ static int __init disable_multitce(char *str) __setup("multitce=", disable_multitce); +static int tce_iommu_bus_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct device *dev = data; + + switch (action) { + case BUS_NOTIFY_DEL_DEVICE: + iommu_del_device(dev); + return 0; + default: + return 0; + } +} + +static struct notifier_block tce_iommu_bus_nb = { + .notifier_call = tce_iommu_bus_notifier, +}; + +static int __init tce_iommu_bus_notifier_init(void) +{ + bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb); + return 0; +} machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init); From patchwork Thu Dec 20 08:23:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016615 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4bB3YS4z9sLw for ; Thu, 20 Dec 2018 19:25:58 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730954AbeLTIYf (ORCPT ); Thu, 20 Dec 2018 03:24:35 -0500 Received: from ozlabs.ru ([107.173.13.209]:54479 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730940AbeLTIYd (ORCPT ); Thu, 20 Dec 2018 03:24:33 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 1956FAE8032F; Thu, 20 Dec 2018 03:24:29 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 10/20] powerpc/iommu_api: Move IOMMU groups setup to a single place Date: Thu, 20 Dec 2018 19:23:40 +1100 Message-Id: <20181220082350.58113-11-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org Registering new IOMMU groups and adding devices to them are separated in code and the latter is dug in the DMA setup code which it does not really belong to. This moved IOMMU groups setup to a separate helper which registers a group and adds devices as before. This does not make a difference as IOMMU groups are not used anyway; the only dependency here is that iommu_add_device() requires a valid pointer to an iommu_table (set by set_iommu_table_base()). To keep the old behaviour, this does not add new IOMMU groups for PEs with no DMA weight and also skips NVLink bridges which do not have pci_controller_ops::setup_bridge (the normal way of adding PEs). Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v5: * fixed compile with defined but not used pnv_ioda_setup_bus_iommu_group(); unfortunately defining a dummy version looks uglier than #ifdef --- arch/powerpc/platforms/powernv/pci-ioda.c | 82 +++++++++++++++++++---- 1 file changed, 68 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index b86a6e0..f6ab13d 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1538,6 +1538,9 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev) static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe); +#ifdef CONFIG_IOMMU_API +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe); +#endif static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs) { struct pci_bus *bus; @@ -1591,6 +1594,9 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs) mutex_unlock(&phb->ioda.pe_list_mutex); pnv_pci_ioda2_setup_dma_pe(phb, pe); +#ifdef CONFIG_IOMMU_API + pnv_ioda_setup_bus_iommu_group(pe); +#endif } } @@ -1930,21 +1936,16 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct pci_dev *pdev) return mask; } -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, - struct pci_bus *bus, - bool add_to_group) +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus) { struct pci_dev *dev; list_for_each_entry(dev, &bus->devices, bus_list) { set_iommu_table_base(&dev->dev, pe->table_group.tables[0]); set_dma_offset(&dev->dev, pe->tce_bypass_base); - if (add_to_group) - iommu_add_device(&pe->table_group, &dev->dev); if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) - pnv_ioda_setup_bus_dma(pe, dev->subordinate, - add_to_group); + pnv_ioda_setup_bus_dma(pe, dev->subordinate); } } @@ -2374,7 +2375,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb, iommu_init_table(tbl, phb->hose->node); if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) - pnv_ioda_setup_bus_dma(pe, pe->pbus, true); + pnv_ioda_setup_bus_dma(pe, pe->pbus); return; fail: @@ -2607,7 +2608,7 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group) pnv_pci_ioda2_set_bypass(pe, false); pnv_pci_ioda2_unset_window(&pe->table_group, 0); if (pe->pbus) - pnv_ioda_setup_bus_dma(pe, pe->pbus, false); + pnv_ioda_setup_bus_dma(pe, pe->pbus); iommu_tce_table_put(tbl); } @@ -2618,7 +2619,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group) pnv_pci_ioda2_setup_default_config(pe); if (pe->pbus) - pnv_ioda_setup_bus_dma(pe, pe->pbus, false); + pnv_ioda_setup_bus_dma(pe, pe->pbus); } static struct iommu_table_group_ops pnv_pci_ioda2_ops = { @@ -2735,12 +2736,68 @@ static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = { .release_ownership = pnv_ioda2_release_ownership, }; +static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe, + struct pci_bus *bus) +{ + struct pci_dev *dev; + + list_for_each_entry(dev, &bus->devices, bus_list) { + iommu_add_device(&pe->table_group, &dev->dev); + + if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) + pnv_ioda_setup_bus_iommu_group_add_devices(pe, + dev->subordinate); + } +} + +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) +{ + if (!pnv_pci_ioda_pe_dma_weight(pe)) + return; + + iommu_register_group(&pe->table_group, pe->phb->hose->global_number, + pe->pe_number); + + /* + * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called + * by now + */ + if (pe->flags & PNV_IODA_PE_DEV) + iommu_add_device(&pe->table_group, &pe->pdev->dev); + else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) + pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus); +} + static void pnv_pci_ioda_setup_iommu_api(void) { struct pci_controller *hose, *tmp; struct pnv_phb *phb; struct pnv_ioda_pe *pe, *gpe; + /* + * There are 4 types of PEs: + * - PNV_IODA_PE_BUS: a downstream port with an adapter, + * created from pnv_pci_setup_bridge(); + * - PNV_IODA_PE_BUS_ALL: a PCI-PCIX bridge with devices behind it, + * created from pnv_pci_setup_bridge(); + * - PNV_IODA_PE_VF: a SRIOV virtual function, + * created from pnv_pcibios_sriov_enable(); + * - PNV_IODA_PE_DEV: an NPU or OCAPI device, + * created from pnv_pci_ioda_fixup(). + * + * Normally a PE is represented by an IOMMU group, however for + * devices with side channels the groups need to be more strict. + */ + list_for_each_entry(hose, &hose_list, list_node) { + phb = hose->private_data; + + if (phb->type == PNV_PHB_NPU_NVLINK) + continue; + + list_for_each_entry(pe, &phb->ioda.pe_list, list) + pnv_ioda_setup_bus_iommu_group(pe); + } + /* * Now we have all PHBs discovered, time to add NPU devices to * the corresponding IOMMU groups. @@ -2801,9 +2858,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* TVE #1 is selected by PCI address bit 59 */ pe->tce_bypass_base = 1ull << 59; - iommu_register_group(&pe->table_group, phb->hose->global_number, - pe->pe_number); - /* The PE will reserve all possible 32-bits space */ pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n", phb->ioda.m32_pci_base); @@ -2824,7 +2878,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, return; if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) - pnv_ioda_setup_bus_dma(pe, pe->pbus, true); + pnv_ioda_setup_bus_dma(pe, pe->pbus); } #ifdef CONFIG_PCI_MSI From patchwork Thu Dec 20 08:23:41 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016614 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4b83GDpz9sNk for ; Thu, 20 Dec 2018 19:25:56 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730974AbeLTIYi (ORCPT ); Thu, 20 Dec 2018 03:24:38 -0500 Received: from ozlabs.ru ([107.173.13.209]:54528 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730963AbeLTIYh (ORCPT ); Thu, 20 Dec 2018 03:24:37 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 98520AE8001E; Thu, 20 Dec 2018 03:24:33 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 11/20] powerpc/powernv: Reference iommu_table while it is linked to a group Date: Thu, 20 Dec 2018 19:23:41 +1100 Message-Id: <20181220082350.58113-12-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org The iommu_table pointer stored in iommu_table_group may get stale by accident, this adds referencing and removes a redundant comment about this. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 3 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 4 ---- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c b/arch/powerpc/platforms/powernv/pci-ioda-tce.c index 7639b21..697449a 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c @@ -368,6 +368,7 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl, found = false; for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { if (table_group->tables[i] == tbl) { + iommu_tce_table_put(tbl); table_group->tables[i] = NULL; found = true; break; @@ -393,7 +394,7 @@ long pnv_pci_link_table_and_group(int node, int num, tgl->table_group = table_group; list_add_rcu(&tgl->next, &tbl->it_group_list); - table_group->tables[num] = tbl; + table_group->tables[num] = iommu_tce_table_get(tbl); return 0; } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f6ab13d..a5879ab 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2719,10 +2719,6 @@ static long pnv_pci_ioda2_npu_unset_window( static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group) { - /* - * Detach NPU first as pnv_ioda2_take_ownership() will destroy - * the iommu_table if 32bit DMA is enabled. - */ pnv_npu_take_ownership(gpe_table_group_to_npe(table_group)); pnv_ioda2_take_ownership(table_group); } From patchwork Thu Dec 20 08:23:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016613 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4b3514Zz9sNj for ; Thu, 20 Dec 2018 19:25:51 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730963AbeLTIYn (ORCPT ); Thu, 20 Dec 2018 03:24:43 -0500 Received: from ozlabs.ru ([107.173.13.209]:54572 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730982AbeLTIYl (ORCPT ); Thu, 20 Dec 2018 03:24:41 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 23C12AE80336; Thu, 20 Dec 2018 03:24:36 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 12/20] powerpc/powernv/npu: Move single TVE handling to NPU PE Date: Thu, 20 Dec 2018 19:23:42 +1100 Message-Id: <20181220082350.58113-13-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only one which points to one of two tables of the corresponding PCI PE. So whenever a new DMA window is programmed to PEs, the NPU PE needs to release old table in order to use the new one. Commit d41ce7b1bcc3e ("powerpc/powernv/npu: Do not try invalidating 32bit table when 64bit table is enabled") did just that but in pci-ioda.c while it actually belongs to npu-dma.c. This moves the single TVE handling to npu-dma.c. This does not implement restoring though as it is highly unlikely that we can set the table to PCI PE and cannot to NPU PE and if that fails, we could only set 32bit table to NPU PE and this configuration is not really supported or wanted. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/npu-dma.c | 8 +++++++ arch/powerpc/platforms/powernv/pci-ioda.c | 27 +++-------------------- 2 files changed, 11 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index ef1457f..26063fb 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -130,6 +130,11 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, tbl->it_level_size : tbl->it_size; const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; const __u64 win_size = tbl->it_size << tbl->it_page_shift; + int num2 = (num == 0) ? 1 : 0; + + /* NPU has just one TVE so if there is another table, remove it first */ + if (npe->table_group.tables[num2]) + pnv_npu_unset_window(npe, num2); pe_info(npe, "Setting up window %llx..%llx pg=%lx\n", start_addr, start_addr + win_size - 1, @@ -160,6 +165,9 @@ long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num) struct pnv_phb *phb = npe->phb; int64_t rc; + if (!npe->table_group.tables[num]) + return 0; + pe_info(npe, "Removing DMA window\n"); rc = opal_pci_map_pe_dma_window(phb->opal_id, npe->pe_number, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a5879ab..1ee3c5d6 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2672,23 +2672,14 @@ static struct pnv_ioda_pe *gpe_table_group_to_npe( static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group, int num, struct iommu_table *tbl) { - struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); - int num2 = (num == 0) ? 1 : 0; long ret = pnv_pci_ioda2_set_window(table_group, num, tbl); if (ret) return ret; - if (table_group->tables[num2]) - pnv_npu_unset_window(npe, num2); - - ret = pnv_npu_set_window(npe, num, tbl); - if (ret) { + ret = pnv_npu_set_window(gpe_table_group_to_npe(table_group), num, tbl); + if (ret) pnv_pci_ioda2_unset_window(table_group, num); - if (table_group->tables[num2]) - pnv_npu_set_window(npe, num2, - table_group->tables[num2]); - } return ret; } @@ -2697,24 +2688,12 @@ static long pnv_pci_ioda2_npu_unset_window( struct iommu_table_group *table_group, int num) { - struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); - int num2 = (num == 0) ? 1 : 0; long ret = pnv_pci_ioda2_unset_window(table_group, num); if (ret) return ret; - if (!npe->table_group.tables[num]) - return 0; - - ret = pnv_npu_unset_window(npe, num); - if (ret) - return ret; - - if (table_group->tables[num2]) - ret = pnv_npu_set_window(npe, num2, table_group->tables[num2]); - - return ret; + return pnv_npu_unset_window(gpe_table_group_to_npe(table_group), num); } static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group) From patchwork Thu Dec 20 08:23:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016612 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4Zz33WZz9sLw for ; Thu, 20 Dec 2018 19:25:47 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731015AbeLTIYo (ORCPT ); Thu, 20 Dec 2018 03:24:44 -0500 Received: from ozlabs.ru ([107.173.13.209]:54618 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730982AbeLTIYo (ORCPT ); Thu, 20 Dec 2018 03:24:44 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id A2404AE80337; Thu, 20 Dec 2018 03:24:40 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 13/20] powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops Date: Thu, 20 Dec 2018 19:23:43 +1100 Message-Id: <20181220082350.58113-14-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org At the moment NPU IOMMU is manipulated directly from the IODA2 PCI PE code; PCI PE acts as a master to NPU PE. Soon we will have compound IOMMU groups with several PEs from several different PHB (such as interconnected GPUs and NPUs) so there will be no single master but a one big IOMMU group. This makes a first step and converts an NPU PE with a set of extern function to a table group. This should cause no behavioral change. Note that pnv_npu_release_ownership() has never been implemented. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- arch/powerpc/platforms/powernv/pci.h | 5 ---- arch/powerpc/platforms/powernv/npu-dma.c | 34 ++++++++++++++++++----- arch/powerpc/platforms/powernv/pci-ioda.c | 10 +++++-- 3 files changed, 34 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index ddb4f02..cf9f748 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -216,11 +216,6 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass); extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm); extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe); -extern long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, - struct iommu_table *tbl); -extern long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num); -extern void pnv_npu_take_ownership(struct pnv_ioda_pe *npe); -extern void pnv_npu_release_ownership(struct pnv_ioda_pe *npe); /* pci-ioda-tce.c */ #define POWERNV_IOMMU_DEFAULT_LEVELS 1 diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 26063fb..dc629ee 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -121,9 +121,14 @@ static struct pnv_ioda_pe *get_gpu_pci_dev_and_pe(struct pnv_ioda_pe *npe, return pe; } -long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, +static long pnv_npu_unset_window(struct iommu_table_group *table_group, + int num); + +static long pnv_npu_set_window(struct iommu_table_group *table_group, int num, struct iommu_table *tbl) { + struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe, + table_group); struct pnv_phb *phb = npe->phb; int64_t rc; const unsigned long size = tbl->it_indirect_levels ? @@ -134,7 +139,7 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, /* NPU has just one TVE so if there is another table, remove it first */ if (npe->table_group.tables[num2]) - pnv_npu_unset_window(npe, num2); + pnv_npu_unset_window(&npe->table_group, num2); pe_info(npe, "Setting up window %llx..%llx pg=%lx\n", start_addr, start_addr + win_size - 1, @@ -160,8 +165,10 @@ long pnv_npu_set_window(struct pnv_ioda_pe *npe, int num, return 0; } -long pnv_npu_unset_window(struct pnv_ioda_pe *npe, int num) +static long pnv_npu_unset_window(struct iommu_table_group *table_group, int num) { + struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe, + table_group); struct pnv_phb *phb = npe->phb; int64_t rc; @@ -206,7 +213,8 @@ static void pnv_npu_dma_set_32(struct pnv_ioda_pe *npe) if (!gpe) return; - rc = pnv_npu_set_window(npe, 0, gpe->table_group.tables[0]); + rc = pnv_npu_set_window(&npe->table_group, 0, + gpe->table_group.tables[0]); /* * NVLink devices use the same TCE table configuration as @@ -231,7 +239,7 @@ static int pnv_npu_dma_set_bypass(struct pnv_ioda_pe *npe) if (phb->type != PNV_PHB_NPU_NVLINK || !npe->pdev) return -EINVAL; - rc = pnv_npu_unset_window(npe, 0); + rc = pnv_npu_unset_window(&npe->table_group, 0); if (rc != OPAL_SUCCESS) return rc; @@ -284,9 +292,12 @@ void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass) } } +#ifdef CONFIG_IOMMU_API /* Switch ownership from platform code to external user (e.g. VFIO) */ -void pnv_npu_take_ownership(struct pnv_ioda_pe *npe) +static void pnv_npu_take_ownership(struct iommu_table_group *table_group) { + struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe, + table_group); struct pnv_phb *phb = npe->phb; int64_t rc; @@ -297,7 +308,7 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe) * if it was enabled at the moment of ownership change. */ if (npe->table_group.tables[0]) { - pnv_npu_unset_window(npe, 0); + pnv_npu_unset_window(&npe->table_group, 0); return; } @@ -312,6 +323,12 @@ void pnv_npu_take_ownership(struct pnv_ioda_pe *npe) pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false); } +static struct iommu_table_group_ops pnv_pci_npu_ops = { + .set_window = pnv_npu_set_window, + .unset_window = pnv_npu_unset_window, + .take_ownership = pnv_npu_take_ownership, +}; + struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe) { struct pnv_phb *phb = npe->phb; @@ -322,6 +339,8 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe) if (!gpe || !gpdev) return NULL; + npe->table_group.ops = &pnv_pci_npu_ops; + list_for_each_entry(npdev, &pbus->devices, bus_list) { gptmp = pnv_pci_get_gpu_dev(npdev); @@ -334,6 +353,7 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe) return gpe; } +#endif /* !CONFIG_IOMMU_API */ /* * NPU2 ATS diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 1ee3c5d6..6972054 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2672,12 +2672,13 @@ static struct pnv_ioda_pe *gpe_table_group_to_npe( static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group, int num, struct iommu_table *tbl) { + struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); long ret = pnv_pci_ioda2_set_window(table_group, num, tbl); if (ret) return ret; - ret = pnv_npu_set_window(gpe_table_group_to_npe(table_group), num, tbl); + ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl); if (ret) pnv_pci_ioda2_unset_window(table_group, num); @@ -2688,17 +2689,20 @@ static long pnv_pci_ioda2_npu_unset_window( struct iommu_table_group *table_group, int num) { + struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); long ret = pnv_pci_ioda2_unset_window(table_group, num); if (ret) return ret; - return pnv_npu_unset_window(gpe_table_group_to_npe(table_group), num); + return npe->table_group.ops->unset_window(&npe->table_group, num); } static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group) { - pnv_npu_take_ownership(gpe_table_group_to_npe(table_group)); + struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); + + npe->table_group.ops->take_ownership(&npe->table_group); pnv_ioda2_take_ownership(table_group); } From patchwork Thu Dec 20 08:23:44 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016611 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4Zv4Rprz9sLw for ; Thu, 20 Dec 2018 19:25:43 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731032AbeLTIYt (ORCPT ); Thu, 20 Dec 2018 03:24:49 -0500 Received: from ozlabs.ru ([107.173.13.209]:54661 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730982AbeLTIYs (ORCPT ); Thu, 20 Dec 2018 03:24:48 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 2CB80AE8035D; Thu, 20 Dec 2018 03:24:43 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 14/20] powerpc/powernv/npu: Add compound IOMMU groups Date: Thu, 20 Dec 2018 19:23:44 +1100 Message-Id: <20181220082350.58113-15-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org At the moment the powernv platform registers an IOMMU group for each PE. There is an exception though: an NVLink bridge which is attached to the corresponding GPU's IOMMU group making it a master. Now we have POWER9 systems with GPUs connected to each other directly bypassing PCI. At the moment we do not control state of these links so we have to put such interconnected GPUs to one IOMMU group which means that the old scheme with one GPU as a master won't work - there will be up to 3 GPUs in such group. This introduces a npu_comp struct which represents a compound IOMMU group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for NVLink bridges). This converts the existing NVLink1 code to use the new scheme. From now on, each PE must have a valid iommu_table_group_ops which will either be called directly (for a single PE group) or indirectly from a compound group handlers. This moves IOMMU group registration for NVLink-connected GPUs to npu-dma.c. For POWER8, this stores a new compound group pointer in the PE (so a GPU is still a master); for POWER9 the new group pointer is stored in an NPU (which is allocated per a PCI host controller). Signed-off-by: Alexey Kardashevskiy --- Changes: v7: * fixed compiler warning in pnv_try_setup_npu_table_group() v5: * now read page sizes from PHB NVLink to narrow down what the compoind PE can actually support (hint: 4K/64K only) --- arch/powerpc/include/asm/pci.h | 1 + arch/powerpc/platforms/powernv/pci.h | 7 + arch/powerpc/platforms/powernv/npu-dma.c | 291 ++++++++++++++++++++-- arch/powerpc/platforms/powernv/pci-ioda.c | 159 ++++-------- 4 files changed, 322 insertions(+), 136 deletions(-) diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h index baf2886..0c72f18 100644 --- a/arch/powerpc/include/asm/pci.h +++ b/arch/powerpc/include/asm/pci.h @@ -132,5 +132,6 @@ extern struct pci_dev *pnv_pci_get_npu_dev(struct pci_dev *gpdev, int index); extern int pnv_npu2_init(struct pci_controller *hose); extern int pnv_npu2_map_lpar_dev(struct pci_dev *gpdev, unsigned int lparid, unsigned long msr); +extern int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev); #endif /* __ASM_POWERPC_PCI_H */ diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index cf9f748..aef4bb5 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -62,6 +62,7 @@ struct pnv_ioda_pe { /* "Base" iommu table, ie, 4K TCEs, 32-bit DMA */ struct iommu_table_group table_group; + struct npu_comp *npucomp; /* 64-bit TCE bypass region */ bool tce_bypass_enabled; @@ -201,6 +202,8 @@ extern void pnv_teardown_msi_irqs(struct pci_dev *pdev); extern struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev); extern void pnv_set_msi_irq_chip(struct pnv_phb *phb, unsigned int virq); extern void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); +extern unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels); extern int pnv_eeh_post_init(void); extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, @@ -216,6 +219,10 @@ extern void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, extern void pnv_npu_try_dma_set_bypass(struct pci_dev *gpdev, bool bypass); extern void pnv_pci_ioda2_tce_invalidate_entire(struct pnv_phb *phb, bool rm); extern struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe); +extern struct iommu_table_group *pnv_try_setup_npu_table_group( + struct pnv_ioda_pe *pe); +extern struct iommu_table_group *pnv_npu_compound_attach( + struct pnv_ioda_pe *pe); /* pci-ioda-tce.c */ #define POWERNV_IOMMU_DEFAULT_LEVELS 1 diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index dc629ee..d93a2cd 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -328,31 +328,6 @@ static struct iommu_table_group_ops pnv_pci_npu_ops = { .unset_window = pnv_npu_unset_window, .take_ownership = pnv_npu_take_ownership, }; - -struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe) -{ - struct pnv_phb *phb = npe->phb; - struct pci_bus *pbus = phb->hose->bus; - struct pci_dev *npdev, *gpdev = NULL, *gptmp; - struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(npe, &gpdev); - - if (!gpe || !gpdev) - return NULL; - - npe->table_group.ops = &pnv_pci_npu_ops; - - list_for_each_entry(npdev, &pbus->devices, bus_list) { - gptmp = pnv_pci_get_gpu_dev(npdev); - - if (gptmp != gpdev) - continue; - - pe_info(gpe, "Attached NPU %s\n", dev_name(&npdev->dev)); - iommu_group_add_device(gpe->table_group.group, &npdev->dev); - } - - return gpe; -} #endif /* !CONFIG_IOMMU_API */ /* @@ -360,6 +335,17 @@ struct pnv_ioda_pe *pnv_pci_npu_setup_iommu(struct pnv_ioda_pe *npe) */ /* Maximum possible number of ATSD MMIO registers per NPU */ #define NV_NMMU_ATSD_REGS 8 +#define NV_NPU_MAX_PE_NUM 16 + +/* + * A compound NPU IOMMU group which might consist of 1 GPU + 2xNPUs (POWER8) or + * up to 3 x (GPU + 2xNPUs) (POWER9). + */ +struct npu_comp { + struct iommu_table_group table_group; + int pe_num; + struct pnv_ioda_pe *pe[NV_NPU_MAX_PE_NUM]; +}; /* An NPU descriptor, valid for POWER9 only */ struct npu { @@ -372,8 +358,263 @@ struct npu { /* Do we need to explicitly flush the nest mmu? */ bool nmmu_flush; + + struct npu_comp npucomp; }; +#ifdef CONFIG_IOMMU_API +static long pnv_npu_peers_create_table_userspace( + struct iommu_table_group *table_group, + int num, __u32 page_shift, __u64 window_size, __u32 levels, + struct iommu_table **ptbl) +{ + struct npu_comp *npucomp = container_of(table_group, struct npu_comp, + table_group); + + if (!npucomp->pe_num || !npucomp->pe[0] || + !npucomp->pe[0]->table_group.ops || + !npucomp->pe[0]->table_group.ops->create_table) + return -EFAULT; + + return npucomp->pe[0]->table_group.ops->create_table( + &npucomp->pe[0]->table_group, num, page_shift, + window_size, levels, ptbl); +} + +static long pnv_npu_peers_set_window(struct iommu_table_group *table_group, + int num, struct iommu_table *tbl) +{ + int i, j; + long ret = 0; + struct npu_comp *npucomp = container_of(table_group, struct npu_comp, + table_group); + + for (i = 0; i < npucomp->pe_num; ++i) { + struct pnv_ioda_pe *pe = npucomp->pe[i]; + + if (!pe->table_group.ops->set_window) + continue; + + ret = pe->table_group.ops->set_window(&pe->table_group, + num, tbl); + if (ret) + break; + } + + if (ret) { + for (j = 0; j < i; ++j) { + struct pnv_ioda_pe *pe = npucomp->pe[j]; + + if (!pe->table_group.ops->unset_window) + continue; + + ret = pe->table_group.ops->unset_window( + &pe->table_group, num); + if (ret) + break; + } + } else { + table_group->tables[num] = iommu_tce_table_get(tbl); + } + + return ret; +} + +static long pnv_npu_peers_unset_window(struct iommu_table_group *table_group, + int num) +{ + int i, j; + long ret = 0; + struct npu_comp *npucomp = container_of(table_group, struct npu_comp, + table_group); + + for (i = 0; i < npucomp->pe_num; ++i) { + struct pnv_ioda_pe *pe = npucomp->pe[i]; + + WARN_ON(npucomp->table_group.tables[num] != + table_group->tables[num]); + if (!npucomp->table_group.tables[num]) + continue; + + if (!pe->table_group.ops->unset_window) + continue; + + ret = pe->table_group.ops->unset_window(&pe->table_group, num); + if (ret) + break; + } + + if (ret) { + for (j = 0; j < i; ++j) { + struct pnv_ioda_pe *pe = npucomp->pe[j]; + + if (!npucomp->table_group.tables[num]) + continue; + + if (!pe->table_group.ops->set_window) + continue; + + ret = pe->table_group.ops->set_window(&pe->table_group, + num, table_group->tables[num]); + if (ret) + break; + } + } else if (table_group->tables[num]) { + iommu_tce_table_put(table_group->tables[num]); + table_group->tables[num] = NULL; + } + + return ret; +} + +static void pnv_npu_peers_take_ownership(struct iommu_table_group *table_group) +{ + int i; + struct npu_comp *npucomp = container_of(table_group, struct npu_comp, + table_group); + + for (i = 0; i < npucomp->pe_num; ++i) { + struct pnv_ioda_pe *pe = npucomp->pe[i]; + + if (!pe->table_group.ops->take_ownership) + continue; + pe->table_group.ops->take_ownership(&pe->table_group); + } +} + +static void pnv_npu_peers_release_ownership( + struct iommu_table_group *table_group) +{ + int i; + struct npu_comp *npucomp = container_of(table_group, struct npu_comp, + table_group); + + for (i = 0; i < npucomp->pe_num; ++i) { + struct pnv_ioda_pe *pe = npucomp->pe[i]; + + if (!pe->table_group.ops->release_ownership) + continue; + pe->table_group.ops->release_ownership(&pe->table_group); + } +} + +static struct iommu_table_group_ops pnv_npu_peers_ops = { + .get_table_size = pnv_pci_ioda2_get_table_size, + .create_table = pnv_npu_peers_create_table_userspace, + .set_window = pnv_npu_peers_set_window, + .unset_window = pnv_npu_peers_unset_window, + .take_ownership = pnv_npu_peers_take_ownership, + .release_ownership = pnv_npu_peers_release_ownership, +}; + +static void pnv_comp_attach_table_group(struct npu_comp *npucomp, + struct pnv_ioda_pe *pe) +{ + if (WARN_ON(npucomp->pe_num == NV_NPU_MAX_PE_NUM)) + return; + + npucomp->pe[npucomp->pe_num] = pe; + ++npucomp->pe_num; +} + +struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe) +{ + struct iommu_table_group *table_group; + struct npu_comp *npucomp; + struct pci_dev *gpdev = NULL; + struct pci_controller *hose; + struct pci_dev *npdev = NULL; + + list_for_each_entry(gpdev, &pe->pbus->devices, bus_list) { + npdev = pnv_pci_get_npu_dev(gpdev, 0); + if (npdev) + break; + } + + if (!npdev) + /* It is not an NPU attached device, skip */ + return NULL; + + hose = pci_bus_to_host(npdev->bus); + + if (hose->npu) { + table_group = &hose->npu->npucomp.table_group; + + if (!table_group->group) { + table_group->ops = &pnv_npu_peers_ops; + iommu_register_group(table_group, + hose->global_number, + pe->pe_number); + } + } else { + /* Create a group for 1 GPU and attached NPUs for POWER8 */ + pe->npucomp = kzalloc(sizeof(pe->npucomp), GFP_KERNEL); + table_group = &pe->npucomp->table_group; + table_group->ops = &pnv_npu_peers_ops; + iommu_register_group(table_group, hose->global_number, + pe->pe_number); + } + + /* Steal capabilities from a GPU PE */ + table_group->max_dynamic_windows_supported = + pe->table_group.max_dynamic_windows_supported; + table_group->tce32_start = pe->table_group.tce32_start; + table_group->tce32_size = pe->table_group.tce32_size; + table_group->max_levels = pe->table_group.max_levels; + if (!table_group->pgsizes) + table_group->pgsizes = pe->table_group.pgsizes; + + npucomp = container_of(table_group, struct npu_comp, table_group); + pnv_comp_attach_table_group(npucomp, pe); + + return table_group; +} + +struct iommu_table_group *pnv_npu_compound_attach(struct pnv_ioda_pe *pe) +{ + struct iommu_table_group *table_group; + struct npu_comp *npucomp; + struct pci_dev *gpdev = NULL; + struct pci_dev *npdev; + struct pnv_ioda_pe *gpe = get_gpu_pci_dev_and_pe(pe, &gpdev); + + WARN_ON(!(pe->flags & PNV_IODA_PE_DEV)); + if (!gpe) + return NULL; + + /* + * IODA2 bridges get this set up from pci_controller_ops::setup_bridge + * but NPU bridges do not have this hook defined so we do it here. + * We do not setup other table group parameters as they won't be used + * anyway - NVLink bridges are subordinate PEs. + */ + pe->table_group.ops = &pnv_pci_npu_ops; + + table_group = iommu_group_get_iommudata( + iommu_group_get(&gpdev->dev)); + + /* + * On P9 NPU PHB and PCI PHB support different page sizes, + * keep only matching. We expect here that NVLink bridge PE pgsizes is + * initialized by the caller. + */ + table_group->pgsizes &= pe->table_group.pgsizes; + npucomp = container_of(table_group, struct npu_comp, table_group); + pnv_comp_attach_table_group(npucomp, pe); + + list_for_each_entry(npdev, &pe->phb->hose->bus->devices, bus_list) { + struct pci_dev *gpdevtmp = pnv_pci_get_gpu_dev(npdev); + + if (gpdevtmp != gpdev) + continue; + + iommu_add_device(table_group, &npdev->dev); + } + + return table_group; +} +#endif /* CONFIG_IOMMU_API */ + /* Maximum number of nvlinks per npu */ #define NV_MAX_LINKS 6 diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 6972054..6fe8907 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -190,7 +190,8 @@ static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe) unsigned int pe_num = pe->pe_number; WARN_ON(pe->pdev); - + WARN_ON(pe->npucomp); /* NPUs are not supposed to be freed */ + kfree(pe->npucomp); memset(pe, 0, sizeof(struct pnv_ioda_pe)); clear_bit(pe_num, phb->ioda.pe_alloc); } @@ -1539,7 +1540,9 @@ void pnv_pci_sriov_disable(struct pci_dev *pdev) static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe); #ifdef CONFIG_IOMMU_API -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe); +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe, + struct iommu_table_group *table_group, struct pci_bus *bus); + #endif static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs) { @@ -1595,7 +1598,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs) pnv_pci_ioda2_setup_dma_pe(phb, pe); #ifdef CONFIG_IOMMU_API - pnv_ioda_setup_bus_iommu_group(pe); + pnv_ioda_setup_bus_iommu_group(pe, &pe->table_group, NULL); #endif } } @@ -2557,7 +2560,7 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, #endif #ifdef CONFIG_IOMMU_API -static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, +unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, __u64 window_size, __u32 levels) { unsigned long bytes = 0; @@ -2631,127 +2634,40 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = { .release_ownership = pnv_ioda2_release_ownership, }; -static int gpe_table_group_to_npe_cb(struct device *dev, void *opaque) -{ - struct pci_controller *hose; - struct pnv_phb *phb; - struct pnv_ioda_pe **ptmppe = opaque; - struct pci_dev *pdev = container_of(dev, struct pci_dev, dev); - struct pci_dn *pdn = pci_get_pdn(pdev); - - if (!pdn || pdn->pe_number == IODA_INVALID_PE) - return 0; - - hose = pci_bus_to_host(pdev->bus); - phb = hose->private_data; - if (phb->type != PNV_PHB_NPU_NVLINK) - return 0; - - *ptmppe = &phb->ioda.pe_array[pdn->pe_number]; - - return 1; -} - -/* - * This returns PE of associated NPU. - * This assumes that NPU is in the same IOMMU group with GPU and there is - * no other PEs. - */ -static struct pnv_ioda_pe *gpe_table_group_to_npe( - struct iommu_table_group *table_group) -{ - struct pnv_ioda_pe *npe = NULL; - int ret = iommu_group_for_each_dev(table_group->group, &npe, - gpe_table_group_to_npe_cb); - - BUG_ON(!ret || !npe); - - return npe; -} - -static long pnv_pci_ioda2_npu_set_window(struct iommu_table_group *table_group, - int num, struct iommu_table *tbl) -{ - struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); - long ret = pnv_pci_ioda2_set_window(table_group, num, tbl); - - if (ret) - return ret; - - ret = npe->table_group.ops->set_window(&npe->table_group, num, tbl); - if (ret) - pnv_pci_ioda2_unset_window(table_group, num); - - return ret; -} - -static long pnv_pci_ioda2_npu_unset_window( - struct iommu_table_group *table_group, - int num) -{ - struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); - long ret = pnv_pci_ioda2_unset_window(table_group, num); - - if (ret) - return ret; - - return npe->table_group.ops->unset_window(&npe->table_group, num); -} - -static void pnv_ioda2_npu_take_ownership(struct iommu_table_group *table_group) -{ - struct pnv_ioda_pe *npe = gpe_table_group_to_npe(table_group); - - npe->table_group.ops->take_ownership(&npe->table_group); - pnv_ioda2_take_ownership(table_group); -} - -static struct iommu_table_group_ops pnv_pci_ioda2_npu_ops = { - .get_table_size = pnv_pci_ioda2_get_table_size, - .create_table = pnv_pci_ioda2_create_table_userspace, - .set_window = pnv_pci_ioda2_npu_set_window, - .unset_window = pnv_pci_ioda2_npu_unset_window, - .take_ownership = pnv_ioda2_npu_take_ownership, - .release_ownership = pnv_ioda2_release_ownership, -}; - static void pnv_ioda_setup_bus_iommu_group_add_devices(struct pnv_ioda_pe *pe, + struct iommu_table_group *table_group, struct pci_bus *bus) { struct pci_dev *dev; list_for_each_entry(dev, &bus->devices, bus_list) { - iommu_add_device(&pe->table_group, &dev->dev); + iommu_add_device(table_group, &dev->dev); if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate) pnv_ioda_setup_bus_iommu_group_add_devices(pe, - dev->subordinate); + table_group, dev->subordinate); } } -static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe) +static void pnv_ioda_setup_bus_iommu_group(struct pnv_ioda_pe *pe, + struct iommu_table_group *table_group, struct pci_bus *bus) { - if (!pnv_pci_ioda_pe_dma_weight(pe)) - return; - iommu_register_group(&pe->table_group, pe->phb->hose->global_number, - pe->pe_number); - - /* - * set_iommu_table_base(&pe->pdev->dev, tbl) should have been called - * by now - */ if (pe->flags & PNV_IODA_PE_DEV) - iommu_add_device(&pe->table_group, &pe->pdev->dev); - else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) - pnv_ioda_setup_bus_iommu_group_add_devices(pe, pe->pbus); + iommu_add_device(table_group, &pe->pdev->dev); + + if ((pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) || bus) + pnv_ioda_setup_bus_iommu_group_add_devices(pe, table_group, + bus); } +static unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb); + static void pnv_pci_ioda_setup_iommu_api(void) { - struct pci_controller *hose, *tmp; + struct pci_controller *hose; struct pnv_phb *phb; - struct pnv_ioda_pe *pe, *gpe; + struct pnv_ioda_pe *pe; /* * There are 4 types of PEs: @@ -2773,24 +2689,45 @@ static void pnv_pci_ioda_setup_iommu_api(void) if (phb->type == PNV_PHB_NPU_NVLINK) continue; - list_for_each_entry(pe, &phb->ioda.pe_list, list) - pnv_ioda_setup_bus_iommu_group(pe); + list_for_each_entry(pe, &phb->ioda.pe_list, list) { + struct iommu_table_group *table_group; + + table_group = pnv_try_setup_npu_table_group(pe); + if (!table_group) { + if (!pnv_pci_ioda_pe_dma_weight(pe)) + continue; + + table_group = &pe->table_group; + iommu_register_group(&pe->table_group, + pe->phb->hose->global_number, + pe->pe_number); + } + pnv_ioda_setup_bus_iommu_group(pe, table_group, + pe->pbus); + } } /* * Now we have all PHBs discovered, time to add NPU devices to * the corresponding IOMMU groups. */ - list_for_each_entry_safe(hose, tmp, &hose_list, list_node) { + list_for_each_entry(hose, &hose_list, list_node) { + unsigned long pgsizes; + phb = hose->private_data; if (phb->type != PNV_PHB_NPU_NVLINK) continue; + pgsizes = pnv_ioda_parse_tce_sizes(phb); list_for_each_entry(pe, &phb->ioda.pe_list, list) { - gpe = pnv_pci_npu_setup_iommu(pe); - if (gpe) - gpe->table_group.ops = &pnv_pci_ioda2_npu_ops; + /* + * IODA2 bridges get this set up from + * pci_controller_ops::setup_bridge but NPU bridges + * do not have this hook defined so we do it here. + */ + pe->table_group.pgsizes = pgsizes; + pnv_npu_compound_attach(pe); } } } From patchwork Thu Dec 20 08:23:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016610 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4Zp2BFFz9sNj for ; Thu, 20 Dec 2018 19:25:38 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731051AbeLTIYw (ORCPT ); Thu, 20 Dec 2018 03:24:52 -0500 Received: from ozlabs.ru ([107.173.13.209]:54711 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730982AbeLTIYv (ORCPT ); Thu, 20 Dec 2018 03:24:51 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id B15F9AE8035E; Thu, 20 Dec 2018 03:24:47 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 15/20] powerpc/powernv/npu: Add release_ownership hook Date: Thu, 20 Dec 2018 19:23:45 +1100 Message-Id: <20181220082350.58113-16-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org In order to make ATS work and translate addresses for arbitrary LPID and PID, we need to program an NPU with LPID and allow PID wildcard matching with a specific MSR mask. This implements a helper to assign a GPU to LPAR and program the NPU with a wildcard for PID and a helper to do clean-up. The helper takes MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for ATS checkout requests support. This exports pnv_npu2_unmap_lpar_dev() as following patches will use it from the VFIO driver. Signed-off-by: Alexey Kardashevskiy --- Changes: v5: * removed opal_purge_cache as it is a part of reset in skiboot now --- arch/powerpc/platforms/powernv/npu-dma.c | 51 ++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index d93a2cd..e06043b 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -300,6 +300,7 @@ static void pnv_npu_take_ownership(struct iommu_table_group *table_group) table_group); struct pnv_phb *phb = npe->phb; int64_t rc; + struct pci_dev *gpdev = NULL; /* * Note: NPU has just a single TVE in the hardware which means that @@ -321,12 +322,28 @@ static void pnv_npu_take_ownership(struct iommu_table_group *table_group) return; } pnv_pci_ioda2_tce_invalidate_entire(npe->phb, false); + + get_gpu_pci_dev_and_pe(npe, &gpdev); + if (gpdev) + pnv_npu2_unmap_lpar_dev(gpdev); +} + +static void pnv_npu_release_ownership(struct iommu_table_group *table_group) +{ + struct pnv_ioda_pe *npe = container_of(table_group, struct pnv_ioda_pe, + table_group); + struct pci_dev *gpdev = NULL; + + get_gpu_pci_dev_and_pe(npe, &gpdev); + if (gpdev) + pnv_npu2_map_lpar_dev(gpdev, 0, MSR_DR | MSR_PR | MSR_HV); } static struct iommu_table_group_ops pnv_pci_npu_ops = { .set_window = pnv_npu_set_window, .unset_window = pnv_npu_unset_window, .take_ownership = pnv_npu_take_ownership, + .release_ownership = pnv_npu_release_ownership, }; #endif /* !CONFIG_IOMMU_API */ @@ -1237,3 +1254,37 @@ void pnv_npu2_map_lpar(struct pnv_ioda_pe *gpe, unsigned long msr) list_for_each_entry(gpdev, &gpe->pbus->devices, bus_list) pnv_npu2_map_lpar_dev(gpdev, 0, msr); } + +int pnv_npu2_unmap_lpar_dev(struct pci_dev *gpdev) +{ + int ret; + struct pci_dev *npdev = pnv_pci_get_npu_dev(gpdev, 0); + struct pci_controller *hose; + struct pnv_phb *nphb; + + if (!npdev) + return -ENODEV; + + hose = pci_bus_to_host(npdev->bus); + nphb = hose->private_data; + + dev_dbg(&gpdev->dev, "destroy context opalid=%llu\n", + nphb->opal_id); + ret = opal_npu_destroy_context(nphb->opal_id, 0/*__unused*/, + PCI_DEVID(gpdev->bus->number, gpdev->devfn)); + if (ret < 0) { + dev_err(&gpdev->dev, "Failed to destroy context: %d\n", ret); + return ret; + } + + /* Set LPID to 0 anyway, just to be safe */ + dev_dbg(&gpdev->dev, "Map LPAR opalid=%llu lparid=0\n", nphb->opal_id); + ret = opal_npu_map_lpar(nphb->opal_id, + PCI_DEVID(gpdev->bus->number, gpdev->devfn), 0 /*LPID*/, + 0 /* LPCR bits */); + if (ret) + dev_err(&gpdev->dev, "Error %d mapping device to LPAR\n", ret); + + return ret; +} +EXPORT_SYMBOL_GPL(pnv_npu2_unmap_lpar_dev); From patchwork Thu Dec 20 08:23:46 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016609 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4Zl1n8Pz9sNw for ; Thu, 20 Dec 2018 19:25:35 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731070AbeLTIYz (ORCPT ); Thu, 20 Dec 2018 03:24:55 -0500 Received: from ozlabs.ru ([107.173.13.209]:54764 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731067AbeLTIYz (ORCPT ); Thu, 20 Dec 2018 03:24:55 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 38ADAAE8035F; Thu, 20 Dec 2018 03:24:51 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 16/20] powerpc/powernv/npu: Check mmio_atsd array bounds when populating Date: Thu, 20 Dec 2018 19:23:46 +1100 Message-Id: <20181220082350.58113-17-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org A broken device tree might contain more than 8 values and introduce hard to debug memory corruption bug. This adds the boundary check. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/npu-dma.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index e06043b..c6163b9 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -1179,8 +1179,9 @@ int pnv_npu2_init(struct pci_controller *hose) npu->nmmu_flush = of_property_read_bool(hose->dn, "ibm,nmmu-flush"); - for (i = 0; !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", - i, &mmio_atsd); i++) + for (i = 0; i < ARRAY_SIZE(npu->mmio_atsd_regs) && + !of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", + i, &mmio_atsd); i++) npu->mmio_atsd_regs[i] = ioremap(mmio_atsd, 32); pr_info("NPU%d: Found %d MMIO ATSD registers", hose->global_number, i); From patchwork Thu Dec 20 08:23:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016608 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4Zh4CdPz9sNw for ; Thu, 20 Dec 2018 19:25:32 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729456AbeLTIY7 (ORCPT ); Thu, 20 Dec 2018 03:24:59 -0500 Received: from ozlabs.ru ([107.173.13.209]:54840 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731085AbeLTIY7 (ORCPT ); Thu, 20 Dec 2018 03:24:59 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id B5A2DAE80381; Thu, 20 Dec 2018 03:24:54 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 17/20] powerpc/powernv/npu: Fault user page into the hypervisor's pagetable Date: Thu, 20 Dec 2018 19:23:47 +1100 Message-Id: <20181220082350.58113-18-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org When a page fault happens in a GPU, the GPU signals the OS and the GPU driver calls the fault handler which populated a page table; this allows the GPU to complete an ATS request. On the bare metal get_user_pages() is enough as it adds a pte to the kernel page table but under KVM the partition scope tree does not get updated so ATS will still fail. This reads a byte from an effective address which causes HV storage interrupt and KVM updates the partition scope tree. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/npu-dma.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index c6163b9..12b8421 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -1133,6 +1133,8 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea, u64 rc = 0, result = 0; int i, is_write; struct page *page[1]; + const char __user *u; + char c; /* mmap_sem should be held so the struct_mm must be present */ struct mm_struct *mm = context->mm; @@ -1145,18 +1147,17 @@ int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea, is_write ? FOLL_WRITE : 0, page, NULL, NULL); - /* - * To support virtualised environments we will have to do an - * access to the page to ensure it gets faulted into the - * hypervisor. For the moment virtualisation is not supported in - * other areas so leave the access out. - */ if (rc != 1) { status[i] = rc; result = -EFAULT; continue; } + /* Make sure partition scoped tree gets a pte */ + u = page_address(page[0]); + if (__get_user(c, u)) + result = -EFAULT; + status[i] = 0; put_page(page[0]); } From patchwork Thu Dec 20 08:23:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016607 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4Zb6G06z9sNL for ; Thu, 20 Dec 2018 19:25:27 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731103AbeLTIZD (ORCPT ); Thu, 20 Dec 2018 03:25:03 -0500 Received: from ozlabs.ru ([107.173.13.209]:54908 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731097AbeLTIZD (ORCPT ); Thu, 20 Dec 2018 03:25:03 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 40C26AE80490; Thu, 20 Dec 2018 03:24:58 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 18/20] vfio_pci: Allow mapping extra regions Date: Thu, 20 Dec 2018 19:23:48 +1100 Message-Id: <20181220082350.58113-19-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org So far we only allowed mapping of MMIO BARs to the userspace. However there are GPUs with on-board coherent RAM accessible via side channels which we also want to map to the userspace. The first client for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9 NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped to the system address space, we are going to export these as an extra PCI region. We already support extra PCI regions and this adds support for mapping them to the userspace. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson Acked-by: Alex Williamson --- Changes: v2: * reverted one of mistakenly removed error checks --- drivers/vfio/pci/vfio_pci_private.h | 3 +++ drivers/vfio/pci/vfio_pci.c | 9 +++++++++ 2 files changed, 12 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index cde3b5d..86aab05 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -59,6 +59,9 @@ struct vfio_pci_regops { size_t count, loff_t *ppos, bool iswrite); void (*release)(struct vfio_pci_device *vdev, struct vfio_pci_region *region); + int (*mmap)(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, + struct vm_area_struct *vma); }; struct vfio_pci_region { diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index fef5002..4a6f7c0 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1130,6 +1130,15 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma) return -EINVAL; if ((vma->vm_flags & VM_SHARED) == 0) return -EINVAL; + if (index >= VFIO_PCI_NUM_REGIONS) { + int regnum = index - VFIO_PCI_NUM_REGIONS; + struct vfio_pci_region *region = vdev->region + regnum; + + if (region && region->ops && region->ops->mmap && + (region->flags & VFIO_REGION_INFO_FLAG_MMAP)) + return region->ops->mmap(vdev, region, vma); + return -EINVAL; + } if (index >= VFIO_PCI_ROM_REGION_INDEX) return -EINVAL; if (!vdev->bar_mmap_supported[index]) From patchwork Thu Dec 20 08:23:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016606 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4ZZ10JVz9sDr for ; Thu, 20 Dec 2018 19:25:26 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731117AbeLTIZH (ORCPT ); Thu, 20 Dec 2018 03:25:07 -0500 Received: from ozlabs.ru ([107.173.13.209]:54979 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731110AbeLTIZH (ORCPT ); Thu, 20 Dec 2018 03:25:07 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id BE537AE80491; Thu, 20 Dec 2018 03:25:01 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 19/20] vfio_pci: Allow regions to add own capabilities Date: Thu, 20 Dec 2018 19:23:49 +1100 Message-Id: <20181220082350.58113-20-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org VFIO regions already support region capabilities with a limited set of fields. However the subdriver might have to report to the userspace additional bits. This adds an add_capability() hook to vfio_pci_regops. Signed-off-by: Alexey Kardashevskiy Acked-by: Alex Williamson --- Changes: v3: * removed confusing rationale for the patch, the next patch makes use of it anyway --- drivers/vfio/pci/vfio_pci_private.h | 3 +++ drivers/vfio/pci/vfio_pci.c | 6 ++++++ 2 files changed, 9 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 86aab05..93c1738 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -62,6 +62,9 @@ struct vfio_pci_regops { int (*mmap)(struct vfio_pci_device *vdev, struct vfio_pci_region *region, struct vm_area_struct *vma); + int (*add_capability)(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, + struct vfio_info_cap *caps); }; struct vfio_pci_region { diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 4a6f7c0..6cb70cf 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -763,6 +763,12 @@ static long vfio_pci_ioctl(void *device_data, if (ret) return ret; + if (vdev->region[i].ops->add_capability) { + ret = vdev->region[i].ops->add_capability(vdev, + &vdev->region[i], &caps); + if (ret) + return ret; + } } } From patchwork Thu Dec 20 08:23:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1016605 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43L4ZT0FXwz9sN8 for ; Thu, 20 Dec 2018 19:25:21 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731126AbeLTIZM (ORCPT ); Thu, 20 Dec 2018 03:25:12 -0500 Received: from ozlabs.ru ([107.173.13.209]:55046 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731133AbeLTIZL (ORCPT ); Thu, 20 Dec 2018 03:25:11 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 44ED5AE80492; Thu, 20 Dec 2018 03:25:05 -0500 (EST) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, kvm@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson , Paul Mackerras , linux-kernel@vger.kernel.org, Christoph Hellwig Subject: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver Date: Thu, 20 Dec 2018 19:23:50 +1100 Message-Id: <20181220082350.58113-21-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181220082350.58113-1-aik@ozlabs.ru> References: <20181220082350.58113-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not pluggable PCIe devices but still have PCIe links which are used for config space and MMIO. In addition to that the GPUs have 6 NVLinks which are connected to other GPUs and the POWER9 CPU. POWER9 chips have a special unit on a die called an NPU which is an NVLink2 host bus adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each. These systems also support ATS (address translation services) which is a part of the NVLink2 protocol. Such GPUs also share on-board RAM (16GB or 32GB) to the system via the same NVLink2 so a CPU has cache-coherent access to a GPU RAM. This exports GPU RAM to the userspace as a new VFIO device region. This preregisters the new memory as device memory as it might be used for DMA. This inserts pfns from the fault handler as the GPU memory is not onlined until the vendor driver is loaded and trained the NVLinks so doing this earlier causes low level errors which we fence in the firmware so it does not hurt the host system but still better be avoided; for the same reason this does not map GPU RAM into the host kernel (usual thing for emulated access otherwise). This exports an ATSD (Address Translation Shootdown) register of NPU which allows TLB invalidations inside GPU for an operating system. The register conveniently occupies a single 64k page. It is also presented to the userspace as a new VFIO device region. One NPU has 8 ATSD registers, each of them can be used for TLB invalidation in a GPU linked to this NPU. This allocates one ATSD register per an NVLink bridge allowing passing up to 6 registers. Due to the host firmware bug (just recently fixed), only 1 ATSD register per NPU was actually advertised to the host system so this passes that alone register via the first NVLink bridge device in the group which is still enough as QEMU collects them all back and presents to the guest via vPHB to mimic the emulated NPU PHB on the host. In order to provide the userspace with the information about GPU-to-NVLink connections, this exports an additional capability called "tgt" (which is an abbreviated host system bus address). The "tgt" property tells the GPU its own system address and allows the guest driver to conglomerate the routing information so each GPU knows how to get directly to the other GPUs. For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to know LPID (a logical partition ID or a KVM guest hardware ID in other words) and PID (a memory context ID of a userspace process, not to be confused with a linux pid). This assigns a GPU to LPID in the NPU and this is why this adds a listener for KVM on an IOMMU group. A PID comes via NVLink from a GPU and NPU uses a PID wildcard to pass it through. This requires coherent memory and ATSD to be available on the host as the GPU vendor only supports configurations with both features enabled and other configurations are known not to work. Because of this and because of the ways the features are advertised to the host system (which is a device tree with very platform specific properties), this requires enabled POWERNV platform. The V100 GPUs do not advertise any of these capabilities via the config space and there are more than just one device ID so this relies on the platform to tell whether these GPUs have special abilities such as NVLinks. Signed-off-by: Alexey Kardashevskiy Acked-by: Alex Williamson --- Changes: v6.1: * fixed outdated comment about VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD v6: * reworked capabilities - tgt for nvlink and gpu and link-speed for nvlink only v5: * do not memremap GPU RAM for emulation, map it only when it is needed * allocate 1 ATSD register per NVLink bridge, if none left, then expose the region with a zero size * separate caps per device type * addressed AW review comments v4: * added nvlink-speed to the NPU bridge capability as this turned out to be not a constant value * instead of looking at the exact device ID (which also changes from system to system), now this (indirectly) looks at the device tree to know if GPU and NPU support NVLink v3: * reworded the commit log about tgt * added tracepoints (do we want them enabled for entire vfio-pci?) * added code comments * added write|mmap flags to the new regions * auto enabled VFIO_PCI_NVLINK2 config option * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu references; there are required by the NVIDIA driver * keep notifier registered only for short time --- drivers/vfio/pci/Makefile | 1 + drivers/vfio/pci/trace.h | 102 ++++++ drivers/vfio/pci/vfio_pci_private.h | 14 + include/uapi/linux/vfio.h | 37 +++ drivers/vfio/pci/vfio_pci.c | 27 +- drivers/vfio/pci/vfio_pci_nvlink2.c | 482 ++++++++++++++++++++++++++++ drivers/vfio/pci/Kconfig | 6 + 7 files changed, 667 insertions(+), 2 deletions(-) create mode 100644 drivers/vfio/pci/trace.h create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 76d8ec0..9662c06 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,5 +1,6 @@ vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h new file mode 100644 index 0000000..b80d2d3 --- /dev/null +++ b/drivers/vfio/pci/trace.h @@ -0,0 +1,102 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ +/* + * VFIO PCI mmap/mmap_fault tracepoints + * + * Copyright (C) 2018 IBM Corp. All rights reserved. + * Author: Alexey Kardashevskiy + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vfio_pci + +#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VFIO_PCI_H + +#include + +TRACE_EVENT(vfio_pci_nvgpu_mmap_fault, + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, + vm_fault_t ret), + TP_ARGS(pdev, hpa, ua, ret), + + TP_STRUCT__entry( + __field(const char *, name) + __field(unsigned long, hpa) + __field(unsigned long, ua) + __field(int, ret) + ), + + TP_fast_assign( + __entry->name = dev_name(&pdev->dev), + __entry->hpa = hpa; + __entry->ua = ua; + __entry->ret = ret; + ), + + TP_printk("%s: %lx -> %lx ret=%d", __entry->name, __entry->hpa, + __entry->ua, __entry->ret) +); + +TRACE_EVENT(vfio_pci_nvgpu_mmap, + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, + unsigned long size, int ret), + TP_ARGS(pdev, hpa, ua, size, ret), + + TP_STRUCT__entry( + __field(const char *, name) + __field(unsigned long, hpa) + __field(unsigned long, ua) + __field(unsigned long, size) + __field(int, ret) + ), + + TP_fast_assign( + __entry->name = dev_name(&pdev->dev), + __entry->hpa = hpa; + __entry->ua = ua; + __entry->size = size; + __entry->ret = ret; + ), + + TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, + __entry->ua, __entry->size, __entry->ret) +); + +TRACE_EVENT(vfio_pci_npu2_mmap, + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, + unsigned long size, int ret), + TP_ARGS(pdev, hpa, ua, size, ret), + + TP_STRUCT__entry( + __field(const char *, name) + __field(unsigned long, hpa) + __field(unsigned long, ua) + __field(unsigned long, size) + __field(int, ret) + ), + + TP_fast_assign( + __entry->name = dev_name(&pdev->dev), + __entry->hpa = hpa; + __entry->ua = ua; + __entry->size = size; + __entry->ret = ret; + ), + + TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, + __entry->ua, __entry->size, __entry->ret) +); + +#endif /* _TRACE_SUBSYS_H */ + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH . +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE trace + +/* This part must be outside protection */ +#include diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 93c1738..127071b 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -163,4 +163,18 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev) return -ENODEV; } #endif +#ifdef CONFIG_VFIO_PCI_NVLINK2 +extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev); +extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev); +#else +static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) +{ + return -ENODEV; +} + +static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) +{ + return -ENODEV; +} +#endif #endif /* VFIO_PCI_PRIVATE_H */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 8131028..5562587 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -353,6 +353,21 @@ struct vfio_region_gfx_edid { #define VFIO_DEVICE_GFX_LINK_STATE_DOWN 2 }; +/* + * 10de vendor sub-type + * + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space. + */ +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM (1) + +/* + * 1014 vendor sub-type + * + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU + * to do TLB invalidation on a GPU. + */ +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1) + /* * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped * which allows direct access to non-MSIX registers which happened to be within @@ -363,6 +378,28 @@ struct vfio_region_gfx_edid { */ #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 +/* + * Capability with compressed real address (aka SSA - small system address) + * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing. + */ +#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT 4 + +struct vfio_region_info_cap_nvlink2_ssatgt { + struct vfio_info_cap_header header; + __u64 tgt; +}; + +/* + * Capability with an NVLink link speed. + */ +#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD 5 + +struct vfio_region_info_cap_nvlink2_lnkspd { + struct vfio_info_cap_header header; + __u32 link_speed; + __u32 __pad; +}; + /** * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, * struct vfio_irq_info) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 6cb70cf..67c03f2 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -302,14 +302,37 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) if (ret) { dev_warn(&vdev->pdev->dev, "Failed to setup Intel IGD regions\n"); - vfio_pci_disable(vdev); - return ret; + goto disable_exit; + } + } + + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA && + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { + ret = vfio_pci_nvdia_v100_nvlink2_init(vdev); + if (ret && ret != -ENODEV) { + dev_warn(&vdev->pdev->dev, + "Failed to setup NVIDIA NV2 RAM region\n"); + goto disable_exit; + } + } + + if (pdev->vendor == PCI_VENDOR_ID_IBM && + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { + ret = vfio_pci_ibm_npu2_init(vdev); + if (ret && ret != -ENODEV) { + dev_warn(&vdev->pdev->dev, + "Failed to setup NVIDIA NV2 ATSD region\n"); + goto disable_exit; } } vfio_pci_probe_mmaps(vdev); return 0; + +disable_exit: + vfio_pci_disable(vdev); + return ret; } static void vfio_pci_disable(struct vfio_pci_device *vdev) diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c new file mode 100644 index 0000000..054a2cf --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c @@ -0,0 +1,482 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2. + * + * Copyright (C) 2018 IBM Corp. All rights reserved. + * Author: Alexey Kardashevskiy + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * Register an on-GPU RAM region for cacheable access. + * + * Derived from original vfio_pci_igd.c: + * Copyright (C) 2016 Red Hat, Inc. All rights reserved. + * Author: Alex Williamson + */ + +#include +#include +#include +#include +#include +#include +#include +#include "vfio_pci_private.h" + +#define CREATE_TRACE_POINTS +#include "trace.h" + +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault); +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap); +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap); + +struct vfio_pci_nvgpu_data { + unsigned long gpu_hpa; /* GPU RAM physical address */ + unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */ + unsigned long useraddr; /* GPU RAM userspace address */ + unsigned long size; /* Size of the GPU RAM window (usually 128GB) */ + struct mm_struct *mm; + struct mm_iommu_table_group_mem_t *mem; /* Pre-registered RAM descr. */ + struct pci_dev *gpdev; + struct notifier_block group_notifier; +}; + +static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev, + char __user *buf, size_t count, loff_t *ppos, bool iswrite) +{ + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; + struct vfio_pci_nvgpu_data *data = vdev->region[i].data; + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; + loff_t posaligned = pos & PAGE_MASK, posoff = pos & ~PAGE_MASK; + size_t sizealigned; + void __iomem *ptr; + + if (pos >= vdev->region[i].size) + return -EINVAL; + + count = min(count, (size_t)(vdev->region[i].size - pos)); + + /* + * We map only a bit of GPU RAM for a short time instead of mapping it + * for the guest lifetime as: + * + * 1) we do not know GPU RAM size, only aperture which is 4-8 times + * bigger than actual RAM size (16/32GB RAM vs. 128GB aperture); + * 2) mapping GPU RAM allows CPU to prefetch and if this happens + * before NVLink bridge is reset (which fences GPU RAM), + * hardware management interrupts (HMI) might happen, this + * will freeze NVLink bridge. + * + * This is not fast path anyway. + */ + sizealigned = _ALIGN_UP(posoff + count, PAGE_SIZE); + ptr = ioremap_cache(data->gpu_hpa + posaligned, sizealigned); + if (!ptr) + return -EFAULT; + + if (iswrite) { + if (copy_from_user(ptr + posoff, buf, count)) + count = -EFAULT; + else + *ppos += count; + } else { + if (copy_to_user(buf, ptr + posoff, count)) + count = -EFAULT; + else + *ppos += count; + } + + iounmap(ptr); + + return count; +} + +static void vfio_pci_nvgpu_release(struct vfio_pci_device *vdev, + struct vfio_pci_region *region) +{ + struct vfio_pci_nvgpu_data *data = region->data; + long ret; + + /* If there were any mappings at all... */ + if (data->mm) { + ret = mm_iommu_put(data->mm, data->mem); + WARN_ON(ret); + + mmdrop(data->mm); + } + + vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, + &data->group_notifier); + + pnv_npu2_unmap_lpar_dev(data->gpdev); + + kfree(data); +} + +static vm_fault_t vfio_pci_nvgpu_mmap_fault(struct vm_fault *vmf) +{ + vm_fault_t ret; + struct vm_area_struct *vma = vmf->vma; + struct vfio_pci_region *region = vma->vm_private_data; + struct vfio_pci_nvgpu_data *data = region->data; + unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT; + unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT; + unsigned long vm_pgoff = vma->vm_pgoff & + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); + unsigned long pfn = nv2pg + vm_pgoff + vmf_off; + + ret = vmf_insert_pfn(vma, vmf->address, pfn); + trace_vfio_pci_nvgpu_mmap_fault(data->gpdev, pfn << PAGE_SHIFT, + vmf->address, ret); + + return ret; +} + +static const struct vm_operations_struct vfio_pci_nvgpu_mmap_vmops = { + .fault = vfio_pci_nvgpu_mmap_fault, +}; + +static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, struct vm_area_struct *vma) +{ + int ret; + struct vfio_pci_nvgpu_data *data = region->data; + + if (data->useraddr) + return -EPERM; + + if (vma->vm_end - vma->vm_start > data->size) + return -EINVAL; + + vma->vm_private_data = region; + vma->vm_flags |= VM_PFNMAP; + vma->vm_ops = &vfio_pci_nvgpu_mmap_vmops; + + /* + * Calling mm_iommu_newdev() here once as the region is not + * registered yet and therefore right initialization will happen now. + * Other places will use mm_iommu_find() which returns + * registered @mem and does not go gup(). + */ + data->useraddr = vma->vm_start; + data->mm = current->mm; + + atomic_inc(&data->mm->mm_count); + ret = (int) mm_iommu_newdev(data->mm, data->useraddr, + (vma->vm_end - vma->vm_start) >> PAGE_SHIFT, + data->gpu_hpa, &data->mem); + + trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr, + vma->vm_end - vma->vm_start, ret); + + return ret; +} + +static int vfio_pci_nvgpu_add_capability(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, struct vfio_info_cap *caps) +{ + struct vfio_pci_nvgpu_data *data = region->data; + struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 }; + + cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT; + cap.header.version = 1; + cap.tgt = data->gpu_tgt; + + return vfio_info_add_capability(caps, &cap.header, sizeof(cap)); +} + +static const struct vfio_pci_regops vfio_pci_nvgpu_regops = { + .rw = vfio_pci_nvgpu_rw, + .release = vfio_pci_nvgpu_release, + .mmap = vfio_pci_nvgpu_mmap, + .add_capability = vfio_pci_nvgpu_add_capability, +}; + +static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb, + unsigned long action, void *opaque) +{ + struct kvm *kvm = opaque; + struct vfio_pci_nvgpu_data *data = container_of(nb, + struct vfio_pci_nvgpu_data, + group_notifier); + + if (action == VFIO_GROUP_NOTIFY_SET_KVM && kvm && + pnv_npu2_map_lpar_dev(data->gpdev, + kvm->arch.lpid, MSR_DR | MSR_PR)) + return NOTIFY_BAD; + + return NOTIFY_OK; +} + +int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) +{ + int ret; + u64 reg[2]; + u64 tgt = 0; + struct device_node *npu_node, *mem_node; + struct pci_dev *npu_dev; + struct vfio_pci_nvgpu_data *data; + uint32_t mem_phandle = 0; + unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM; + + /* + * PCI config space does not tell us about NVLink presense but + * platform does, use this. + */ + npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0); + if (!npu_dev) + return -ENODEV; + + npu_node = pci_device_to_OF_node(npu_dev); + if (!npu_node) + return -EINVAL; + + if (of_property_read_u32(npu_node, "memory-region", &mem_phandle)) + return -EINVAL; + + mem_node = of_find_node_by_phandle(mem_phandle); + if (!mem_node) + return -EINVAL; + + if (of_property_read_variable_u64_array(mem_node, "reg", reg, + ARRAY_SIZE(reg), ARRAY_SIZE(reg)) != + ARRAY_SIZE(reg)) + return -EINVAL; + + if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) { + dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n"); + return -EFAULT; + } + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return -ENOMEM; + + data->gpu_hpa = reg[0]; + data->gpu_tgt = tgt; + data->size = reg[1]; + + dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa, + data->gpu_hpa + data->size - 1); + + data->gpdev = vdev->pdev; + data->group_notifier.notifier_call = vfio_pci_nvgpu_group_notifier; + + ret = vfio_register_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, + &events, &data->group_notifier); + if (ret) + goto free_exit; + + /* + * We have just set KVM, we do not need the listener anymore. + * Also, keeping it registered means that if more than one GPU is + * assigned, we will get several similar notifiers notifying about + * the same device again which does not help with anything. + */ + vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, + &data->group_notifier); + + ret = vfio_pci_register_dev_region(vdev, + PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE, + VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM, + &vfio_pci_nvgpu_regops, + data->size, + VFIO_REGION_INFO_FLAG_READ | + VFIO_REGION_INFO_FLAG_WRITE | + VFIO_REGION_INFO_FLAG_MMAP, + data); + if (ret) + goto free_exit; + + return 0; +free_exit: + kfree(data); + + return ret; +} + +/* + * IBM NPU2 bridge + */ +struct vfio_pci_npu2_data { + void *base; /* ATSD register virtual address, for emulated access */ + unsigned long mmio_atsd; /* ATSD physical address */ + unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */ + unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */ +}; + +static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev, + char __user *buf, size_t count, loff_t *ppos, bool iswrite) +{ + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; + struct vfio_pci_npu2_data *data = vdev->region[i].data; + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; + + if (pos >= vdev->region[i].size) + return -EINVAL; + + count = min(count, (size_t)(vdev->region[i].size - pos)); + + if (iswrite) { + if (copy_from_user(data->base + pos, buf, count)) + return -EFAULT; + } else { + if (copy_to_user(buf, data->base + pos, count)) + return -EFAULT; + } + *ppos += count; + + return count; +} + +static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, struct vm_area_struct *vma) +{ + int ret; + struct vfio_pci_npu2_data *data = region->data; + unsigned long req_len = vma->vm_end - vma->vm_start; + + if (req_len != PAGE_SIZE) + return -EINVAL; + + vma->vm_flags |= VM_PFNMAP; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + + ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT, + req_len, vma->vm_page_prot); + trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start, + vma->vm_end - vma->vm_start, ret); + + return ret; +} + +static void vfio_pci_npu2_release(struct vfio_pci_device *vdev, + struct vfio_pci_region *region) +{ + struct vfio_pci_npu2_data *data = region->data; + + memunmap(data->base); + kfree(data); +} + +static int vfio_pci_npu2_add_capability(struct vfio_pci_device *vdev, + struct vfio_pci_region *region, struct vfio_info_cap *caps) +{ + struct vfio_pci_npu2_data *data = region->data; + struct vfio_region_info_cap_nvlink2_ssatgt captgt = { 0 }; + struct vfio_region_info_cap_nvlink2_lnkspd capspd = { 0 }; + int ret; + + captgt.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT; + captgt.header.version = 1; + captgt.tgt = data->gpu_tgt; + + capspd.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD; + capspd.header.version = 1; + capspd.link_speed = data->link_speed; + + ret = vfio_info_add_capability(caps, &captgt.header, sizeof(captgt)); + if (ret) + return ret; + + return vfio_info_add_capability(caps, &capspd.header, sizeof(capspd)); +} + +static const struct vfio_pci_regops vfio_pci_npu2_regops = { + .rw = vfio_pci_npu2_rw, + .mmap = vfio_pci_npu2_mmap, + .release = vfio_pci_npu2_release, + .add_capability = vfio_pci_npu2_add_capability, +}; + +int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) +{ + int ret; + struct vfio_pci_npu2_data *data; + struct device_node *nvlink_dn; + u32 nvlink_index = 0; + struct pci_dev *npdev = vdev->pdev; + struct device_node *npu_node = pci_device_to_OF_node(npdev); + struct pci_controller *hose = pci_bus_to_host(npdev->bus); + u64 mmio_atsd = 0; + u64 tgt = 0; + u32 link_speed = 0xff; + + /* + * PCI config space does not tell us about NVLink presense but + * platform does, use this. + */ + if (!pnv_pci_get_gpu_dev(vdev->pdev)) + return -ENODEV; + + /* + * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links + * so we can allocate one register per link, using nvlink index as + * a key. + * There is always at least one ATSD register so as long as at least + * NVLink bridge #0 is passed to the guest, ATSD will be available. + */ + nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0); + if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index", + &nvlink_index))) + return -ENODEV; + + if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index, + &mmio_atsd)) { + dev_warn(&vdev->pdev->dev, "No available ATSD found\n"); + mmio_atsd = 0; + } + + if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) { + dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n"); + return -EFAULT; + } + + if (of_property_read_u32(npu_node, "ibm,nvlink-speed", &link_speed)) { + dev_warn(&vdev->pdev->dev, "No ibm,nvlink-speed found\n"); + return -EFAULT; + } + + data = kzalloc(sizeof(*data), GFP_KERNEL); + if (!data) + return -ENOMEM; + + data->mmio_atsd = mmio_atsd; + data->gpu_tgt = tgt; + data->link_speed = link_speed; + if (data->mmio_atsd) { + data->base = memremap(data->mmio_atsd, SZ_64K, MEMREMAP_WT); + if (!data->base) { + ret = -ENOMEM; + goto free_exit; + } + } + + /* + * We want to expose the capability even if this specific NVLink + * did not get its own ATSD register because capabilities + * belong to VFIO regions and normally there will be ATSD register + * assigned to the NVLink bridge. + */ + ret = vfio_pci_register_dev_region(vdev, + PCI_VENDOR_ID_IBM | + VFIO_REGION_TYPE_PCI_VENDOR_TYPE, + VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD, + &vfio_pci_npu2_regops, + data->mmio_atsd ? PAGE_SIZE : 0, + VFIO_REGION_INFO_FLAG_READ | + VFIO_REGION_INFO_FLAG_WRITE | + VFIO_REGION_INFO_FLAG_MMAP, + data); + if (ret) + goto free_exit; + + return 0; + +free_exit: + kfree(data); + + return ret; +} diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 42dc1d3..d0f8e4f 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -38,3 +38,9 @@ config VFIO_PCI_IGD and LPC bridge config space. To enable Intel IGD assignment through vfio-pci, say Y. + +config VFIO_PCI_NVLINK2 + def_bool y + depends on VFIO_PCI && PPC_POWERNV + help + VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs