From patchwork Wed Feb 27 08:51:47 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1048760 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=nongnu.org (client-ip=209.51.188.17; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 448TzM2Sf2z9s2R for ; Wed, 27 Feb 2019 19:55:27 +1100 (AEDT) Received: from localhost ([127.0.0.1]:40389 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuzt-0006G4-6D for incoming@patchwork.ozlabs.org; Wed, 27 Feb 2019 03:55:25 -0500 Received: from eggs.gnu.org ([209.51.188.92]:36475) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxe-0004uV-Gu for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gyuxX-0004de-DU for qemu-devel@nongnu.org; Wed, 27 Feb 2019 03:53:02 -0500 Received: from ozlabs.ru ([107.173.13.209]:35365) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gyuxL-0004RR-E9; Wed, 27 Feb 2019 03:52:49 -0500 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 0C066AE807F0; Wed, 27 Feb 2019 03:52:02 -0500 (EST) From: Alexey Kardashevskiy To: qemu-devel@nongnu.org Date: Wed, 27 Feb 2019 19:51:47 +1100 Message-Id: <20190227085149.38596-5-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190227085149.38596-1-aik@ozlabs.ru> References: <20190227085149.38596-1-aik@ozlabs.ru> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v3 4/6] spapr_iommu: Do not replay mappings from just created DMA window X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Jose Ricardo Ziviani , Alexey Kardashevskiy , Daniel Henrique Barboza , Alex Williamson , Sam Bobroff , Piotr Jaroszynski , qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Au?= =?utf-8?q?gusto_Guimar=C3=A3es_Garcia?= , David Gibson Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" On sPAPR vfio_listener_region_add() is called in 2 situations: 1. a new listener is registered from vfio_connect_container(); 2. a new IOMMU Memory Region is added from rtas_ibm_create_pe_dma_window(). In both cases vfio_listener_region_add() calls memory_region_iommu_replay() to notify newly registered IOMMU notifiers about existing mappings which is totally desirable for case 1. However for case 2 it is nothing but noop as the window has just been created and has no valid mappings so replaying those does not do anything. It is barely noticeable with usual guests but if the window happens to be really big, such no-op replay might take minutes and trigger RCU stall warnings in the guest. For example, a upcoming GPU RAM memory region mapped at 64TiB (right after SPAPR_PCI_LIMIT) causes a 64bit DMA window to be at least 128TiB which is (128<<40)/0x10000=2.147.483.648 TCEs to replay. This mitigates the problem by adding an "skipping_replay" flag to sPAPRTCETable and defining sPAPR own IOMMU MR replay() hook which does exactly the same thing as the generic one except it returns early if @skipping_replay==true. When "ibm,create-pe-dma-window" is complete, the guest will map only required regions of the huge DMA window. Signed-off-by: Alexey Kardashevskiy --- include/hw/ppc/spapr.h | 1 + hw/ppc/spapr_iommu.c | 31 +++++++++++++++++++++++++++++++ hw/ppc/spapr_rtas_ddw.c | 7 +++++++ 3 files changed, 39 insertions(+) diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h index 86b0488..358bb38 100644 --- a/include/hw/ppc/spapr.h +++ b/include/hw/ppc/spapr.h @@ -727,6 +727,7 @@ struct sPAPRTCETable { uint64_t *mig_table; bool bypass; bool need_vfio; + bool skipping_replay; int fd; MemoryRegion root; IOMMUMemoryRegion iommu; diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c index 37e98f9..8f23179 100644 --- a/hw/ppc/spapr_iommu.c +++ b/hw/ppc/spapr_iommu.c @@ -141,6 +141,36 @@ static IOMMUTLBEntry spapr_tce_translate_iommu(IOMMUMemoryRegion *iommu, return ret; } +static void spapr_tce_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) +{ + MemoryRegion *mr = MEMORY_REGION(iommu_mr); + IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_GET_CLASS(iommu_mr); + hwaddr addr, granularity; + IOMMUTLBEntry iotlb; + sPAPRTCETable *tcet = container_of(iommu_mr, sPAPRTCETable, iommu); + + if (tcet->skipping_replay) { + return; + } + + granularity = memory_region_iommu_get_min_page_size(iommu_mr); + + for (addr = 0; addr < memory_region_size(mr); addr += granularity) { + iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx); + if (iotlb.perm != IOMMU_NONE) { + n->notify(n, &iotlb); + } + + /* + * if (2^64 - MR size) < granularity, it's possible to get an + * infinite loop here. This should catch such a wraparound. + */ + if ((addr + granularity) < addr) { + break; + } + } +} + static int spapr_tce_table_pre_save(void *opaque) { sPAPRTCETable *tcet = SPAPR_TCE_TABLE(opaque); @@ -659,6 +689,7 @@ static void spapr_iommu_memory_region_class_init(ObjectClass *klass, void *data) IOMMUMemoryRegionClass *imrc = IOMMU_MEMORY_REGION_CLASS(klass); imrc->translate = spapr_tce_translate_iommu; + imrc->replay = spapr_tce_replay; imrc->get_min_page_size = spapr_tce_get_min_page_size; imrc->notify_flag_changed = spapr_tce_notify_flag_changed; imrc->get_attr = spapr_tce_get_attr; diff --git a/hw/ppc/spapr_rtas_ddw.c b/hw/ppc/spapr_rtas_ddw.c index cb8a410..9cc020d 100644 --- a/hw/ppc/spapr_rtas_ddw.c +++ b/hw/ppc/spapr_rtas_ddw.c @@ -171,8 +171,15 @@ static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu, } win_addr = (windows == 0) ? sphb->dma_win_addr : sphb->dma64_win_addr; + /* + * We have just created a window, we know for the fact that it is empty, + * use a hack to avoid iterating over the table as it is quite possible + * to have billions of TCEs, all empty. + */ + tcet->skipping_replay = true; spapr_tce_table_enable(tcet, page_shift, win_addr, 1ULL << (window_shift - page_shift)); + tcet->skipping_replay = false; if (!tcet->nb_table) { goto hw_error_exit; }