From patchwork Mon Mar 11 16:50:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Leonardo Bras X-Patchwork-Id: 1054585 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 44J3zT1RxMz9s55; Tue, 12 Mar 2019 03:51:49 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1h3O9O-0001f0-Qn; Mon, 11 Mar 2019 16:51:42 +0000 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5] helo=mx0a-001b2d01.pphosted.com) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1h3O9J-0001b9-S9 for kernel-team@lists.ubuntu.com; Mon, 11 Mar 2019 16:51:38 +0000 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x2BGn4DZ085254 for ; Mon, 11 Mar 2019 12:51:36 -0400 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by mx0b-001b2d01.pphosted.com with ESMTP id 2r5t6amgjk-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Mon, 11 Mar 2019 12:51:36 -0400 Received: from localhost by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 11 Mar 2019 16:51:35 -0000 Received: from b03cxnp08028.gho.boulder.ibm.com (9.17.130.20) by e32.co.us.ibm.com (192.168.1.132) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Mon, 11 Mar 2019 16:51:33 -0000 Received: from b03ledav006.gho.boulder.ibm.com (b03ledav006.gho.boulder.ibm.com [9.17.130.237]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x2BGpWHK27525364 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 11 Mar 2019 16:51:32 GMT Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B63A8C605B; Mon, 11 Mar 2019 16:51:32 +0000 (GMT) Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B3089C6059; Mon, 11 Mar 2019 16:51:31 +0000 (GMT) Received: from LeoBras.aus.stglabs.ibm.com (unknown [9.18.235.41]) by b03ledav006.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 11 Mar 2019 16:51:31 +0000 (GMT) From: Leonardo Bras To: kernel-team@lists.ubuntu.com Subject: [PATCH 13/20] KVM: PPC: Book3S HV: Recursively unmap all page table entries when unmapping Date: Mon, 11 Mar 2019 13:50:44 -0300 X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190311165051.1149-1-leonardo@linux.ibm.com> References: <20190311165051.1149-1-leonardo@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 19031116-0004-0000-0000-000014ECB090 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00010739; HX=3.00000242; KW=3.00000007; PH=3.00000004; SC=3.00000281; SDB=6.01172892; UDB=6.00613190; IPR=6.00953546; MB=3.00025930; MTD=3.00000008; XFM=3.00000015; UTC=2019-03-11 16:51:35 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19031116-0005-0000-0000-00008ADE5A9D Message-Id: <20190311165051.1149-14-leonardo@linux.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-03-11_12:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=3 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=895 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1903110120 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Paul Mackerras , Nicholas Piggin Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Nicholas Piggin When partition scope mappings are unmapped with kvm_unmap_radix, the pte is cleared, but the page table structure is left in place. If the next page fault requests a different page table geometry (e.g., due to THP promotion or split), kvmppc_create_pte is responsible for changing the page tables. When a page table entry is to be converted to a large pte, the page table entry is cleared, the PWC flushed, then the page table it points to freed. This will cause pte page tables to leak when a 1GB page is to replace a pud entry points to a pmd table with pte tables under it: The pmd table will be freed, but its pte tables will be missed. Fix this by replacing the simple clear and free code with one that walks down the page tables and frees children. Care must be taken to clear the root entry being unmapped then flushing the PWC before freeing any page tables, as explained in comments. This requires PWC flush to logically become a flush-all-PWC (which it already is in hardware, but the KVM API needs to be changed to avoid confusion). This code also checks that no unexpected pte entries exist in any page table being freed, and unmaps those and emits a WARN. This is an expensive operation for the pte page level, but partition scope changes are rare, so it's unconditional for now to iron out bugs. It can be put under a CONFIG option or removed after some time. Signed-off-by: Nicholas Piggin Signed-off-by: Paul Mackerras Signed-off-by: Leonardo Bras --- arch/powerpc/kvm/book3s_64_mmu_radix.c | 192 ++++++++++++++++++------- 1 file changed, 138 insertions(+), 54 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 2c49b31ec7fb..e514370ab5ae 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -165,7 +165,7 @@ static void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr, asm volatile("eieio ; tlbsync ; ptesync": : :"memory"); } -static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned long addr) +static void kvmppc_radix_flush_pwc(struct kvm *kvm) { unsigned long rb = 0x2 << PPC_BITLSHIFT(53); /* IS = 2 */ @@ -247,6 +247,139 @@ static void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, } } +/* + * kvmppc_free_p?d are used to free existing page tables, and recursively + * descend and clear and free children. + * Callers are responsible for flushing the PWC. + * + * When page tables are being unmapped/freed as part of page fault path + * (full == false), ptes are not expected. There is code to unmap them + * and emit a warning if encountered, but there may already be data + * corruption due to the unexpected mappings. + */ +static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t *pte, bool full) +{ + if (full) { + memset(pte, 0, sizeof(long) << PTE_INDEX_SIZE); + } else { + pte_t *p = pte; + unsigned long it; + + for (it = 0; it < PTRS_PER_PTE; ++it, ++p) { + if (pte_val(*p) == 0) + continue; + WARN_ON_ONCE(1); + kvmppc_unmap_pte(kvm, p, + pte_pfn(*p) << PAGE_SHIFT, + PAGE_SHIFT); + } + } + + kvmppc_pte_free(pte); +} + +static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full) +{ + unsigned long im; + pmd_t *p = pmd; + + for (im = 0; im < PTRS_PER_PMD; ++im, ++p) { + if (!pmd_present(*p)) + continue; + if (pmd_is_leaf(*p)) { + if (full) { + pmd_clear(p); + } else { + WARN_ON_ONCE(1); + kvmppc_unmap_pte(kvm, (pte_t *)p, + pte_pfn(*(pte_t *)p) << PAGE_SHIFT, + PMD_SHIFT); + } + } else { + pte_t *pte; + + pte = pte_offset_map(p, 0); + kvmppc_unmap_free_pte(kvm, pte, full); + pmd_clear(p); + } + } + kvmppc_pmd_free(pmd); +} + +static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t *pud) +{ + unsigned long iu; + pud_t *p = pud; + + for (iu = 0; iu < PTRS_PER_PUD; ++iu, ++p) { + if (!pud_present(*p)) + continue; + if (pud_huge(*p)) { + pud_clear(p); + } else { + pmd_t *pmd; + + pmd = pmd_offset(p, 0); + kvmppc_unmap_free_pmd(kvm, pmd, true); + pud_clear(p); + } + } + pud_free(kvm->mm, pud); +} + +void kvmppc_free_radix(struct kvm *kvm) +{ + unsigned long ig; + pgd_t *pgd; + + if (!kvm->arch.pgtable) + return; + pgd = kvm->arch.pgtable; + for (ig = 0; ig < PTRS_PER_PGD; ++ig, ++pgd) { + pud_t *pud; + + if (!pgd_present(*pgd)) + continue; + pud = pud_offset(pgd, 0); + kvmppc_unmap_free_pud(kvm, pud); + pgd_clear(pgd); + } + pgd_free(kvm->mm, kvm->arch.pgtable); + kvm->arch.pgtable = NULL; +} + +static void kvmppc_unmap_free_pmd_entry_table(struct kvm *kvm, pmd_t *pmd, + unsigned long gpa) +{ + pte_t *pte = pte_offset_kernel(pmd, 0); + + /* + * Clearing the pmd entry then flushing the PWC ensures that the pte + * page no longer be cached by the MMU, so can be freed without + * flushing the PWC again. + */ + pmd_clear(pmd); + kvmppc_radix_flush_pwc(kvm); + + kvmppc_unmap_free_pte(kvm, pte, false); +} + +static void kvmppc_unmap_free_pud_entry_table(struct kvm *kvm, pud_t *pud, + unsigned long gpa) +{ + pmd_t *pmd = pmd_offset(pud, 0); + + /* + * Clearing the pud entry then flushing the PWC ensures that the pmd + * page and any children pte pages will no longer be cached by the MMU, + * so can be freed without flushing the PWC again. + */ + pud_clear(pud); + kvmppc_radix_flush_pwc(kvm); + + kvmppc_unmap_free_pmd(kvm, pmd, false); +} + static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, unsigned int level, unsigned long mmu_seq) { @@ -312,11 +445,9 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, /* * There's a page table page here, but we wanted to * install a large page, so remove and free the page - * table page. new_pmd will be NULL since level == 2. + * table page. */ - new_pmd = pmd_offset(pud, 0); - pud_clear(pud); - kvmppc_radix_flush_pwc(kvm, gpa); + kvmppc_unmap_free_pud_entry_table(kvm, pud, gpa); } kvmppc_radix_set_pte_at(kvm, gpa, (pte_t *)pud, pte); ret = 0; @@ -353,11 +484,9 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa, /* * There's a page table page here, but we wanted to * install a large page, so remove and free the page - * table page. new_ptep will be NULL since level == 1. + * table page. */ - new_ptep = pte_offset_kernel(pmd, 0); - pmd_clear(pmd); - kvmppc_radix_flush_pwc(kvm, gpa); + kvmppc_unmap_free_pmd_entry_table(kvm, pmd, gpa); } kvmppc_radix_set_pte_at(kvm, gpa, pmdp_ptep(pmd), pte); ret = 0; @@ -734,51 +863,6 @@ int kvmppc_init_vm_radix(struct kvm *kvm) return 0; } -void kvmppc_free_radix(struct kvm *kvm) -{ - unsigned long ig, iu, im; - pte_t *pte; - pmd_t *pmd; - pud_t *pud; - pgd_t *pgd; - - if (!kvm->arch.pgtable) - return; - pgd = kvm->arch.pgtable; - for (ig = 0; ig < PTRS_PER_PGD; ++ig, ++pgd) { - if (!pgd_present(*pgd)) - continue; - pud = pud_offset(pgd, 0); - for (iu = 0; iu < PTRS_PER_PUD; ++iu, ++pud) { - if (!pud_present(*pud)) - continue; - if (pud_huge(*pud)) { - pud_clear(pud); - continue; - } - pmd = pmd_offset(pud, 0); - for (im = 0; im < PTRS_PER_PMD; ++im, ++pmd) { - if (pmd_is_leaf(*pmd)) { - pmd_clear(pmd); - continue; - } - if (!pmd_present(*pmd)) - continue; - pte = pte_offset_map(pmd, 0); - memset(pte, 0, sizeof(long) << PTE_INDEX_SIZE); - kvmppc_pte_free(pte); - pmd_clear(pmd); - } - kvmppc_pmd_free(pmd_offset(pud, 0)); - pud_clear(pud); - } - pud_free(kvm->mm, pud_offset(pgd, 0)); - pgd_clear(pgd); - } - pgd_free(kvm->mm, kvm->arch.pgtable); - kvm->arch.pgtable = NULL; -} - static void pte_ctor(void *addr) { memset(addr, 0, RADIX_PTE_TABLE_SIZE);