From patchwork Mon Apr 20 12:44:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Aneesh Kumar K V X-Patchwork-Id: 1273323 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=linux.ibm.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 495RSp6l8Xz9sWN for ; Mon, 20 Apr 2020 22:53:14 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728973AbgDTMxN (ORCPT ); Mon, 20 Apr 2020 08:53:13 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:38988 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726759AbgDTMqo (ORCPT ); Mon, 20 Apr 2020 08:46:44 -0400 Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 03KCWOG8146713; Mon, 20 Apr 2020 08:46:30 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 30gmu6wtcd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 20 Apr 2020 08:46:29 -0400 Received: from m0098404.ppops.net (m0098404.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 03KCWQnE146897; Mon, 20 Apr 2020 08:46:29 -0400 Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com with ESMTP id 30gmu6wtbv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 20 Apr 2020 08:46:29 -0400 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 03KChTUO010741; Mon, 20 Apr 2020 12:46:28 GMT Received: from b03cxnp08025.gho.boulder.ibm.com (b03cxnp08025.gho.boulder.ibm.com [9.17.130.17]) by ppma01dal.us.ibm.com with ESMTP id 30fs66cqs9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 20 Apr 2020 12:46:28 +0000 Received: from b03ledav002.gho.boulder.ibm.com (b03ledav002.gho.boulder.ibm.com [9.17.130.233]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 03KCkQFP43909446 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 20 Apr 2020 12:46:26 GMT Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A0D47136053; Mon, 20 Apr 2020 12:46:26 +0000 (GMT) Received: from b03ledav002.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D4BE1136051; Mon, 20 Apr 2020 12:46:22 +0000 (GMT) Received: from skywalker.ibmuc.com (unknown [9.199.51.43]) by b03ledav002.gho.boulder.ibm.com (Postfix) with ESMTP; Mon, 20 Apr 2020 12:46:22 +0000 (GMT) From: "Aneesh Kumar K.V" To: linuxppc-dev@lists.ozlabs.org, mpe@ellerman.id.au, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kvm-ppc@vger.kernel.org Cc: npiggin@gmail.com, paulus@ozlabs.org, leonardo@linux.ibm.com, kirill@shutemov.name, "Aneesh Kumar K.V" Subject: [PATCH v3 22/22] powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race Date: Mon, 20 Apr 2020 18:14:34 +0530 Message-Id: <20200420124434.47330-23-aneesh.kumar@linux.ibm.com> X-Mailer: git-send-email 2.25.3 In-Reply-To: <20200420124434.47330-1-aneesh.kumar@linux.ibm.com> References: <20200420124434.47330-1-aneesh.kumar@linux.ibm.com> MIME-Version: 1.0 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.676 definitions=2020-04-20_03:2020-04-20,2020-04-20 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 bulkscore=0 priorityscore=1501 spamscore=0 lowpriorityscore=0 adultscore=0 impostorscore=0 mlxscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 clxscore=1015 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2004200107 Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org MADV_DONTNEED holds mmap_sem in read mode and that implies a parallel page fault is possible and the kernel can end up with a level 1 PTE entry (THP entry) converted to a level 0 PTE entry without flushing the THP TLB entry. Most architectures including POWER have issues with kernel instantiating a level 0 PTE entry while holding level 1 TLB entries. The code sequence I am looking at is down_read(mmap_sem) down_read(mmap_sem) zap_pmd_range() zap_huge_pmd() pmd lock held pmd_cleared table details added to mmu_gather pmd_unlock() insert a level 0 PTE entry() tlb_finish_mmu(). Fix this by forcing a tlb flush before releasing pmd lock if this is not a fullmm invalidate. We can safely skip this invalidate for task exit case (fullmm invalidate) because in that case we are sure there can be no parallel fault handlers. This do change the Qemu guest RAM del/unplug time as below 128 core, 496GB guest: Without patch: munmap start: timer = 196449 ms, PID=6681 munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms With patch: munmap start: timer = 196345 ms, PID=6879 munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++ arch/powerpc/mm/book3s64/pgtable.c | 18 ++++++++++++++++++ 2 files changed, 23 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 03521a8b0292..e1f551159f7d 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1265,6 +1265,11 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, } #define pmdp_collapse_flush pmdp_collapse_flush +#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL +pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, + unsigned long addr, + pmd_t *pmdp, int full); + #define __HAVE_ARCH_PGTABLE_DEPOSIT static inline void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, pgtable_t pgtable) diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 127325ead505..54b6d6d103ea 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -112,6 +112,24 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, return __pmd(old_pmd); } +pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp, int full) +{ + pmd_t pmd; + VM_BUG_ON(addr & ~HPAGE_PMD_MASK); + VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) && + !pmd_devmap(*pmdp)) || !pmd_present(*pmdp)); + pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp); + /* + * if it not a fullmm flush, then we can possibly end up converting + * this PMD pte entry to a regular level 0 PTE by a parallel page fault. + * Make sure we flush the tlb in this case. + */ + if (!full) + flush_pmd_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE); + return pmd; +} + static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot) { return __pmd(pmd_val(pmd) | pgprot_val(pgprot));