diff mbox

[RFC] mm/thp: Use new function to clear pmd before THP splitting

Message ID 1430760556-28137-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Aneesh Kumar K.V May 4, 2015, 5:29 p.m. UTC
Archs like ppc64 require pte_t * to remain stable in some code path.
They use local_irq_disable to prevent a parallel split. Generic code
clear pmd instead of marking it _PAGE_SPLITTING in code path
where we can afford to mark pmd none before splitting. Use a
variant of pmdp_splitting_clear_notify that arch can override.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++++
 arch/powerpc/mm/pgtable_64.c             | 11 +++++++++++
 include/asm-generic/pgtable.h            |  4 ++++
 mm/huge_memory.c                         |  4 ++--
 4 files changed, 21 insertions(+), 2 deletions(-)

Comments

Kirill A. Shutemov May 6, 2015, 12:11 a.m. UTC | #1
On Mon, May 04, 2015 at 10:59:16PM +0530, Aneesh Kumar K.V wrote:
> Archs like ppc64 require pte_t * to remain stable in some code path.
> They use local_irq_disable to prevent a parallel split. Generic code
> clear pmd instead of marking it _PAGE_SPLITTING in code path
> where we can afford to mark pmd none before splitting. Use a
> variant of pmdp_splitting_clear_notify that arch can override.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Sorry, I still try wrap my head around this problem.

So, Power has __find_linux_pte_or_hugepte() which does lock-less lookup in
page tables with local interrupts disabled. For huge pages it casts pmd_t
to pte_t. Since format of pte_t is different from pmd_t we want to prevent
transit from pmd pointing to page table to pmd pinging to huge page (and
back) while interrupts are disabled.

The complication for Power is that it doesn't do implicit IPI on tlb
flush.

Is it correct?

For THP, split_huge_page() and collapse sides are covered. This patch
should address two cases of splitting PMD, but not compound page in
current upstream.

But I think there's still *big* problem for Power -- zap_huge_pmd().

For instance: other CPU can shoot out a THP PMD with MADV_DONTNEED and
fault in small pages instead. IIUC, for __find_linux_pte_or_hugepte(),
it's equivalent of splitting.

I don't see how this can be fixed without kick_all_cpus_sync() in all
pmdp_clear_flush() on Power.
Aneesh Kumar K.V May 6, 2015, 8:48 a.m. UTC | #2
"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Mon, May 04, 2015 at 10:59:16PM +0530, Aneesh Kumar K.V wrote:
>> Archs like ppc64 require pte_t * to remain stable in some code path.
>> They use local_irq_disable to prevent a parallel split. Generic code
>> clear pmd instead of marking it _PAGE_SPLITTING in code path
>> where we can afford to mark pmd none before splitting. Use a
>> variant of pmdp_splitting_clear_notify that arch can override.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Sorry, I still try wrap my head around this problem.
>
> So, Power has __find_linux_pte_or_hugepte() which does lock-less lookup in
> page tables with local interrupts disabled. For huge pages it casts pmd_t
> to pte_t. Since format of pte_t is different from pmd_t we want to prevent
> transit from pmd pointing to page table to pmd pinging to huge page (and
> back) while interrupts are disabled.
>
> The complication for Power is that it doesn't do implicit IPI on tlb
> flush.
>

s/doesn't do/doesn't need to do/


> Is it correct?

that is correct. I will add more info to the commit message of the patch
I will end up doing.

>
> For THP, split_huge_page() and collapse sides are covered. This patch
> should address two cases of splitting PMD, but not compound page in
> current upstream.
>
> But I think there's still *big* problem for Power -- zap_huge_pmd().
>
> For instance: other CPU can shoot out a THP PMD with MADV_DONTNEED and
> fault in small pages instead. IIUC, for __find_linux_pte_or_hugepte(),
> it's equivalent of splitting.
>
> I don't see how this can be fixed without kick_all_cpus_sync() in all
> pmdp_clear_flush() on Power.
>


Yes we could run into issue with that. Thanks for catching this. Now i
am not sure whether we want to do the kick_all_cpus_sync in
pmdp_get_and_clear. We do use that function while updating huge pte. The
one i am looking at is change_huge_pmd. We don't need a IPI there
and we would really like to avoid the IPI. Any idea why we follow
the sequence of pmd_clear and set_pmd, instead of pmd_update there ?

I looked at code paths we are clearing pmd where we would not
require an IPI. Listing them

move_huge_pmd
do_huge_pmd_wp_page
migrate_misplace_transhuge_page
change_huge_pmd.

Of this IIUC change_huge_pmd may be called more frequently and hence we
may want to avoid doing kick_all_cpus_sync there ?

One way to fix that would be switch change_huge_pmd to pmd_update and
then we could do a kick_all_cpus_sync in pmdp_get_and_clear.

-aneesh
diff mbox

Patch

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 43e6ad424c7f..307e705cbead 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -576,6 +576,10 @@  static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 extern void pmdp_splitting_flush(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp);
 
+#define __HAVE_ARCH_PMDP_SPLITTING_CLEAR_NOTIFY
+extern void pmdp_splitting_clear_notify(struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmdp);
+
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 59daa5eeec25..c30b15894edf 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -36,6 +36,7 @@ 
 #include <linux/memblock.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/page.h>
@@ -598,6 +599,16 @@  pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	return pmd;
 }
 
+void pmdp_splitting_clear_notify(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp)
+{
+	pmdp_clear_flush_notify(vma, address, pmdp);
+	/*
+	 * Serialize against find_linux_pte_or_hugepte
+	 */
+	kick_all_cpus_sync();
+}
+
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long address, pmd_t *pmdp)
 {
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 39f1d6a2b04d..83ed56c8594a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -189,6 +189,10 @@  extern void pmdp_splitting_flush(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_CLEAR_NOTIFY
+#define pmdp_splitting_clear_notify pmdp_clear_flush_notify
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 078832cf3636..f596312980f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1031,7 +1031,7 @@  static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	pmdp_splitting_clear_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
@@ -2865,7 +2865,7 @@  static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_t _pmd;
 	int i;
 
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	pmdp_splitting_clear_notify(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);