Message ID | 1430760556-28137-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, May 04, 2015 at 10:59:16PM +0530, Aneesh Kumar K.V wrote: > Archs like ppc64 require pte_t * to remain stable in some code path. > They use local_irq_disable to prevent a parallel split. Generic code > clear pmd instead of marking it _PAGE_SPLITTING in code path > where we can afford to mark pmd none before splitting. Use a > variant of pmdp_splitting_clear_notify that arch can override. > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Sorry, I still try wrap my head around this problem. So, Power has __find_linux_pte_or_hugepte() which does lock-less lookup in page tables with local interrupts disabled. For huge pages it casts pmd_t to pte_t. Since format of pte_t is different from pmd_t we want to prevent transit from pmd pointing to page table to pmd pinging to huge page (and back) while interrupts are disabled. The complication for Power is that it doesn't do implicit IPI on tlb flush. Is it correct? For THP, split_huge_page() and collapse sides are covered. This patch should address two cases of splitting PMD, but not compound page in current upstream. But I think there's still *big* problem for Power -- zap_huge_pmd(). For instance: other CPU can shoot out a THP PMD with MADV_DONTNEED and fault in small pages instead. IIUC, for __find_linux_pte_or_hugepte(), it's equivalent of splitting. I don't see how this can be fixed without kick_all_cpus_sync() in all pmdp_clear_flush() on Power.
"Kirill A. Shutemov" <kirill@shutemov.name> writes: > On Mon, May 04, 2015 at 10:59:16PM +0530, Aneesh Kumar K.V wrote: >> Archs like ppc64 require pte_t * to remain stable in some code path. >> They use local_irq_disable to prevent a parallel split. Generic code >> clear pmd instead of marking it _PAGE_SPLITTING in code path >> where we can afford to mark pmd none before splitting. Use a >> variant of pmdp_splitting_clear_notify that arch can override. >> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > > Sorry, I still try wrap my head around this problem. > > So, Power has __find_linux_pte_or_hugepte() which does lock-less lookup in > page tables with local interrupts disabled. For huge pages it casts pmd_t > to pte_t. Since format of pte_t is different from pmd_t we want to prevent > transit from pmd pointing to page table to pmd pinging to huge page (and > back) while interrupts are disabled. > > The complication for Power is that it doesn't do implicit IPI on tlb > flush. > s/doesn't do/doesn't need to do/ > Is it correct? that is correct. I will add more info to the commit message of the patch I will end up doing. > > For THP, split_huge_page() and collapse sides are covered. This patch > should address two cases of splitting PMD, but not compound page in > current upstream. > > But I think there's still *big* problem for Power -- zap_huge_pmd(). > > For instance: other CPU can shoot out a THP PMD with MADV_DONTNEED and > fault in small pages instead. IIUC, for __find_linux_pte_or_hugepte(), > it's equivalent of splitting. > > I don't see how this can be fixed without kick_all_cpus_sync() in all > pmdp_clear_flush() on Power. > Yes we could run into issue with that. Thanks for catching this. Now i am not sure whether we want to do the kick_all_cpus_sync in pmdp_get_and_clear. We do use that function while updating huge pte. The one i am looking at is change_huge_pmd. We don't need a IPI there and we would really like to avoid the IPI. Any idea why we follow the sequence of pmd_clear and set_pmd, instead of pmd_update there ? I looked at code paths we are clearing pmd where we would not require an IPI. Listing them move_huge_pmd do_huge_pmd_wp_page migrate_misplace_transhuge_page change_huge_pmd. Of this IIUC change_huge_pmd may be called more frequently and hence we may want to avoid doing kick_all_cpus_sync there ? One way to fix that would be switch change_huge_pmd to pmd_update and then we could do a kick_all_cpus_sync in pmdp_get_and_clear. -aneesh
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h index 43e6ad424c7f..307e705cbead 100644 --- a/arch/powerpc/include/asm/pgtable-ppc64.h +++ b/arch/powerpc/include/asm/pgtable-ppc64.h @@ -576,6 +576,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, extern void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +#define __HAVE_ARCH_PMDP_SPLITTING_CLEAR_NOTIFY +extern void pmdp_splitting_clear_notify(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp); + #define __HAVE_ARCH_PGTABLE_DEPOSIT extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, pgtable_t pgtable); diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 59daa5eeec25..c30b15894edf 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -36,6 +36,7 @@ #include <linux/memblock.h> #include <linux/slab.h> #include <linux/hugetlb.h> +#include <linux/mmu_notifier.h> #include <asm/pgalloc.h> #include <asm/page.h> @@ -598,6 +599,16 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address, return pmd; } +void pmdp_splitting_clear_notify(struct vm_area_struct *vma, + unsigned long address, pmd_t *pmdp) +{ + pmdp_clear_flush_notify(vma, address, pmdp); + /* + * Serialize against find_linux_pte_or_hugepte + */ + kick_all_cpus_sync(); +} + int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) { diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 39f1d6a2b04d..83ed56c8594a 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -189,6 +189,10 @@ extern void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); #endif +#ifndef __HAVE_ARCH_PMDP_SPLITTING_CLEAR_NOTIFY +#define pmdp_splitting_clear_notify pmdp_clear_flush_notify +#endif + #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp, pgtable_t pgtable); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 078832cf3636..f596312980f9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1031,7 +1031,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, goto out_free_pages; VM_BUG_ON_PAGE(!PageHead(page), page); - pmdp_clear_flush_notify(vma, haddr, pmd); + pmdp_splitting_clear_notify(vma, haddr, pmd); /* leave pmd empty until pte is filled */ pgtable = pgtable_trans_huge_withdraw(mm, pmd); @@ -2865,7 +2865,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, pmd_t _pmd; int i; - pmdp_clear_flush_notify(vma, haddr, pmd); + pmdp_splitting_clear_notify(vma, haddr, pmd); /* leave pmd empty until pte is filled */ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
Archs like ppc64 require pte_t * to remain stable in some code path. They use local_irq_disable to prevent a parallel split. Generic code clear pmd instead of marking it _PAGE_SPLITTING in code path where we can afford to mark pmd none before splitting. Use a variant of pmdp_splitting_clear_notify that arch can override. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> --- arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++++ arch/powerpc/mm/pgtable_64.c | 11 +++++++++++ include/asm-generic/pgtable.h | 4 ++++ mm/huge_memory.c | 4 ++-- 4 files changed, 21 insertions(+), 2 deletions(-)