Message ID | 20190920195047.7703-1-leonardo@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduces new count-based method for monitoring lockless pagetable wakls | expand |
On Fri, 2019-09-20 at 16:50 -0300, Leonardo Bras wrote:
> *** BLURB HERE ***
Sorry, something gone terribly wrong with my cover letter.
I will try to find it and send here, or rewrite it.
Best regards,
If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk of memory (496GB) mapped with THP, it takes an average of 275 seconds, which can cause a lot of problems to the load (in qemu case, the guest will lock for this time). Trying to find the source of this bug, I found out most of this time is spent on serialize_against_pte_lookup(). This function will take a lot of time in smp_call_function_many() if there is more than a couple CPUs running the user process. Since it has to happen to all THP mapped, it will take a very long time for large amounts of memory. By the docs, serialize_against_pte_lookup() is needed in order to avoid pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless pagetable walk, to happen concurrently with THP splitting/collapsing. It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], after interrupts are re-enabled. Since, interrupts are (usually) disabled during lockless pagetable walk, and serialize_against_pte_lookup will only return after interrupts are enabled, it is protected. So, by what I could understand, if there is no lockless pagetable walk running, there is no need to call serialize_against_pte_lookup(). So, to avoid the cost of running serialize_against_pte_lookup(), I propose a counter that keeps track of how many find_current_mm_pte() are currently running, and if there is none, just skip smp_call_function_many(). The related functions are: start_lockless_pgtbl_walk(mm) Insert before starting any lockless pgtable walk end_lockless_pgtbl_walk(mm) Insert after the end of any lockless pgtable walk (Mostly after the ptep is last used) running_lockless_pgtbl_walk(mm) Returns the number of lockless pgtable walks running On my workload (qemu), I could see munmap's time reduction from 275 seconds to 418ms. > Leonardo Bras (11): > powerpc/mm: Adds counting method to monitor lockless pgtable walks > asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable > walks > mm/gup: Applies counting method to monitor gup_pgd_range > powerpc/mce_power: Applies counting method to monitor lockless pgtbl > walks > powerpc/perf: Applies counting method to monitor lockless pgtbl walks > powerpc/mm/book3s64/hash: Applies counting method to monitor lockless > pgtbl walks > powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl > walks > powerpc/kvm/book3s_hv: Applies counting method to monitor lockless > pgtbl walks > powerpc/kvm/book3s_64: Applies counting method to monitor lockless > pgtbl walks > powerpc/book3s_64: Enables counting method to monitor lockless pgtbl > walk > powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing > > arch/powerpc/include/asm/book3s/64/mmu.h | 3 +++ > arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++ > arch/powerpc/kernel/mce_power.c | 13 ++++++++++--- > arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++ > arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++++++++++++++++++-- > arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++++ > arch/powerpc/kvm/book3s_hv_nested.c | 8 ++++++++ > arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 ++++++++- > arch/powerpc/kvm/e500_mmu_host.c | 4 ++++ > arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++ > arch/powerpc/mm/book3s64/hash_utils.c | 7 +++++++ > arch/powerpc/mm/book3s64/mmu_context.c | 1 + > arch/powerpc/mm/book3s64/pgtable.c | 20 +++++++++++++++++++- > arch/powerpc/perf/callchain.c | 5 ++++- > include/asm-generic/pgtable.h | 9 +++++++++ > mm/gup.c | 4 ++++ > 16 files changed, 108 insertions(+), 8 deletions(-) >
On 9/20/19 1:12 PM, Leonardo Bras wrote: > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > chunk of memory (496GB) mapped with THP, it takes an average of 275 > seconds, which can cause a lot of problems to the load (in qemu case, > the guest will lock for this time). > > Trying to find the source of this bug, I found out most of this time is > spent on serialize_against_pte_lookup(). This function will take a lot > of time in smp_call_function_many() if there is more than a couple CPUs > running the user process. Since it has to happen to all THP mapped, it > will take a very long time for large amounts of memory. > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > pagetable walk, to happen concurrently with THP splitting/collapsing. > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > after interrupts are re-enabled. > Since, interrupts are (usually) disabled during lockless pagetable > walk, and serialize_against_pte_lookup will only return after > interrupts are enabled, it is protected. > > So, by what I could understand, if there is no lockless pagetable walk > running, there is no need to call serialize_against_pte_lookup(). > > So, to avoid the cost of running serialize_against_pte_lookup(), I > propose a counter that keeps track of how many find_current_mm_pte() > are currently running, and if there is none, just skip > smp_call_function_many(). Just noticed that this really should also include linux-mm, maybe it's best to repost the patchset with them included? In particular, there is likely to be some feedback about adding more calls, in addition to local_irq_disable/enable, around the gup_fast() path, separately from my questions about the synchronization cases in ppc. thanks,
On 9/20/19 1:12 PM, Leonardo Bras wrote: ... >> arch/powerpc/include/asm/book3s/64/mmu.h | 3 +++ >> arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++ >> arch/powerpc/kernel/mce_power.c | 13 ++++++++++--- >> arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++ >> arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++++++++++++++++++-- >> arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++++ >> arch/powerpc/kvm/book3s_hv_nested.c | 8 ++++++++ >> arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 ++++++++- >> arch/powerpc/kvm/e500_mmu_host.c | 4 ++++ >> arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++ >> arch/powerpc/mm/book3s64/hash_utils.c | 7 +++++++ >> arch/powerpc/mm/book3s64/mmu_context.c | 1 + >> arch/powerpc/mm/book3s64/pgtable.c | 20 +++++++++++++++++++- >> arch/powerpc/perf/callchain.c | 5 ++++- >> include/asm-generic/pgtable.h | 9 +++++++++ >> mm/gup.c | 4 ++++ >> 16 files changed, 108 insertions(+), 8 deletions(-) >> Also, which tree do these patches apply to, please? thanks,
On Mon, 2019-09-23 at 13:51 -0700, John Hubbard wrote: > Also, which tree do these patches apply to, please? > > thanks, They should apply on top of v5.3 + one patch: https://patchwork.ozlabs.org/patch/1164925/ I was working on top of this patch, because I thought it would be merged fast. But since I got no feedback, it was not merged and the present patchset became broken. :( But I will rebase v3 on top of plain v5.3. Thanks, Leonardo Bras