Message ID | 20190916205527.1859-1-leonardo@linux.ibm.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | [1/1] powerpc: mm: Check if serialize_against_pte_lookup() really needs to run | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch next (31f571deea2bc840fbe52c6385d6723b4e69a15c) |
snowpatch_ozlabs/build-ppc64le | success | Build succeeded |
snowpatch_ozlabs/build-ppc64be | success | Build succeeded |
snowpatch_ozlabs/build-ppc64e | success | Build succeeded |
snowpatch_ozlabs/build-pmac32 | success | Build succeeded |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 30 lines checked |
On 9/17/19 2:25 AM, Leonardo Bras wrote: > If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk > of memory (496GB) mapped with THP, it takes an average of 275 seconds, > which can cause a lot of problems to the load (in qemu case, the guest > will lock for this time). > > Trying to find the source of this bug, I found out most of this time is > spent on serialize_against_pte_lookup(). This function will take a lot of > time in smp_call_function_many() if there is more than a couple CPUs > running the user process. Since it has to happen to all THP mapped, it will > take a very long time for large amounts of memory. > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently > with the next part of the functions it's called in. > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[]; > > So, by what I could understand, if there is no find_current_mm_pte() > running, there is no need to call serialize_against_pte_lookup(). > > So, to avoid the cost of running serialize_against_pte_lookup(), I propose > a counter that keeps track of how many find_current_mm_pte() are currently > running, and if there is none, just skip smp_call_function_many(). > > On my workload (qemu), I could see munmap's time reduction from 275 seconds > to 418ms. > > Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com> We could possibly avoid that serialize for a full task exit unmap. ie, when tlb->fullmm == 1 . But that won't help the Qemu case because it does an umap of the guest ram range for which tlb->fullmm != 1. > > --- > I need more experienced people's help in order to understand if this is > really a valid improvement, and if mm_struct is the best place to put such > counter. > > Thanks! > --- > arch/powerpc/include/asm/pte-walk.h | 3 +++ > arch/powerpc/mm/book3s64/pgtable.c | 2 ++ > include/linux/mm_types.h | 1 + > 3 files changed, 6 insertions(+) > > diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h > index 33fa5dd8ee6a..3b82cb3bd563 100644 > --- a/arch/powerpc/include/asm/pte-walk.h > +++ b/arch/powerpc/include/asm/pte-walk.h > @@ -40,6 +40,8 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea, > { > pte_t *pte; > > + atomic64_inc(¤t->mm->lockless_ptlookup_count); > + > VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", __func__); > VM_WARN(pgdir != current->mm->pgd, > "%s lock less page table lookup called on wrong mm\n", __func__); > @@ -53,6 +55,7 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea, > if (hshift) > WARN_ON(*hshift); > #endif > + atomic64_dec(¤t->mm->lockless_ptlookup_count); > return pte; > } That is not correct. We need to keep the count updated across the local_irq_disable()/local_irq_enable(). That is we mostly should have a variant like start_lockless_pgtbl_walk()/end_lockless_pgtbl_walk(). Also there is hash_page_mm which we need to make sure we are protected against w.r.t collapse pmd. > > diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c > index 7d0e0d0d22c4..8f6fc2f80071 100644 > --- a/arch/powerpc/mm/book3s64/pgtable.c > +++ b/arch/powerpc/mm/book3s64/pgtable.c > @@ -95,6 +95,8 @@ static void do_nothing(void *unused) > void serialize_against_pte_lookup(struct mm_struct *mm) > { > smp_mb(); > + if (atomic64_read(&mm->lockless_ptlookup_count) == 0) > + return; > smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); > } > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 6a7a1083b6fb..97fb2545e967 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -518,6 +518,7 @@ struct mm_struct { > #endif > } __randomize_layout; > > + atomic64_t lockless_ptlookup_count; > /* > * The mm_cpumask needs to be at the end of mm_struct, because it > * is dynamically sized based on nr_cpu_ids. > Move that to ppc64 specific mm_context_t. -aneesh
Hello Aneesh, thanks for the feedback! On Tue, 2019-09-17 at 08:26 +0530, Aneesh Kumar K.V wrote: > We could possibly avoid that serialize for a full task exit unmap. ie, > when tlb->fullmm == 1 . But that won't help the Qemu case because it > does an umap of the guest ram range for which tlb->fullmm != 1. Yes, in qemu we can hot-add memory to guests using memory allocated by qemu using mmap(). If we want/need to free this memory to other processes (or guests), we remove it from this guest and do a common munmap(). This happens without exiting qemu, or pausing the guest. > That is not correct. We need to keep the count updated across the > local_irq_disable()/local_irq_enable(). So, by that you mean incrementing the counter before local_irq_{disable,save} and decrementing only after local_irq_{enable,restore}? > That is we mostly should have a variant like > start_lockless_pgtbl_walk()/end_lockless_pgtbl_walk(). And with that, you mean replacing the current local_irq_{disable,save} with a start_lockless_pgtbl_walk() (that contains increment + irq disable)? (and doing similarly with local_irq_{enable,restore} and end_lockless_pgtbl_walk()) > Also there is hash_page_mm which we need to make sure we are protected > against w.r.t collapse pmd. I did not get what I need to do here. Is it a new thing to protect? If so, I need to better understand how to protect it. If not, I would like to understand how it's protected by current behavior. > Move that to ppc64 specific mm_context_t. Ok, fixed! I will send that on v2. Best regards, Leonardo Bras
diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h index 33fa5dd8ee6a..3b82cb3bd563 100644 --- a/arch/powerpc/include/asm/pte-walk.h +++ b/arch/powerpc/include/asm/pte-walk.h @@ -40,6 +40,8 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea, { pte_t *pte; + atomic64_inc(¤t->mm->lockless_ptlookup_count); + VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", __func__); VM_WARN(pgdir != current->mm->pgd, "%s lock less page table lookup called on wrong mm\n", __func__); @@ -53,6 +55,7 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea, if (hshift) WARN_ON(*hshift); #endif + atomic64_dec(¤t->mm->lockless_ptlookup_count); return pte; } diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 7d0e0d0d22c4..8f6fc2f80071 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -95,6 +95,8 @@ static void do_nothing(void *unused) void serialize_against_pte_lookup(struct mm_struct *mm) { smp_mb(); + if (atomic64_read(&mm->lockless_ptlookup_count) == 0) + return; smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6a7a1083b6fb..97fb2545e967 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -518,6 +518,7 @@ struct mm_struct { #endif } __randomize_layout; + atomic64_t lockless_ptlookup_count; /* * The mm_cpumask needs to be at the end of mm_struct, because it * is dynamically sized based on nr_cpu_ids.
If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk of memory (496GB) mapped with THP, it takes an average of 275 seconds, which can cause a lot of problems to the load (in qemu case, the guest will lock for this time). Trying to find the source of this bug, I found out most of this time is spent on serialize_against_pte_lookup(). This function will take a lot of time in smp_call_function_many() if there is more than a couple CPUs running the user process. Since it has to happen to all THP mapped, it will take a very long time for large amounts of memory. By the docs, serialize_against_pte_lookup() is needed in order to avoid pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently with the next part of the functions it's called in. It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[]; So, by what I could understand, if there is no find_current_mm_pte() running, there is no need to call serialize_against_pte_lookup(). So, to avoid the cost of running serialize_against_pte_lookup(), I propose a counter that keeps track of how many find_current_mm_pte() are currently running, and if there is none, just skip smp_call_function_many(). On my workload (qemu), I could see munmap's time reduction from 275 seconds to 418ms. Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com> --- I need more experienced people's help in order to understand if this is really a valid improvement, and if mm_struct is the best place to put such counter. Thanks! --- arch/powerpc/include/asm/pte-walk.h | 3 +++ arch/powerpc/mm/book3s64/pgtable.c | 2 ++ include/linux/mm_types.h | 1 + 3 files changed, 6 insertions(+)