diff mbox series

[1/1] powerpc: mm: Check if serialize_against_pte_lookup() really needs to run

Message ID 20190916205527.1859-1-leonardo@linux.ibm.com (mailing list archive)
State Superseded, archived
Headers show
Series [1/1] powerpc: mm: Check if serialize_against_pte_lookup() really needs to run | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (31f571deea2bc840fbe52c6385d6723b4e69a15c)
snowpatch_ozlabs/build-ppc64le success Build succeeded
snowpatch_ozlabs/build-ppc64be success Build succeeded
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 30 lines checked

Commit Message

Leonardo Bras Sept. 16, 2019, 8:55 p.m. UTC
If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk
of memory (496GB) mapped with THP, it takes an average of 275 seconds,
which can cause a lot of problems to the load (in qemu case, the guest
will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot of
time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it will
take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently
with the next part of the functions it's called in.
It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[];

So, by what I could understand, if there is no find_current_mm_pte()
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I propose
a counter that keeps track of how many find_current_mm_pte() are currently
running, and if there is none, just skip smp_call_function_many().

On my workload (qemu), I could see munmap's time reduction from 275 seconds
to 418ms.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>

---
I need more experienced people's help in order to understand if this is
really a valid improvement, and if mm_struct is the best place to put such
counter.

Thanks!
---
 arch/powerpc/include/asm/pte-walk.h | 3 +++
 arch/powerpc/mm/book3s64/pgtable.c  | 2 ++
 include/linux/mm_types.h            | 1 +
 3 files changed, 6 insertions(+)

Comments

Aneesh Kumar K V Sept. 17, 2019, 2:56 a.m. UTC | #1
On 9/17/19 2:25 AM, Leonardo Bras wrote:
> If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk
> of memory (496GB) mapped with THP, it takes an average of 275 seconds,
> which can cause a lot of problems to the load (in qemu case, the guest
> will lock for this time).
> 
> Trying to find the source of this bug, I found out most of this time is
> spent on serialize_against_pte_lookup(). This function will take a lot of
> time in smp_call_function_many() if there is more than a couple CPUs
> running the user process. Since it has to happen to all THP mapped, it will
> take a very long time for large amounts of memory.
> 
> By the docs, serialize_against_pte_lookup() is needed in order to avoid
> pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently
> with the next part of the functions it's called in.
> It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[];
> 
> So, by what I could understand, if there is no find_current_mm_pte()
> running, there is no need to call serialize_against_pte_lookup().
> 
> So, to avoid the cost of running serialize_against_pte_lookup(), I propose
> a counter that keeps track of how many find_current_mm_pte() are currently
> running, and if there is none, just skip smp_call_function_many().
> 
> On my workload (qemu), I could see munmap's time reduction from 275 seconds
> to 418ms.
> 
> Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>

We could possibly avoid that serialize for a full task exit unmap. ie, 
when tlb->fullmm == 1 . But that won't help the Qemu case because it 
does an umap of the guest ram range for which tlb->fullmm != 1.

> 
> ---
> I need more experienced people's help in order to understand if this is
> really a valid improvement, and if mm_struct is the best place to put such
> counter.
> 
> Thanks!
> ---
>   arch/powerpc/include/asm/pte-walk.h | 3 +++
>   arch/powerpc/mm/book3s64/pgtable.c  | 2 ++
>   include/linux/mm_types.h            | 1 +
>   3 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h
> index 33fa5dd8ee6a..3b82cb3bd563 100644
> --- a/arch/powerpc/include/asm/pte-walk.h
> +++ b/arch/powerpc/include/asm/pte-walk.h
> @@ -40,6 +40,8 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
>   {
>   	pte_t *pte;
>   
> +	atomic64_inc(&current->mm->lockless_ptlookup_count);
> +
>   	VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", __func__);
>   	VM_WARN(pgdir != current->mm->pgd,
>   		"%s lock less page table lookup called on wrong mm\n", __func__);
> @@ -53,6 +55,7 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
>   	if (hshift)
>   		WARN_ON(*hshift);
>   #endif
> +	atomic64_dec(&current->mm->lockless_ptlookup_count);
>   	return pte;
>   }



That is not correct. We need to keep the count updated  across the 
local_irq_disable()/local_irq_enable(). That is we mostly should have a 
variant like start_lockless_pgtbl_walk()/end_lockless_pgtbl_walk(). Also 
there is hash_page_mm which we need to make sure we are protected 
against w.r.t collapse pmd.

>   
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 7d0e0d0d22c4..8f6fc2f80071 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -95,6 +95,8 @@ static void do_nothing(void *unused)
>   void serialize_against_pte_lookup(struct mm_struct *mm)
>   {
>   	smp_mb();
> +	if (atomic64_read(&mm->lockless_ptlookup_count) == 0)
> +		return;
>   	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
>   }
>   
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a7a1083b6fb..97fb2545e967 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -518,6 +518,7 @@ struct mm_struct {
>   #endif
>   	} __randomize_layout;
>   
> +	atomic64_t lockless_ptlookup_count;
>   	/*
>   	 * The mm_cpumask needs to be at the end of mm_struct, because it
>   	 * is dynamically sized based on nr_cpu_ids.
> 

Move that to ppc64 specific mm_context_t.


-aneesh
Leonardo Bras Sept. 17, 2019, 9:21 p.m. UTC | #2
Hello Aneesh, thanks for the feedback!

On Tue, 2019-09-17 at 08:26 +0530, Aneesh Kumar K.V wrote:
> We could possibly avoid that serialize for a full task exit unmap. ie, 
> when tlb->fullmm == 1 . But that won't help the Qemu case because it 
> does an umap of the guest ram range for which tlb->fullmm != 1.
Yes, in qemu we can hot-add memory to guests using memory allocated by
qemu using mmap(). If we want/need to free this memory to other
processes (or guests), we remove it from this guest and do a common
munmap(). This happens without exiting qemu, or pausing the guest.


> That is not correct. We need to keep the count updated  across the 
> local_irq_disable()/local_irq_enable(). 
So, by that you mean incrementing the counter before
local_irq_{disable,save} and decrementing only after
local_irq_{enable,restore}?

> That is we mostly should have a variant like 
> start_lockless_pgtbl_walk()/end_lockless_pgtbl_walk(). 
And with that, you mean replacing the current local_irq_{disable,save}
with a start_lockless_pgtbl_walk() (that contains increment + irq
disable)? (and doing similarly with local_irq_{enable,restore} and
end_lockless_pgtbl_walk()) 

> Also there is hash_page_mm which we need to make sure we are protected 
> against w.r.t collapse pmd.
I did not get what I need to do here. 
Is it a new thing to protect? 
If so, I need to better understand how to protect it.
If not, I would like to understand how it's protected by current
behavior. 

> Move that to ppc64 specific mm_context_t.
Ok, fixed! I will send that on v2.

Best regards,
Leonardo Bras
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h
index 33fa5dd8ee6a..3b82cb3bd563 100644
--- a/arch/powerpc/include/asm/pte-walk.h
+++ b/arch/powerpc/include/asm/pte-walk.h
@@ -40,6 +40,8 @@  static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
 {
 	pte_t *pte;
 
+	atomic64_inc(&current->mm->lockless_ptlookup_count);
+
 	VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", __func__);
 	VM_WARN(pgdir != current->mm->pgd,
 		"%s lock less page table lookup called on wrong mm\n", __func__);
@@ -53,6 +55,7 @@  static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
 	if (hshift)
 		WARN_ON(*hshift);
 #endif
+	atomic64_dec(&current->mm->lockless_ptlookup_count);
 	return pte;
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..8f6fc2f80071 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,6 +95,8 @@  static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
 	smp_mb();
+	if (atomic64_read(&mm->lockless_ptlookup_count) == 0)
+		return;
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..97fb2545e967 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -518,6 +518,7 @@  struct mm_struct {
 #endif
 	} __randomize_layout;
 
+	atomic64_t lockless_ptlookup_count;
 	/*
 	 * The mm_cpumask needs to be at the end of mm_struct, because it
 	 * is dynamically sized based on nr_cpu_ids.