diff mbox series

[SRU,B/E/F,1/1] hugetlbfs: take read_lock on i_mmap for PMD sharing

Message ID 20200604093931.30669-2-gavin.guo@canonical.com
State New
Headers show
Series LP:#1882039 - The thread level parallelism would be a bottleneck when searching for the shared pmd by using hugetlbfs | expand

Commit Message

Gavin Guo June 4, 2020, 9:39 a.m. UTC
From: Waiman Long <longman@redhat.com>

BugLink: https://bugs.launchpad.net/bugs/1882039

A customer with large SMP systems (up to 16 sockets) with application
that uses large amount of static hugepages (~500-1500GB) are
experiencing random multisecond delays.  These delays were caused by the
long time it took to scan the VMA interval tree with mmap_sem held.

The sharing of huge PMD does not require changes to the i_mmap at all.
Therefore, we can just take the read lock and let other threads
searching for the right VMA share it in parallel.  Once the right VMA is
found, either the PMD lock (2M huge page for x86-64) or the
mm->page_table_lock will be acquired to perform the actual PMD sharing.

Lock contention, if present, will happen in the spinlock.  That is much
better than contention in the rwsem where the time needed to scan the
the interval tree is indeterminate.

With this patch applied, the customer is seeing significant performance
improvement over the unpatched kernel.

Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Colin Ian King June 4, 2020, 9:57 a.m. UTC | #1
On 04/06/2020 10:39, Gavin Guo wrote:
> From: Waiman Long <longman@redhat.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1882039
> 
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are
> experiencing random multisecond delays.  These delays were caused by the
> long time it took to scan the VMA interval tree with mmap_sem held.
> 
> The sharing of huge PMD does not require changes to the i_mmap at all.
> Therefore, we can just take the read lock and let other threads
> searching for the right VMA share it in parallel.  Once the right VMA is
> found, either the PMD lock (2M huge page for x86-64) or the
> mm->page_table_lock will be acquired to perform the actual PMD sharing.
> 
> Lock contention, if present, will happen in the spinlock.  That is much
> better than contention in the rwsem where the time needed to scan the
> the interval tree is indeterminate.
> 
> With this patch applied, the customer is seeing significant performance
> improvement over the unpatched kernel.
> 
> Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
> Signed-off-by: Waiman Long <longman@redhat.com>
> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
> Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
> ---
>  mm/hugetlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2af1831596f2..9c871b47ef86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	if (!vma_shareable(vma, addr))
>  		return (pte_t *)pmd_alloc(mm, pud, addr);
>  
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	spin_unlock(ptl);
>  out:
>  	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  	return pte;
>  }
>  
>I don't see any testing evidence of this working on B/E/F. Mind you this
looks straight forward, and that's what concerns me as a good win with
minimal change. What testing has been performed on this fix?

Colin
Gavin Guo June 4, 2020, 10:12 a.m. UTC | #2
Hi Colin,

On Thu, Jun 4, 2020 at 5:57 PM Colin Ian King <colin.king@canonical.com> wrote:
>
> On 04/06/2020 10:39, Gavin Guo wrote:
> > From: Waiman Long <longman@redhat.com>
> >
> > BugLink: https://bugs.launchpad.net/bugs/1882039
> >
> > A customer with large SMP systems (up to 16 sockets) with application
> > that uses large amount of static hugepages (~500-1500GB) are
> > experiencing random multisecond delays.  These delays were caused by the
> > long time it took to scan the VMA interval tree with mmap_sem held.
> >
> > The sharing of huge PMD does not require changes to the i_mmap at all.
> > Therefore, we can just take the read lock and let other threads
> > searching for the right VMA share it in parallel.  Once the right VMA is
> > found, either the PMD lock (2M huge page for x86-64) or the
> > mm->page_table_lock will be acquired to perform the actual PMD sharing.
> >
> > Lock contention, if present, will happen in the spinlock.  That is much
> > better than contention in the rwsem where the time needed to scan the
> > the interval tree is indeterminate.
> >
> > With this patch applied, the customer is seeing significant performance
> > improvement over the unpatched kernel.
> >
> > Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
> > Signed-off-by: Waiman Long <longman@redhat.com>
> > Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Davidlohr Bueso <dave@stgolabs.net>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
> > Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
> > ---
> >  mm/hugetlb.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 2af1831596f2..9c871b47ef86 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >       if (!vma_shareable(vma, addr))
> >               return (pte_t *)pmd_alloc(mm, pud, addr);
> >
> > -     i_mmap_lock_write(mapping);
> > +     i_mmap_lock_read(mapping);
> >       vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> >               if (svma == vma)
> >                       continue;
> > @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >       spin_unlock(ptl);
> >  out:
> >       pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > -     i_mmap_unlock_write(mapping);
> > +     i_mmap_unlock_read(mapping);
> >       return pte;
> >  }
> >
> >I don't see any testing evidence of this working on B/E/F. Mind you this
> looks straight forward, and that's what concerns me as a good win with
> minimal change. What testing has been performed on this fix?

I agree that this is really straight forward. However, I asked many
times to the customer for the detailed data and this is all I have.
And currently, I cannot reproduce it locally. I'll try more and also
ask the customer to see if they can give me more details on how they
proceed with the testing.

>
> Colin
Stefan Bader June 10, 2020, 7:29 a.m. UTC | #3
On 04.06.20 11:39, Gavin Guo wrote:
> From: Waiman Long <longman@redhat.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1882039
> 
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are
> experiencing random multisecond delays.  These delays were caused by the
> long time it took to scan the VMA interval tree with mmap_sem held.
> 
> The sharing of huge PMD does not require changes to the i_mmap at all.
> Therefore, we can just take the read lock and let other threads
> searching for the right VMA share it in parallel.  Once the right VMA is
> found, either the PMD lock (2M huge page for x86-64) or the
> mm->page_table_lock will be acquired to perform the actual PMD sharing.
> 
> Lock contention, if present, will happen in the spinlock.  That is much
> better than contention in the rwsem where the time needed to scan the
> the interval tree is indeterminate.
> 
> With this patch applied, the customer is seeing significant performance
> improvement over the unpatched kernel.
> 
> Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
> Signed-off-by: Waiman Long <longman@redhat.com>
> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
> Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
> ---

I made some adjustments to the bug tasks. Please for future submissions try to
set status and importance to something sensible.

Thanks,
Stefan
>  mm/hugetlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2af1831596f2..9c871b47ef86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	if (!vma_shareable(vma, addr))
>  		return (pte_t *)pmd_alloc(mm, pud, addr);
>  
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	spin_unlock(ptl);
>  out:
>  	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  	return pte;
>  }
>  
>
Gerald Yang June 22, 2020, 3:51 p.m. UTC | #4
Hi Colin,

The steps to simulate the issue as below:
1. add a kprobe to huge_pmd_share and set offset between taking write lock
on mapping
and vma_interval_tree_foreach:
==================================================
i_mmap_lock_write(mapping);
*kprobe*
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
==================================================
and add 2 ms delay in kprobe to simulate long search time

2. use BPF tool to insert a kprobe before the above kprobe, and get the
timestamp when a process calls to this point

3. use hugetlbfs(https://github.com/libhugetlbfs/libhugetlbfs.git) to
create 4 processes and trigger VMA interval tree scan simultaneously

So we can observe how much time each process takes to search the VMA
interval tree

Without this patch, each process needs to take the write lock before
searching the tree
so only one process can search the tree and takes around 2 ms to complete,
the the following is the output of bcc tool

root@testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
kernel function: huge_pmd_share
  2 233713.406179438 11499  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  3 233715.392836926 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc20e7c90 HUGEPMD
  0 233717.378148059 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1fbfc90 HUGEPMD
  1 233719.366618585 11498  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c80 HUGEPMD
  2 233721.589174146 11499  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  0 233723.577443027 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1b3fc90 HUGEPMD
  3 233725.565996868 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc1fbfc80 HUGEPMD
  1 233727.788305070 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  0 233729.773605019 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c90 HUGEPMD
  2 233731.761853316 11499  shm-fork 0        ffff93d606742910        0
ffffb05bc1b3fc80 HUGEPMD
  1 233733.988826353 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  3 233735.977068987 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc20e7c90 HUGEPMD
  0 233737.965477337 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c80 HUGEPMD
  1 233740.186784801 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  3 233742.174938960 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c90 HUGEPMD
  2 233744.163177097 11499  shm-fork 0        ffff93d606742910        0
ffffb05bc20e7c80 HUGEPMD
  1 233746.382554150 11498  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD
  0 233748.370968889 11497  shm-fork 0        ffff93d606742910        0
ffffb05bc1b3fc90 HUGEPMD
  3 233750.359224507 11500  shm-fork 0        ffff93d606742910        0
ffffb05bc1ac7c80 HUGEPMD
  2 233752.563115019 11499  shm-fork 0        ffff93d606742910        0
           0 HUGEPMD

With the patch, all processes take read lock before searching the VMA
interval tree
so they can do it simultaneously as below

root@testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
kernel function: pcibios_pm_ops
  6 471.183414969 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 471.183428598 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 471.183414908 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  3 471.183414962 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 476.111138155 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 476.111007350 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 476.111408005 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 476.111808625 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 480.421521033 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 480.421618975 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 480.421988461 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 480.422808252 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 484.732804127 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 484.733568351 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 484.733698494 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 484.733473733 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  4 489.037550955 1872   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  6 489.037880018 1873   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  5 489.038004825 1875   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD
  1 489.038070483 1871   shm-fork 0        ffff896a271f6e00        0
7fea18c00000 HUGEPMD

On Thu, Jun 4, 2020 at 6:13 PM Gavin Guo <gavin.guo@canonical.com> wrote:

> Hi Colin,
>
> On Thu, Jun 4, 2020 at 5:57 PM Colin Ian King <colin.king@canonical.com>
> wrote:
> >
> > On 04/06/2020 10:39, Gavin Guo wrote:
> > > From: Waiman Long <longman@redhat.com>
> > >
> > > BugLink: https://bugs.launchpad.net/bugs/1882039
> > >
> > > A customer with large SMP systems (up to 16 sockets) with application
> > > that uses large amount of static hugepages (~500-1500GB) are
> > > experiencing random multisecond delays.  These delays were caused by
> the
> > > long time it took to scan the VMA interval tree with mmap_sem held.
> > >
> > > The sharing of huge PMD does not require changes to the i_mmap at all.
> > > Therefore, we can just take the read lock and let other threads
> > > searching for the right VMA share it in parallel.  Once the right VMA
> is
> > > found, either the PMD lock (2M huge page for x86-64) or the
> > > mm->page_table_lock will be acquired to perform the actual PMD sharing.
> > >
> > > Lock contention, if present, will happen in the spinlock.  That is much
> > > better than contention in the rwsem where the time needed to scan the
> > > the interval tree is indeterminate.
> > >
> > > With this patch applied, the customer is seeing significant performance
> > > improvement over the unpatched kernel.
> > >
> > > Link:
> http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
> > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > Cc: Davidlohr Bueso <dave@stgolabs.net>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Will Deacon <will.deacon@arm.com>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > > (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
> > > Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
> > > ---
> > >  mm/hugetlb.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 2af1831596f2..9c871b47ef86 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm,
> unsigned long addr, pud_t *pud)
> > >       if (!vma_shareable(vma, addr))
> > >               return (pte_t *)pmd_alloc(mm, pud, addr);
> > >
> > > -     i_mmap_lock_write(mapping);
> > > +     i_mmap_lock_read(mapping);
> > >       vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> > >               if (svma == vma)
> > >                       continue;
> > > @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm,
> unsigned long addr, pud_t *pud)
> > >       spin_unlock(ptl);
> > >  out:
> > >       pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > -     i_mmap_unlock_write(mapping);
> > > +     i_mmap_unlock_read(mapping);
> > >       return pte;
> > >  }
> > >
> > >I don't see any testing evidence of this working on B/E/F. Mind you this
> > looks straight forward, and that's what concerns me as a good win with
> > minimal change. What testing has been performed on this fix?
>
> I agree that this is really straight forward. However, I asked many
> times to the customer for the detailed data and this is all I have.
> And currently, I cannot reproduce it locally. I'll try more and also
> ask the customer to see if they can give me more details on how they
> proceed with the testing.
>
> >
> > Colin
>
> --
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
>
Gavin Guo June 22, 2020, 11:02 p.m. UTC | #5
On Mon, Jun 22, 2020 at 11:51 PM Gerald Yang <gerald.yang@canonical.com> wrote:
>
> Hi Colin,
>
> The steps to simulate the issue as below:
> 1. add a kprobe to huge_pmd_share and set offset between taking write lock on mapping
> and vma_interval_tree_foreach:
> ==================================================
> i_mmap_lock_write(mapping);
> *kprobe*
> vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> ==================================================
> and add 2 ms delay in kprobe to simulate long search time
>
> 2. use BPF tool to insert a kprobe before the above kprobe, and get the timestamp when a process calls to this point
>
> 3. use hugetlbfs(https://github.com/libhugetlbfs/libhugetlbfs.git) to create 4 processes and trigger VMA interval tree scan simultaneously
>
> So we can observe how much time each process takes to search the VMA interval tree
>
> Without this patch, each process needs to take the write lock before searching the tree
> so only one process can search the tree and takes around 2 ms to complete, the the following is the output of bcc tool
>
> root@testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
> kernel function: huge_pmd_share
>   2 233713.406179438 11499  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   3 233715.392836926 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc20e7c90 HUGEPMD
>   0 233717.378148059 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1fbfc90 HUGEPMD
>   1 233719.366618585 11498  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c80 HUGEPMD
>   2 233721.589174146 11499  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   0 233723.577443027 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1b3fc90 HUGEPMD
>   3 233725.565996868 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc1fbfc80 HUGEPMD
>   1 233727.788305070 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   0 233729.773605019 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c90 HUGEPMD
>   2 233731.761853316 11499  shm-fork 0        ffff93d606742910        0 ffffb05bc1b3fc80 HUGEPMD
>   1 233733.988826353 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   3 233735.977068987 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc20e7c90 HUGEPMD
>   0 233737.965477337 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c80 HUGEPMD
>   1 233740.186784801 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   3 233742.174938960 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c90 HUGEPMD
>   2 233744.163177097 11499  shm-fork 0        ffff93d606742910        0 ffffb05bc20e7c80 HUGEPMD
>   1 233746.382554150 11498  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>   0 233748.370968889 11497  shm-fork 0        ffff93d606742910        0 ffffb05bc1b3fc90 HUGEPMD
>   3 233750.359224507 11500  shm-fork 0        ffff93d606742910        0 ffffb05bc1ac7c80 HUGEPMD
>   2 233752.563115019 11499  shm-fork 0        ffff93d606742910        0                0 HUGEPMD
>
> With the patch, all processes take read lock before searching the VMA interval tree
> so they can do it simultaneously as below
>
> root@testvm:~/kprobe# /usr/share/bcc/tools/sharepmd
> kernel function: pcibios_pm_ops
>   6 471.183414969 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 471.183428598 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 471.183414908 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   3 471.183414962 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 476.111138155 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 476.111007350 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 476.111408005 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 476.111808625 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 480.421521033 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 480.421618975 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 480.421988461 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 480.422808252 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 484.732804127 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 484.733568351 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 484.733698494 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 484.733473733 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   4 489.037550955 1872   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   6 489.037880018 1873   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   5 489.038004825 1875   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>   1 489.038070483 1871   shm-fork 0        ffff896a271f6e00        0     7fea18c00000 HUGEPMD
>
> On Thu, Jun 4, 2020 at 6:13 PM Gavin Guo <gavin.guo@canonical.com> wrote:
>>
>> Hi Colin,
>>
>> On Thu, Jun 4, 2020 at 5:57 PM Colin Ian King <colin.king@canonical.com> wrote:
>> >
>> > On 04/06/2020 10:39, Gavin Guo wrote:
>> > > From: Waiman Long <longman@redhat.com>
>> > >
>> > > BugLink: https://bugs.launchpad.net/bugs/1882039
>> > >
>> > > A customer with large SMP systems (up to 16 sockets) with application
>> > > that uses large amount of static hugepages (~500-1500GB) are
>> > > experiencing random multisecond delays.  These delays were caused by the
>> > > long time it took to scan the VMA interval tree with mmap_sem held.
>> > >
>> > > The sharing of huge PMD does not require changes to the i_mmap at all.
>> > > Therefore, we can just take the read lock and let other threads
>> > > searching for the right VMA share it in parallel.  Once the right VMA is
>> > > found, either the PMD lock (2M huge page for x86-64) or the
>> > > mm->page_table_lock will be acquired to perform the actual PMD sharing.
>> > >
>> > > Lock contention, if present, will happen in the spinlock.  That is much
>> > > better than contention in the rwsem where the time needed to scan the
>> > > the interval tree is indeterminate.
>> > >
>> > > With this patch applied, the customer is seeing significant performance
>> > > improvement over the unpatched kernel.
>> > >
>> > > Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
>> > > Signed-off-by: Waiman Long <longman@redhat.com>
>> > > Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
>> > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>> > > Cc: Davidlohr Bueso <dave@stgolabs.net>
>> > > Cc: Peter Zijlstra <peterz@infradead.org>
>> > > Cc: Ingo Molnar <mingo@redhat.com>
>> > > Cc: Will Deacon <will.deacon@arm.com>
>> > > Cc: Matthew Wilcox <willy@infradead.org>
>> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>> > > (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
>> > > Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
>> > > ---
>> > >  mm/hugetlb.c | 4 ++--
>> > >  1 file changed, 2 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> > > index 2af1831596f2..9c871b47ef86 100644
>> > > --- a/mm/hugetlb.c
>> > > +++ b/mm/hugetlb.c
>> > > @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>> > >       if (!vma_shareable(vma, addr))
>> > >               return (pte_t *)pmd_alloc(mm, pud, addr);
>> > >
>> > > -     i_mmap_lock_write(mapping);
>> > > +     i_mmap_lock_read(mapping);
>> > >       vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>> > >               if (svma == vma)
>> > >                       continue;
>> > > @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>> > >       spin_unlock(ptl);
>> > >  out:
>> > >       pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> > > -     i_mmap_unlock_write(mapping);
>> > > +     i_mmap_unlock_read(mapping);
>> > >       return pte;
>> > >  }
>> > >
>> > >I don't see any testing evidence of this working on B/E/F. Mind you this
>> > looks straight forward, and that's what concerns me as a good win with
>> > minimal change. What testing has been performed on this fix?
>>
>> I agree that this is really straight forward. However, I asked many
>> times to the customer for the detailed data and this is all I have.
>> And currently, I cannot reproduce it locally. I'll try more and also
>> ask the customer to see if they can give me more details on how they
>> proceed with the testing.
>>
>> >
>> > Colin
>>
>> --
>> kernel-team mailing list
>> kernel-team@lists.ubuntu.com
>> https://lists.ubuntu.com/mailman/listinfo/kernel-team

As a supplementary to Gerald's steps(Thanks to Gerald for the explanation
about the idea), the reproducer source code could be found in:


shm-fork.c (To start the processes to use the huge page based on the
share memory mechanism)
https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/shm-fork.c

kprobe_hugepmd.c (To simulate the 2s search in the VMA tree)
http://paste.ubuntu.com/p/t3WqCQKgsR/

shm-fork.sh (To start the reproducing process with 4 processes to go
through 1024 2MB hugepages)
http://paste.ubuntu.com/p/DVmS239KZk/

bcc tool sharepmd(To monitor the contention timing between the
processes)
http://paste.ubuntu.com/p/J5DfdTnfNJ/
Juerg Haefliger July 1, 2020, 7:24 a.m. UTC | #6
On Thu,  4 Jun 2020 17:39:31 +0800
Gavin Guo <gavin.guo@canonical.com> wrote:

> From: Waiman Long <longman@redhat.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1882039
> 
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are
> experiencing random multisecond delays.  These delays were caused by the
> long time it took to scan the VMA interval tree with mmap_sem held.
> 
> The sharing of huge PMD does not require changes to the i_mmap at all.
> Therefore, we can just take the read lock and let other threads
> searching for the right VMA share it in parallel.  Once the right VMA is
> found, either the PMD lock (2M huge page for x86-64) or the
> mm->page_table_lock will be acquired to perform the actual PMD sharing.
> 
> Lock contention, if present, will happen in the spinlock.  That is much
> better than contention in the rwsem where the time needed to scan the
> the interval tree is indeterminate.
> 
> With this patch applied, the customer is seeing significant performance
> improvement over the unpatched kernel.
> 
> Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com
> Signed-off-by: Waiman Long <longman@redhat.com>
> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> (cherry picked from commit 930668c34408ba983049322e04f13f03b6f1fafa)
> Signed-off-by: Gavin Guo <gavin.guo@canonical.com>

Acked-by: Juerg Haefliger <juergh@canonical.com>


> ---
>  mm/hugetlb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2af1831596f2..9c871b47ef86 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4891,7 +4891,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	if (!vma_shareable(vma, addr))
>  		return (pte_t *)pmd_alloc(mm, pud, addr);
>  
> -	i_mmap_lock_write(mapping);
> +	i_mmap_lock_read(mapping);
>  	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
>  		if (svma == vma)
>  			continue;
> @@ -4921,7 +4921,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>  	spin_unlock(ptl);
>  out:
>  	pte = (pte_t *)pmd_alloc(mm, pud, addr);
> -	i_mmap_unlock_write(mapping);
> +	i_mmap_unlock_read(mapping);
>  	return pte;
>  }
>
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2af1831596f2..9c871b47ef86 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4891,7 +4891,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
-	i_mmap_lock_write(mapping);
+	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
@@ -4921,7 +4921,7 @@  pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	spin_unlock(ptl);
 out:
 	pte = (pte_t *)pmd_alloc(mm, pud, addr);
-	i_mmap_unlock_write(mapping);
+	i_mmap_unlock_read(mapping);
 	return pte;
 }