[B,aws] UBUNTU SAUCE: mm: swap: improve swap readahead heuristic
diff mbox series

Message ID 20191203105859.GA13534@xps-13
State New
Headers show
Series
  • [B,aws] UBUNTU SAUCE: mm: swap: improve swap readahead heuristic
Related show

Commit Message

Andrea Righi Dec. 3, 2019, 10:58 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1831940

Apply a more aggressive swapin readahead policy to improve swapoff
performance.

The idea is to start with no readahead (only read one page) and linearly
increment the amount of readahead pages each time swapin_readahead() is
called, up to the maximum cluster size (defined by vm.page-cluster),
then go back to one page to give the disk enough time to prefetch the
requested pages and avoid re-requesting them multiple times.

Also increase the default vm.page-cluster size to 8 (that seems to work
better with this new heuristic).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
---
 mm/swap.c       |  2 +-
 mm/swap_state.c | 60 ++++++++-----------------------------------------
 2 files changed, 10 insertions(+), 52 deletions(-)

Comments

Connor Kuehl Dec. 6, 2019, 8:53 p.m. UTC | #1
On 12/3/19 2:58 AM, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1831940
> 
> Apply a more aggressive swapin readahead policy to improve swapoff
> performance.
> 
> The idea is to start with no readahead (only read one page) and linearly
> increment the amount of readahead pages each time swapin_readahead() is
> called, up to the maximum cluster size (defined by vm.page-cluster),
> then go back to one page to give the disk enough time to prefetch the
> requested pages and avoid re-requesting them multiple times.
> 
> Also increase the default vm.page-cluster size to 8 (that seems to work
> better with this new heuristic).
> 
> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Was this patch was included in the AMI that received positive test 
results? I remember having a small chat about a recent round of testing 
for this on IRC but also wanted the mailing list to be informed of the 
positive results too :-)

Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
Andrea Righi Dec. 6, 2019, 9:05 p.m. UTC | #2
On Fri, Dec 06, 2019 at 12:53:19PM -0800, Connor Kuehl wrote:
> On 12/3/19 2:58 AM, Andrea Righi wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1831940
> > 
> > Apply a more aggressive swapin readahead policy to improve swapoff
> > performance.
> > 
> > The idea is to start with no readahead (only read one page) and linearly
> > increment the amount of readahead pages each time swapin_readahead() is
> > called, up to the maximum cluster size (defined by vm.page-cluster),
> > then go back to one page to give the disk enough time to prefetch the
> > requested pages and avoid re-requesting them multiple times.
> > 
> > Also increase the default vm.page-cluster size to 8 (that seems to work
> > better with this new heuristic).
> > 
> > Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
> 
> Was this patch was included in the AMI that received positive test results?
> I remember having a small chat about a recent round of testing for this on
> IRC but also wanted the mailing list to be informed of the positive results
> too :-)
> 
> Acked-by: Connor Kuehl <connor.kuehl@canonical.com>

Thanks Connor, good point, I should have mentioned it in the comment.
This patch has been tested a lot and it is included in the AMI that
received positive results.

-Andrea
Kamal Mostafa Dec. 11, 2019, 8:10 p.m. UTC | #3
LGTM.  As we've discussed, I suggest also submitting this upstream, maybe
enabled by a heuristic-selector switch.

Acked-by: Kamal Mostafa <kamal@canonical.com>

 -Kamal

On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1831940
> 
> Apply a more aggressive swapin readahead policy to improve swapoff
> performance.
> 
> The idea is to start with no readahead (only read one page) and linearly
> increment the amount of readahead pages each time swapin_readahead() is
> called, up to the maximum cluster size (defined by vm.page-cluster),
> then go back to one page to give the disk enough time to prefetch the
> requested pages and avoid re-requesting them multiple times.
> 
> Also increase the default vm.page-cluster size to 8 (that seems to work
> better with this new heuristic).
> 
> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
> ---
>  mm/swap.c       |  2 +-
>  mm/swap_state.c | 60 ++++++++-----------------------------------------
>  2 files changed, 10 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index abc82e6c14d1..5603bc987ef0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1022,7 +1022,7 @@ void __init swap_setup(void)
>  	if (megs < 16)
>  		page_cluster = 2;
>  	else
> -		page_cluster = 3;
> +		page_cluster = 8;
>  	/*
>  	 * Right now other parts of the system means that we
>  	 * _really_ don't want to cluster much more
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6dac8c6ee6d9..a2246bcebc77 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  	return retpage;
>  }
>  
> -static unsigned int __swapin_nr_pages(unsigned long prev_offset,
> -				      unsigned long offset,
> -				      int hits,
> -				      int max_pages,
> -				      int prev_win)
> -{
> -	unsigned int pages, last_ra;
> -
> -	/*
> -	 * This heuristic has been found to work well on both sequential and
> -	 * random loads, swapping to hard disk or to SSD: please don't ask
> -	 * what the "+ 2" means, it just happens to work well, that's all.
> -	 */
> -	pages = hits + 2;
> -	if (pages == 2) {
> -		/*
> -		 * We can have no readahead hits to judge by: but must not get
> -		 * stuck here forever, so check for an adjacent offset instead
> -		 * (and don't even bother to check whether swap type is same).
> -		 */
> -		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> -			pages = 1;
> -	} else {
> -		unsigned int roundup = 4;
> -		while (roundup < pages)
> -			roundup <<= 1;
> -		pages = roundup;
> -	}
> -
> -	if (pages > max_pages)
> -		pages = max_pages;
> -
> -	/* Don't shrink readahead too fast */
> -	last_ra = prev_win / 2;
> -	if (pages < last_ra)
> -		pages = last_ra;
> -
> -	return pages;
> -}
> -
>  static unsigned long swapin_nr_pages(unsigned long offset)
>  {
> -	static unsigned long prev_offset;
> -	unsigned int hits, pages, max_pages;
> -	static atomic_t last_readahead_pages;
> +	static unsigned int prev_pages;
> +	unsigned long pages, max_pages;
>  
>  	max_pages = 1 << READ_ONCE(page_cluster);
>  	if (max_pages <= 1)
>  		return 1;
>  
> -	hits = atomic_xchg(&swapin_readahead_hits, 0);
> -	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> -				  atomic_read(&last_readahead_pages));
> -	if (!hits)
> -		prev_offset = offset;
> -	atomic_set(&last_readahead_pages, pages);
> +	pages = READ_ONCE(prev_pages) + 1;
> +	if (pages > max_pages) {
> +		WRITE_ONCE(prev_pages, 0);
> +		pages = max_pages;
> +	} else
> +		WRITE_ONCE(prev_pages, pages);
>  
>  	return pages;
>  }
> @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
>  	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
>  	prev_win = SWAP_RA_WIN(swap_ra_info);
>  	hits = SWAP_RA_HITS(swap_ra_info);
> -	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
> -					       max_win, prev_win);
> +	swap_ra->win = win = swapin_nr_pages(fpfn);
>  	atomic_long_set(&vma->swap_readahead_info,
>  			SWAP_RA_VAL(faddr, win, 0));
>  
> -- 
> 2.17.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
Sultan Alsawaf Dec. 11, 2019, 9 p.m. UTC | #4
On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1831940
> 
> Apply a more aggressive swapin readahead policy to improve swapoff
> performance.
> 
> The idea is to start with no readahead (only read one page) and linearly
> increment the amount of readahead pages each time swapin_readahead() is
> called, up to the maximum cluster size (defined by vm.page-cluster),
> then go back to one page to give the disk enough time to prefetch the
> requested pages and avoid re-requesting them multiple times.
> 
> Also increase the default vm.page-cluster size to 8 (that seems to work
> better with this new heuristic).
> 
> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
> ---
>  mm/swap.c       |  2 +-
>  mm/swap_state.c | 60 ++++++++-----------------------------------------
>  2 files changed, 10 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index abc82e6c14d1..5603bc987ef0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1022,7 +1022,7 @@ void __init swap_setup(void)
>  	if (megs < 16)
>  		page_cluster = 2;
>  	else
> -		page_cluster = 3;
> +		page_cluster = 8;
>  	/*
>  	 * Right now other parts of the system means that we
>  	 * _really_ don't want to cluster much more
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 6dac8c6ee6d9..a2246bcebc77 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  	return retpage;
>  }
>  
> -static unsigned int __swapin_nr_pages(unsigned long prev_offset,
> -				      unsigned long offset,
> -				      int hits,
> -				      int max_pages,
> -				      int prev_win)
> -{
> -	unsigned int pages, last_ra;
> -
> -	/*
> -	 * This heuristic has been found to work well on both sequential and
> -	 * random loads, swapping to hard disk or to SSD: please don't ask
> -	 * what the "+ 2" means, it just happens to work well, that's all.
> -	 */
> -	pages = hits + 2;
> -	if (pages == 2) {
> -		/*
> -		 * We can have no readahead hits to judge by: but must not get
> -		 * stuck here forever, so check for an adjacent offset instead
> -		 * (and don't even bother to check whether swap type is same).
> -		 */
> -		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> -			pages = 1;
> -	} else {
> -		unsigned int roundup = 4;
> -		while (roundup < pages)
> -			roundup <<= 1;
> -		pages = roundup;
> -	}
> -
> -	if (pages > max_pages)
> -		pages = max_pages;
> -
> -	/* Don't shrink readahead too fast */
> -	last_ra = prev_win / 2;
> -	if (pages < last_ra)
> -		pages = last_ra;
> -
> -	return pages;
> -}
> -
>  static unsigned long swapin_nr_pages(unsigned long offset)
>  {
> -	static unsigned long prev_offset;
> -	unsigned int hits, pages, max_pages;
> -	static atomic_t last_readahead_pages;
> +	static unsigned int prev_pages;
> +	unsigned long pages, max_pages;
>  
>  	max_pages = 1 << READ_ONCE(page_cluster);
>  	if (max_pages <= 1)
>  		return 1;
>  
> -	hits = atomic_xchg(&swapin_readahead_hits, 0);
> -	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> -				  atomic_read(&last_readahead_pages));
> -	if (!hits)
> -		prev_offset = offset;
> -	atomic_set(&last_readahead_pages, pages);
> +	pages = READ_ONCE(prev_pages) + 1;
> +	if (pages > max_pages) {
> +		WRITE_ONCE(prev_pages, 0);
> +		pages = max_pages;
> +	} else
> +		WRITE_ONCE(prev_pages, pages);
>  
>  	return pages;
>  }
> @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
>  	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
>  	prev_win = SWAP_RA_WIN(swap_ra_info);
>  	hits = SWAP_RA_HITS(swap_ra_info);
> -	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
> -					       max_win, prev_win);
> +	swap_ra->win = win = swapin_nr_pages(fpfn);
>  	atomic_long_set(&vma->swap_readahead_info,
>  			SWAP_RA_VAL(faddr, win, 0));
>  
> -- 
> 2.17.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests
that swapin_nr_pages() can be executed concurrently or in parallel on different
CPUs, in which cases there are major synchronization issues.

In the case of this function running concurrently, reads and writes of
`prev_pages` can be interleaved, which is probably not desired. In the case of
this function running in parallel on different CPUs, the updated value in
`prev_pages` on one CPU will not be reflected on another CPU due to a lack of
explicit memory barriers to guarantee multicopy atomicity. The atomic ops that
imply a memory barrier, guaranteeing multicopy atomicity, are the ones which
return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like
atomic_read() and atomic_set()) do not.

If swapin_nr_pages() never executes concurrently or in parallel, then this patch
is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of
swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a
spin lock should be used.

Nacked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com>
Sultan Alsawaf Dec. 11, 2019, 9:06 p.m. UTC | #5
On Wed, Dec 11, 2019 at 01:00:02PM -0800, Sultan Alsawaf wrote:
> On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1831940
> > 
> > Apply a more aggressive swapin readahead policy to improve swapoff
> > performance.
> > 
> > The idea is to start with no readahead (only read one page) and linearly
> > increment the amount of readahead pages each time swapin_readahead() is
> > called, up to the maximum cluster size (defined by vm.page-cluster),
> > then go back to one page to give the disk enough time to prefetch the
> > requested pages and avoid re-requesting them multiple times.
> > 
> > Also increase the default vm.page-cluster size to 8 (that seems to work
> > better with this new heuristic).
> > 
> > Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
> > ---
> >  mm/swap.c       |  2 +-
> >  mm/swap_state.c | 60 ++++++++-----------------------------------------
> >  2 files changed, 10 insertions(+), 52 deletions(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index abc82e6c14d1..5603bc987ef0 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -1022,7 +1022,7 @@ void __init swap_setup(void)
> >  	if (megs < 16)
> >  		page_cluster = 2;
> >  	else
> > -		page_cluster = 3;
> > +		page_cluster = 8;
> >  	/*
> >  	 * Right now other parts of the system means that we
> >  	 * _really_ don't want to cluster much more
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 6dac8c6ee6d9..a2246bcebc77 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >  	return retpage;
> >  }
> >  
> > -static unsigned int __swapin_nr_pages(unsigned long prev_offset,
> > -				      unsigned long offset,
> > -				      int hits,
> > -				      int max_pages,
> > -				      int prev_win)
> > -{
> > -	unsigned int pages, last_ra;
> > -
> > -	/*
> > -	 * This heuristic has been found to work well on both sequential and
> > -	 * random loads, swapping to hard disk or to SSD: please don't ask
> > -	 * what the "+ 2" means, it just happens to work well, that's all.
> > -	 */
> > -	pages = hits + 2;
> > -	if (pages == 2) {
> > -		/*
> > -		 * We can have no readahead hits to judge by: but must not get
> > -		 * stuck here forever, so check for an adjacent offset instead
> > -		 * (and don't even bother to check whether swap type is same).
> > -		 */
> > -		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> > -			pages = 1;
> > -	} else {
> > -		unsigned int roundup = 4;
> > -		while (roundup < pages)
> > -			roundup <<= 1;
> > -		pages = roundup;
> > -	}
> > -
> > -	if (pages > max_pages)
> > -		pages = max_pages;
> > -
> > -	/* Don't shrink readahead too fast */
> > -	last_ra = prev_win / 2;
> > -	if (pages < last_ra)
> > -		pages = last_ra;
> > -
> > -	return pages;
> > -}
> > -
> >  static unsigned long swapin_nr_pages(unsigned long offset)
> >  {
> > -	static unsigned long prev_offset;
> > -	unsigned int hits, pages, max_pages;
> > -	static atomic_t last_readahead_pages;
> > +	static unsigned int prev_pages;
> > +	unsigned long pages, max_pages;
> >  
> >  	max_pages = 1 << READ_ONCE(page_cluster);
> >  	if (max_pages <= 1)
> >  		return 1;
> >  
> > -	hits = atomic_xchg(&swapin_readahead_hits, 0);
> > -	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> > -				  atomic_read(&last_readahead_pages));
> > -	if (!hits)
> > -		prev_offset = offset;
> > -	atomic_set(&last_readahead_pages, pages);
> > +	pages = READ_ONCE(prev_pages) + 1;
> > +	if (pages > max_pages) {
> > +		WRITE_ONCE(prev_pages, 0);
> > +		pages = max_pages;
> > +	} else
> > +		WRITE_ONCE(prev_pages, pages);
> >  
> >  	return pages;
> >  }
> > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
> >  	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
> >  	prev_win = SWAP_RA_WIN(swap_ra_info);
> >  	hits = SWAP_RA_HITS(swap_ra_info);
> > -	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
> > -					       max_win, prev_win);
> > +	swap_ra->win = win = swapin_nr_pages(fpfn);
> >  	atomic_long_set(&vma->swap_readahead_info,
> >  			SWAP_RA_VAL(faddr, win, 0));
> >  
> > -- 
> > 2.17.1
> > 
> > 
> > -- 
> > kernel-team mailing list
> > kernel-team@lists.ubuntu.com
> > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> 
> Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests
> that swapin_nr_pages() can be executed concurrently or in parallel on different
> CPUs, in which cases there are major synchronization issues.
> 
> In the case of this function running concurrently, reads and writes of
> `prev_pages` can be interleaved, which is probably not desired. In the case of
> this function running in parallel on different CPUs, the updated value in
> `prev_pages` on one CPU will not be reflected on another CPU due to a lack of
> explicit memory barriers to guarantee multicopy atomicity. The atomic ops that
> imply a memory barrier, guaranteeing multicopy atomicity, are the ones which
> return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like
> atomic_read() and atomic_set()) do not.
> 
> If swapin_nr_pages() never executes concurrently or in parallel, then this patch
> is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of
> swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a
> spin lock should be used.
> 
> Nacked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com>

Forgot to mention that the *_ONCE() macros do not imply memory barriers. They
alone are not enough for multicopy atomicity, though they're great for deterring
compiler mischief where the compiler might insert multiple reads/writes when
only one is desired, though I don't think that usecase applies here...

Sultan
Andrea Righi Dec. 12, 2019, 7:21 a.m. UTC | #6
On Wed, Dec 11, 2019 at 01:06:16PM -0800, Sultan Alsawaf wrote:
> On Wed, Dec 11, 2019 at 01:00:02PM -0800, Sultan Alsawaf wrote:
> > On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote:
> > > BugLink: https://bugs.launchpad.net/bugs/1831940
> > > 
> > > Apply a more aggressive swapin readahead policy to improve swapoff
> > > performance.
> > > 
> > > The idea is to start with no readahead (only read one page) and linearly
> > > increment the amount of readahead pages each time swapin_readahead() is
> > > called, up to the maximum cluster size (defined by vm.page-cluster),
> > > then go back to one page to give the disk enough time to prefetch the
> > > requested pages and avoid re-requesting them multiple times.
> > > 
> > > Also increase the default vm.page-cluster size to 8 (that seems to work
> > > better with this new heuristic).
> > > 
> > > Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
> > > ---
> > >  mm/swap.c       |  2 +-
> > >  mm/swap_state.c | 60 ++++++++-----------------------------------------
> > >  2 files changed, 10 insertions(+), 52 deletions(-)
> > > 
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index abc82e6c14d1..5603bc987ef0 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -1022,7 +1022,7 @@ void __init swap_setup(void)
> > >  	if (megs < 16)
> > >  		page_cluster = 2;
> > >  	else
> > > -		page_cluster = 3;
> > > +		page_cluster = 8;
> > >  	/*
> > >  	 * Right now other parts of the system means that we
> > >  	 * _really_ don't want to cluster much more
> > > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > > index 6dac8c6ee6d9..a2246bcebc77 100644
> > > --- a/mm/swap_state.c
> > > +++ b/mm/swap_state.c
> > > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >  	return retpage;
> > >  }
> > >  
> > > -static unsigned int __swapin_nr_pages(unsigned long prev_offset,
> > > -				      unsigned long offset,
> > > -				      int hits,
> > > -				      int max_pages,
> > > -				      int prev_win)
> > > -{
> > > -	unsigned int pages, last_ra;
> > > -
> > > -	/*
> > > -	 * This heuristic has been found to work well on both sequential and
> > > -	 * random loads, swapping to hard disk or to SSD: please don't ask
> > > -	 * what the "+ 2" means, it just happens to work well, that's all.
> > > -	 */
> > > -	pages = hits + 2;
> > > -	if (pages == 2) {
> > > -		/*
> > > -		 * We can have no readahead hits to judge by: but must not get
> > > -		 * stuck here forever, so check for an adjacent offset instead
> > > -		 * (and don't even bother to check whether swap type is same).
> > > -		 */
> > > -		if (offset != prev_offset + 1 && offset != prev_offset - 1)
> > > -			pages = 1;
> > > -	} else {
> > > -		unsigned int roundup = 4;
> > > -		while (roundup < pages)
> > > -			roundup <<= 1;
> > > -		pages = roundup;
> > > -	}
> > > -
> > > -	if (pages > max_pages)
> > > -		pages = max_pages;
> > > -
> > > -	/* Don't shrink readahead too fast */
> > > -	last_ra = prev_win / 2;
> > > -	if (pages < last_ra)
> > > -		pages = last_ra;
> > > -
> > > -	return pages;
> > > -}
> > > -
> > >  static unsigned long swapin_nr_pages(unsigned long offset)
> > >  {
> > > -	static unsigned long prev_offset;
> > > -	unsigned int hits, pages, max_pages;
> > > -	static atomic_t last_readahead_pages;
> > > +	static unsigned int prev_pages;
> > > +	unsigned long pages, max_pages;
> > >  
> > >  	max_pages = 1 << READ_ONCE(page_cluster);
> > >  	if (max_pages <= 1)
> > >  		return 1;
> > >  
> > > -	hits = atomic_xchg(&swapin_readahead_hits, 0);
> > > -	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> > > -				  atomic_read(&last_readahead_pages));
> > > -	if (!hits)
> > > -		prev_offset = offset;
> > > -	atomic_set(&last_readahead_pages, pages);
> > > +	pages = READ_ONCE(prev_pages) + 1;
> > > +	if (pages > max_pages) {
> > > +		WRITE_ONCE(prev_pages, 0);
> > > +		pages = max_pages;
> > > +	} else
> > > +		WRITE_ONCE(prev_pages, pages);
> > >  
> > >  	return pages;
> > >  }
> > > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf,
> > >  	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
> > >  	prev_win = SWAP_RA_WIN(swap_ra_info);
> > >  	hits = SWAP_RA_HITS(swap_ra_info);
> > > -	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
> > > -					       max_win, prev_win);
> > > +	swap_ra->win = win = swapin_nr_pages(fpfn);
> > >  	atomic_long_set(&vma->swap_readahead_info,
> > >  			SWAP_RA_VAL(faddr, win, 0));
> > >  
> > > -- 
> > > 2.17.1
> > > 
> > > 
> > > -- 
> > > kernel-team mailing list
> > > kernel-team@lists.ubuntu.com
> > > https://lists.ubuntu.com/mailman/listinfo/kernel-team
> > 
> > Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests
> > that swapin_nr_pages() can be executed concurrently or in parallel on different
> > CPUs, in which cases there are major synchronization issues.
> > 
> > In the case of this function running concurrently, reads and writes of
> > `prev_pages` can be interleaved, which is probably not desired. In the case of
> > this function running in parallel on different CPUs, the updated value in
> > `prev_pages` on one CPU will not be reflected on another CPU due to a lack of
> > explicit memory barriers to guarantee multicopy atomicity. The atomic ops that
> > imply a memory barrier, guaranteeing multicopy atomicity, are the ones which
> > return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like
> > atomic_read() and atomic_set()) do not.
> > 
> > If swapin_nr_pages() never executes concurrently or in parallel, then this patch
> > is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of
> > swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a
> > spin lock should be used.
> > 
> > Nacked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com>
> 
> Forgot to mention that the *_ONCE() macros do not imply memory barriers. They
> alone are not enough for multicopy atomicity, though they're great for deterring
> compiler mischief where the compiler might insert multiple reads/writes when
> only one is desired, though I don't think that usecase applies here...

As already mentioned on IRC (but also reporting here for completeness)
we don't really need to be extra precise here, at the end it's just an
heuristic to determine the size of the pages to prefetch and the idea is
simply to have a more aggressive prefetching.

The idea of the *_ONCE() was to make sure the individual reads/writes of
prev_pages were consistent, but I don't expect surprises if we just drop
them.

So, I'll repeat the tests removing those and if everything is working as
expected I'll send a v2.

Thanks for the review!
-Andrea

Patch
diff mbox series

diff --git a/mm/swap.c b/mm/swap.c
index abc82e6c14d1..5603bc987ef0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1022,7 +1022,7 @@  void __init swap_setup(void)
 	if (megs < 16)
 		page_cluster = 2;
 	else
-		page_cluster = 3;
+		page_cluster = 8;
 	/*
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 6dac8c6ee6d9..a2246bcebc77 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -472,62 +472,21 @@  struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	return retpage;
 }
 
-static unsigned int __swapin_nr_pages(unsigned long prev_offset,
-				      unsigned long offset,
-				      int hits,
-				      int max_pages,
-				      int prev_win)
-{
-	unsigned int pages, last_ra;
-
-	/*
-	 * This heuristic has been found to work well on both sequential and
-	 * random loads, swapping to hard disk or to SSD: please don't ask
-	 * what the "+ 2" means, it just happens to work well, that's all.
-	 */
-	pages = hits + 2;
-	if (pages == 2) {
-		/*
-		 * We can have no readahead hits to judge by: but must not get
-		 * stuck here forever, so check for an adjacent offset instead
-		 * (and don't even bother to check whether swap type is same).
-		 */
-		if (offset != prev_offset + 1 && offset != prev_offset - 1)
-			pages = 1;
-	} else {
-		unsigned int roundup = 4;
-		while (roundup < pages)
-			roundup <<= 1;
-		pages = roundup;
-	}
-
-	if (pages > max_pages)
-		pages = max_pages;
-
-	/* Don't shrink readahead too fast */
-	last_ra = prev_win / 2;
-	if (pages < last_ra)
-		pages = last_ra;
-
-	return pages;
-}
-
 static unsigned long swapin_nr_pages(unsigned long offset)
 {
-	static unsigned long prev_offset;
-	unsigned int hits, pages, max_pages;
-	static atomic_t last_readahead_pages;
+	static unsigned int prev_pages;
+	unsigned long pages, max_pages;
 
 	max_pages = 1 << READ_ONCE(page_cluster);
 	if (max_pages <= 1)
 		return 1;
 
-	hits = atomic_xchg(&swapin_readahead_hits, 0);
-	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
-				  atomic_read(&last_readahead_pages));
-	if (!hits)
-		prev_offset = offset;
-	atomic_set(&last_readahead_pages, pages);
+	pages = READ_ONCE(prev_pages) + 1;
+	if (pages > max_pages) {
+		WRITE_ONCE(prev_pages, 0);
+		pages = max_pages;
+	} else
+		WRITE_ONCE(prev_pages, pages);
 
 	return pages;
 }
@@ -684,8 +643,7 @@  struct page *swap_readahead_detect(struct vm_fault *vmf,
 	pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info));
 	prev_win = SWAP_RA_WIN(swap_ra_info);
 	hits = SWAP_RA_HITS(swap_ra_info);
-	swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits,
-					       max_win, prev_win);
+	swap_ra->win = win = swapin_nr_pages(fpfn);
 	atomic_long_set(&vma->swap_readahead_info,
 			SWAP_RA_VAL(faddr, win, 0));