Message ID | 20191203105859.GA13534@xps-13 |
---|---|
State | New |
Headers | show |
Series | [B,aws] UBUNTU SAUCE: mm: swap: improve swap readahead heuristic | expand |
On 12/3/19 2:58 AM, Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1831940 > > Apply a more aggressive swapin readahead policy to improve swapoff > performance. > > The idea is to start with no readahead (only read one page) and linearly > increment the amount of readahead pages each time swapin_readahead() is > called, up to the maximum cluster size (defined by vm.page-cluster), > then go back to one page to give the disk enough time to prefetch the > requested pages and avoid re-requesting them multiple times. > > Also increase the default vm.page-cluster size to 8 (that seems to work > better with this new heuristic). > > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Was this patch was included in the AMI that received positive test results? I remember having a small chat about a recent round of testing for this on IRC but also wanted the mailing list to be informed of the positive results too :-) Acked-by: Connor Kuehl <connor.kuehl@canonical.com>
On Fri, Dec 06, 2019 at 12:53:19PM -0800, Connor Kuehl wrote: > On 12/3/19 2:58 AM, Andrea Righi wrote: > > BugLink: https://bugs.launchpad.net/bugs/1831940 > > > > Apply a more aggressive swapin readahead policy to improve swapoff > > performance. > > > > The idea is to start with no readahead (only read one page) and linearly > > increment the amount of readahead pages each time swapin_readahead() is > > called, up to the maximum cluster size (defined by vm.page-cluster), > > then go back to one page to give the disk enough time to prefetch the > > requested pages and avoid re-requesting them multiple times. > > > > Also increase the default vm.page-cluster size to 8 (that seems to work > > better with this new heuristic). > > > > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> > > Was this patch was included in the AMI that received positive test results? > I remember having a small chat about a recent round of testing for this on > IRC but also wanted the mailing list to be informed of the positive results > too :-) > > Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Thanks Connor, good point, I should have mentioned it in the comment. This patch has been tested a lot and it is included in the AMI that received positive results. -Andrea
LGTM. As we've discussed, I suggest also submitting this upstream, maybe enabled by a heuristic-selector switch. Acked-by: Kamal Mostafa <kamal@canonical.com> -Kamal On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1831940 > > Apply a more aggressive swapin readahead policy to improve swapoff > performance. > > The idea is to start with no readahead (only read one page) and linearly > increment the amount of readahead pages each time swapin_readahead() is > called, up to the maximum cluster size (defined by vm.page-cluster), > then go back to one page to give the disk enough time to prefetch the > requested pages and avoid re-requesting them multiple times. > > Also increase the default vm.page-cluster size to 8 (that seems to work > better with this new heuristic). > > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> > --- > mm/swap.c | 2 +- > mm/swap_state.c | 60 ++++++++----------------------------------------- > 2 files changed, 10 insertions(+), 52 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index abc82e6c14d1..5603bc987ef0 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -1022,7 +1022,7 @@ void __init swap_setup(void) > if (megs < 16) > page_cluster = 2; > else > - page_cluster = 3; > + page_cluster = 8; > /* > * Right now other parts of the system means that we > * _really_ don't want to cluster much more > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 6dac8c6ee6d9..a2246bcebc77 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > return retpage; > } > > -static unsigned int __swapin_nr_pages(unsigned long prev_offset, > - unsigned long offset, > - int hits, > - int max_pages, > - int prev_win) > -{ > - unsigned int pages, last_ra; > - > - /* > - * This heuristic has been found to work well on both sequential and > - * random loads, swapping to hard disk or to SSD: please don't ask > - * what the "+ 2" means, it just happens to work well, that's all. > - */ > - pages = hits + 2; > - if (pages == 2) { > - /* > - * We can have no readahead hits to judge by: but must not get > - * stuck here forever, so check for an adjacent offset instead > - * (and don't even bother to check whether swap type is same). > - */ > - if (offset != prev_offset + 1 && offset != prev_offset - 1) > - pages = 1; > - } else { > - unsigned int roundup = 4; > - while (roundup < pages) > - roundup <<= 1; > - pages = roundup; > - } > - > - if (pages > max_pages) > - pages = max_pages; > - > - /* Don't shrink readahead too fast */ > - last_ra = prev_win / 2; > - if (pages < last_ra) > - pages = last_ra; > - > - return pages; > -} > - > static unsigned long swapin_nr_pages(unsigned long offset) > { > - static unsigned long prev_offset; > - unsigned int hits, pages, max_pages; > - static atomic_t last_readahead_pages; > + static unsigned int prev_pages; > + unsigned long pages, max_pages; > > max_pages = 1 << READ_ONCE(page_cluster); > if (max_pages <= 1) > return 1; > > - hits = atomic_xchg(&swapin_readahead_hits, 0); > - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > - atomic_read(&last_readahead_pages)); > - if (!hits) > - prev_offset = offset; > - atomic_set(&last_readahead_pages, pages); > + pages = READ_ONCE(prev_pages) + 1; > + if (pages > max_pages) { > + WRITE_ONCE(prev_pages, 0); > + pages = max_pages; > + } else > + WRITE_ONCE(prev_pages, pages); > > return pages; > } > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf, > pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info)); > prev_win = SWAP_RA_WIN(swap_ra_info); > hits = SWAP_RA_HITS(swap_ra_info); > - swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits, > - max_win, prev_win); > + swap_ra->win = win = swapin_nr_pages(fpfn); > atomic_long_set(&vma->swap_readahead_info, > SWAP_RA_VAL(faddr, win, 0)); > > -- > 2.17.1 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote: > BugLink: https://bugs.launchpad.net/bugs/1831940 > > Apply a more aggressive swapin readahead policy to improve swapoff > performance. > > The idea is to start with no readahead (only read one page) and linearly > increment the amount of readahead pages each time swapin_readahead() is > called, up to the maximum cluster size (defined by vm.page-cluster), > then go back to one page to give the disk enough time to prefetch the > requested pages and avoid re-requesting them multiple times. > > Also increase the default vm.page-cluster size to 8 (that seems to work > better with this new heuristic). > > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> > --- > mm/swap.c | 2 +- > mm/swap_state.c | 60 ++++++++----------------------------------------- > 2 files changed, 10 insertions(+), 52 deletions(-) > > diff --git a/mm/swap.c b/mm/swap.c > index abc82e6c14d1..5603bc987ef0 100644 > --- a/mm/swap.c > +++ b/mm/swap.c > @@ -1022,7 +1022,7 @@ void __init swap_setup(void) > if (megs < 16) > page_cluster = 2; > else > - page_cluster = 3; > + page_cluster = 8; > /* > * Right now other parts of the system means that we > * _really_ don't want to cluster much more > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 6dac8c6ee6d9..a2246bcebc77 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > return retpage; > } > > -static unsigned int __swapin_nr_pages(unsigned long prev_offset, > - unsigned long offset, > - int hits, > - int max_pages, > - int prev_win) > -{ > - unsigned int pages, last_ra; > - > - /* > - * This heuristic has been found to work well on both sequential and > - * random loads, swapping to hard disk or to SSD: please don't ask > - * what the "+ 2" means, it just happens to work well, that's all. > - */ > - pages = hits + 2; > - if (pages == 2) { > - /* > - * We can have no readahead hits to judge by: but must not get > - * stuck here forever, so check for an adjacent offset instead > - * (and don't even bother to check whether swap type is same). > - */ > - if (offset != prev_offset + 1 && offset != prev_offset - 1) > - pages = 1; > - } else { > - unsigned int roundup = 4; > - while (roundup < pages) > - roundup <<= 1; > - pages = roundup; > - } > - > - if (pages > max_pages) > - pages = max_pages; > - > - /* Don't shrink readahead too fast */ > - last_ra = prev_win / 2; > - if (pages < last_ra) > - pages = last_ra; > - > - return pages; > -} > - > static unsigned long swapin_nr_pages(unsigned long offset) > { > - static unsigned long prev_offset; > - unsigned int hits, pages, max_pages; > - static atomic_t last_readahead_pages; > + static unsigned int prev_pages; > + unsigned long pages, max_pages; > > max_pages = 1 << READ_ONCE(page_cluster); > if (max_pages <= 1) > return 1; > > - hits = atomic_xchg(&swapin_readahead_hits, 0); > - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > - atomic_read(&last_readahead_pages)); > - if (!hits) > - prev_offset = offset; > - atomic_set(&last_readahead_pages, pages); > + pages = READ_ONCE(prev_pages) + 1; > + if (pages > max_pages) { > + WRITE_ONCE(prev_pages, 0); > + pages = max_pages; > + } else > + WRITE_ONCE(prev_pages, pages); > > return pages; > } > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf, > pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info)); > prev_win = SWAP_RA_WIN(swap_ra_info); > hits = SWAP_RA_HITS(swap_ra_info); > - swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits, > - max_win, prev_win); > + swap_ra->win = win = swapin_nr_pages(fpfn); > atomic_long_set(&vma->swap_readahead_info, > SWAP_RA_VAL(faddr, win, 0)); > > -- > 2.17.1 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests that swapin_nr_pages() can be executed concurrently or in parallel on different CPUs, in which cases there are major synchronization issues. In the case of this function running concurrently, reads and writes of `prev_pages` can be interleaved, which is probably not desired. In the case of this function running in parallel on different CPUs, the updated value in `prev_pages` on one CPU will not be reflected on another CPU due to a lack of explicit memory barriers to guarantee multicopy atomicity. The atomic ops that imply a memory barrier, guaranteeing multicopy atomicity, are the ones which return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like atomic_read() and atomic_set()) do not. If swapin_nr_pages() never executes concurrently or in parallel, then this patch is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a spin lock should be used. Nacked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com>
On Wed, Dec 11, 2019 at 01:00:02PM -0800, Sultan Alsawaf wrote: > On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote: > > BugLink: https://bugs.launchpad.net/bugs/1831940 > > > > Apply a more aggressive swapin readahead policy to improve swapoff > > performance. > > > > The idea is to start with no readahead (only read one page) and linearly > > increment the amount of readahead pages each time swapin_readahead() is > > called, up to the maximum cluster size (defined by vm.page-cluster), > > then go back to one page to give the disk enough time to prefetch the > > requested pages and avoid re-requesting them multiple times. > > > > Also increase the default vm.page-cluster size to 8 (that seems to work > > better with this new heuristic). > > > > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> > > --- > > mm/swap.c | 2 +- > > mm/swap_state.c | 60 ++++++++----------------------------------------- > > 2 files changed, 10 insertions(+), 52 deletions(-) > > > > diff --git a/mm/swap.c b/mm/swap.c > > index abc82e6c14d1..5603bc987ef0 100644 > > --- a/mm/swap.c > > +++ b/mm/swap.c > > @@ -1022,7 +1022,7 @@ void __init swap_setup(void) > > if (megs < 16) > > page_cluster = 2; > > else > > - page_cluster = 3; > > + page_cluster = 8; > > /* > > * Right now other parts of the system means that we > > * _really_ don't want to cluster much more > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index 6dac8c6ee6d9..a2246bcebc77 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > return retpage; > > } > > > > -static unsigned int __swapin_nr_pages(unsigned long prev_offset, > > - unsigned long offset, > > - int hits, > > - int max_pages, > > - int prev_win) > > -{ > > - unsigned int pages, last_ra; > > - > > - /* > > - * This heuristic has been found to work well on both sequential and > > - * random loads, swapping to hard disk or to SSD: please don't ask > > - * what the "+ 2" means, it just happens to work well, that's all. > > - */ > > - pages = hits + 2; > > - if (pages == 2) { > > - /* > > - * We can have no readahead hits to judge by: but must not get > > - * stuck here forever, so check for an adjacent offset instead > > - * (and don't even bother to check whether swap type is same). > > - */ > > - if (offset != prev_offset + 1 && offset != prev_offset - 1) > > - pages = 1; > > - } else { > > - unsigned int roundup = 4; > > - while (roundup < pages) > > - roundup <<= 1; > > - pages = roundup; > > - } > > - > > - if (pages > max_pages) > > - pages = max_pages; > > - > > - /* Don't shrink readahead too fast */ > > - last_ra = prev_win / 2; > > - if (pages < last_ra) > > - pages = last_ra; > > - > > - return pages; > > -} > > - > > static unsigned long swapin_nr_pages(unsigned long offset) > > { > > - static unsigned long prev_offset; > > - unsigned int hits, pages, max_pages; > > - static atomic_t last_readahead_pages; > > + static unsigned int prev_pages; > > + unsigned long pages, max_pages; > > > > max_pages = 1 << READ_ONCE(page_cluster); > > if (max_pages <= 1) > > return 1; > > > > - hits = atomic_xchg(&swapin_readahead_hits, 0); > > - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > > - atomic_read(&last_readahead_pages)); > > - if (!hits) > > - prev_offset = offset; > > - atomic_set(&last_readahead_pages, pages); > > + pages = READ_ONCE(prev_pages) + 1; > > + if (pages > max_pages) { > > + WRITE_ONCE(prev_pages, 0); > > + pages = max_pages; > > + } else > > + WRITE_ONCE(prev_pages, pages); > > > > return pages; > > } > > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf, > > pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info)); > > prev_win = SWAP_RA_WIN(swap_ra_info); > > hits = SWAP_RA_HITS(swap_ra_info); > > - swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits, > > - max_win, prev_win); > > + swap_ra->win = win = swapin_nr_pages(fpfn); > > atomic_long_set(&vma->swap_readahead_info, > > SWAP_RA_VAL(faddr, win, 0)); > > > > -- > > 2.17.1 > > > > > > -- > > kernel-team mailing list > > kernel-team@lists.ubuntu.com > > https://lists.ubuntu.com/mailman/listinfo/kernel-team > > Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests > that swapin_nr_pages() can be executed concurrently or in parallel on different > CPUs, in which cases there are major synchronization issues. > > In the case of this function running concurrently, reads and writes of > `prev_pages` can be interleaved, which is probably not desired. In the case of > this function running in parallel on different CPUs, the updated value in > `prev_pages` on one CPU will not be reflected on another CPU due to a lack of > explicit memory barriers to guarantee multicopy atomicity. The atomic ops that > imply a memory barrier, guaranteeing multicopy atomicity, are the ones which > return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like > atomic_read() and atomic_set()) do not. > > If swapin_nr_pages() never executes concurrently or in parallel, then this patch > is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of > swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a > spin lock should be used. > > Nacked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com> Forgot to mention that the *_ONCE() macros do not imply memory barriers. They alone are not enough for multicopy atomicity, though they're great for deterring compiler mischief where the compiler might insert multiple reads/writes when only one is desired, though I don't think that usecase applies here... Sultan
On Wed, Dec 11, 2019 at 01:06:16PM -0800, Sultan Alsawaf wrote: > On Wed, Dec 11, 2019 at 01:00:02PM -0800, Sultan Alsawaf wrote: > > On Tue, Dec 03, 2019 at 11:58:59AM +0100, Andrea Righi wrote: > > > BugLink: https://bugs.launchpad.net/bugs/1831940 > > > > > > Apply a more aggressive swapin readahead policy to improve swapoff > > > performance. > > > > > > The idea is to start with no readahead (only read one page) and linearly > > > increment the amount of readahead pages each time swapin_readahead() is > > > called, up to the maximum cluster size (defined by vm.page-cluster), > > > then go back to one page to give the disk enough time to prefetch the > > > requested pages and avoid re-requesting them multiple times. > > > > > > Also increase the default vm.page-cluster size to 8 (that seems to work > > > better with this new heuristic). > > > > > > Signed-off-by: Andrea Righi <andrea.righi@canonical.com> > > > --- > > > mm/swap.c | 2 +- > > > mm/swap_state.c | 60 ++++++++----------------------------------------- > > > 2 files changed, 10 insertions(+), 52 deletions(-) > > > > > > diff --git a/mm/swap.c b/mm/swap.c > > > index abc82e6c14d1..5603bc987ef0 100644 > > > --- a/mm/swap.c > > > +++ b/mm/swap.c > > > @@ -1022,7 +1022,7 @@ void __init swap_setup(void) > > > if (megs < 16) > > > page_cluster = 2; > > > else > > > - page_cluster = 3; > > > + page_cluster = 8; > > > /* > > > * Right now other parts of the system means that we > > > * _really_ don't want to cluster much more > > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > > index 6dac8c6ee6d9..a2246bcebc77 100644 > > > --- a/mm/swap_state.c > > > +++ b/mm/swap_state.c > > > @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, > > > return retpage; > > > } > > > > > > -static unsigned int __swapin_nr_pages(unsigned long prev_offset, > > > - unsigned long offset, > > > - int hits, > > > - int max_pages, > > > - int prev_win) > > > -{ > > > - unsigned int pages, last_ra; > > > - > > > - /* > > > - * This heuristic has been found to work well on both sequential and > > > - * random loads, swapping to hard disk or to SSD: please don't ask > > > - * what the "+ 2" means, it just happens to work well, that's all. > > > - */ > > > - pages = hits + 2; > > > - if (pages == 2) { > > > - /* > > > - * We can have no readahead hits to judge by: but must not get > > > - * stuck here forever, so check for an adjacent offset instead > > > - * (and don't even bother to check whether swap type is same). > > > - */ > > > - if (offset != prev_offset + 1 && offset != prev_offset - 1) > > > - pages = 1; > > > - } else { > > > - unsigned int roundup = 4; > > > - while (roundup < pages) > > > - roundup <<= 1; > > > - pages = roundup; > > > - } > > > - > > > - if (pages > max_pages) > > > - pages = max_pages; > > > - > > > - /* Don't shrink readahead too fast */ > > > - last_ra = prev_win / 2; > > > - if (pages < last_ra) > > > - pages = last_ra; > > > - > > > - return pages; > > > -} > > > - > > > static unsigned long swapin_nr_pages(unsigned long offset) > > > { > > > - static unsigned long prev_offset; > > > - unsigned int hits, pages, max_pages; > > > - static atomic_t last_readahead_pages; > > > + static unsigned int prev_pages; > > > + unsigned long pages, max_pages; > > > > > > max_pages = 1 << READ_ONCE(page_cluster); > > > if (max_pages <= 1) > > > return 1; > > > > > > - hits = atomic_xchg(&swapin_readahead_hits, 0); > > > - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, > > > - atomic_read(&last_readahead_pages)); > > > - if (!hits) > > > - prev_offset = offset; > > > - atomic_set(&last_readahead_pages, pages); > > > + pages = READ_ONCE(prev_pages) + 1; > > > + if (pages > max_pages) { > > > + WRITE_ONCE(prev_pages, 0); > > > + pages = max_pages; > > > + } else > > > + WRITE_ONCE(prev_pages, pages); > > > > > > return pages; > > > } > > > @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf, > > > pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info)); > > > prev_win = SWAP_RA_WIN(swap_ra_info); > > > hits = SWAP_RA_HITS(swap_ra_info); > > > - swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits, > > > - max_win, prev_win); > > > + swap_ra->win = win = swapin_nr_pages(fpfn); > > > atomic_long_set(&vma->swap_readahead_info, > > > SWAP_RA_VAL(faddr, win, 0)); > > > > > > -- > > > 2.17.1 > > > > > > > > > -- > > > kernel-team mailing list > > > kernel-team@lists.ubuntu.com > > > https://lists.ubuntu.com/mailman/listinfo/kernel-team > > > > Something is not quite right in swapin_nr_pages(). The use of *_ONCE() suggests > > that swapin_nr_pages() can be executed concurrently or in parallel on different > > CPUs, in which cases there are major synchronization issues. > > > > In the case of this function running concurrently, reads and writes of > > `prev_pages` can be interleaved, which is probably not desired. In the case of > > this function running in parallel on different CPUs, the updated value in > > `prev_pages` on one CPU will not be reflected on another CPU due to a lack of > > explicit memory barriers to guarantee multicopy atomicity. The atomic ops that > > imply a memory barrier, guaranteeing multicopy atomicity, are the ones which > > return a value (like atomic_xchg() and atomic_cmpxchg()); the others (like > > atomic_read() and atomic_set()) do not. > > > > If swapin_nr_pages() never executes concurrently or in parallel, then this patch > > is safe as-is and the use of *_ONCE() should be removed. Otherwise, the body of > > swapin_nr_pages() should either be converted into an atomic_cmpxchg() loop, or a > > spin lock should be used. > > > > Nacked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com> > > Forgot to mention that the *_ONCE() macros do not imply memory barriers. They > alone are not enough for multicopy atomicity, though they're great for deterring > compiler mischief where the compiler might insert multiple reads/writes when > only one is desired, though I don't think that usecase applies here... As already mentioned on IRC (but also reporting here for completeness) we don't really need to be extra precise here, at the end it's just an heuristic to determine the size of the pages to prefetch and the idea is simply to have a more aggressive prefetching. The idea of the *_ONCE() was to make sure the individual reads/writes of prev_pages were consistent, but I don't expect surprises if we just drop them. So, I'll repeat the tests removing those and if everything is working as expected I'll send a v2. Thanks for the review! -Andrea
diff --git a/mm/swap.c b/mm/swap.c index abc82e6c14d1..5603bc987ef0 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -1022,7 +1022,7 @@ void __init swap_setup(void) if (megs < 16) page_cluster = 2; else - page_cluster = 3; + page_cluster = 8; /* * Right now other parts of the system means that we * _really_ don't want to cluster much more diff --git a/mm/swap_state.c b/mm/swap_state.c index 6dac8c6ee6d9..a2246bcebc77 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -472,62 +472,21 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, return retpage; } -static unsigned int __swapin_nr_pages(unsigned long prev_offset, - unsigned long offset, - int hits, - int max_pages, - int prev_win) -{ - unsigned int pages, last_ra; - - /* - * This heuristic has been found to work well on both sequential and - * random loads, swapping to hard disk or to SSD: please don't ask - * what the "+ 2" means, it just happens to work well, that's all. - */ - pages = hits + 2; - if (pages == 2) { - /* - * We can have no readahead hits to judge by: but must not get - * stuck here forever, so check for an adjacent offset instead - * (and don't even bother to check whether swap type is same). - */ - if (offset != prev_offset + 1 && offset != prev_offset - 1) - pages = 1; - } else { - unsigned int roundup = 4; - while (roundup < pages) - roundup <<= 1; - pages = roundup; - } - - if (pages > max_pages) - pages = max_pages; - - /* Don't shrink readahead too fast */ - last_ra = prev_win / 2; - if (pages < last_ra) - pages = last_ra; - - return pages; -} - static unsigned long swapin_nr_pages(unsigned long offset) { - static unsigned long prev_offset; - unsigned int hits, pages, max_pages; - static atomic_t last_readahead_pages; + static unsigned int prev_pages; + unsigned long pages, max_pages; max_pages = 1 << READ_ONCE(page_cluster); if (max_pages <= 1) return 1; - hits = atomic_xchg(&swapin_readahead_hits, 0); - pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, - atomic_read(&last_readahead_pages)); - if (!hits) - prev_offset = offset; - atomic_set(&last_readahead_pages, pages); + pages = READ_ONCE(prev_pages) + 1; + if (pages > max_pages) { + WRITE_ONCE(prev_pages, 0); + pages = max_pages; + } else + WRITE_ONCE(prev_pages, pages); return pages; } @@ -684,8 +643,7 @@ struct page *swap_readahead_detect(struct vm_fault *vmf, pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info)); prev_win = SWAP_RA_WIN(swap_ra_info); hits = SWAP_RA_HITS(swap_ra_info); - swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits, - max_win, prev_win); + swap_ra->win = win = swapin_nr_pages(fpfn); atomic_long_set(&vma->swap_readahead_info, SWAP_RA_VAL(faddr, win, 0));
BugLink: https://bugs.launchpad.net/bugs/1831940 Apply a more aggressive swapin readahead policy to improve swapoff performance. The idea is to start with no readahead (only read one page) and linearly increment the amount of readahead pages each time swapin_readahead() is called, up to the maximum cluster size (defined by vm.page-cluster), then go back to one page to give the disk enough time to prefetch the requested pages and avoid re-requesting them multiple times. Also increase the default vm.page-cluster size to 8 (that seems to work better with this new heuristic). Signed-off-by: Andrea Righi <andrea.righi@canonical.com> --- mm/swap.c | 2 +- mm/swap_state.c | 60 ++++++++----------------------------------------- 2 files changed, 10 insertions(+), 52 deletions(-)