diff mbox

page eviction from the buddy cache

Message ID 7398CEE9-AF68-4A2A-82E4-940FADF81F97@gmail.com
State New, archived
Headers show

Commit Message

Alexey Lyahkov April 25, 2013, 6:37 p.m. UTC
Mel,


On Apr 25, 2013, at 17:30, Mel Gorman wrote:

> On Wed, Apr 24, 2013 at 10:26:50AM -0400, Theodore Ts'o wrote:
>> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
>>> That should fix things for now.  Although it might be better to just do
>>> 
>>> 	mark_page_accessed(page);	/* to SetPageReferenced */
>>> 	lru_add_drain();		/* to SetPageLRU */
>>> 
>>> Because a) this was too early to decide that the page is
>>> super-important and b) the second touch of this page should have a
>>> mark_page_accessed() in it already.
>> 
>> The question is do we really want to put lru_add_drain() into the ext4
>> file system code?  That seems to pushing some fairly mm-specific
>> knowledge into file system code.  I'll do this if I have to do, but
>> wouldn't be better if this was pushed into mark_page_accessed(), or
>> some other new API was exported by the mm subsystem?
>> 
> 
> I don't think we want to push lru_add_drain() into the ext4 code. It's
> too specific of knowledge just to work around pagevecs. Before we rework
> how pagevecs select what LRU to place a page, can we make sure that fixing
> that will fix the problem?
> 
what is "that"? puting lru_add_drain() in ext4 core? sure that is fixes problem with many small reads during large write.
originally i have put shake_page() in ext4 code, but that have call lru_add_drain_all() so to exaggerated.



additional i_state = I_NEW need to prevent kill page cache from sysctl -w vm.drop_caches=3

> Andrew, can you try the following patch please? Also, is there any chance
> you can describe in more detail what the workload does?
lustre OSS node + IOR with file size twice more then OSS memory.

> If it fails to boot,
> remove the second that calls lru_add_drain_all() and try again.
well, i will try.

> 
> The patch looks deceptively simple, a downside from is is that workloads that
> call mark_page_accessed() frequently will contend more on the zone->lru_lock
> than it did previously. Moving lru_add_drain() to the ext4 could would
> suffer the same contention problem.
NO, isn't. we have call lru_add_drain() in new page allocation case, but mark_page_accessed called without differences - is page in page cache already or it's new allocated - so we have very small zone->lru_lock contention.


> 
> Thanks.
> 
> ---8<---
> mm: pagevec: Move inactive pages to active lists even if on a pagevec
> 
> If a page is on a pagevec aimed at the inactive list then two subsequent
> calls to mark_page_acessed() will still not move it to the active list.
> This can cause a page to be reclaimed sooner than is expected. This
> patch detects if an inactive page is not on the LRU and drains the
> pagevec before promoting it.
> 
> Not-signed-off
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index 8a529a0..eac64fe 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -437,7 +437,18 @@ void activate_page(struct page *page)
> void mark_page_accessed(struct page *page)
> {
> 	if (!PageActive(page) && !PageUnevictable(page) &&
> -			PageReferenced(page) && PageLRU(page)) {
> +			PageReferenced(page)) {
> +		/* Page could be in pagevec */
> +		if (!PageLRU(page))
> +			lru_add_drain();
> +
> +		/*
> +		 * Weeeee, using in_atomic() like this is a hand-grenade.
> +		 * Patch is for debugging purposes only, do not merge this.
> +		 */
> +		if (!PageLRU(page) && !in_atomic())
> +			lru_add_drain_all();
> +
> 		activate_page(page);
> 		ClearPageReferenced(page);
> 	} else if (!PageReferenced(page)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mel Gorman April 25, 2013, 10:40 p.m. UTC | #1
On Thu, Apr 25, 2013 at 09:37:07PM +0300, Alexey Lyahkov wrote:
> Mel,
> 
> 
> On Apr 25, 2013, at 17:30, Mel Gorman wrote:
> 
> > On Wed, Apr 24, 2013 at 10:26:50AM -0400, Theodore Ts'o wrote:
> >> On Tue, Apr 23, 2013 at 03:00:08PM -0700, Andrew Morton wrote:
> >>> That should fix things for now.  Although it might be better to just do
> >>> 
> >>> 	mark_page_accessed(page);	/* to SetPageReferenced */
> >>> 	lru_add_drain();		/* to SetPageLRU */
> >>> 
> >>> Because a) this was too early to decide that the page is
> >>> super-important and b) the second touch of this page should have a
> >>> mark_page_accessed() in it already.
> >> 
> >> The question is do we really want to put lru_add_drain() into the ext4
> >> file system code?  That seems to pushing some fairly mm-specific
> >> knowledge into file system code.  I'll do this if I have to do, but
> >> wouldn't be better if this was pushed into mark_page_accessed(), or
> >> some other new API was exported by the mm subsystem?
> >> 
> > 
> > I don't think we want to push lru_add_drain() into the ext4 code. It's
> > too specific of knowledge just to work around pagevecs. Before we rework
> > how pagevecs select what LRU to place a page, can we make sure that fixing
> > that will fix the problem?
> > 
> what is "that"? puting lru_add_drain() in ext4 core? sure that is fixes problem with many small reads during large write.
> originally i have put shake_page() in ext4 code, but that have call lru_add_drain_all() so to exaggerated.
> 

No, I would prefer if this was not fixed within ext4. I need confirmation
that fixing mark_page_accessed() addresses the performance problem you
encounter. The two-line check for PageLRU() followed by a lru_add_drain()
is meant to check that. That is still not my preferred fix because even
if you do not encounter higher LRU contention, other workloads would be
at risk.  The likely fix will involve converting pagevecs to using a single
list and then selecting what LRU to put a page on at drain time but I
want to know that it's worthwhile.

Using shake_page() in ext4 is certainly overkill.

> > Andrew, can you try the following patch please? Also, is there any chance
> > you can describe in more detail what the workload does?
>
> lustre OSS node + IOR with file size twice more then OSS memory.
> 

Ok, no way I'll be reproducing that workload. Thanks.
Alexey Lyahkov April 26, 2013, 6:03 a.m. UTC | #2
On Apr 26, 2013, at 01:40, Mel Gorman wrote:
> No, I would prefer if this was not fixed within ext4. I need confirmation
> that fixing mark_page_accessed() addresses the performance problem you
> encounter. The two-line check for PageLRU() followed by a lru_add_drain()
> is meant to check that. That is still not my preferred fix because even
> if you do not encounter higher LRU contention, other workloads would be
> at risk.  The likely fix will involve converting pagevecs to using a single
> list and then selecting what LRU to put a page on at drain time but I
> want to know that it's worthwhile.
> 
> Using shake_page() in ext4 is certainly overkill.
agree, but it's was my prof of concept patch :) just to verify founded

> 
>>> Andrew, can you try the following patch please? Also, is there any chance
>>> you can describe in more detail what the workload does?
>> 
>> lustre OSS node + IOR with file size twice more then OSS memory.
>> 
> 
> Ok, no way I'll be reproducing that workload. Thanks.
> 
I think you should be try several processes with DIO (so don't put any pages in lru_pagevec as that is heap), each have a filesize twice or more of available memory.
Main idea you should be have a read a new pages in budy cache (to allocate) and have large memory allocation in same time.
DIO chunk should be enough to start streaming allocation.

also you may use attached jprobe module to hit an BUG() if buddy page removed from a memory by shrinker.
diff mbox

Patch

Index: linux-stage/fs/ext4/mballoc.c
===================================================================
--- linux-stage.orig/fs/ext4/mballoc.c	2013-03-19 10:55:52.000000000 +0200
+++ linux-stage/fs/ext4/mballoc.c	2013-03-19 10:59:02.000000000 +0200
@@ -900,8 +900,11 @@  static int ext4_mb_init_cache(struct pag
 			incore = data;
 		}
 	}
-	if (likely(err == 0))
+	if (likely(err == 0)) {
 		SetPageUptodate(page);
+		/* make sure it's in active list */
+		mark_page_accessed(page);
+	}
 
 out:
 	if (bh) {
@@ -957,6 +960,8 @@  int ext4_mb_init_group(struct super_bloc
 	page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
 	if (page) {
 		BUG_ON(page->mapping != inode->i_mapping);
+		/* move to lru - should be lru_add_drain() */
+		shake_page(page, 0);
 		ret = ext4_mb_init_cache(page, NULL);
 		if (ret) {
 			unlock_page(page);
@@ -986,6 +991,8 @@  int ext4_mb_init_group(struct super_bloc
 		unlock_page(page);
 	} else if (page) {
 		BUG_ON(page->mapping != inode->i_mapping);
+		/* move to lru - should be lru_add_drain() */
+		shake_page(page, 0);
 		ret = ext4_mb_init_cache(page, bitmap);
 		if (ret) {
 			unlock_page(page);
@@ -1087,6 +1094,7 @@  repeat_load_buddy:
 		if (page) {
 			BUG_ON(page->mapping != inode->i_mapping);
 			if (!PageUptodate(page)) {
+				shake_page(page, 0);
 				ret = ext4_mb_init_cache(page, NULL);
 				if (ret) {
 					unlock_page(page);
@@ -1118,6 +1126,7 @@  repeat_load_buddy:
 		if (page) {
 			BUG_ON(page->mapping != inode->i_mapping);
 			if (!PageUptodate(page)) {
+				shake_page(page, 0);
 				ret = ext4_mb_init_cache(page, e4b->bd_bitmap);
 				if (ret) {
 					unlock_page(page);
@@ -2500,6 +2509,8 @@  static int ext4_mb_init_backend(struct s
 	 * not in the inode hash, so it should never be found by iget(), but
 	 * this will avoid confusion if it ever shows up during debugging. */
 	sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
+	sbi->s_buddy_cache->i_state = I_NEW;
+//	mapping_set_unevictable(sbi->s_buddy_cache->i_mapping);
 	EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
 	for (i = 0; i < ngroups; i++) {
 		desc = ext4_get_group_desc(sb, i, NULL);