diff mbox series

[06/10] hugetlbfs: Convert remove_inode_hugepages() to use filemap_get_folios()

Message ID 20220605193854.2371230-7-willy@infradead.org
State Not Applicable
Headers show
Series Convert to filemap_get_folios() | expand

Commit Message

Matthew Wilcox June 5, 2022, 7:38 p.m. UTC
Use folios throughout this function.  That removes the last caller of
huge_pagevec_release(), so delete that too.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 fs/hugetlbfs/inode.c | 44 ++++++++++++++------------------------------
 1 file changed, 14 insertions(+), 30 deletions(-)

Comments

Christoph Hellwig June 8, 2022, 8:04 a.m. UTC | #1
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Sumanth Korikkar June 10, 2022, 3:52 p.m. UTC | #2
Hi,

The kernel crashes with the following backtrace on linux-next:

[  203.304451] kernel BUG at fs/inode.c:612!
[  203.304466] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  203.305215] CPU: 0 PID: 868 Comm: alloc-instantia Not tainted 5.19.0-rc1-next-20220609 #256
[  203.305563] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
[  203.305922] RIP: 0010:clear_inode+0x6e/0x80
[  203.306139] Code: 00 a8 20 74 29 a8 40 75 27 48 8b 93 18 01 00 00 48 8d 83 18 01 00 00 48 39 c2 75 16 48 c7 83 98 00 00 00 60 00 00 00 5b 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[  203.306827] RSP: 0018:ffffa49dc07cbde8 EFLAGS: 00010002
[  203.307074] RAX: 0000000000000000 RBX: ffff8bf4cecc4010 RCX: 0000000000069600
[  203.307380] RDX: 0000000000000001 RSI: ffffffff929b5b2b RDI: 0000000000000000
[  203.307715] RBP: ffff8bf4cecc4180 R08: 000003fffffffffe R09: ffffffffffffffc0
[  203.307988] R10: ffff8bf4ca515ec8 R11: ffffa49dc07cbc68 R12: ffff8bf4cecc4118
[  203.308256] R13: ffff8bf4cf029a80 R14: ffff8bf4cb2ce900 R15: ffff8bf4c79b8848
[  203.308591] FS:  0000000000000000(0000) GS:ffff8bf533000000(0000) knlGS:0000000000000000
[  203.309033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  203.309327] CR2: 00007fadbf5d3838 CR3: 000000016520c000 CR4: 00000000000006f0
[  203.309661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  203.309997] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  203.310330] Call Trace:
[  203.310534]  <TASK>
[  203.310733]  evict+0xc3/0x1c0
[  203.310956]  __dentry_kill+0xd6/0x170
[  203.311196]  dput+0x144/0x2e0
[  203.311416]  __fput+0xdb/0x240
[  203.311634]  task_work_run+0x5c/0x90
[  203.311876]  do_exit+0x317/0xa80
[  203.312104]  do_group_exit+0x2d/0x90
[  203.312337]  __x64_sys_exit_group+0x14/0x20
[  203.312599]  do_syscall_64+0x3b/0x90
[  203.312816]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  203.313064] RIP: 0033:0x7fadbf4f2711
[  203.313275] Code: Unable to access opcode bytes at RIP 0x7fadbf4f26e7.
[  203.313559] RSP: 002b:00007fff6b0e0458 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  203.313932] RAX: ffffffffffffffda RBX: 00007fadbf5cf9e0 RCX: 00007fadbf4f2711
[  203.314228] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
[  203.314523] RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000000
[  203.314821] R10: 00007fadbf3dffa8 R11: 0000000000000246 R12: 00007fadbf5cf9e0
[  203.315120] R13: 0000000000000000 R14: 00007fadbf5d4ee8 R15: 00007fadbf5d4f00
[  203.315431]  </TASK>
[  203.315606] Modules linked in: zram zsmalloc xfs libcrc32c
[  203.315875] ---[ end trace 0000000000000000 ]---
[  203.315876] RIP: 0010:clear_inode+0x6e/0x80
[  203.315878] Code: 00 a8 20 74 29 a8 40 75 27 48 8b 93 18 01 00 00 48 8d 83 18 01 00 00 48 39 c2 75 16 48 c7 83 98 00 00 00 60 00 00 00 5b 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[  203.315879] RSP: 0018:ffffa49dc07cbde8 EFLAGS: 00010002
[  203.315880] RAX: 0000000000000000 RBX: ffff8bf4cecc4010 RCX: 0000000000069600
[  203.315881] RDX: 0000000000000001 RSI: ffffffff929b5b2b RDI: 0000000000000000
[  203.315881] RBP: ffff8bf4cecc4180 R08: 000003fffffffffe R09: ffffffffffffffc0
[  203.315882] R10: ffff8bf4ca515ec8 R11: ffffa49dc07cbc68 R12: ffff8bf4cecc4118
[  203.315883] R13: ffff8bf4cf029a80 R14: ffff8bf4cb2ce900 R15: ffff8bf4c79b8848
[  203.315884] FS:  0000000000000000(0000) GS:ffff8bf533000000(0000) knlGS:0000000000000000
[  203.315886] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  203.315887] CR2: 00007fadbf5d3838 CR3: 000000016520c000 CR4: 00000000000006f0
[  203.315887] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  203.315888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  203.315889] note: alloc-instantia[868] exited with preempt_count 1
[  203.315890] Fixing recursive fault but reboot is needed!
[  203.315892] BUG: scheduling while atomic: alloc-instantia/868/0x00000000
[  203.315893] Modules linked in: zram zsmalloc xfs libcrc32c
[  203.315894] Preemption disabled at:
[  203.315895] [<0000000000000000>] 0x0
[  203.315896] CPU: 0 PID: 868 Comm: alloc-instantia Tainted: G      D           5.19.0-rc1-next-20220609 #256
[  203.315898] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-6.fc35 04/01/2014
[  203.315898] Call Trace:
[  203.315900]  <TASK>
[  203.315901]  dump_stack_lvl+0x34/0x44
[  203.315905]  __schedule_bug.cold+0x7d/0x8b
[  203.315907]  __schedule+0x624/0x700
[  203.315908]  ? _printk+0x58/0x6f
[  203.315911]  do_task_dead+0x3f/0x50
[  203.315913]  make_task_dead.cold+0x51/0xab
[  203.315914]  rewind_stack_and_make_dead+0x17/0x17
[  203.315917] RIP: 0033:0x7fadbf4f2711
[  203.315918] Code: Unable to access opcode bytes at RIP 0x7fadbf4f26e7.
[  203.315918] RSP: 002b:00007fff6b0e0458 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  203.315919] RAX: ffffffffffffffda RBX: 00007fadbf5cf9e0 RCX: 00007fadbf4f2711
[  203.315920] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
[  203.315921] RBP: 0000000000000000 R08: ffffffffffffff80 R09: 0000000000000000
[  203.315921] R10: 00007fadbf3dffa8 R11: 0000000000000246 R12: 00007fadbf5cf9e0
[  203.315922] R13: 0000000000000000 R14: 00007fadbf5d4ee8 R15: 00007fadbf5d4f00
[  203.315924]  </TASK>


* Bisected the crash to this commit.

To reproduce:
* clone libhugetlbfs:
* Execute, PATH=$PATH:"obj64/" LD_LIBRARY_PATH=../obj64/ alloc-instantiate-race shared
 
Crashes on both s390 and x86. 
 
Thanks

--
Sumanth
Gerald Schaefer June 10, 2022, 6:35 p.m. UTC | #3
On Fri, 10 Jun 2022 17:52:05 +0200
Sumanth Korikkar <sumanthk@linux.ibm.com> wrote:

[...]
> 
> * Bisected the crash to this commit.
> 
> To reproduce:
> * clone libhugetlbfs:
> * Execute, PATH=$PATH:"obj64/" LD_LIBRARY_PATH=../obj64/ alloc-instantiate-race shared
>  
> Crashes on both s390 and x86. 

FWIW, not really able to understand the code changes, so I added some
printks to track the state of inode->i_data.nrpages during
remove_inode_hugepages().

Before this commit, we enter with nrpages = 99, and leave with nrpages = 0.
With this commit we enter with nrpages = 99, and leave with nrpages = 84
(i.e. 99 - PAGEVEC_SIZE), resulting in the BUG later in fs/inode.c:612.

The difference seems to be that with this commit, the outer
while(filemap_get_folios) loop is only traversed once, while before
the corresponding while(pagevec_lookup_range) loop was repeated until
nrpages reached 0 (in steps of 15 == PAGEVEC_SIZE for the inner loop).

Both before and after the commit, the pagevec_count / folio_batch_count
for the inner loop starts with 15, but before the pagevec_lookup_range()
also increased &next in steps of 15, while now the filemap_get_folios()
moved &next from 0 to 270 in one step, while still only returning
15 as folio_batch_count for the inner loop. I assume the next index
of 270 is then too big to find any other folios, so it stops after the
first iteration, even though only 15 pages have been processed yet with
remove_huge_page(&folio->page).

I guess it is either wrong to return 15 as folio_batch_count (although
it seems that would be the maximum possible value), or it is wrong to
advance &next by 270 instead of 15.

Hope that makes any sense, and might be of help for debugging, to someone
more familiar with this code.
Matthew Wilcox June 10, 2022, 9:17 p.m. UTC | #4
On Fri, Jun 10, 2022 at 05:52:05PM +0200, Sumanth Korikkar wrote:
> To reproduce:
> * clone libhugetlbfs:
> * Execute, PATH=$PATH:"obj64/" LD_LIBRARY_PATH=../obj64/ alloc-instantiate-race shared

... it's a lot harder to set up hugetlb than that ...

anyway, i figured it out without being able to run the reproducer.

Can you try this?

diff --git a/mm/filemap.c b/mm/filemap.c
index a30587f2e598..8ef861297ffb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2160,7 +2160,11 @@ unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start,
 		if (xa_is_value(folio))
 			continue;
 		if (!folio_batch_add(fbatch, folio)) {
-			*start = folio->index + folio_nr_pages(folio);
+			unsigned long nr = folio_nr_pages(folio);
+
+			if (folio_test_hugetlb(folio))
+				nr = 1;
+			*start = folio->index + nr;
 			goto out;
 		}
 	}
Mike Kravetz June 10, 2022, 9:56 p.m. UTC | #5
On 6/10/22 14:17, Matthew Wilcox wrote:
> On Fri, Jun 10, 2022 at 05:52:05PM +0200, Sumanth Korikkar wrote:
>> To reproduce:
>> * clone libhugetlbfs:
>> * Execute, PATH=$PATH:"obj64/" LD_LIBRARY_PATH=../obj64/ alloc-instantiate-race shared
> 
> ... it's a lot harder to set up hugetlb than that ...
> 
> anyway, i figured it out without being able to run the reproducer.
> 
> Can you try this?

I can confirm that libhugetlbfs tests do not trigger the BUG with the
below change.
Sumanth Korikkar June 13, 2022, 6:56 a.m. UTC | #6
On Fri, Jun 10, 2022 at 10:17:36PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 10, 2022 at 05:52:05PM +0200, Sumanth Korikkar wrote:
> > To reproduce:
> > * clone libhugetlbfs:
> > * Execute, PATH=$PATH:"obj64/" LD_LIBRARY_PATH=../obj64/ alloc-instantiate-race shared
> 
> ... it's a lot harder to set up hugetlb than that ...
> 
> anyway, i figured it out without being able to run the reproducer.
> 
> Can you try this?
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a30587f2e598..8ef861297ffb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2160,7 +2160,11 @@ unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start,
>  		if (xa_is_value(folio))
>  			continue;
>  		if (!folio_batch_add(fbatch, folio)) {
> -			*start = folio->index + folio_nr_pages(folio);
> +			unsigned long nr = folio_nr_pages(folio);
> +
> +			if (folio_test_hugetlb(folio))
> +				nr = 1;
> +			*start = folio->index + nr;
>  			goto out;
>  		}
>  	}

Yes, With the patch, The above tests works fine. 

--
Thanks,
Sumanth
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ae2524480f23..14d33f725e05 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -108,16 +108,6 @@  static inline void hugetlb_drop_vma_policy(struct vm_area_struct *vma)
 }
 #endif
 
-static void huge_pagevec_release(struct pagevec *pvec)
-{
-	int i;
-
-	for (i = 0; i < pagevec_count(pvec); ++i)
-		put_page(pvec->pages[i]);
-
-	pagevec_reinit(pvec);
-}
-
 /*
  * Mask used when checking the page offset value passed in via system
  * calls.  This value will be converted to a loff_t which is signed.
@@ -480,25 +470,19 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 	struct address_space *mapping = &inode->i_data;
 	const pgoff_t start = lstart >> huge_page_shift(h);
 	const pgoff_t end = lend >> huge_page_shift(h);
-	struct pagevec pvec;
+	struct folio_batch fbatch;
 	pgoff_t next, index;
 	int i, freed = 0;
 	bool truncate_op = (lend == LLONG_MAX);
 
-	pagevec_init(&pvec);
+	folio_batch_init(&fbatch);
 	next = start;
-	while (next < end) {
-		/*
-		 * When no more pages are found, we are done.
-		 */
-		if (!pagevec_lookup_range(&pvec, mapping, &next, end - 1))
-			break;
-
-		for (i = 0; i < pagevec_count(&pvec); ++i) {
-			struct page *page = pvec.pages[i];
+	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			struct folio *folio = fbatch.folios[i];
 			u32 hash = 0;
 
-			index = page->index;
+			index = folio->index;
 			if (!truncate_op) {
 				/*
 				 * Only need to hold the fault mutex in the
@@ -511,15 +495,15 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			}
 
 			/*
-			 * If page is mapped, it was faulted in after being
+			 * If folio is mapped, it was faulted in after being
 			 * unmapped in caller.  Unmap (again) now after taking
 			 * the fault mutex.  The mutex will prevent faults
-			 * until we finish removing the page.
+			 * until we finish removing the folio.
 			 *
 			 * This race can only happen in the hole punch case.
 			 * Getting here in a truncate operation is a bug.
 			 */
-			if (unlikely(page_mapped(page))) {
+			if (unlikely(folio_mapped(folio))) {
 				BUG_ON(truncate_op);
 
 				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
@@ -532,7 +516,7 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 				i_mmap_unlock_write(mapping);
 			}
 
-			lock_page(page);
+			folio_lock(folio);
 			/*
 			 * We must free the huge page and remove from page
 			 * cache (remove_huge_page) BEFORE removing the
@@ -542,8 +526,8 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * the subpool and global reserve usage count can need
 			 * to be adjusted.
 			 */
-			VM_BUG_ON(HPageRestoreReserve(page));
-			remove_huge_page(page);
+			VM_BUG_ON(HPageRestoreReserve(&folio->page));
+			remove_huge_page(&folio->page);
 			freed++;
 			if (!truncate_op) {
 				if (unlikely(hugetlb_unreserve_pages(inode,
@@ -551,11 +535,11 @@  static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 					hugetlb_fix_reserve_counts(inode);
 			}
 
-			unlock_page(page);
+			folio_unlock(folio);
 			if (!truncate_op)
 				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
-		huge_pagevec_release(&pvec);
+		folio_batch_release(&fbatch);
 		cond_resched();
 	}