diff mbox series

[v2,01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages

Message ID 20260513130542.35604-2-songmuchun@bytedance.com (mailing list archive)
State Handled Elsewhere
Headers show
Series mm: Generalize HVO for HugeTLB and device DAX | expand

Commit Message

Muchun Song May 13, 2026, 1:04 p.m. UTC
Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].

Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
shared tail page and can panic during boot.

Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
bootmem HugeTLB folios are processed, and drop the later initialization
from hugetlb_vmemmap_init().

This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
assertion is evaluated.

Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c         | 19 +++++++++++++++++++
 mm/hugetlb_vmemmap.c | 17 -----------------
 2 files changed, 19 insertions(+), 17 deletions(-)

Comments

Oscar Salvador May 14, 2026, 7:51 a.m. UTC | #1
On Wed, May 13, 2026 at 09:04:29PM +0800, Muchun Song wrote:
> Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
> HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].
> 
> Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
> bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
> With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
> pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
> already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
> get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
> shared tail page and can panic during boot.
> 
> Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
> bootmem HugeTLB folios are processed, and drop the later initialization
> from hugetlb_vmemmap_init().
> 
> This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
> assertion is evaluated.
> 
> Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

For the correctness of the change

Acked-by: Oscar Salvador <osalvador@suse.de>

but I have a couple of comments:

> ---
>  mm/hugetlb.c         | 19 +++++++++++++++++++
>  mm/hugetlb_vmemmap.c | 17 -----------------
>  2 files changed, 19 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 31b34ca0f402..d22683ab30a1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void)
>  		.max_threads	= num_node_state(N_MEMORY),
>  		.numa_aware	= true,
>  	};
> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> +	struct zone *zone;
> +
> +	for_each_zone(zone) {
> +		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
> +			struct page *tail, *p;
> +			unsigned int order;
> +
> +			tail = zone->vmemmap_tails[i];
> +			if (!tail)
> +				continue;
> +
> +			order = i + VMEMMAP_TAIL_MIN_ORDER;
> +			p = page_to_virt(tail);
> +			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
> +				init_compound_tail(p + j, NULL, order, zone);
> +		}
> +	}

This deserves a comment why do we need to initialize those pages here, no need for
a fat one but a hint, because everybody else looking at this will wonder why do not
have it in hugetlb_vmemmap_init(), as the name suggests.


> +#endif
>  
>  	padata_do_multithreaded(&job);
>  }
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 4a077d231d3a..62e61af18c9a 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
>  static int __init hugetlb_vmemmap_init(void)

At this point, hugetlb_vmemmap_init only initialites the sysctls for
each hstate, should the name reflect that better?

Also, vmemmap_get_tail() still thinks that tail pages are initialized
here from this comment:

 "/*
   * Only allocate the page, but do not initialize it.
   *
   * Any initialization done here will be overwritten by memmap_init().
   *
   * hugetlb_vmemmap_init() will take care of initialization after
   * memmap_init().
   */"

so we might want to adjust that as well, and I am not sure if we have
more comments in the tree referencing hugetlb_vmemmap_init() as the init
point for those pages that need correction.

 
Having said that, and this is just a question, we cannot really make
gather_bootmem_prealloc() also do the initialization of the sysfs right?
I see that hugetlb sysfs gets initialized from hugetlb_init() a few
calls after, so.. meh.

Anyway, maybe just convert hugetlb_vmemmap_init to hugetlb_vmemmap_sysfs_create
(pick a better name), so it does not look like having two entry points
for vmemmap init stuff.
Muchun Song May 14, 2026, 8:13 a.m. UTC | #2
> On May 14, 2026, at 15:51, Oscar Salvador <osalvador@suse.de> wrote:
> 
> On Wed, May 13, 2026 at 09:04:29PM +0800, Muchun Song wrote:
>> Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
>> HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].
>> 
>> Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
>> bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
>> With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
>> pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
>> already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
>> get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
>> shared tail page and can panic during boot.
>> 
>> Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
>> bootmem HugeTLB folios are processed, and drop the later initialization
>> from hugetlb_vmemmap_init().
>> 
>> This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
>> assertion is evaluated.
>> 
>> Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> For the correctness of the change
> 
> Acked-by: Oscar Salvador <osalvador@suse.de>

Thanks.

> 
> but I have a couple of comments:
> 
>> ---
>> mm/hugetlb.c         | 19 +++++++++++++++++++
>> mm/hugetlb_vmemmap.c | 17 -----------------
>> 2 files changed, 19 insertions(+), 17 deletions(-)
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 31b34ca0f402..d22683ab30a1 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void)
>> .max_threads = num_node_state(N_MEMORY),
>> .numa_aware = true,
>> };
>> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
>> + 	struct zone *zone;
>> +
>> + 	for_each_zone(zone) {
>> + 		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
>> + 			struct page *tail, *p;
>> + 			unsigned int order;
>> +
>> + 			tail = zone->vmemmap_tails[i];
>> + 			if (!tail)
>> + 				continue;
>> +
>> + 			order = i + VMEMMAP_TAIL_MIN_ORDER;
>> + 			p = page_to_virt(tail);
>> + 			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
>> + 				init_compound_tail(p + j, NULL, order, zone);
>> + 		}
>> + 	}
> 
> This deserves a comment why do we need to initialize those pages here, no need for
> a fat one but a hint, because everybody else looking at this will wonder why do not
> have it in hugetlb_vmemmap_init(), as the name suggests.

Make sense.

> 
> 
>> +#endif
>> 
>> 	padata_do_multithreaded(&job);
>> }
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 4a077d231d3a..62e61af18c9a 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
>> static int __init hugetlb_vmemmap_init(void)
> 
> At this point, hugetlb_vmemmap_init only initialites the sysctls for
> each hstate, should the name reflect that better?

A good point. I can add a new cleanup patch.

> 
> Also, vmemmap_get_tail() still thinks that tail pages are initialized
> here from this comment:

Good catch. I'll fix it.

> 
> "/*
>   * Only allocate the page, but do not initialize it.
>   *
>   * Any initialization done here will be overwritten by memmap_init().
>   *
>   * hugetlb_vmemmap_init() will take care of initialization after
>   * memmap_init().
>   */"
> 
> so we might want to adjust that as well, and I am not sure if we have
> more comments in the tree referencing hugetlb_vmemmap_init() as the init
> point for those pages that need correction.

Only this one.

> 
> 
> Having said that, and this is just a question, we cannot really make
> gather_bootmem_prealloc() also do the initialization of the sysfs right?
> I see that hugetlb sysfs gets initialized from hugetlb_init() a few
> calls after, so.. meh.

hugetlb_vmemmap_init() is introduced by me. I want to hide some symbols
from mm/hugetlb_vmemmap.c at that time. So I used late_initcall() for that.

> 
> Anyway, maybe just convert hugetlb_vmemmap_init to hugetlb_vmemmap_sysfs_create
> (pick a better name), so it does not look like having two entry points
> for vmemmap init stuff.

This bug fix is a temporary change, the whole zone initialization will
removed later in patch 30. Then, there will be only one place for vmemmap init :)

I’m curious about your thoughts on the practice of minimizing external
symbols. Personally, I’d prefer to move these initializations into
their respective files:

    - hugetlb_sysfs_init();
    - hugetlb_cgroup_file_init();
    - hugetlb_sysctl_init();

Of course, that’s just my personal preference.

Muchun,
Thanks.


> 
> -- 
> Oscar Salvador
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31b34ca0f402..d22683ab30a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3382,6 +3382,25 @@  static void __init gather_bootmem_prealloc(void)
 		.max_threads	= num_node_state(N_MEMORY),
 		.numa_aware	= true,
 	};
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+	struct zone *zone;
+
+	for_each_zone(zone) {
+		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
+			struct page *tail, *p;
+			unsigned int order;
+
+			tail = zone->vmemmap_tails[i];
+			if (!tail)
+				continue;
+
+			order = i + VMEMMAP_TAIL_MIN_ORDER;
+			p = page_to_virt(tail);
+			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
+				init_compound_tail(p + j, NULL, order, zone);
+		}
+	}
+#endif
 
 	padata_do_multithreaded(&job);
 }
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4a077d231d3a..62e61af18c9a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -870,27 +870,10 @@  static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
 static int __init hugetlb_vmemmap_init(void)
 {
 	const struct hstate *h;
-	struct zone *zone;
 
 	/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
 	BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
 
-	for_each_zone(zone) {
-		for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
-			struct page *tail, *p;
-			unsigned int order;
-
-			tail = zone->vmemmap_tails[i];
-			if (!tail)
-				continue;
-
-			order = i + VMEMMAP_TAIL_MIN_ORDER;
-			p = page_to_virt(tail);
-			for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
-				init_compound_tail(p + j, NULL, order, zone);
-		}
-	}
-
 	for_each_hstate(h) {
 		if (hugetlb_vmemmap_optimizable(h)) {
 			register_sysctl_init("vm", hugetlb_vmemmap_sysctls);