| Message ID | 20260513130542.35604-2-songmuchun@bytedance.com (mailing list archive) |
|---|---|
| State | Handled Elsewhere |
| Headers | show |
| Series | mm: Generalize HVO for HugeTLB and device DAX | expand |
On Wed, May 13, 2026 at 09:04:29PM +0800, Muchun Song wrote: > Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched > HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[]. > > Those shared tail pages were initialized in hugetlb_vmemmap_init(), but > bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc(). > With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access > pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages > already point to the shared tail page. On CONFIG_DEBUG_VM kernels, > get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized > shared tail page and can panic during boot. > > Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before > bootmem HugeTLB folios are processed, and drop the later initialization > from hugetlb_vmemmap_init(). > > This bug only affects CONFIG_DEBUG_VM kernels, where the relevant > assertion is evaluated. > > Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages") > Signed-off-by: Muchun Song <songmuchun@bytedance.com> For the correctness of the change Acked-by: Oscar Salvador <osalvador@suse.de> but I have a couple of comments: > --- > mm/hugetlb.c | 19 +++++++++++++++++++ > mm/hugetlb_vmemmap.c | 17 ----------------- > 2 files changed, 19 insertions(+), 17 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 31b34ca0f402..d22683ab30a1 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void) > .max_threads = num_node_state(N_MEMORY), > .numa_aware = true, > }; > +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP > + struct zone *zone; > + > + for_each_zone(zone) { > + for (int i = 0; i < NR_VMEMMAP_TAILS; i++) { > + struct page *tail, *p; > + unsigned int order; > + > + tail = zone->vmemmap_tails[i]; > + if (!tail) > + continue; > + > + order = i + VMEMMAP_TAIL_MIN_ORDER; > + p = page_to_virt(tail); > + for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++) > + init_compound_tail(p + j, NULL, order, zone); > + } > + } This deserves a comment why do we need to initialize those pages here, no need for a fat one but a hint, because everybody else looking at this will wonder why do not have it in hugetlb_vmemmap_init(), as the name suggests. > +#endif > > padata_do_multithreaded(&job); > } > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 4a077d231d3a..62e61af18c9a 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = { > static int __init hugetlb_vmemmap_init(void) At this point, hugetlb_vmemmap_init only initialites the sysctls for each hstate, should the name reflect that better? Also, vmemmap_get_tail() still thinks that tail pages are initialized here from this comment: "/* * Only allocate the page, but do not initialize it. * * Any initialization done here will be overwritten by memmap_init(). * * hugetlb_vmemmap_init() will take care of initialization after * memmap_init(). */" so we might want to adjust that as well, and I am not sure if we have more comments in the tree referencing hugetlb_vmemmap_init() as the init point for those pages that need correction. Having said that, and this is just a question, we cannot really make gather_bootmem_prealloc() also do the initialization of the sysfs right? I see that hugetlb sysfs gets initialized from hugetlb_init() a few calls after, so.. meh. Anyway, maybe just convert hugetlb_vmemmap_init to hugetlb_vmemmap_sysfs_create (pick a better name), so it does not look like having two entry points for vmemmap init stuff.
> On May 14, 2026, at 15:51, Oscar Salvador <osalvador@suse.de> wrote: > > On Wed, May 13, 2026 at 09:04:29PM +0800, Muchun Song wrote: >> Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched >> HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[]. >> >> Those shared tail pages were initialized in hugetlb_vmemmap_init(), but >> bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc(). >> With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access >> pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages >> already point to the shared tail page. On CONFIG_DEBUG_VM kernels, >> get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized >> shared tail page and can panic during boot. >> >> Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before >> bootmem HugeTLB folios are processed, and drop the later initialization >> from hugetlb_vmemmap_init(). >> >> This bug only affects CONFIG_DEBUG_VM kernels, where the relevant >> assertion is evaluated. >> >> Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages") >> Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > For the correctness of the change > > Acked-by: Oscar Salvador <osalvador@suse.de> Thanks. > > but I have a couple of comments: > >> --- >> mm/hugetlb.c | 19 +++++++++++++++++++ >> mm/hugetlb_vmemmap.c | 17 ----------------- >> 2 files changed, 19 insertions(+), 17 deletions(-) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 31b34ca0f402..d22683ab30a1 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void) >> .max_threads = num_node_state(N_MEMORY), >> .numa_aware = true, >> }; >> +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP >> + struct zone *zone; >> + >> + for_each_zone(zone) { >> + for (int i = 0; i < NR_VMEMMAP_TAILS; i++) { >> + struct page *tail, *p; >> + unsigned int order; >> + >> + tail = zone->vmemmap_tails[i]; >> + if (!tail) >> + continue; >> + >> + order = i + VMEMMAP_TAIL_MIN_ORDER; >> + p = page_to_virt(tail); >> + for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++) >> + init_compound_tail(p + j, NULL, order, zone); >> + } >> + } > > This deserves a comment why do we need to initialize those pages here, no need for > a fat one but a hint, because everybody else looking at this will wonder why do not > have it in hugetlb_vmemmap_init(), as the name suggests. Make sense. > > >> +#endif >> >> padata_do_multithreaded(&job); >> } >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >> index 4a077d231d3a..62e61af18c9a 100644 >> --- a/mm/hugetlb_vmemmap.c >> +++ b/mm/hugetlb_vmemmap.c >> @@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = { >> static int __init hugetlb_vmemmap_init(void) > > At this point, hugetlb_vmemmap_init only initialites the sysctls for > each hstate, should the name reflect that better? A good point. I can add a new cleanup patch. > > Also, vmemmap_get_tail() still thinks that tail pages are initialized > here from this comment: Good catch. I'll fix it. > > "/* > * Only allocate the page, but do not initialize it. > * > * Any initialization done here will be overwritten by memmap_init(). > * > * hugetlb_vmemmap_init() will take care of initialization after > * memmap_init(). > */" > > so we might want to adjust that as well, and I am not sure if we have > more comments in the tree referencing hugetlb_vmemmap_init() as the init > point for those pages that need correction. Only this one. > > > Having said that, and this is just a question, we cannot really make > gather_bootmem_prealloc() also do the initialization of the sysfs right? > I see that hugetlb sysfs gets initialized from hugetlb_init() a few > calls after, so.. meh. hugetlb_vmemmap_init() is introduced by me. I want to hide some symbols from mm/hugetlb_vmemmap.c at that time. So I used late_initcall() for that. > > Anyway, maybe just convert hugetlb_vmemmap_init to hugetlb_vmemmap_sysfs_create > (pick a better name), so it does not look like having two entry points > for vmemmap init stuff. This bug fix is a temporary change, the whole zone initialization will removed later in patch 30. Then, there will be only one place for vmemmap init :) I’m curious about your thoughts on the practice of minimizing external symbols. Personally, I’d prefer to move these initializations into their respective files: - hugetlb_sysfs_init(); - hugetlb_cgroup_file_init(); - hugetlb_sysctl_init(); Of course, that’s just my personal preference. Muchun, Thanks. > > -- > Oscar Salvador > SUSE Labs
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 31b34ca0f402..d22683ab30a1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void) .max_threads = num_node_state(N_MEMORY), .numa_aware = true, }; +#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP + struct zone *zone; + + for_each_zone(zone) { + for (int i = 0; i < NR_VMEMMAP_TAILS; i++) { + struct page *tail, *p; + unsigned int order; + + tail = zone->vmemmap_tails[i]; + if (!tail) + continue; + + order = i + VMEMMAP_TAIL_MIN_ORDER; + p = page_to_virt(tail); + for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++) + init_compound_tail(p + j, NULL, order, zone); + } + } +#endif padata_do_multithreaded(&job); } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 4a077d231d3a..62e61af18c9a 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = { static int __init hugetlb_vmemmap_init(void) { const struct hstate *h; - struct zone *zone; /* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */ BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES); - for_each_zone(zone) { - for (int i = 0; i < NR_VMEMMAP_TAILS; i++) { - struct page *tail, *p; - unsigned int order; - - tail = zone->vmemmap_tails[i]; - if (!tail) - continue; - - order = i + VMEMMAP_TAIL_MIN_ORDER; - p = page_to_virt(tail); - for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++) - init_compound_tail(p + j, NULL, order, zone); - } - } - for_each_hstate(h) { if (hugetlb_vmemmap_optimizable(h)) { register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[]. Those shared tail pages were initialized in hugetlb_vmemmap_init(), but bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc(). With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages already point to the shared tail page. On CONFIG_DEBUG_VM kernels, get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized shared tail page and can panic during boot. Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before bootmem HugeTLB folios are processed, and drop the later initialization from hugetlb_vmemmap_init(). This bug only affects CONFIG_DEBUG_VM kernels, where the relevant assertion is evaluated. Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages") Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- mm/hugetlb.c | 19 +++++++++++++++++++ mm/hugetlb_vmemmap.c | 17 ----------------- 2 files changed, 19 insertions(+), 17 deletions(-)