diff mbox series

[v9,08/12] mm: zero reserved and unavailable struct pages

Message ID 20170920201714.19817-9-pasha.tatashin@oracle.com (mailing list archive)
State Not Applicable
Headers show
Series complete deferred page initialization | expand

Commit Message

Pavel Tatashin Sept. 20, 2017, 8:17 p.m. UTC
Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
	for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
---
 include/linux/memblock.h | 16 ++++++++++++++++
 include/linux/mm.h       |  6 ++++++
 mm/page_alloc.c          | 30 ++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

Comments

Michal Hocko Oct. 3, 2017, 1:18 p.m. UTC | #1
On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by going
> through __init_single_page().

Could you be more specific where is such a memory reserved?
Pavel Tatashin Oct. 3, 2017, 3:29 p.m. UTC | #2
On 10/03/2017 09:18 AM, Michal Hocko wrote:
> On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
>> Some memory is reserved but unavailable: not present in memblock.memory
>> (because not backed by physical pages), but present in memblock.reserved.
>> Such memory has backing struct pages, but they are not initialized by going
>> through __init_single_page().
> 
> Could you be more specific where is such a memory reserved?
> 

I know of one example: trim_low_memory_range() unconditionally reserves 
from pfn 0, but e820__memblock_setup() might provide the exiting memory 
from pfn 1 (i.e. KVM).

But, there could be more based on this comment from linux/page-flags.h:

  19  * PG_reserved is set for special pages, which can never be swapped 
out. Some
  20  * of them might not even exist (eg empty_bad_page)...

Pasha
Michal Hocko Oct. 4, 2017, 8:56 a.m. UTC | #3
On Tue 03-10-17 11:29:16, Pasha Tatashin wrote:
> On 10/03/2017 09:18 AM, Michal Hocko wrote:
> > On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
> > > Some memory is reserved but unavailable: not present in memblock.memory
> > > (because not backed by physical pages), but present in memblock.reserved.
> > > Such memory has backing struct pages, but they are not initialized by going
> > > through __init_single_page().
> > 
> > Could you be more specific where is such a memory reserved?
> > 
> 
> I know of one example: trim_low_memory_range() unconditionally reserves from
> pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
> 1 (i.e. KVM).

Then just initialize struct pages for that mapping rigth there where a
special API is used.

> But, there could be more based on this comment from linux/page-flags.h:
> 
>  19  * PG_reserved is set for special pages, which can never be swapped out.
> Some
>  20  * of them might not even exist (eg empty_bad_page)...

I have no idea wht empty_bad_page is but a quick grep shows that this is
never used. I might be wrong here but if somebody is reserving a memory
in a special way then we should handle the initialization right there.
E.g. create an API for special memblock reservations.
Pavel Tatashin Oct. 4, 2017, 12:40 p.m. UTC | #4
>>> Could you be more specific where is such a memory reserved?
>>>
>>
>> I know of one example: trim_low_memory_range() unconditionally reserves from
>> pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
>> 1 (i.e. KVM).
> 
> Then just initialize struct pages for that mapping rigth there where a
> special API is used.
> 
>> But, there could be more based on this comment from linux/page-flags.h:
>>
>>   19  * PG_reserved is set for special pages, which can never be swapped out.
>> Some
>>   20  * of them might not even exist (eg empty_bad_page)...
> 
> I have no idea wht empty_bad_page is but a quick grep shows that this is
> never used. I might be wrong here but if somebody is reserving a memory
> in a special way then we should handle the initialization right there.
> E.g. create an API for special memblock reservations.
> 

Hi Michal,

The reservations happen before struct pages are allocated and mapped. 
So, it is not always possible to do it at call sites.

Previously, I have solved this problem like this:

https://patchwork.kernel.org/patch/9886163

But, I was not too happy with that approach, so I replaced it with the 
current approach as it is more generic, and solves similar issues if 
they happen in other places. Also, the comment in page-flags got me 
scared that there are probably other places perhaps on other 
architectures that can have the similar issue.

In addition, I did not like my solution, I was simply shrinking the low 
reservation from:
[0 - reserve_low) to [min_pfn - reserve_low), but if min_pfn > 
reserve_low can we skip low reservation entirely? I was not sure.

The current approach notifies us if there are such pages, and we can 
fix/remove them in the future without crashing kernel in the meantime.

Pasha
Michal Hocko Oct. 4, 2017, 12:57 p.m. UTC | #5
On Wed 04-10-17 08:40:11, Pasha Tatashin wrote:
> > > > Could you be more specific where is such a memory reserved?
> > > > 
> > > 
> > > I know of one example: trim_low_memory_range() unconditionally reserves from
> > > pfn 0, but e820__memblock_setup() might provide the exiting memory from pfn
> > > 1 (i.e. KVM).
> > 
> > Then just initialize struct pages for that mapping rigth there where a
> > special API is used.
> > 
> > > But, there could be more based on this comment from linux/page-flags.h:
> > > 
> > >   19  * PG_reserved is set for special pages, which can never be swapped out.
> > > Some
> > >   20  * of them might not even exist (eg empty_bad_page)...
> > 
> > I have no idea wht empty_bad_page is but a quick grep shows that this is
> > never used. I might be wrong here but if somebody is reserving a memory
> > in a special way then we should handle the initialization right there.
> > E.g. create an API for special memblock reservations.
> > 
> 
> Hi Michal,
> 
> The reservations happen before struct pages are allocated and mapped. So, it
> is not always possible to do it at call sites.

OK, I didn't realize that.
 
> Previously, I have solved this problem like this:
> 
> https://patchwork.kernel.org/patch/9886163
> 
> But, I was not too happy with that approach, so I replaced it with the
> current approach as it is more generic, and solves similar issues if they
> happen in other places. Also, the comment in page-flags got me scared that
> there are probably other places perhaps on other architectures that can have
> the similar issue.

I believe the comment is just stale. I have looked into empty_bad_page
and it is just a relict. I plan to post a patch soon.

> In addition, I did not like my solution, I was simply shrinking the low
> reservation from:
> [0 - reserve_low) to [min_pfn - reserve_low), but if min_pfn > reserve_low
> can we skip low reservation entirely? I was not sure.
> 
> The current approach notifies us if there are such pages, and we can
> fix/remove them in the future without crashing kernel in the meantime.

I am not really familiar with the trim_low_memory_range code path. I am
not even sure we have to care about it because nobody should be walking
pfns outside of any zone. I am worried that this patch adds a code which
is not really used and it will just stay that way for ever because
nobody will dare to change it as it is too obscure and not explained
very well. trim_low_memory_range is a good example of this. Why do we
even reserve this range from the memory block allocator? The memory
shouldn't be backed by any real memory and thus not in the allocator in
the first place, no?
Pavel Tatashin Oct. 4, 2017, 1:28 p.m. UTC | #6
> I am not really familiar with the trim_low_memory_range code path. I am
> not even sure we have to care about it because nobody should be walking
> pfns outside of any zone.

According to commit comments first 4K belongs to BIOS, so I think the 
memory exists but BIOS may or may not report it to Linux. So, reserve it 
to make sure we never touch it.

  I am worried that this patch adds a code which
> is not really used and it will just stay that way for ever because
> nobody will dare to change it as it is too obscure and not explained
> very well.

I could explain mine code better. Perhaps add more comments, and explain 
when it can be removed?

  trim_low_memory_range is a good example of this. Why do we
> even reserve this range from the memory block allocator? The memory
> shouldn't be backed by any real memory and thus not in the allocator in
> the first place, no?
> 

Since it is not enforced in memblock that everything in reserved list 
must be part of memory list, we can have it, and we need to make sure 
kernel does not panic. Otherwise, it is very hard to detect such bugs.
Michal Hocko Oct. 4, 2017, 2:04 p.m. UTC | #7
On Wed 04-10-17 09:28:55, Pasha Tatashin wrote:
> 
> > I am not really familiar with the trim_low_memory_range code path. I am
> > not even sure we have to care about it because nobody should be walking
> > pfns outside of any zone.
> 
> According to commit comments first 4K belongs to BIOS, so I think the memory
> exists but BIOS may or may not report it to Linux. So, reserve it to make
> sure we never touch it.

Yes and that memory should be outside of any zones, no?

> > I am worried that this patch adds a code which
> > is not really used and it will just stay that way for ever because
> > nobody will dare to change it as it is too obscure and not explained
> > very well.
> 
> I could explain mine code better. Perhaps add more comments, and explain
> when it can be removed?

More explanation would be definitely helpful

> > trim_low_memory_range is a good example of this. Why do we
> > even reserve this range from the memory block allocator? The memory
> > shouldn't be backed by any real memory and thus not in the allocator in
> > the first place, no?
> > 
> 
> Since it is not enforced in memblock that everything in reserved list must
> be part of memory list, we can have it, and we need to make sure kernel does
> not panic. Otherwise, it is very hard to detect such bugs.

So, should we report such a memblock reservation API (ab)use to the log?
Are you actually sure that trim_low_memory_range is doing a sane and
really needed thing? In other words do we have a zone which contains
this no-memory backed pfns?
Pavel Tatashin Oct. 4, 2017, 3:08 p.m. UTC | #8
On 10/04/2017 10:04 AM, Michal Hocko wrote:
> On Wed 04-10-17 09:28:55, Pasha Tatashin wrote:
>>
>>> I am not really familiar with the trim_low_memory_range code path. I am
>>> not even sure we have to care about it because nobody should be walking
>>> pfns outside of any zone.
>>
>> According to commit comments first 4K belongs to BIOS, so I think the memory
>> exists but BIOS may or may not report it to Linux. So, reserve it to make
>> sure we never touch it.
> 
> Yes and that memory should be outside of any zones, no?

I am not totally sure, I think some x86 expert could help us here. But, 
in either case this issue can be fixed separately from the rest of the 
series.

> 
>>> I am worried that this patch adds a code which
>>> is not really used and it will just stay that way for ever because
>>> nobody will dare to change it as it is too obscure and not explained
>>> very well.
>>
>> I could explain mine code better. Perhaps add more comments, and explain
>> when it can be removed?
> 
> More explanation would be definitely helpful
> 
>>> trim_low_memory_range is a good example of this. Why do we
>>> even reserve this range from the memory block allocator? The memory
>>> shouldn't be backed by any real memory and thus not in the allocator in
>>> the first place, no?
>>>
>>
>> Since it is not enforced in memblock that everything in reserved list must
>> be part of memory list, we can have it, and we need to make sure kernel does
>> not panic. Otherwise, it is very hard to detect such bugs.
> 
> So, should we report such a memblock reservation API (ab)use to the log?
> Are you actually sure that trim_low_memory_range is doing a sane and
> really needed thing? In other words do we have a zone which contains
> this no-memory backed pfns?
> 

And, this patch reports it already:

+	pr_info("Reserved but unavailable: %lld pages", pgcnt);

I could add a comment above this print call, explain that such memory is 
probably bogus and must be studied/fixed. Also, add that this code can 
be removed once memblock is changed to allow reserve only memory that is 
backed by physical memory i.e. in "memory" list.

Pasha
diff mbox series

Patch

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..bdd4268f9323 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@  unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn);
 	for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved,	\
 			       nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailabled but reserved (reserved && !memory) areas of memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end)			\
+	for_each_mem_range(i, &memblock.reserved, &memblock.memory,	\
+			   NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 					     unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50b74d628243..a7bba4ce79ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2010,6 +2010,12 @@  extern int __meminit __early_pfn_to_nid(unsigned long pfn,
 					struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
 				unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b630ee91430..1d38d391dffd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6202,6 +6202,34 @@  void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 	free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+	phys_addr_t start, end;
+	unsigned long pfn;
+	u64 i, pgcnt;
+
+	/* Loop through ranges that are reserved, but do not have reported
+	 * physical memory backing.
+	 */
+	pgcnt = 0;
+	for_each_resv_unavail_range(i, &start, &end) {
+		for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+			mm_zero_struct_page(pfn_to_page(pfn));
+			pgcnt++;
+		}
+	}
+	pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6625,6 +6653,7 @@  void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 			node_set_state(nid, N_MEMORY);
 		check_for_memory(pgdat, nid);
 	}
+	zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6788,6 +6817,7 @@  void __init free_area_init(unsigned long *zones_size)
 {
 	free_area_init_node(0, zones_size,
 			__pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
+	zero_resv_unavail();
 }
 
 static int page_alloc_cpu_dead(unsigned int cpu)