[RFC,7/7] mm: better document PG_reserved

Message ID 20181205122851.5891-8-david@redhat.com
State New
Headers show
Series
  • mm: PG_reserved cleanups and documentation
Related show

Checks

Context Check Description
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 24 lines checked
snowpatch_ozlabs/build-pmac32 success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64e success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64be success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64le success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied

Commit Message

David Hildenbrand Dec. 5, 2018, 12:28 p.m.
The usage of PG_reserved and how PG_reserved pages are to be treated is
burried deep down in different parts of the kernel. Let's shine some light
onto these details by documenting (most?) current users and expected
behavior.

I don't see a reason why we have to document "Some of them might not even
exist". If there is a user, we should document it. E.g. for balloon
drivers we now use PG_offline to indicate that a page might currently
not be backed by memory in the hypervisor. And that is independent from
PG_reserved.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: yi.z.zhang@linux.intel.com
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

Comments

Michal Hocko Dec. 5, 2018, 12:59 p.m. | #1
On Wed 05-12-18 13:28:51, David Hildenbrand wrote:
> The usage of PG_reserved and how PG_reserved pages are to be treated is
> burried deep down in different parts of the kernel. Let's shine some light
> onto these details by documenting (most?) current users and expected
> behavior.
> 
> I don't see a reason why we have to document "Some of them might not even
> exist". If there is a user, we should document it. E.g. for balloon
> drivers we now use PG_offline to indicate that a page might currently
> not be backed by memory in the hypervisor. And that is independent from
> PG_reserved.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
> Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
> Cc: Miles Chen <miles.chen@mediatek.com>
> Cc: yi.z.zhang@linux.intel.com
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

This looks like an improvement. The essential part is that PG_reserved
page belongs to its user and no generic code should touch it. The rest
is a description of current users which I haven't checked due to to lack
of time but yeah, I like the updated wording because I have seen
multiple people confused from the swapped out part which is not true for
many many years. I have tried to dig out when it was actually the case
but failed.

So I cannot give my Ack because I didn't really do a real review but I
like this FWIW.

> ---
>  include/linux/page-flags.h | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 68b8495e2fbc..112526f5ba61 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -17,8 +17,22 @@
>  /*
>   * Various page->flags bits:
>   *
> - * PG_reserved is set for special pages, which can never be swapped out. Some
> - * of them might not even exist...
> + * PG_reserved is set for special pages. The "struct page" of such a page
> + * should in general not be touched (e.g. set dirty) except by their owner.
> + * Pages marked as PG_reserved include:
> + * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
> + * - Pages allocated early during boot (bootmem, memblock)
> + * - Zero pages
> + * - Pages that have been associated with a zone but are not available for
> + *   the page allocator (e.g. excluded via online_page_callback())
> + * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
> + * - MMIO pages (communicate with a device, special caching strategy needed)
> + * - MCA pages on ia64 (pages with memory errors)
> + * - Device memory (e.g. PMEM, DAX, HMM)
> + * Some architectures don't allow to ioremap pages that are not marked
> + * PG_reserved (as they might be in use by somebody else who does not respect
> + * the caching strategy). Consequently, PG_reserved for a page mapped into
> + * user space can indicate the zero page, the vDSO, MMIO pages or device memory.
>   *
>   * The PG_private bitflag is set on pagecache pages if they contain filesystem
>   * specific data (which is normally at page->private). It can be used by
> -- 
> 2.17.2
>
Matthew Wilcox Dec. 5, 2018, 2:35 p.m. | #2
On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
> I don't see a reason why we have to document "Some of them might not even
> exist". If there is a user, we should document it. E.g. for balloon
> drivers we now use PG_offline to indicate that a page might currently
> not be backed by memory in the hypervisor. And that is independent from
> PG_reserved.

I think you're confused by the meaning of "some of them might not even
exist".  What this means is that there might not be memory there; maybe
writes to that memory will be discarded, or maybe they'll cause a machine
check.  Maybe reads will return ~0, or 0, or cause a machine check.
We just don't know what's there, and we shouldn't try touching the memory.

> +++ b/include/linux/page-flags.h
> @@ -17,8 +17,22 @@
>  /*
>   * Various page->flags bits:
>   *
> - * PG_reserved is set for special pages, which can never be swapped out. Some
> - * of them might not even exist...
> + * PG_reserved is set for special pages. The "struct page" of such a page
> + * should in general not be touched (e.g. set dirty) except by their owner.
> + * Pages marked as PG_reserved include:
> + * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
> + * - Pages allocated early during boot (bootmem, memblock)
> + * - Zero pages
> + * - Pages that have been associated with a zone but are not available for
> + *   the page allocator (e.g. excluded via online_page_callback())
> + * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
> + * - MMIO pages (communicate with a device, special caching strategy needed)
> + * - MCA pages on ia64 (pages with memory errors)
> + * - Device memory (e.g. PMEM, DAX, HMM)
> + * Some architectures don't allow to ioremap pages that are not marked
> + * PG_reserved (as they might be in use by somebody else who does not respect
> + * the caching strategy). Consequently, PG_reserved for a page mapped into
> + * user space can indicate the zero page, the vDSO, MMIO pages or device memory.

So maybe just add one more option to the list.
David Hildenbrand Dec. 5, 2018, 3:05 p.m. | #3
On 05.12.18 15:35, Matthew Wilcox wrote:
> On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
>> I don't see a reason why we have to document "Some of them might not even
>> exist". If there is a user, we should document it. E.g. for balloon
>> drivers we now use PG_offline to indicate that a page might currently
>> not be backed by memory in the hypervisor. And that is independent from
>> PG_reserved.
> 
> I think you're confused by the meaning of "some of them might not even
> exist".  What this means is that there might not be memory there; maybe
> writes to that memory will be discarded, or maybe they'll cause a machine
> check.  Maybe reads will return ~0, or 0, or cause a machine check.
> We just don't know what's there, and we shouldn't try touching the memory.

If there are users, let's document it. And I need more details for that :)

1. machine check: if there is a HW error, we set PG_hwpoison (except
ia64 MCA, see the list)

2. Writes to that memory will be discarded

Who is the user of that? When will we have such pages right now?

3. Reads will return ~0, / 0?

I think this is a special case of e.g. x86? But where do we have that,
are there any user?


In summary: When can we have memory sections that are online but pages
reserved and not accessible? (one example is ballooning I mention here)

(I classify this as dangerous as dump tools will happily dump
PG_reserved pages (unless PG_hwpoison/PG_offline) and that's the right
thing to do).

I want to avoid documenting things that are not actually getting used.

> 
>> +++ b/include/linux/page-flags.h
>> @@ -17,8 +17,22 @@
>>  /*
>>   * Various page->flags bits:
>>   *
>> - * PG_reserved is set for special pages, which can never be swapped out. Some
>> - * of them might not even exist...
>> + * PG_reserved is set for special pages. The "struct page" of such a page
>> + * should in general not be touched (e.g. set dirty) except by their owner.
>> + * Pages marked as PG_reserved include:
>> + * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
>> + * - Pages allocated early during boot (bootmem, memblock)
>> + * - Zero pages
>> + * - Pages that have been associated with a zone but are not available for
>> + *   the page allocator (e.g. excluded via online_page_callback())
>> + * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
>> + * - MMIO pages (communicate with a device, special caching strategy needed)
>> + * - MCA pages on ia64 (pages with memory errors)
>> + * - Device memory (e.g. PMEM, DAX, HMM)
>> + * Some architectures don't allow to ioremap pages that are not marked
>> + * PG_reserved (as they might be in use by somebody else who does not respect
>> + * the caching strategy). Consequently, PG_reserved for a page mapped into
>> + * user space can indicate the zero page, the vDSO, MMIO pages or device memory.
> 
> So maybe just add one more option to the list.
>
Matthew Wilcox Dec. 5, 2018, 5:32 p.m. | #4
On Wed, Dec 05, 2018 at 04:05:12PM +0100, David Hildenbrand wrote:
> On 05.12.18 15:35, Matthew Wilcox wrote:
> > On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
> >> I don't see a reason why we have to document "Some of them might not even
> >> exist". If there is a user, we should document it. E.g. for balloon
> >> drivers we now use PG_offline to indicate that a page might currently
> >> not be backed by memory in the hypervisor. And that is independent from
> >> PG_reserved.
> > 
> > I think you're confused by the meaning of "some of them might not even
> > exist".  What this means is that there might not be memory there; maybe
> > writes to that memory will be discarded, or maybe they'll cause a machine
> > check.  Maybe reads will return ~0, or 0, or cause a machine check.
> > We just don't know what's there, and we shouldn't try touching the memory.
> 
> If there are users, let's document it. And I need more details for that :)
> 
> 1. machine check: if there is a HW error, we set PG_hwpoison (except
> ia64 MCA, see the list)
> 
> 2. Writes to that memory will be discarded
> 
> Who is the user of that? When will we have such pages right now?
> 
> 3. Reads will return ~0, / 0?
> 
> I think this is a special case of e.g. x86? But where do we have that,
> are there any user?

When there are gaps in the physical memory.  As in, if you put that
physical address on the bus (or in a packet), no device will respond
to it.  Look:

00000000-00000fff : Reserved
00001000-00057fff : System RAM
00058000-00058fff : Reserved
00059000-0009dfff : System RAM
0009e000-000fffff : Reserved

Those examples I gave are examples of how various different architectures
respond to "no device responded to this memory access".
David Hildenbrand Dec. 5, 2018, 6:13 p.m. | #5
On 05.12.18 18:32, Matthew Wilcox wrote:
> On Wed, Dec 05, 2018 at 04:05:12PM +0100, David Hildenbrand wrote:
>> On 05.12.18 15:35, Matthew Wilcox wrote:
>>> On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
>>>> I don't see a reason why we have to document "Some of them might not even
>>>> exist". If there is a user, we should document it. E.g. for balloon
>>>> drivers we now use PG_offline to indicate that a page might currently
>>>> not be backed by memory in the hypervisor. And that is independent from
>>>> PG_reserved.
>>>
>>> I think you're confused by the meaning of "some of them might not even
>>> exist".  What this means is that there might not be memory there; maybe
>>> writes to that memory will be discarded, or maybe they'll cause a machine
>>> check.  Maybe reads will return ~0, or 0, or cause a machine check.
>>> We just don't know what's there, and we shouldn't try touching the memory.
>>
>> If there are users, let's document it. And I need more details for that :)
>>
>> 1. machine check: if there is a HW error, we set PG_hwpoison (except
>> ia64 MCA, see the list)
>>
>> 2. Writes to that memory will be discarded
>>
>> Who is the user of that? When will we have such pages right now?
>>
>> 3. Reads will return ~0, / 0?
>>
>> I think this is a special case of e.g. x86? But where do we have that,
>> are there any user?
> 
> When there are gaps in the physical memory.  As in, if you put that
> physical address on the bus (or in a packet), no device will respond
> to it.  Look:
> 
> 00000000-00000fff : Reserved
> 00001000-00057fff : System RAM
> 00058000-00058fff : Reserved
> 00059000-0009dfff : System RAM
> 0009e000-000fffff : Reserved
> 
> Those examples I gave are examples of how various different architectures
> respond to "no device responded to this memory access".
> 

Okay, so for this memory we will have
a) vmmaps
b) Memory block devices
c) A sections that is online

So essentially "Gaps in physical memory" which is part of a online section.

This might be a candidate for PG_offline as well.

Thanks for the info, I'll try to find out how such things are handled.
In general I assume this memory has to be readable, because otherwise
kdump and friends would crash already when trying to dump?
David Hildenbrand Dec. 6, 2018, 10:46 a.m. | #6
On 05.12.18 19:13, David Hildenbrand wrote:
> On 05.12.18 18:32, Matthew Wilcox wrote:
>> On Wed, Dec 05, 2018 at 04:05:12PM +0100, David Hildenbrand wrote:
>>> On 05.12.18 15:35, Matthew Wilcox wrote:
>>>> On Wed, Dec 05, 2018 at 01:28:51PM +0100, David Hildenbrand wrote:
>>>>> I don't see a reason why we have to document "Some of them might not even
>>>>> exist". If there is a user, we should document it. E.g. for balloon
>>>>> drivers we now use PG_offline to indicate that a page might currently
>>>>> not be backed by memory in the hypervisor. And that is independent from
>>>>> PG_reserved.
>>>>
>>>> I think you're confused by the meaning of "some of them might not even
>>>> exist".  What this means is that there might not be memory there; maybe
>>>> writes to that memory will be discarded, or maybe they'll cause a machine
>>>> check.  Maybe reads will return ~0, or 0, or cause a machine check.
>>>> We just don't know what's there, and we shouldn't try touching the memory.
>>>
>>> If there are users, let's document it. And I need more details for that :)
>>>
>>> 1. machine check: if there is a HW error, we set PG_hwpoison (except
>>> ia64 MCA, see the list)
>>>
>>> 2. Writes to that memory will be discarded
>>>
>>> Who is the user of that? When will we have such pages right now?
>>>
>>> 3. Reads will return ~0, / 0?
>>>
>>> I think this is a special case of e.g. x86? But where do we have that,
>>> are there any user?
>>
>> When there are gaps in the physical memory.  As in, if you put that
>> physical address on the bus (or in a packet), no device will respond
>> to it.  Look:
>>
>> 00000000-00000fff : Reserved
>> 00001000-00057fff : System RAM
>> 00058000-00058fff : Reserved
>> 00059000-0009dfff : System RAM
>> 0009e000-000fffff : Reserved
>>
>> Those examples I gave are examples of how various different architectures
>> respond to "no device responded to this memory access".
>>
> 
> Okay, so for this memory we will have
> a) vmmaps
> b) Memory block devices
> c) A sections that is online
> 
> So essentially "Gaps in physical memory" which is part of a online section.
> 
> This might be a candidate for PG_offline as well.
> 
> Thanks for the info, I'll try to find out how such things are handled.
> In general I assume this memory has to be readable, because otherwise
> kdump and friends would crash already when trying to dump?
> 

So I finally understood how physical memory holes in online sections are
handled when dumping. They won't be dumped because the list of dumpable
chunks (contained in /proc/kcore and after a crash /proc/vmcore) is
built using walk_system_ram_range(). So anything not listed as RAM will
be ignored.

I will update the documentation, describing that if we have an online
section that is not completely IORESOURCE_SYSTEM_RAM, that the physical
memory gaps will also be set to PG_reserved.

Trying to touch this memory is indeed dangerous, luckily dumping is
properly handled.

I'll think about if marking these ranges as PG_offline might make sense
(and if it can be easily added). Then we directly know when seeing that
page type that we should not touch it. Ever.

That hint was really helpful :)

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 68b8495e2fbc..112526f5ba61 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -17,8 +17,22 @@ 
 /*
  * Various page->flags bits:
  *
- * PG_reserved is set for special pages, which can never be swapped out. Some
- * of them might not even exist...
+ * PG_reserved is set for special pages. The "struct page" of such a page
+ * should in general not be touched (e.g. set dirty) except by their owner.
+ * Pages marked as PG_reserved include:
+ * - Kernel image (including vDSO) and similar (e.g. BIOS, initrd)
+ * - Pages allocated early during boot (bootmem, memblock)
+ * - Zero pages
+ * - Pages that have been associated with a zone but are not available for
+ *   the page allocator (e.g. excluded via online_page_callback())
+ * - Pages to exclude from the hibernation image (e.g. loaded kexec images)
+ * - MMIO pages (communicate with a device, special caching strategy needed)
+ * - MCA pages on ia64 (pages with memory errors)
+ * - Device memory (e.g. PMEM, DAX, HMM)
+ * Some architectures don't allow to ioremap pages that are not marked
+ * PG_reserved (as they might be in use by somebody else who does not respect
+ * the caching strategy). Consequently, PG_reserved for a page mapped into
+ * user space can indicate the zero page, the vDSO, MMIO pages or device memory.
  *
  * The PG_private bitflag is set on pagecache pages if they contain filesystem
  * specific data (which is normally at page->private). It can be used by