[0/4] arch, mm: improve robustness of direct map manipulation

Message ID	20201025101555.3057-1-rppt@kernel.org (mailing list archive)
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> From: Mike Rapoport <rppt@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Subject: [PATCH 0/4] arch, mm: improve robustness of direct map manipulation Date: Sun, 25 Oct 2020 12:15:51 +0200 Message-Id: <20201025101555.3057-1-rppt@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: David Hildenbrand <david@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Dave Hansen <dave.hansen@linux.intel.com>, linux-mm@kvack.org, Paul Mackerras <paulus@samba.org>, Pavel Machek <pavel@ucw.cz>, "H. Peter Anvin" <hpa@zytor.com>, sparclinux@vger.kernel.org, Christoph Lameter <cl@linux.com>, Will Deacon <will@kernel.org>, linux-riscv@lists.infradead.org, linux-s390@vger.kernel.org, x86@kernel.org, Mike Rapoport <rppt@linux.ibm.com>, Christian Borntraeger <borntraeger@de.ibm.com>, Ingo Molnar <mingo@redhat.com>, Catalin Marinas <catalin.marinas@arm.com>, Len Brown <len.brown@intel.com>, Albert Ou <aou@eecs.berkeley.edu>, Vasily Gorbik <gor@linux.ibm.com>, linux-pm@vger.kernel.org, Heiko Carstens <hca@linux.ibm.com>, David Rientjes <rientjes@google.com>, Borislav Petkov <bp@alien8.de>, Andy Lutomirski <luto@kernel.org>, Paul Walmsley <paul.walmsley@sifive.com>, "Kirill A. Shutemov" <kirill@shutemov.name>, Thomas Gleixner <tglx@linutronix.de>, linux-arm-kernel@lists.infradead.org, "Rafael J. Wysocki" <rjw@rjwysocki.net>, linux-kernel@vger.kernel.org, Pekka Enberg <penberg@kernel.org>, Palmer Dabbelt <palmer@dabbelt.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>, linuxppc-dev@lists.ozlabs.org, "David S. Miller" <davem@davemloft.net>, Mike Rapoport <rppt@kernel.org> Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>
Series	arch, mm: improve robustness of direct map manipulation \| expand [0/4] arch, mm: improve robustness of direct map manipulation [1/4] mm: introduce debug_pagealloc_map_pages() helper [2/4] PM: hibernate: improve robustness of mapping pages in the direct map [3/4] arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC [4/4] arch, mm: make kernel_page_present() always available

Mike Rapoport Oct. 25, 2020, 10:15 a.m. UTC

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

During recent discussion about KVM protected memory, David raised a concern
about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1].

Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
possible that __kernel_map_pages() would fail, but since this function is
void, the failure will go unnoticed.

Moreover, there's lack of consistency of __kernel_map_pages() semantics
across architectures as some guard this function with
#ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page
allocation debugging is disabled at run time and some allow modifying the
direct map regardless of DEBUG_PAGEALLOC settings.

This set straightens this out by restoring dependency of
__kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
accordingly. 

[1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com

Mike Rapoport (4):
  mm: introduce debug_pagealloc_map_pages() helper
  PM: hibernate: improve robustness of mapping pages in the direct map
  arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC
  arch, mm: make kernel_page_present() always available

 arch/Kconfig                        |  3 +++
 arch/arm64/Kconfig                  |  4 +---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c            |  6 +++--
 arch/powerpc/Kconfig                |  5 +----
 arch/riscv/Kconfig                  |  4 +---
 arch/riscv/include/asm/pgtable.h    |  2 --
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c            | 31 +++++++++++++++++++++++++
 arch/s390/Kconfig                   |  4 +---
 arch/sparc/Kconfig                  |  4 +---
 arch/x86/Kconfig                    |  4 +---
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c        |  4 ++--
 include/linux/mm.h                  | 35 +++++++++++++----------------
 include/linux/set_memory.h          |  5 +++++
 kernel/power/snapshot.c             | 24 ++++++++++++++++++--
 mm/memory_hotplug.c                 |  3 +--
 mm/page_alloc.c                     |  6 ++---
 mm/slab.c                           |  8 +++----
 20 files changed, 97 insertions(+), 58 deletions(-)

Edgecombe, Rick P Oct. 26, 2020, 1:13 a.m. UTC | #1

On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP
> it is
> possible that __kernel_map_pages() would fail, but since this
> function is
> void, the failure will go unnoticed.

Could you elaborate on how this could happen? Do you mean during
runtime today or if something new was introduced?

Mike Rapoport Oct. 26, 2020, 9:05 a.m. UTC | #2

On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
> On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP
> > it is
> > possible that __kernel_map_pages() would fail, but since this
> > function is
> > void, the failure will go unnoticed.
> 
> Could you elaborate on how this could happen? Do you mean during
> runtime today or if something new was introduced?

A failure in__kernel_map_pages() may happen today. For instance, on x86
if the kernel is built with DEBUG_PAGEALLOC.

	__kernel_map_pages(page, 1, 0);

will need to split, say, 2M page and during the split an allocation of
page table could fail.

Currently, the only user of __kernel_map_pages() outside DEBUG_PAGEALLOC
is hibernation, but I think it would be safer to entirely prevent usage
of __kernel_map_pages() when DEBUG_PAGEALLOC=n.

Edgecombe, Rick P Oct. 26, 2020, 6:05 p.m. UTC | #3

On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
> On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
> > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > > Indeed, for architectures that define
> > > CONFIG_ARCH_HAS_SET_DIRECT_MAP
> > > it is
> > > possible that __kernel_map_pages() would fail, but since this
> > > function is
> > > void, the failure will go unnoticed.
> > 
> > Could you elaborate on how this could happen? Do you mean during
> > runtime today or if something new was introduced?
> 
> A failure in__kernel_map_pages() may happen today. For instance, on
> x86
> if the kernel is built with DEBUG_PAGEALLOC.
> 
>         __kernel_map_pages(page, 1, 0);
> 
> will need to split, say, 2M page and during the split an allocation
> of
> page table could fail.

On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page
on the direct map and even disables locking in cpa because it assumes
this. If this is happening somehow anyway then we should probably fix
that. Even if it's a debug feature, it will not be as useful if it is
causing its own crashes.

I'm still wondering if there is something I'm missing here. It seems
like you are saying there is a bug in some arch's, so let's add a WARN
in cross-arch code to log it as it crashes. A warn and making things
clearer seem like good ideas, but if there is a bug we should fix it.
The code around the callers still functionally assume re-mapping can't
fail.

> Currently, the only user of __kernel_map_pages() outside
> DEBUG_PAGEALLOC
> is hibernation, but I think it would be safer to entirely prevent
> usage
> of __kernel_map_pages() when DEBUG_PAGEALLOC=n.

I totally agree it's error prone FWIW. On x86, my mental model of how
it is supposed to work is: If a page is 4k and NP it cannot fail to be
remapped. set_direct_map_invalid_noflush() should result in 4k NP
pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct
map. Are you seeing this violated or do I have wrong assumptions?

Beyond whatever you are seeing, for the latter case of new things
getting introduced to an interface with hidden dependencies... Another
edge case could be a new caller to set_memory_np() could result in
large NP pages. None of the callers today should cause this AFAICT, but
it's not great to rely on the callers to know these details.

Mike Rapoport Oct. 27, 2020, 8:38 a.m. UTC | #4

On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
> > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
> > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > > > Indeed, for architectures that define
> > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP
> > > > it is
> > > > possible that __kernel_map_pages() would fail, but since this
> > > > function is
> > > > void, the failure will go unnoticed.
> > > 
> > > Could you elaborate on how this could happen? Do you mean during
> > > runtime today or if something new was introduced?
> > 
> > A failure in__kernel_map_pages() may happen today. For instance, on
> > x86
> > if the kernel is built with DEBUG_PAGEALLOC.
> > 
> >         __kernel_map_pages(page, 1, 0);
> > 
> > will need to split, say, 2M page and during the split an allocation
> > of
> > page table could fail.
> 
> On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page
> on the direct map and even disables locking in cpa because it assumes
> this. If this is happening somehow anyway then we should probably fix
> that. Even if it's a debug feature, it will not be as useful if it is
> causing its own crashes.
> 
> I'm still wondering if there is something I'm missing here. It seems
> like you are saying there is a bug in some arch's, so let's add a WARN
> in cross-arch code to log it as it crashes. A warn and making things
> clearer seem like good ideas, but if there is a bug we should fix it.
> The code around the callers still functionally assume re-mapping can't
> fail.

Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call
that unmaps pages back in safe_copy_page will just reset a 4K page to
NP because whatever made it NP at the first place already did the split.

Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race
between map/unmap dance in __vunmap() and safe_copy_page() that may
cause access to unmapped memory:

__vunmap()
    vm_remove_mappings()
        set_direct_map_invalid()
					safe_copy_page()	
					    __kernel_map_pages()
					    	return
					    do_copy_page() -> fault
					   	
This is a theoretical bug, but it is still not nice :) 							

> > Currently, the only user of __kernel_map_pages() outside
> > DEBUG_PAGEALLOC
> > is hibernation, but I think it would be safer to entirely prevent
> > usage
> > of __kernel_map_pages() when DEBUG_PAGEALLOC=n.
> 
> I totally agree it's error prone FWIW. On x86, my mental model of how
> it is supposed to work is: If a page is 4k and NP it cannot fail to be
> remapped. set_direct_map_invalid_noflush() should result in 4k NP
> pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct
> map. Are you seeing this violated or do I have wrong assumptions?

You are right, there is a set of assumptions about the remapping of the
direct map pages that make it all work, at least on x86.
But this is very subtle and it's not easy to wrap one's head around
this.

That's why putting __kernel_map_pages() out of "common" use and
keep it only for DEBUG_PAGEALLOC would make things clearer.

> Beyond whatever you are seeing, for the latter case of new things
> getting introduced to an interface with hidden dependencies... Another
> edge case could be a new caller to set_memory_np() could result in
> large NP pages. None of the callers today should cause this AFAICT, but
> it's not great to rely on the callers to know these details.
 
A caller of set_memory_*() or set_direct_map_*() should expect a failure
and be ready for that. So adding a WARN to safe_copy_page() is the first
step in that direction :)

David Hildenbrand Oct. 27, 2020, 8:46 a.m. UTC | #5

On 27.10.20 09:38, Mike Rapoport wrote:
> On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
>> On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
>>> On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
>>>> On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
>>>>> Indeed, for architectures that define
>>>>> CONFIG_ARCH_HAS_SET_DIRECT_MAP
>>>>> it is
>>>>> possible that __kernel_map_pages() would fail, but since this
>>>>> function is
>>>>> void, the failure will go unnoticed.
>>>>
>>>> Could you elaborate on how this could happen? Do you mean during
>>>> runtime today or if something new was introduced?
>>>
>>> A failure in__kernel_map_pages() may happen today. For instance, on
>>> x86
>>> if the kernel is built with DEBUG_PAGEALLOC.
>>>
>>>          __kernel_map_pages(page, 1, 0);
>>>
>>> will need to split, say, 2M page and during the split an allocation
>>> of
>>> page table could fail.
>>
>> On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page
>> on the direct map and even disables locking in cpa because it assumes
>> this. If this is happening somehow anyway then we should probably fix
>> that. Even if it's a debug feature, it will not be as useful if it is
>> causing its own crashes.
>>
>> I'm still wondering if there is something I'm missing here. It seems
>> like you are saying there is a bug in some arch's, so let's add a WARN
>> in cross-arch code to log it as it crashes. A warn and making things
>> clearer seem like good ideas, but if there is a bug we should fix it.
>> The code around the callers still functionally assume re-mapping can't
>> fail.
> 
> Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call
> that unmaps pages back in safe_copy_page will just reset a 4K page to
> NP because whatever made it NP at the first place already did the split.
> 
> Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race
> between map/unmap dance in __vunmap() and safe_copy_page() that may
> cause access to unmapped memory:
> 
> __vunmap()
>      vm_remove_mappings()
>          set_direct_map_invalid()
> 					safe_copy_page()	
> 					    __kernel_map_pages()
> 					    	return
> 					    do_copy_page() -> fault
> 					   	
> This is a theoretical bug, but it is still not nice :) 							
> 
>>> Currently, the only user of __kernel_map_pages() outside
>>> DEBUG_PAGEALLOC
>>> is hibernation, but I think it would be safer to entirely prevent
>>> usage
>>> of __kernel_map_pages() when DEBUG_PAGEALLOC=n.
>>
>> I totally agree it's error prone FWIW. On x86, my mental model of how
>> it is supposed to work is: If a page is 4k and NP it cannot fail to be
>> remapped. set_direct_map_invalid_noflush() should result in 4k NP
>> pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct
>> map. Are you seeing this violated or do I have wrong assumptions?
> 
> You are right, there is a set of assumptions about the remapping of the
> direct map pages that make it all work, at least on x86.
> But this is very subtle and it's not easy to wrap one's head around
> this.
> 
> That's why putting __kernel_map_pages() out of "common" use and
> keep it only for DEBUG_PAGEALLOC would make things clearer.
> 
>> Beyond whatever you are seeing, for the latter case of new things
>> getting introduced to an interface with hidden dependencies... Another
>> edge case could be a new caller to set_memory_np() could result in
>> large NP pages. None of the callers today should cause this AFAICT, but
>> it's not great to rely on the callers to know these details.
>   
> A caller of set_memory_*() or set_direct_map_*() should expect a failure
> and be ready for that. So adding a WARN to safe_copy_page() is the first
> step in that direction :)
> 

I am probably missing something important, but why are we 
saving/restoring the content of pages that were explicitly removed from 
the identity mapping such that nobody will access them?

Pages that are not allocated should contain garbage or be zero 
(init_on_free). That should be easy to handle without ever reading the 
page content.

The other user seems to be vm_remove_mappings(), where we only 
*temporarily* remove the mapping - while hibernating, that code 
shouldn't be active anymore I guess - or we could protect it from happening.

As I expressed in another mail, secretmem pages should rather not be 
saved when hibernating - hibernation should be rather be disabled.

What am I missing?

Mike Rapoport Oct. 27, 2020, 9:47 a.m. UTC | #6

On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote:
> On 27.10.20 09:38, Mike Rapoport wrote:
> > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
> > > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
> > > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
> > > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > > > > > Indeed, for architectures that define
> > > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP
> > > > > > it is
> > > > > > possible that __kernel_map_pages() would fail, but since this
> > > > > > function is
> > > > > > void, the failure will go unnoticed.
> > > > > 
> > > > > Could you elaborate on how this could happen? Do you mean during
> > > > > runtime today or if something new was introduced?
> > > > 
> > > > A failure in__kernel_map_pages() may happen today. For instance, on
> > > > x86
> > > > if the kernel is built with DEBUG_PAGEALLOC.
> > > > 
> > > >          __kernel_map_pages(page, 1, 0);
> > > > 
> > > > will need to split, say, 2M page and during the split an allocation
> > > > of
> > > > page table could fail.
> > > 
> > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page
> > > on the direct map and even disables locking in cpa because it assumes
> > > this. If this is happening somehow anyway then we should probably fix
> > > that. Even if it's a debug feature, it will not be as useful if it is
> > > causing its own crashes.
> > > 
> > > I'm still wondering if there is something I'm missing here. It seems
> > > like you are saying there is a bug in some arch's, so let's add a WARN
> > > in cross-arch code to log it as it crashes. A warn and making things
> > > clearer seem like good ideas, but if there is a bug we should fix it.
> > > The code around the callers still functionally assume re-mapping can't
> > > fail.
> > 
> > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call
> > that unmaps pages back in safe_copy_page will just reset a 4K page to
> > NP because whatever made it NP at the first place already did the split.
> > 
> > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race
> > between map/unmap dance in __vunmap() and safe_copy_page() that may
> > cause access to unmapped memory:
> > 
> > __vunmap()
> >      vm_remove_mappings()
> >          set_direct_map_invalid()
> > 					safe_copy_page()	
> > 					    __kernel_map_pages()
> > 					    	return
> > 					    do_copy_page() -> fault
> > 					   	
> > This is a theoretical bug, but it is still not nice :) 							
> > 
> > > > Currently, the only user of __kernel_map_pages() outside
> > > > DEBUG_PAGEALLOC
> > > > is hibernation, but I think it would be safer to entirely prevent
> > > > usage
> > > > of __kernel_map_pages() when DEBUG_PAGEALLOC=n.
> > > 
> > > I totally agree it's error prone FWIW. On x86, my mental model of how
> > > it is supposed to work is: If a page is 4k and NP it cannot fail to be
> > > remapped. set_direct_map_invalid_noflush() should result in 4k NP
> > > pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct
> > > map. Are you seeing this violated or do I have wrong assumptions?
> > 
> > You are right, there is a set of assumptions about the remapping of the
> > direct map pages that make it all work, at least on x86.
> > But this is very subtle and it's not easy to wrap one's head around
> > this.
> > 
> > That's why putting __kernel_map_pages() out of "common" use and
> > keep it only for DEBUG_PAGEALLOC would make things clearer.
> > 
> > > Beyond whatever you are seeing, for the latter case of new things
> > > getting introduced to an interface with hidden dependencies... Another
> > > edge case could be a new caller to set_memory_np() could result in
> > > large NP pages. None of the callers today should cause this AFAICT, but
> > > it's not great to rely on the callers to know these details.
> > A caller of set_memory_*() or set_direct_map_*() should expect a failure
> > and be ready for that. So adding a WARN to safe_copy_page() is the first
> > step in that direction :)
> > 
> 
> I am probably missing something important, but why are we saving/restoring
> the content of pages that were explicitly removed from the identity mapping
> such that nobody will access them?
> 
> Pages that are not allocated should contain garbage or be zero
> (init_on_free). That should be easy to handle without ever reading the page
> content.

I'm not familiar with hibernation to say anything smart here, but the
help text of DEBUG_PAGEALLOC in Kconfig says:

	... this option cannot be enabled in combination with
	hibernation as that would result in incorrect warnings of memory
	corruption after a resume because free pages are not saved to
	the suspend image.

Probably you are right and free pages need to be handled differently,
but it does not seem the case now.

> The other user seems to be vm_remove_mappings(), where we only *temporarily*
> remove the mapping - while hibernating, that code shouldn't be active
> anymore I guess - or we could protect it from happening.

Hmm, I _think_ vm_remove_mappings() shouldn't be active while
hibernating, but I'm not 100% sure.

> As I expressed in another mail, secretmem pages should rather not be saved
> when hibernating - hibernation should be rather be disabled.

Agree.

> What am I missing?

I think I miscommunicated the purpose of this set, which was to hide
__kernel_map_pages() under DEBUG_PAGEALLOC and make hibernation use
set_direct_map_*() explictly without major rework of free pages handling
during hibernation.

Does it help?

> -- 
> Thanks,
> 
> David / dhildenb
>

David Hildenbrand Oct. 27, 2020, 10:34 a.m. UTC | #7

On 27.10.20 10:47, Mike Rapoport wrote:
> On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote:
>> On 27.10.20 09:38, Mike Rapoport wrote:
>>> On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
>>>> On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
>>>>> On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
>>>>>> On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
>>>>>>> Indeed, for architectures that define
>>>>>>> CONFIG_ARCH_HAS_SET_DIRECT_MAP
>>>>>>> it is
>>>>>>> possible that __kernel_map_pages() would fail, but since this
>>>>>>> function is
>>>>>>> void, the failure will go unnoticed.
>>>>>>
>>>>>> Could you elaborate on how this could happen? Do you mean during
>>>>>> runtime today or if something new was introduced?
>>>>>
>>>>> A failure in__kernel_map_pages() may happen today. For instance, on
>>>>> x86
>>>>> if the kernel is built with DEBUG_PAGEALLOC.
>>>>>
>>>>>           __kernel_map_pages(page, 1, 0);
>>>>>
>>>>> will need to split, say, 2M page and during the split an allocation
>>>>> of
>>>>> page table could fail.
>>>>
>>>> On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page
>>>> on the direct map and even disables locking in cpa because it assumes
>>>> this. If this is happening somehow anyway then we should probably fix
>>>> that. Even if it's a debug feature, it will not be as useful if it is
>>>> causing its own crashes.
>>>>
>>>> I'm still wondering if there is something I'm missing here. It seems
>>>> like you are saying there is a bug in some arch's, so let's add a WARN
>>>> in cross-arch code to log it as it crashes. A warn and making things
>>>> clearer seem like good ideas, but if there is a bug we should fix it.
>>>> The code around the callers still functionally assume re-mapping can't
>>>> fail.
>>>
>>> Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call
>>> that unmaps pages back in safe_copy_page will just reset a 4K page to
>>> NP because whatever made it NP at the first place already did the split.
>>>
>>> Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race
>>> between map/unmap dance in __vunmap() and safe_copy_page() that may
>>> cause access to unmapped memory:
>>>
>>> __vunmap()
>>>       vm_remove_mappings()
>>>           set_direct_map_invalid()
>>> 					safe_copy_page()	
>>> 					    __kernel_map_pages()
>>> 					    	return
>>> 					    do_copy_page() -> fault
>>> 					   	
>>> This is a theoretical bug, but it is still not nice :) 							
>>>
>>>>> Currently, the only user of __kernel_map_pages() outside
>>>>> DEBUG_PAGEALLOC
>>>>> is hibernation, but I think it would be safer to entirely prevent
>>>>> usage
>>>>> of __kernel_map_pages() when DEBUG_PAGEALLOC=n.
>>>>
>>>> I totally agree it's error prone FWIW. On x86, my mental model of how
>>>> it is supposed to work is: If a page is 4k and NP it cannot fail to be
>>>> remapped. set_direct_map_invalid_noflush() should result in 4k NP
>>>> pages, and DEBUG_PAGEALLOC should result in all 4k pages on the direct
>>>> map. Are you seeing this violated or do I have wrong assumptions?
>>>
>>> You are right, there is a set of assumptions about the remapping of the
>>> direct map pages that make it all work, at least on x86.
>>> But this is very subtle and it's not easy to wrap one's head around
>>> this.
>>>
>>> That's why putting __kernel_map_pages() out of "common" use and
>>> keep it only for DEBUG_PAGEALLOC would make things clearer.
>>>
>>>> Beyond whatever you are seeing, for the latter case of new things
>>>> getting introduced to an interface with hidden dependencies... Another
>>>> edge case could be a new caller to set_memory_np() could result in
>>>> large NP pages. None of the callers today should cause this AFAICT, but
>>>> it's not great to rely on the callers to know these details.
>>> A caller of set_memory_*() or set_direct_map_*() should expect a failure
>>> and be ready for that. So adding a WARN to safe_copy_page() is the first
>>> step in that direction :)
>>>
>>
>> I am probably missing something important, but why are we saving/restoring
>> the content of pages that were explicitly removed from the identity mapping
>> such that nobody will access them?
>>
>> Pages that are not allocated should contain garbage or be zero
>> (init_on_free). That should be easy to handle without ever reading the page
>> content.
> 
> I'm not familiar with hibernation to say anything smart here, but the
> help text of DEBUG_PAGEALLOC in Kconfig says:
> 
> 	... this option cannot be enabled in combination with
> 	hibernation as that would result in incorrect warnings of memory
> 	corruption after a resume because free pages are not saved to
> 	the suspend image.
> 
> Probably you are right and free pages need to be handled differently,
> but it does not seem the case now.
> 
>> The other user seems to be vm_remove_mappings(), where we only *temporarily*
>> remove the mapping - while hibernating, that code shouldn't be active
>> anymore I guess - or we could protect it from happening.
> 
> Hmm, I _think_ vm_remove_mappings() shouldn't be active while
> hibernating, but I'm not 100% sure.
> 
>> As I expressed in another mail, secretmem pages should rather not be saved
>> when hibernating - hibernation should be rather be disabled.
> 
> Agree.
> 
>> What am I missing?
> 
> I think I miscommunicated the purpose of this set, which was to hide
> __kernel_map_pages() under DEBUG_PAGEALLOC and make hibernation use
> set_direct_map_*() explictly without major rework of free pages handling
> during hibernation.
> 
> Does it help?
> 

Heh, as always, once you touch questionable code, people will beg for 
proper cleanups instead :)

Mike Rapoport Oct. 28, 2020, 11:09 a.m. UTC | #8

On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote:
> On 27.10.20 09:38, Mike Rapoport wrote:
> > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
> > 
> > > Beyond whatever you are seeing, for the latter case of new things
> > > getting introduced to an interface with hidden dependencies... Another
> > > edge case could be a new caller to set_memory_np() could result in
> > > large NP pages. None of the callers today should cause this AFAICT, but
> > > it's not great to rely on the callers to know these details.

> > A caller of set_memory_*() or set_direct_map_*() should expect a failure
> > and be ready for that. So adding a WARN to safe_copy_page() is the first
> > step in that direction :)
> > 
> 
> I am probably missing something important, but why are we saving/restoring
> the content of pages that were explicitly removed from the identity mapping
> such that nobody will access them?

Actually, we should not be saving/restoring free pages during
hibernation as there are several calls to mark_free_pages() that should
exclude the free pages from the snapshot. I've tried to find why the fix
that maps/unmaps a page to save it was required at the first place, but
I could not find bug reports.

The closest I've got is an email from Rafael that asked to update
"hibernate: handle DEBUG_PAGEALLOC" patch:

https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/

Could it be that safe_copy_page() tries to workaround a non-existent
problem?

David Hildenbrand Oct. 28, 2020, 11:17 a.m. UTC | #9

On 28.10.20 12:09, Mike Rapoport wrote:
> On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote:
>> On 27.10.20 09:38, Mike Rapoport wrote:
>>> On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
>>>
>>>> Beyond whatever you are seeing, for the latter case of new things
>>>> getting introduced to an interface with hidden dependencies... Another
>>>> edge case could be a new caller to set_memory_np() could result in
>>>> large NP pages. None of the callers today should cause this AFAICT, but
>>>> it's not great to rely on the callers to know these details.
> 
>>> A caller of set_memory_*() or set_direct_map_*() should expect a failure
>>> and be ready for that. So adding a WARN to safe_copy_page() is the first
>>> step in that direction :)
>>>
>>
>> I am probably missing something important, but why are we saving/restoring
>> the content of pages that were explicitly removed from the identity mapping
>> such that nobody will access them?
> 
> Actually, we should not be saving/restoring free pages during
> hibernation as there are several calls to mark_free_pages() that should
> exclude the free pages from the snapshot. I've tried to find why the fix
> that maps/unmaps a page to save it was required at the first place, but
> I could not find bug reports.
> 
> The closest I've got is an email from Rafael that asked to update
> "hibernate: handle DEBUG_PAGEALLOC" patch:
> 
> https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/
> 
> Could it be that safe_copy_page() tries to workaround a non-existent
> problem?
> 

Clould be! Also see

https://lkml.kernel.org/r/38de5bb0-5559-d069-0ce0-daec66ef2746@suse.cz

which restores free page content based on more kernel parameters, not 
based on the original content.

Will Deacon Oct. 28, 2020, 11:20 a.m. UTC | #10

On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote:
> On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
> > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
> > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
> > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > > > > Indeed, for architectures that define
> > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP
> > > > > it is
> > > > > possible that __kernel_map_pages() would fail, but since this
> > > > > function is
> > > > > void, the failure will go unnoticed.
> > > > 
> > > > Could you elaborate on how this could happen? Do you mean during
> > > > runtime today or if something new was introduced?
> > > 
> > > A failure in__kernel_map_pages() may happen today. For instance, on
> > > x86
> > > if the kernel is built with DEBUG_PAGEALLOC.
> > > 
> > >         __kernel_map_pages(page, 1, 0);
> > > 
> > > will need to split, say, 2M page and during the split an allocation
> > > of
> > > page table could fail.
> > 
> > On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page
> > on the direct map and even disables locking in cpa because it assumes
> > this. If this is happening somehow anyway then we should probably fix
> > that. Even if it's a debug feature, it will not be as useful if it is
> > causing its own crashes.
> > 
> > I'm still wondering if there is something I'm missing here. It seems
> > like you are saying there is a bug in some arch's, so let's add a WARN
> > in cross-arch code to log it as it crashes. A warn and making things
> > clearer seem like good ideas, but if there is a bug we should fix it.
> > The code around the callers still functionally assume re-mapping can't
> > fail.
> 
> Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call
> that unmaps pages back in safe_copy_page will just reset a 4K page to
> NP because whatever made it NP at the first place already did the split.
> 
> Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race
> between map/unmap dance in __vunmap() and safe_copy_page() that may
> cause access to unmapped memory:
> 
> __vunmap()
>     vm_remove_mappings()
>         set_direct_map_invalid()
> 					safe_copy_page()	
> 					    __kernel_map_pages()
> 					    	return
> 					    do_copy_page() -> fault
> 					   	
> This is a theoretical bug, but it is still not nice :) 							

Just to clarify: this patch series fixes this problem, right?

Will

Mike Rapoport Oct. 28, 2020, 11:30 a.m. UTC | #11

On Wed, Oct 28, 2020 at 11:20:12AM +0000, Will Deacon wrote:
> On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote:
> > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
> > > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
> > > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P wrote:
> > > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > > > > > Indeed, for architectures that define
> > > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP
> > > > > > it is
> > > > > > possible that __kernel_map_pages() would fail, but since this
> > > > > > function is
> > > > > > void, the failure will go unnoticed.
> > > > > 
> > > > > Could you elaborate on how this could happen? Do you mean during
> > > > > runtime today or if something new was introduced?
> > > > 
> > > > A failure in__kernel_map_pages() may happen today. For instance, on
> > > > x86
> > > > if the kernel is built with DEBUG_PAGEALLOC.
> > > > 
> > > >         __kernel_map_pages(page, 1, 0);
> > > > 
> > > > will need to split, say, 2M page and during the split an allocation
> > > > of
> > > > page table could fail.
> > > 
> > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break a page
> > > on the direct map and even disables locking in cpa because it assumes
> > > this. If this is happening somehow anyway then we should probably fix
> > > that. Even if it's a debug feature, it will not be as useful if it is
> > > causing its own crashes.
> > > 
> > > I'm still wondering if there is something I'm missing here. It seems
> > > like you are saying there is a bug in some arch's, so let's add a WARN
> > > in cross-arch code to log it as it crashes. A warn and making things
> > > clearer seem like good ideas, but if there is a bug we should fix it.
> > > The code around the callers still functionally assume re-mapping can't
> > > fail.
> > 
> > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed the call
> > that unmaps pages back in safe_copy_page will just reset a 4K page to
> > NP because whatever made it NP at the first place already did the split.
> > 
> > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of a race
> > between map/unmap dance in __vunmap() and safe_copy_page() that may
> > cause access to unmapped memory:
> > 
> > __vunmap()
> >     vm_remove_mappings()
> >         set_direct_map_invalid()
> > 					safe_copy_page()	
> > 					    __kernel_map_pages()
> > 					    	return
> > 					    do_copy_page() -> fault
> > 					   	
> > This is a theoretical bug, but it is still not nice :) 							
> 
> Just to clarify: this patch series fixes this problem, right?

Yes.

> Will

Mike Rapoport Oct. 28, 2020, 12:22 p.m. UTC | #12

On Wed, Oct 28, 2020 at 12:17:35PM +0100, David Hildenbrand wrote:
> On 28.10.20 12:09, Mike Rapoport wrote:
> > On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote:
> > > On 27.10.20 09:38, Mike Rapoport wrote:
> > > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P wrote:
> > > > 
> > > > > Beyond whatever you are seeing, for the latter case of new things
> > > > > getting introduced to an interface with hidden dependencies... Another
> > > > > edge case could be a new caller to set_memory_np() could result in
> > > > > large NP pages. None of the callers today should cause this AFAICT, but
> > > > > it's not great to rely on the callers to know these details.
> > 
> > > > A caller of set_memory_*() or set_direct_map_*() should expect a failure
> > > > and be ready for that. So adding a WARN to safe_copy_page() is the first
> > > > step in that direction :)
> > > > 
> > > 
> > > I am probably missing something important, but why are we saving/restoring
> > > the content of pages that were explicitly removed from the identity mapping
> > > such that nobody will access them?
> > 
> > Actually, we should not be saving/restoring free pages during
> > hibernation as there are several calls to mark_free_pages() that should
> > exclude the free pages from the snapshot. I've tried to find why the fix
> > that maps/unmaps a page to save it was required at the first place, but
> > I could not find bug reports.
> > 
> > The closest I've got is an email from Rafael that asked to update
> > "hibernate: handle DEBUG_PAGEALLOC" patch:
> > 
> > https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/
> > 
> > Could it be that safe_copy_page() tries to workaround a non-existent
> > problem?
> > 
> 
> Clould be! Also see
> 
> https://lkml.kernel.org/r/38de5bb0-5559-d069-0ce0-daec66ef2746@suse.cz
> 
> which restores free page content based on more kernel parameters, not based
> on the original content.

Ah, after looking at it now I've run kernel with DEBUG_PAGEALLOC=y and
CONFIG_INIT_ON_FREE_DEFAULT_ON=y and restore crahsed nicely.

[   27.210093] PM: Image successfully loaded
[   27.226709] Disabling non-boot CPUs ...                                      
[   27.231208] smpboot: CPU 1 is now offline                                    
[   27.363926] kvm-clock: cpu 0, msr 5c889001, primary cpu clock, resume        
[   27.363995] BUG: unable to handle page fault for address: ffff9f7a40108000   
[   27.367996] #PF: supervisor write access in kernel mode                      
[   27.369558] #PF: error_code(0x0002) - not-present page                       
[   27.371098] PGD 5ca01067 P4D 5ca01067 PUD 5ca02067 PMD 5ca03067 PTE 800ffffff
fef7060                                                                         
[   27.373421] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC PTI                          
[   27.374905] CPU: 0 PID: 1200 Comm: bash Not tainted 5.10.0-rc1 #5            
[   27.376700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14
.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014                                 
[   27.379879] RIP: 0010:clear_page_rep+0x7/0x10          
[   27.381218] Code: e8 be 88 75 00 44 89 e2 48 89 ee 48 89 df e8 60 ff ff ff c6
 03 00 5b 5d 41 5c c3 cc cc cc cc cc cc cc cc b9 00 02 00 00 31 c0 <f3> 48 ab c3
 0f 1f 44 00 00 31 c0 b9 40 00 00 00 66 0f 1f 84 00 00                          
[   27.386457] RSP: 0018:ffffb6838046be08 EFLAGS: 00010046                      
[   27.388011] RAX: 0000000000000000 RBX: ffff9f7a487c0ec0 RCX: 0000000000000200
[   27.390082] RDX: ffff9f7a4c788000 RSI: 0000000000000000 RDI: ffff9f7a40108000
[   27.392138] RBP: ffffffff8629c860 R08: 0000000000000000 R09: 0000000000000007
[   27.394205] R10: 0000000000000004 R11: ffffb6838046bbf8 R12: 0000000000000000
[   27.396271] R13: ffff9f7a419a62a0 R14: 0000000000000005 R15: ffff9f7a484f4da0
[   27.398334] FS:  00007fe0c3f6a700(0000) GS:ffff9f7abf800000(0000) knlGS:0000000000000000                                                                     
[   27.400717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                
[   27.402432] CR2: ffff9f7a40108000 CR3: 000000000859a001 CR4: 0000000000060ef0
[   27.404485] Call Trace:                                                      
[   27.405326]  clear_free_pages+0xf5/0x150                                     
[   27.406568]  hibernation_snapshot+0x390/0x3d0                                
[   27.407908]  hibernate+0xdb/0x240                                            
[   27.408978]  state_store+0xd7/0xe0                                           
[   27.410078]  kernfs_fop_write+0x10e/0x1a0                                    
[   27.411333]  vfs_write+0xbb/0x210                                            
[   27.412423]  ksys_write+0x9c/0xd0                      
[   27.413488]  do_syscall_64+0x33/0x40                                         
[   27.414636]  entry_SYSCALL_64_after_hwframe+0x44/0xa9                        
[   27.416150] RIP: 0033:0x7fe0c364e380                                         
 66 0f 1f 44 00 00 83 3d c9 23 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0
 ff ff 73 31 c3 48 83 ec 08 e8 fe dd 01 00 48 89 04 24
[   27.422500] RSP: 002b:00007ffeb64bd0c8 EFLAGS: 00000246 ORIG_RAX: 00000000000
00001
[   27.424724] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fe0c364e380
[   27.426761] RDX: 0000000000000005 RSI: 0000000001eb6408 RDI: 0000000000000001
[   27.428791] RBP: 0000000001eb6408 R08: 00007fe0c391d780 R09: 00007fe0c3f6a700
[   27.430863] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000000005
[   27.432920] R13: 0000000000000001 R14: 00007fe0c391c620 R15: 0000000000000000
[   27.434989] Modules linked in:
[   27.436004] CR2: ffff9f7a40108000
[   27.437075] ---[ end trace 424c466bcd2bfcad ]---



> -- 
> Thanks,
> 
> David / dhildenb
>

Edgecombe, Rick P Oct. 28, 2020, 6:31 p.m. UTC | #13

On Wed, 2020-10-28 at 13:09 +0200, Mike Rapoport wrote:
> On Tue, Oct 27, 2020 at 09:46:35AM +0100, David Hildenbrand wrote:
> > On 27.10.20 09:38, Mike Rapoport wrote:
> > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P
> > > wrote:
> > > 
> > > > Beyond whatever you are seeing, for the latter case of new
> > > > things
> > > > getting introduced to an interface with hidden dependencies...
> > > > Another
> > > > edge case could be a new caller to set_memory_np() could result
> > > > in
> > > > large NP pages. None of the callers today should cause this
> > > > AFAICT, but
> > > > it's not great to rely on the callers to know these details.
> > > A caller of set_memory_*() or set_direct_map_*() should expect a
> > > failure
> > > and be ready for that. So adding a WARN to safe_copy_page() is
> > > the first
> > > step in that direction :)
> > > 
> > 
> > I am probably missing something important, but why are we
> > saving/restoring
> > the content of pages that were explicitly removed from the identity
> > mapping
> > such that nobody will access them?
> 
> Actually, we should not be saving/restoring free pages during
> hibernation as there are several calls to mark_free_pages() that
> should
> exclude the free pages from the snapshot. I've tried to find why the
> fix
> that maps/unmaps a page to save it was required at the first place,
> but
> I could not find bug reports.
> 
> The closest I've got is an email from Rafael that asked to update
> "hibernate: handle DEBUG_PAGEALLOC" patch:
> 
> https://lore.kernel.org/linux-pm/200802200133.44098.rjw@sisk.pl/
> 
> Could it be that safe_copy_page() tries to workaround a non-existent
> problem?

It looks like inside page_alloc.c it unmaps the page before it actually
frees it, so to hibernate it could look like the page is still
allocated even though it's unmapped? Maybe that small window is what it
cared about initially.

There is also now the vmalloc case, which I am actually working on
expanding. So I think the re-mapping logic is needed.

Edgecombe, Rick P Oct. 28, 2020, 9:03 p.m. UTC | #14

On Wed, 2020-10-28 at 13:30 +0200, Mike Rapoport wrote:
> On Wed, Oct 28, 2020 at 11:20:12AM +0000, Will Deacon wrote:
> > On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote:
> > > On Mon, Oct 26, 2020 at 06:05:30PM +0000, Edgecombe, Rick P
> > > wrote:
> > > > On Mon, 2020-10-26 at 11:05 +0200, Mike Rapoport wrote:
> > > > > On Mon, Oct 26, 2020 at 01:13:52AM +0000, Edgecombe, Rick P
> > > > > wrote:
> > > > > > On Sun, 2020-10-25 at 12:15 +0200, Mike Rapoport wrote:
> > > > > > > Indeed, for architectures that define
> > > > > > > CONFIG_ARCH_HAS_SET_DIRECT_MAP
> > > > > > > it is
> > > > > > > possible that __kernel_map_pages() would fail, but since
> > > > > > > this
> > > > > > > function is
> > > > > > > void, the failure will go unnoticed.
> > > > > > 
> > > > > > Could you elaborate on how this could happen? Do you mean
> > > > > > during
> > > > > > runtime today or if something new was introduced?
> > > > > 
> > > > > A failure in__kernel_map_pages() may happen today. For
> > > > > instance, on
> > > > > x86
> > > > > if the kernel is built with DEBUG_PAGEALLOC.
> > > > > 
> > > > >         __kernel_map_pages(page, 1, 0);
> > > > > 
> > > > > will need to split, say, 2M page and during the split an
> > > > > allocation
> > > > > of
> > > > > page table could fail.
> > > > 
> > > > On x86 at least, DEBUG_PAGEALLOC expects to never have to break
> > > > a page
> > > > on the direct map and even disables locking in cpa because it
> > > > assumes
> > > > this. If this is happening somehow anyway then we should
> > > > probably fix
> > > > that. Even if it's a debug feature, it will not be as useful if
> > > > it is
> > > > causing its own crashes.
> > > > 
> > > > I'm still wondering if there is something I'm missing here. It
> > > > seems
> > > > like you are saying there is a bug in some arch's, so let's add
> > > > a WARN
> > > > in cross-arch code to log it as it crashes. A warn and making
> > > > things
> > > > clearer seem like good ideas, but if there is a bug we should
> > > > fix it.
> > > > The code around the callers still functionally assume re-
> > > > mapping can't
> > > > fail.
> > > 
> > > Oh, I've meant x86 kernel *without* DEBUG_PAGEALLOC, and indeed
> > > the call
> > > that unmaps pages back in safe_copy_page will just reset a 4K
> > > page to
> > > NP because whatever made it NP at the first place already did the
> > > split.
> > > 
> > > Still, on arm64 with DEBUG_PAGEALLOC=n there is a possibility of
> > > a race
> > > between map/unmap dance in __vunmap() and safe_copy_page() that
> > > may
> > > cause access to unmapped memory:
> > > 
> > > __vunmap()
> > >     vm_remove_mappings()
> > >         set_direct_map_invalid()
> > > 					safe_copy_page()	
> > > 					    __kernel_map_pages()
> > > 					    	return
> > > 					    do_copy_page() -> fault
> > > 					   	
> > > This is a theoretical bug, but it is still not nice :) 		
> > > 					
> > 
> > Just to clarify: this patch series fixes this problem, right?
> 
> Yes.
> 

Well, now I'm confused again.

As David pointed, __vunmap() should not be executing simultaneously
with the hibernate operation because hibernate can't snapshot while
data it needs to save is still updating. If a thread was paused when a
page was in an "invalid" state, it should be remapped by hibernate
before the copy.

To level set, before reading this mail, my takeaways from the
discussions on potential hibernate/debug page alloc problems were:

Potential RISC-V issue:
Doesn't have hibernate support

Potential ARM issue:
The logic around when it's cpa determines pages might be unmapped looks
correct for current callers.

Potential x86 page break issue:
Seems to be ok for now, but a new set_memory_np() caller could violate
assumptions in hibernate.

Non-obvious thorny logic: 
General agreement it would be good to separate dependencies.

Behavior of V1 of this patchset:
No functional change other than addition of a warn in hibernate.


So "does this fix the problem", "yes" leaves me a bit confused... Not
saying there couldn't be any problems, especially due to the thorniness
and cross arch stride, but what is it exactly and how does this series
fix it?

Mike Rapoport Oct. 29, 2020, 8:12 a.m. UTC | #15

On Wed, Oct 28, 2020 at 09:03:31PM +0000, Edgecombe, Rick P wrote:

> > On Wed, Oct 28, 2020 at 11:20:12AM +0000, Will Deacon wrote:
> > > On Tue, Oct 27, 2020 at 10:38:16AM +0200, Mike Rapoport wrote:
> > > > 					   	
> > > > This is a theoretical bug, but it is still not nice :) 		
> > > > 					
> > > 
> > > Just to clarify: this patch series fixes this problem, right?
> > 
> > Yes.
> > 
> 
> Well, now I'm confused again.
> 
> As David pointed, __vunmap() should not be executing simultaneously
> with the hibernate operation because hibernate can't snapshot while
> data it needs to save is still updating. If a thread was paused when a
> page was in an "invalid" state, it should be remapped by hibernate
> before the copy.
> 
> To level set, before reading this mail, my takeaways from the
> discussions on potential hibernate/debug page alloc problems were:
> 
> Potential RISC-V issue:
> Doesn't have hibernate support
> 
> Potential ARM issue:
> The logic around when it's cpa determines pages might be unmapped looks
> correct for current callers.
> 
> Potential x86 page break issue:
> Seems to be ok for now, but a new set_memory_np() caller could violate
> assumptions in hibernate.
> 
> Non-obvious thorny logic: 
> General agreement it would be good to separate dependencies.
> 
> Behavior of V1 of this patchset:
> No functional change other than addition of a warn in hibernate.

There is a change that adds explicit use of set_direct_map() to
hibernate. Currently, in case of arm64 with DEBUG_PAGEALLOC=n if a
thread was paused when a page was in an "invalid" state hibernate will
access an unmapped data because __kernel_map_pages() will bail out.
After the change set_direct_map_default_noflush() would be used and the
page will get mapped before copy.

> So "does this fix the problem", "yes" leaves me a bit confused... Not
> saying there couldn't be any problems, especially due to the thorniness
> and cross arch stride, but what is it exactly and how does this series
> fix it?

This series goal was primarily to separate dependincies and make it
clearer what DEBUG_PAGEALLOC and what SET_DIRECT_MAP are. As it turned
out, there is also some lack of consistency between architectures that
implement either of this so I tried to improve this as well.

Honestly, I don't know if a thread can be paused at the time __vunmap()
left invalid pages, but it could, there is an issue on arm64 with
DEBUG_PAGEALLOC=n and this set fixes it.

__vunmap()
    vm_remove_mappings()
        set_direct_map_invalid()
	/* thread is frozen */
 					safe_copy_page()	
 					    __kernel_map_pages()
						if (!debug_pagealloc())
 					    	    return
 					    do_copy_page() -> fault

David Hildenbrand Oct. 29, 2020, 8:15 a.m. UTC | #16

On 25.10.20 11:15, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> During recent discussion about KVM protected memory, David raised a concern
> about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1].
> 
> Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
> possible that __kernel_map_pages() would fail, but since this function is
> void, the failure will go unnoticed.
> 
> Moreover, there's lack of consistency of __kernel_map_pages() semantics
> across architectures as some guard this function with
> #ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page
> allocation debugging is disabled at run time and some allow modifying the
> direct map regardless of DEBUG_PAGEALLOC settings.
> 
> This set straightens this out by restoring dependency of
> __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
> accordingly.
> 

So, I was primarily wondering if we really have to touch direct mappings 
in hibernation code, or if we can avoid doing that. I was wondering if 
we cannot simply do something like kmap() when trying to access a 
!mapped page. Similar to reading old-os memory after kexec when in 
kdump. Just a thought.

Edgecombe, Rick P Oct. 29, 2020, 11:19 p.m. UTC | #17

On Thu, 2020-10-29 at 10:12 +0200, Mike Rapoport wrote:
> This series goal was primarily to separate dependincies and make it
> clearer what DEBUG_PAGEALLOC and what SET_DIRECT_MAP are. As it
> turned
> out, there is also some lack of consistency between architectures
> that
> implement either of this so I tried to improve this as well.
> 
> Honestly, I don't know if a thread can be paused at the time
> __vunmap()
> left invalid pages, but it could, there is an issue on arm64 with
> DEBUG_PAGEALLOC=n and this set fixes it.

Ah, ok. So from this and the other thread, this is about the logic in
arm's cpa for when it will try the un/map operations. I think the logic
actually works currently. And this series introduces a problem on ARM
similar to the one you are saying preexists. I put the details in the
other thread.

[0/4] arch, mm: improve robustness of direct map manipulation

Message

Comments