diff mbox series

powerpc/powernv: Increase memory block size to 1GB on radix

Message ID 20170907050551.4632-1-anton@ozlabs.org (mailing list archive)
State Accepted
Commit 53ecde0b9126ff140abe3aefd7f0ec64d6fa36b0
Headers show
Series powerpc/powernv: Increase memory block size to 1GB on radix | expand

Commit Message

Anton Blanchard Sept. 7, 2017, 5:05 a.m. UTC
From: Anton Blanchard <anton@samba.org>

Memory hot unplug on PowerNV radix hosts is broken. Our memory block
size is 256MB but since we map the linear region with very large pages,
each pte we tear down maps 1GB.

A hot unplug of one 256MB memory block results in 768MB of memory
getting unintentionally unmapped. At this point we are likely to oops.

Fix this by increasing our memory block size to 1GB on PowerNV radix
hosts.

Signed-off-by: Anton Blanchard <anton@samba.org>
---
 arch/powerpc/platforms/powernv/setup.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Aneesh Kumar K.V Sept. 7, 2017, 5:09 a.m. UTC | #1
On 09/07/2017 10:35 AM, Anton Blanchard wrote:
> From: Anton Blanchard <anton@samba.org>
> 
> Memory hot unplug on PowerNV radix hosts is broken. Our memory block
> size is 256MB but since we map the linear region with very large pages,
> each pte we tear down maps 1GB.
> 
> A hot unplug of one 256MB memory block results in 768MB of memory
> getting unintentionally unmapped. At this point we are likely to oops.
> 
> Fix this by increasing our memory block size to 1GB on PowerNV radix
> hosts.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
>   arch/powerpc/platforms/powernv/setup.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
> index 897aa1400eb8..bbb73aa0eb8f 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -272,7 +272,15 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
>   #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
>   static unsigned long pnv_memory_block_size(void)
>   {
> -	return 256UL * 1024 * 1024;
> +	/*
> +	 * We map the kernel linear region with 1GB large pages on radix. For
> +	 * memory hot unplug to work our memory block size must be at least
> +	 * this size.
> +	 */
> +	if (radix_enabled())
> +		return 1UL * 1024 * 1024 * 1024;
> +	else
> +		return 256UL * 1024 * 1024;
>   }
>   #endif
> 

There is a similar issue being worked on w.r.t pseries.

https://lkml.kernel.org/r/1502357028-27465-1-git-send-email-bharata@linux.vnet.ibm.com

The question is should we map these regions ? ie, we need to tell the 
kernel memory region that we would like to hot unplug later so that we 
avoid doing kernel allocations from that. If we do that, then we can 
possibly map them via 2M size ?

-aneesh
Anton Blanchard Sept. 7, 2017, 5:17 a.m. UTC | #2
Hi,

> There is a similar issue being worked on w.r.t pseries.
> 
> https://lkml.kernel.org/r/1502357028-27465-1-git-send-email-bharata@linux.vnet.ibm.com
> 
> The question is should we map these regions ? ie, we need to tell the 
> kernel memory region that we would like to hot unplug later so that
> we avoid doing kernel allocations from that. If we do that, then we
> can possibly map them via 2M size ?

But all of memory on PowerNV should be able to be hot unplugged, so
there are two options as I see it - either increase the memory block
size, or map everything with 2MB pages. 

Anton
Benjamin Herrenschmidt Sept. 7, 2017, 7:21 a.m. UTC | #3
On Thu, 2017-09-07 at 15:17 +1000, Anton Blanchard wrote:
> Hi,
> 
> > There is a similar issue being worked on w.r.t pseries.
> > 
> > https://lkml.kernel.org/r/1502357028-27465-1-git-send-email-bharata@linux.vnet.ibm.com
> > 
> > The question is should we map these regions ? ie, we need to tell the 
> > kernel memory region that we would like to hot unplug later so that
> > we avoid doing kernel allocations from that. If we do that, then we
> > can possibly map them via 2M size ?
> 
> But all of memory on PowerNV should be able to be hot unplugged, so
> there are two options as I see it - either increase the memory block
> size, or map everything with 2MB pages. 

Or be smarter and map with 1G when blocks of 1G are available and break
down to 2M where necessary, it shouldn't be too hard.

Cheers,
Ben.
Reza Arbab Sept. 7, 2017, 3:59 p.m. UTC | #4
On Thu, Sep 07, 2017 at 05:17:41AM +0000, Anton Blanchard wrote:
>But all of memory on PowerNV should be able to be hot unplugged, so
>there are two options as I see it - either increase the memory block
>size, or map everything with 2MB pages.

I may be misunderstanding this, but what if we did something like x86 
does? When trying to unplug a region smaller than the mapping, they fill 
that part of the pagetable with 0xFD instead of freeing the whole thing.  
Once the whole thing is 0xFD, free it.

See arch/x86/mm/init_64.c:remove_{pte,pmd,pud}_table()

---%<---
	memset((void *)addr, PAGE_INUSE, next - addr);

	page_addr = page_address(pte_page(*pte));
	if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
		...
		pte_clear(&init_mm, addr, pte);
		...
	}
---%<---
Anton Blanchard Sept. 8, 2017, 1:15 a.m. UTC | #5
Hi Reza,

> I may be misunderstanding this, but what if we did something like x86 
> does? When trying to unplug a region smaller than the mapping, they
> fill that part of the pagetable with 0xFD instead of freeing the
> whole thing. Once the whole thing is 0xFD, free it.
> 
> See arch/x86/mm/init_64.c:remove_{pte,pmd,pud}_table()
> 
> ---%<---
> 	memset((void *)addr, PAGE_INUSE, next - addr);
> 
> 	page_addr = page_address(pte_page(*pte));
> 	if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> 		...
> 		pte_clear(&init_mm, addr, pte);
> 		...
> 	}
> ---%<---

But you only have 1GB ptes at this point, you'd need to start
instantiating a new level in the tree, and populate 2MB ptes.

That is what Ben is suggesting. I'm happy to go any way (fix hotplug
to handle this, or increase the memblock size on PowerNV to 1GB), I just
need a solution.

Anton
Balbir Singh Sept. 8, 2017, 9:51 p.m. UTC | #6
On Thu, Sep 7, 2017 at 5:21 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Thu, 2017-09-07 at 15:17 +1000, Anton Blanchard wrote:
>> Hi,
>>
>> > There is a similar issue being worked on w.r.t pseries.
>> >
>> > https://lkml.kernel.org/r/1502357028-27465-1-git-send-email-bharata@linux.vnet.ibm.com
>> >
>> > The question is should we map these regions ? ie, we need to tell the
>> > kernel memory region that we would like to hot unplug later so that
>> > we avoid doing kernel allocations from that. If we do that, then we
>> > can possibly map them via 2M size ?
>>
>> But all of memory on PowerNV should be able to be hot unplugged, so

For this ideally we need movable mappings for the regions we intend
to hot-unplug - no? Otherwise, there is no guarantee that hot-unplug
will work

>> there are two options as I see it - either increase the memory block
>> size, or map everything with 2MB pages.
>
> Or be smarter and map with 1G when blocks of 1G are available and break
> down to 2M where necessary, it shouldn't be too hard.
>

strict_rwx patches added helpers to do this

Balbir Singh.
Michael Ellerman Sept. 9, 2017, 9:30 p.m. UTC | #7
We should do the 1G block size as a fix, and backport it, and then make the hot unplug code smarter.

cheers

On 8 September 2017 11:15:47 am AEST, Anton Blanchard <anton@ozlabs.org> wrote:
>Hi Reza,
>
>> I may be misunderstanding this, but what if we did something like x86
>
>> does? When trying to unplug a region smaller than the mapping, they
>> fill that part of the pagetable with 0xFD instead of freeing the
>> whole thing. Once the whole thing is 0xFD, free it.
>> 
>> See arch/x86/mm/init_64.c:remove_{pte,pmd,pud}_table()
>> 
>> ---%<---
>> 	memset((void *)addr, PAGE_INUSE, next - addr);
>> 
>> 	page_addr = page_address(pte_page(*pte));
>> 	if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>> 		...
>> 		pte_clear(&init_mm, addr, pte);
>> 		...
>> 	}
>> ---%<---
>
>But you only have 1GB ptes at this point, you'd need to start
>instantiating a new level in the tree, and populate 2MB ptes.
>
>That is what Ben is suggesting. I'm happy to go any way (fix hotplug
>to handle this, or increase the memblock size on PowerNV to 1GB), I
>just
>need a solution.
>
>Anton
Michael Ellerman Oct. 6, 2017, 11:10 a.m. UTC | #8
On Thu, 2017-09-07 at 05:05:51 UTC, Anton Blanchard wrote:
> From: Anton Blanchard <anton@samba.org>
> 
> Memory hot unplug on PowerNV radix hosts is broken. Our memory block
> size is 256MB but since we map the linear region with very large pages,
> each pte we tear down maps 1GB.
> 
> A hot unplug of one 256MB memory block results in 768MB of memory
> getting unintentionally unmapped. At this point we are likely to oops.
> 
> Fix this by increasing our memory block size to 1GB on PowerNV radix
> hosts.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/53ecde0b9126ff140abe3aefd7f0ec

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 897aa1400eb8..bbb73aa0eb8f 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -272,7 +272,15 @@  static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 static unsigned long pnv_memory_block_size(void)
 {
-	return 256UL * 1024 * 1024;
+	/*
+	 * We map the kernel linear region with 1GB large pages on radix. For
+	 * memory hot unplug to work our memory block size must be at least
+	 * this size.
+	 */
+	if (radix_enabled())
+		return 1UL * 1024 * 1024 * 1024;
+	else
+		return 256UL * 1024 * 1024;
 }
 #endif