diff mbox

[FIX,v0] powerpc: Fix memory unplug failure on radix guest

Message ID 1502357028-27465-1-git-send-email-bharata@linux.vnet.ibm.com
State New
Headers show

Commit Message

Bharata B Rao Aug. 10, 2017, 9:23 a.m. UTC
For a PowerKVM guest, it is possible to specify a DIMM device in
addition to the system RAM at boot time. When such a cold plugged DIMM
device is removed from a radix guest, we hit the following warning in the
guest kernel resulting in the eventual failure of memory unplug:

remove_pud_table: unaligned range
WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 remove_pagetable+0x468/0xca0
Call Trace:
remove_pagetable+0x464/0xca0 (unreliable)
radix__remove_section_mapping+0x24/0x40
remove_section_mapping+0x28/0x60
arch_remove_memory+0xcc/0x120
remove_memory+0x1ac/0x270
dlpar_remove_lmb+0x1ac/0x210
dlpar_memory+0xbc4/0xeb0
pseries_hp_work_fn+0x1a4/0x230
process_one_work+0x1cc/0x660
worker_thread+0xac/0x6d0
kthread+0x16c/0x1b0
ret_from_kernel_thread+0x5c/0x74

The DIMM memory that is cold plugged gets merged to the same memblock
region as RAM and hence gets mapped at 1G alignment. However since the
removal is done for one LMB (lmb size 256MB) at a time, the address
of the LMB (which is 256MB aligned) would get flagged as unaligned
in remove_pud_table() resulting in the above failure.

This problem is not seen for hot plugged memory because for the
hot plugged memory, the mappings are created separately for each
LMB and hence they all get aligned at 256MB.

To fix this problem for the cold plugged memory, let us mark the
cold plugged memblock region explicitly as HOTPLUGGED so that the
region doesn't get merged with RAM. All the memory that is discovered
via ibm,dynamic-memory-configuration is marked so(1). Next identify
such regions in radix_init_pgtable() and create separate mappings
within that region for each LMB so that they get don't get aligned
like RAM region at 1G (2).

(1) For PowerKVM guests, all boot time memory is represented via
memory@XXXX nodes and hot plugged/pluggable memory is represented via
ibm,dynamic-memory-reconfiguration property. We are marking all
hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
With this only cold plugged memory gets marked for PowerKVM but
need to check how this will affect PowerVM guests.

(2) To create separate mappings for every LMB in the hot plugged
region, we need lmb-size. I am currently using memory_block_size_bytes()
API to get the lmb-size. Since this is early init time code, the
machine type isn't probed yet and hence memory_block_size_bytes()
would return the default LMB size as 16MB. Hence we end up creating
separate mappings at much lower granularity than what we can ideally
do for pseries machine.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/prom.c      |  1 +
 arch/powerpc/mm/pgtable-radix.c | 17 ++++++++++++++---
 2 files changed, 15 insertions(+), 3 deletions(-)

Comments

Reza Arbab Aug. 10, 2017, 4:50 p.m. UTC | #1
On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
>diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>index f830562..24ecf53 100644
>--- a/arch/powerpc/kernel/prom.c
>+++ b/arch/powerpc/kernel/prom.c
>@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node)
> 					size = 0x80000000ul - base;
> 			}
> 			memblock_add(base, size);
>+			memblock_mark_hotplug(base, size);
> 		} while (--rngs);
> 	}
> 	memblock_dump_all();

Doing this has the effect of putting all the affected memory into 
ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no 
kernel allocations can occur there. Is that okay?

>diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
>index 671a45d..180d25a 100644
>--- a/arch/powerpc/mm/pgtable-radix.c
>+++ b/arch/powerpc/mm/pgtable-radix.c
>@@ -12,6 +12,7 @@
> #include <linux/memblock.h>
> #include <linux/of_fdt.h>
> #include <linux/mm.h>
>+#include <linux/memory.h>
>
> #include <asm/pgtable.h>
> #include <asm/pgalloc.h>
>@@ -255,15 +256,25 @@ static void __init radix_init_pgtable(void)
> {
> 	unsigned long rts_field;
> 	struct memblock_region *reg;
>+	phys_addr_t addr;
>+	u64 lmb_size = memory_block_size_bytes();
>
> 	/* We don't support slb for radix */
> 	mmu_slb_size = 0;
> 	/*
> 	 * Create the linear mapping, using standard page size for now
> 	 */
>-	for_each_memblock(memory, reg)
>-		WARN_ON(create_physical_mapping(reg->base,
>-						reg->base + reg->size));
>+	for_each_memblock(memory, reg) {
>+		if (memblock_is_hotpluggable(reg)) {
>+			for (addr = reg->base; addr < (reg->base + reg->size);
>+				addr += lmb_size)
>+				WARN_ON(create_physical_mapping(addr,
>+					addr + lmb_size));
>+		} else {
>+			WARN_ON(create_physical_mapping(reg->base,
>+							reg->base + reg->size));
>+		}
>+	}
>
> 	/* Find out how many PID bits are supported */
> 	if (cpu_has_feature(CPU_FTR_HVMODE)) {
>-- 
>2.7.4
>
Reza Arbab Aug. 10, 2017, 8:38 p.m. UTC | #2
On Thu, Aug 10, 2017 at 11:50:19AM -0500, Reza Arbab wrote:
>On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
>>diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>>index f830562..24ecf53 100644
>>--- a/arch/powerpc/kernel/prom.c
>>+++ b/arch/powerpc/kernel/prom.c
>>@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node)
>>					size = 0x80000000ul - base;
>>			}
>>			memblock_add(base, size);
>>+			memblock_mark_hotplug(base, size);
>>		} while (--rngs);
>>	}
>>	memblock_dump_all();
>
>Doing this has the effect of putting all the affected memory into 
>ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no 
>kernel allocations can occur there. Is that okay?

I should clarify. The behavior change I mention applies when 
movable_node_is_enabled().
Aneesh Kumar K.V Aug. 11, 2017, 8:37 a.m. UTC | #3
Reza Arbab <arbab@linux.vnet.ibm.com> writes:

> On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
>>diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>>index f830562..24ecf53 100644
>>--- a/arch/powerpc/kernel/prom.c
>>+++ b/arch/powerpc/kernel/prom.c
>>@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node)
>> 					size = 0x80000000ul - base;
>> 			}
>> 			memblock_add(base, size);
>>+			memblock_mark_hotplug(base, size);
>> 		} while (--rngs);
>> 	}
>> 	memblock_dump_all();
>
> Doing this has the effect of putting all the affected memory into 
> ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no 
> kernel allocations can occur there. Is that okay?
>

So the thinking here is any memory identified via ibm,dynamic-memory can
be hot removed later. Hence the need to add them lmb size, because our
hotplug framework remove them in lmb size. If we want to support
hotunplug, then we will have to make sure kernel allocation doesn't
happen in that region right ?

With the above i would consider not marking it hotplug was a bug before
?

-aneesh
Aneesh Kumar K.V Aug. 11, 2017, 8:42 a.m. UTC | #4
Bharata B Rao <bharata@linux.vnet.ibm.com> writes:

> For a PowerKVM guest, it is possible to specify a DIMM device in
> addition to the system RAM at boot time. When such a cold plugged DIMM
> device is removed from a radix guest, we hit the following warning in the
> guest kernel resulting in the eventual failure of memory unplug:
>
> remove_pud_table: unaligned range
> WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 remove_pagetable+0x468/0xca0
> Call Trace:
> remove_pagetable+0x464/0xca0 (unreliable)
> radix__remove_section_mapping+0x24/0x40
> remove_section_mapping+0x28/0x60
> arch_remove_memory+0xcc/0x120
> remove_memory+0x1ac/0x270
> dlpar_remove_lmb+0x1ac/0x210
> dlpar_memory+0xbc4/0xeb0
> pseries_hp_work_fn+0x1a4/0x230
> process_one_work+0x1cc/0x660
> worker_thread+0xac/0x6d0
> kthread+0x16c/0x1b0
> ret_from_kernel_thread+0x5c/0x74
>
> The DIMM memory that is cold plugged gets merged to the same memblock
> region as RAM and hence gets mapped at 1G alignment. However since the
> removal is done for one LMB (lmb size 256MB) at a time, the address
> of the LMB (which is 256MB aligned) would get flagged as unaligned
> in remove_pud_table() resulting in the above failure.
>
> This problem is not seen for hot plugged memory because for the
> hot plugged memory, the mappings are created separately for each
> LMB and hence they all get aligned at 256MB.
>
> To fix this problem for the cold plugged memory, let us mark the
> cold plugged memblock region explicitly as HOTPLUGGED so that the
> region doesn't get merged with RAM. All the memory that is discovered
> via ibm,dynamic-memory-configuration is marked so(1). Next identify
> such regions in radix_init_pgtable() and create separate mappings
> within that region for each LMB so that they get don't get aligned
> like RAM region at 1G (2).
>
> (1) For PowerKVM guests, all boot time memory is represented via
> memory@XXXX nodes and hot plugged/pluggable memory is represented via
> ibm,dynamic-memory-reconfiguration property. We are marking all
> hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
> With this only cold plugged memory gets marked for PowerKVM but
> need to check how this will affect PowerVM guests.

Can you verify this on PowerVM too ? ie we should in most case not find
anything under ibm,dynamic-memory-reconfiguration ?


-aneesh
Reza Arbab Aug. 11, 2017, 4:28 p.m. UTC | #5
On Fri, Aug 11, 2017 at 02:07:51PM +0530, Aneesh Kumar K.V wrote:
>Reza Arbab <arbab@linux.vnet.ibm.com> writes:
>
>> On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
>>>diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>>>index f830562..24ecf53 100644
>>>--- a/arch/powerpc/kernel/prom.c
>>>+++ b/arch/powerpc/kernel/prom.c
>>>@@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node)
>>> 					size = 0x80000000ul - base;
>>> 			}
>>> 			memblock_add(base, size);
>>>+			memblock_mark_hotplug(base, size);
>>> 		} while (--rngs);
>>> 	}
>>> 	memblock_dump_all();
>>
>> Doing this has the effect of putting all the affected memory into
>> ZONE_MOVABLE. See find_zone_movable_pfns_for_nodes(). This means no
>> kernel allocations can occur there. Is that okay?
>>
>
>So the thinking here is any memory identified via ibm,dynamic-memory can
>be hot removed later. Hence the need to add them lmb size, because our
>hotplug framework remove them in lmb size. If we want to support
>hotunplug, then we will have to make sure kernel allocation doesn't
>happen in that region right ?

Yes, the net result is that this memory can now be hotremoved. I just 
wanted to point out that the patch doesn't only change the granularity 
of addition--it also causes the memory to end up in a different zone 
(when using movable_node).

>With the above i would consider not marking it hotplug was a bug before
>?

Sure, that's reasonable.
Bharata B Rao Aug. 17, 2017, 9:58 a.m. UTC | #6
On Fri, Aug 11, 2017 at 02:12:04PM +0530, Aneesh Kumar K.V wrote:
> Bharata B Rao <bharata@linux.vnet.ibm.com> writes:
> 
> > For a PowerKVM guest, it is possible to specify a DIMM device in
> > addition to the system RAM at boot time. When such a cold plugged DIMM
> > device is removed from a radix guest, we hit the following warning in the
> > guest kernel resulting in the eventual failure of memory unplug:
> >
> > remove_pud_table: unaligned range
> > WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 remove_pagetable+0x468/0xca0
> > Call Trace:
> > remove_pagetable+0x464/0xca0 (unreliable)
> > radix__remove_section_mapping+0x24/0x40
> > remove_section_mapping+0x28/0x60
> > arch_remove_memory+0xcc/0x120
> > remove_memory+0x1ac/0x270
> > dlpar_remove_lmb+0x1ac/0x210
> > dlpar_memory+0xbc4/0xeb0
> > pseries_hp_work_fn+0x1a4/0x230
> > process_one_work+0x1cc/0x660
> > worker_thread+0xac/0x6d0
> > kthread+0x16c/0x1b0
> > ret_from_kernel_thread+0x5c/0x74
> >
> > The DIMM memory that is cold plugged gets merged to the same memblock
> > region as RAM and hence gets mapped at 1G alignment. However since the
> > removal is done for one LMB (lmb size 256MB) at a time, the address
> > of the LMB (which is 256MB aligned) would get flagged as unaligned
> > in remove_pud_table() resulting in the above failure.
> >
> > This problem is not seen for hot plugged memory because for the
> > hot plugged memory, the mappings are created separately for each
> > LMB and hence they all get aligned at 256MB.
> >
> > To fix this problem for the cold plugged memory, let us mark the
> > cold plugged memblock region explicitly as HOTPLUGGED so that the
> > region doesn't get merged with RAM. All the memory that is discovered
> > via ibm,dynamic-memory-configuration is marked so(1). Next identify
> > such regions in radix_init_pgtable() and create separate mappings
> > within that region for each LMB so that they get don't get aligned
> > like RAM region at 1G (2).
> >
> > (1) For PowerKVM guests, all boot time memory is represented via
> > memory@XXXX nodes and hot plugged/pluggable memory is represented via
> > ibm,dynamic-memory-reconfiguration property. We are marking all
> > hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
> > With this only cold plugged memory gets marked for PowerKVM but
> > need to check how this will affect PowerVM guests.
> 
> Can you verify this on PowerVM too ? ie we should in most case not find
> anything under ibm,dynamic-memory-reconfiguration ?

Checked with a couple of PowerVM systems. Look like except for RMA which
is represented by memory@0, rest of the memory is coming under
ibm,dynamic-memory-reconfiguration. So the approach I have taken in this
fix wouldn't be optimal on PowerVM ?

Regards,
Bharata.
Bharata B Rao Sept. 1, 2017, 6:53 a.m. UTC | #7
On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
> For a PowerKVM guest, it is possible to specify a DIMM device in
> addition to the system RAM at boot time. When such a cold plugged DIMM
> device is removed from a radix guest, we hit the following warning in the
> guest kernel resulting in the eventual failure of memory unplug:
> 
> remove_pud_table: unaligned range
> WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 remove_pagetable+0x468/0xca0
> Call Trace:
> remove_pagetable+0x464/0xca0 (unreliable)
> radix__remove_section_mapping+0x24/0x40
> remove_section_mapping+0x28/0x60
> arch_remove_memory+0xcc/0x120
> remove_memory+0x1ac/0x270
> dlpar_remove_lmb+0x1ac/0x210
> dlpar_memory+0xbc4/0xeb0
> pseries_hp_work_fn+0x1a4/0x230
> process_one_work+0x1cc/0x660
> worker_thread+0xac/0x6d0
> kthread+0x16c/0x1b0
> ret_from_kernel_thread+0x5c/0x74
> 
> The DIMM memory that is cold plugged gets merged to the same memblock
> region as RAM and hence gets mapped at 1G alignment. However since the
> removal is done for one LMB (lmb size 256MB) at a time, the address
> of the LMB (which is 256MB aligned) would get flagged as unaligned
> in remove_pud_table() resulting in the above failure.
> 
> This problem is not seen for hot plugged memory because for the
> hot plugged memory, the mappings are created separately for each
> LMB and hence they all get aligned at 256MB.
> 
> To fix this problem for the cold plugged memory, let us mark the
> cold plugged memblock region explicitly as HOTPLUGGED so that the
> region doesn't get merged with RAM. All the memory that is discovered
> via ibm,dynamic-memory-configuration is marked so(1). Next identify
> such regions in radix_init_pgtable() and create separate mappings
> within that region for each LMB so that they get don't get aligned
> like RAM region at 1G (2).
> 
> (1) For PowerKVM guests, all boot time memory is represented via
> memory@XXXX nodes and hot plugged/pluggable memory is represented via
> ibm,dynamic-memory-reconfiguration property. We are marking all
> hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
> With this only cold plugged memory gets marked for PowerKVM but
> need to check how this will affect PowerVM guests.
> 
> (2) To create separate mappings for every LMB in the hot plugged
> region, we need lmb-size. I am currently using memory_block_size_bytes()
> API to get the lmb-size. Since this is early init time code, the
> machine type isn't probed yet and hence memory_block_size_bytes()
> would return the default LMB size as 16MB. Hence we end up creating
> separate mappings at much lower granularity than what we can ideally
> do for pseries machine.
> 
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/prom.c      |  1 +
>  arch/powerpc/mm/pgtable-radix.c | 17 ++++++++++++++---
>  2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index f830562..24ecf53 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node)
>  					size = 0x80000000ul - base;
>  			}
>  			memblock_add(base, size);
> +			memblock_mark_hotplug(base, size);

One of the suggestions was to make the above conditional to radix so
that PowerVM doesn't get affected by this. However early_radix_enabled()
check isn't usable yet at this point and MMU_FTR_TYPE_RADIX will get set
only a bit later in early_init_devtree().

Regards,
Bharata.
Nathan Fontenot Sept. 1, 2017, 2:11 p.m. UTC | #8
On 09/01/2017 01:53 AM, Bharata B Rao wrote:
> On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
>> For a PowerKVM guest, it is possible to specify a DIMM device in
>> addition to the system RAM at boot time. When such a cold plugged DIMM
>> device is removed from a radix guest, we hit the following warning in the
>> guest kernel resulting in the eventual failure of memory unplug:
>>
>> remove_pud_table: unaligned range
>> WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 remove_pagetable+0x468/0xca0
>> Call Trace:
>> remove_pagetable+0x464/0xca0 (unreliable)
>> radix__remove_section_mapping+0x24/0x40
>> remove_section_mapping+0x28/0x60
>> arch_remove_memory+0xcc/0x120
>> remove_memory+0x1ac/0x270
>> dlpar_remove_lmb+0x1ac/0x210
>> dlpar_memory+0xbc4/0xeb0
>> pseries_hp_work_fn+0x1a4/0x230
>> process_one_work+0x1cc/0x660
>> worker_thread+0xac/0x6d0
>> kthread+0x16c/0x1b0
>> ret_from_kernel_thread+0x5c/0x74
>>
>> The DIMM memory that is cold plugged gets merged to the same memblock
>> region as RAM and hence gets mapped at 1G alignment. However since the
>> removal is done for one LMB (lmb size 256MB) at a time, the address
>> of the LMB (which is 256MB aligned) would get flagged as unaligned
>> in remove_pud_table() resulting in the above failure.
>>
>> This problem is not seen for hot plugged memory because for the
>> hot plugged memory, the mappings are created separately for each
>> LMB and hence they all get aligned at 256MB.
>>
>> To fix this problem for the cold plugged memory, let us mark the
>> cold plugged memblock region explicitly as HOTPLUGGED so that the
>> region doesn't get merged with RAM. All the memory that is discovered
>> via ibm,dynamic-memory-configuration is marked so(1). Next identify
>> such regions in radix_init_pgtable() and create separate mappings
>> within that region for each LMB so that they get don't get aligned
>> like RAM region at 1G (2).
>>
>> (1) For PowerKVM guests, all boot time memory is represented via
>> memory@XXXX nodes and hot plugged/pluggable memory is represented via
>> ibm,dynamic-memory-reconfiguration property. We are marking all
>> hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
>> With this only cold plugged memory gets marked for PowerKVM but
>> need to check how this will affect PowerVM guests.
>>
>> (2) To create separate mappings for every LMB in the hot plugged
>> region, we need lmb-size. I am currently using memory_block_size_bytes()
>> API to get the lmb-size. Since this is early init time code, the
>> machine type isn't probed yet and hence memory_block_size_bytes()
>> would return the default LMB size as 16MB. Hence we end up creating
>> separate mappings at much lower granularity than what we can ideally
>> do for pseries machine.
>>
>> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/kernel/prom.c      |  1 +
>>  arch/powerpc/mm/pgtable-radix.c | 17 ++++++++++++++---
>>  2 files changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>> index f830562..24ecf53 100644
>> --- a/arch/powerpc/kernel/prom.c
>> +++ b/arch/powerpc/kernel/prom.c
>> @@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node)
>>  					size = 0x80000000ul - base;
>>  			}
>>  			memblock_add(base, size);
>> +			memblock_mark_hotplug(base, size);
> 
> One of the suggestions was to make the above conditional to radix so
> that PowerVM doesn't get affected by this. However early_radix_enabled()
> check isn't usable yet at this point and MMU_FTR_TYPE_RADIX will get set
> only a bit later in early_init_devtree().

We do walk the dynamic reconfiguration memory again in the numa code, see
parse_drconf_memory() in numa.c, would it far enough along in boot to use
early_radix_enabled() and mark the memory hotplug at this point?

This may not be the ideal place to mark hotplug memory for radix but it may
be nicer than adding another walk of the device tree property somewhere else.

-Nathan

> 
> Regards,
> Bharata.
>
Bharata B Rao Sept. 5, 2017, 4:20 a.m. UTC | #9
On Fri, Sep 01, 2017 at 09:11:18AM -0500, Nathan Fontenot wrote:
> On 09/01/2017 01:53 AM, Bharata B Rao wrote:
> > On Thu, Aug 10, 2017 at 02:53:48PM +0530, Bharata B Rao wrote:
> >> For a PowerKVM guest, it is possible to specify a DIMM device in
> >> addition to the system RAM at boot time. When such a cold plugged DIMM
> >> device is removed from a radix guest, we hit the following warning in the
> >> guest kernel resulting in the eventual failure of memory unplug:
> >>
> >> remove_pud_table: unaligned range
> >> WARNING: CPU: 3 PID: 164 at arch/powerpc/mm/pgtable-radix.c:597 remove_pagetable+0x468/0xca0
> >> Call Trace:
> >> remove_pagetable+0x464/0xca0 (unreliable)
> >> radix__remove_section_mapping+0x24/0x40
> >> remove_section_mapping+0x28/0x60
> >> arch_remove_memory+0xcc/0x120
> >> remove_memory+0x1ac/0x270
> >> dlpar_remove_lmb+0x1ac/0x210
> >> dlpar_memory+0xbc4/0xeb0
> >> pseries_hp_work_fn+0x1a4/0x230
> >> process_one_work+0x1cc/0x660
> >> worker_thread+0xac/0x6d0
> >> kthread+0x16c/0x1b0
> >> ret_from_kernel_thread+0x5c/0x74
> >>
> >> The DIMM memory that is cold plugged gets merged to the same memblock
> >> region as RAM and hence gets mapped at 1G alignment. However since the
> >> removal is done for one LMB (lmb size 256MB) at a time, the address
> >> of the LMB (which is 256MB aligned) would get flagged as unaligned
> >> in remove_pud_table() resulting in the above failure.
> >>
> >> This problem is not seen for hot plugged memory because for the
> >> hot plugged memory, the mappings are created separately for each
> >> LMB and hence they all get aligned at 256MB.
> >>
> >> To fix this problem for the cold plugged memory, let us mark the
> >> cold plugged memblock region explicitly as HOTPLUGGED so that the
> >> region doesn't get merged with RAM. All the memory that is discovered
> >> via ibm,dynamic-memory-configuration is marked so(1). Next identify
> >> such regions in radix_init_pgtable() and create separate mappings
> >> within that region for each LMB so that they get don't get aligned
> >> like RAM region at 1G (2).
> >>
> >> (1) For PowerKVM guests, all boot time memory is represented via
> >> memory@XXXX nodes and hot plugged/pluggable memory is represented via
> >> ibm,dynamic-memory-reconfiguration property. We are marking all
> >> hotplugged memory that is in ASSIGNED state during boot as HOTPLUGGED.
> >> With this only cold plugged memory gets marked for PowerKVM but
> >> need to check how this will affect PowerVM guests.
> >>
> >> (2) To create separate mappings for every LMB in the hot plugged
> >> region, we need lmb-size. I am currently using memory_block_size_bytes()
> >> API to get the lmb-size. Since this is early init time code, the
> >> machine type isn't probed yet and hence memory_block_size_bytes()
> >> would return the default LMB size as 16MB. Hence we end up creating
> >> separate mappings at much lower granularity than what we can ideally
> >> do for pseries machine.
> >>
> >> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/kernel/prom.c      |  1 +
> >>  arch/powerpc/mm/pgtable-radix.c | 17 ++++++++++++++---
> >>  2 files changed, 15 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> >> index f830562..24ecf53 100644
> >> --- a/arch/powerpc/kernel/prom.c
> >> +++ b/arch/powerpc/kernel/prom.c
> >> @@ -524,6 +524,7 @@ static int __init early_init_dt_scan_drconf_memory(unsigned long node)
> >>  					size = 0x80000000ul - base;
> >>  			}
> >>  			memblock_add(base, size);
> >> +			memblock_mark_hotplug(base, size);
> > 
> > One of the suggestions was to make the above conditional to radix so
> > that PowerVM doesn't get affected by this. However early_radix_enabled()
> > check isn't usable yet at this point and MMU_FTR_TYPE_RADIX will get set
> > only a bit later in early_init_devtree().
> 
> We do walk the dynamic reconfiguration memory again in the numa code, see
> parse_drconf_memory() in numa.c, would it far enough along in boot to use
> early_radix_enabled() and mark the memory hotplug at this point?

parse_drconf_memory() in numa.c happens after radix page tables are setup.
Hence setting the hotplugged state from it will not help.

Regards,
Bharata.
diff mbox

Patch

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f830562..24ecf53 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -524,6 +524,7 @@  static int __init early_init_dt_scan_drconf_memory(unsigned long node)
 					size = 0x80000000ul - base;
 			}
 			memblock_add(base, size);
+			memblock_mark_hotplug(base, size);
 		} while (--rngs);
 	}
 	memblock_dump_all();
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 671a45d..180d25a 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -12,6 +12,7 @@ 
 #include <linux/memblock.h>
 #include <linux/of_fdt.h>
 #include <linux/mm.h>
+#include <linux/memory.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -255,15 +256,25 @@  static void __init radix_init_pgtable(void)
 {
 	unsigned long rts_field;
 	struct memblock_region *reg;
+	phys_addr_t addr;
+	u64 lmb_size = memory_block_size_bytes();
 
 	/* We don't support slb for radix */
 	mmu_slb_size = 0;
 	/*
 	 * Create the linear mapping, using standard page size for now
 	 */
-	for_each_memblock(memory, reg)
-		WARN_ON(create_physical_mapping(reg->base,
-						reg->base + reg->size));
+	for_each_memblock(memory, reg) {
+		if (memblock_is_hotpluggable(reg)) {
+			for (addr = reg->base; addr < (reg->base + reg->size);
+				addr += lmb_size)
+				WARN_ON(create_physical_mapping(addr,
+					addr + lmb_size));
+		} else {
+			WARN_ON(create_physical_mapping(reg->base,
+							reg->base + reg->size));
+		}
+	}
 
 	/* Find out how many PID bits are supported */
 	if (cpu_has_feature(CPU_FTR_HVMODE)) {