powerpc: mm: Limit rma_size to 1TB when running without HV mode
diff mbox series

Message ID 20190710052018.14628-1-sjitindarsingh@gmail.com
State Accepted
Headers show
Series
  • powerpc: mm: Limit rma_size to 1TB when running without HV mode
Related show

Commit Message

Suraj Jitindar Singh July 10, 2019, 5:20 a.m. UTC
The virtual real mode addressing (VRMA) mechanism is used when a
partition is using HPT (Hash Page Table) translation and performs
real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
mode effective address bits 0:23 are treated as zero (i.e. the access
is aliased to 0) and the access is performed using an implicit 1TB SLB
entry.

The size of the RMA (Real Memory Area) is communicated to the guest as
the size of the first memory region in the device tree. And because of
the mechanism described above can be expected to not exceed 1TB. In the
event that the host erroneously represents the RMA as being larger than
1TB, guest accesses in real mode to memory addresses above 1TB will be
aliased down to below 1TB. This means that a memory access performed in
real mode may differ to one performed in virtual mode for the same memory
address, which would likely have unintended consequences.

To avoid this outcome have the guest explicitly limit the size of the
RMA to the current maximum, which is 1TB. This means that even if the
first memory block is larger than 1TB, only the first 1TB should be
accessed in real mode.

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
---
 arch/powerpc/mm/book3s64/hash_utils.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Satheesh Rajendran July 10, 2019, 7:32 a.m. UTC | #1
On Wed, Jul 10, 2019 at 03:20:18PM +1000, Suraj Jitindar Singh wrote:
> The virtual real mode addressing (VRMA) mechanism is used when a
> partition is using HPT (Hash Page Table) translation and performs
> real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> mode effective address bits 0:23 are treated as zero (i.e. the access
> is aliased to 0) and the access is performed using an implicit 1TB SLB
> entry.
> 
> The size of the RMA (Real Memory Area) is communicated to the guest as
> the size of the first memory region in the device tree. And because of
> the mechanism described above can be expected to not exceed 1TB. In the
> event that the host erroneously represents the RMA as being larger than
> 1TB, guest accesses in real mode to memory addresses above 1TB will be
> aliased down to below 1TB. This means that a memory access performed in
> real mode may differ to one performed in virtual mode for the same memory
> address, which would likely have unintended consequences.
> 
> To avoid this outcome have the guest explicitly limit the size of the
> RMA to the current maximum, which is 1TB. This means that even if the
> first memory block is larger than 1TB, only the first 1TB should be
> accessed in real mode.
> 
> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> ---
>  arch/powerpc/mm/book3s64/hash_utils.c | 8 ++++++++
>  1 file changed, 8 insertions(+)

Hi,

Tested this patch and now Power8 compat guest boots fine with mem >1024G on 
Power9 host.

Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>

Host: P9; kernel: 5.2.0-00915-g5ad18b2e60b7

Before this patch:
Guest crashes..
[0.000000] BUG: Kernel NULL pointer dereference at 0x00000028
[0.000000] Faulting instruction address: 0xc00000000102caa0
[0.000000] Oops: Kernel access of bad area, sig: 11 [#1]
[0.000000] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[0.000000] Modules linked in:
[0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-03135-ge9a83bd23220 #24
[0.000000] NIP:  c00000000102caa0 LR: c00000000102ca84 CTR: 0000000000000000
[0.000000] REGS: c000000001603ba0 TRAP: 0380   Not tainted  (5.2.0-03135-ge9a83bd23220)
[0.000000] MSR:  8000000000001033 <SF,ME,IR,DR,RI,LE>  CR: 24000428  XER: 20000000
[0.000000] CFAR: c00000000102c1d8 IRQMASK: 1 
[0.000000] GPR00: c00000000102ca84 c000000001603e30 c000000001605300 0000010000000000 
[0.000000] GPR04: 0000000000000000 0000000000000000 c00000ffffff8000 c000000001863dc8 
[0.000000] GPR08: 0000000000002028 0000000000000000 c00000ffffff8000 0000000000000009 
[0.000000] GPR12: 0000000000000000 c0000000018f0000 000000007dc5fef0 00000000012e1220 
[0.000000] GPR16: 00000000012e10a0 fffffffffffffffd 000000007dc5fef0 000000000130fcc0 
[0.000000] GPR20: 0000000000000014 0000000001a80000 000000002fff0000 fffffffffffffffd 
[0.000000] GPR24: 0000000001d0000c c000000000000000 c000000001641ed8 c000000001641b78 
[0.000000] GPR28: 0000000000000000 0000000000000000 0000010000000000 0000000000000000 
[0.000000] NIP [c00000000102caa0] emergency_stack_init+0xb8/0x118
[0.000000] LR [c00000000102ca84] emergency_stack_init+0x9c/0x118
[0.000000] Call Trace:
[0.000000] [c000000001603e30] [c00000000102ca84] emergency_stack_init+0x9c/0x118 (unreliable)
[0.000000] [c000000001603e80] [c00000000102bd54] setup_arch+0x2fc/0x388
[0.000000] [c000000001603ef0] [c000000001023ccc] start_kernel+0xa4/0x660
[0.000000] [c000000001603f90] [c00000000000b774] start_here_common+0x1c/0x528
[0.000000] Instruction dump:
[0.000000] 7ffc07b4 7fc3f378 7bfd1f24 7f84e378 4bfff6e9 3f620004 3b7bc878 7f84e378 
[0.000000] 39434000 7fc3f378 e93b0000 7d29e82a <f9490028> 4bfff6c5 e93b0000 7f84e378 
[0.000000] random: get_random_bytes called from print_oops_end_marker+0x6c/0xa0 with crng_init=0
[0.000000] ---[ end trace 0000000000000000 ]---
[0.000000] 
[0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

-------------------------
With this patch:
# virsh start --console p8
Domain p8 started
Connected to domain p8
..
..
Fedora 27 (Twenty Seven)
Kernel 5.2.0-03136-gf709b0494ad9 on an ppc64le (hvc0)

atest-guest login: 
# free -g
              total        used        free      shared  buff/cache   available
Mem:           1028       0        1027           0           0        1025
Swap:         0           0     

Regards,
-Satheesh.

> 
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index 28ced26f2a00..4d0e2cce9cd5 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1901,11 +1901,19 @@ void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
>  	 *
>  	 * For guests on platforms before POWER9, we clamp the it limit to 1G
>  	 * to avoid some funky things such as RTAS bugs etc...
> +	 * On POWER9 we limit to 1TB in case the host erroneously told us that
> +	 * the RMA was >1TB. Effective address bits 0:23 are treated as zero
> +	 * (meaning the access is aliased to zero i.e. addr = addr % 1TB)
> +	 * for virtual real mode addressing and so it doesn't make sense to
> +	 * have an area larger than 1TB as it can't be addressed.
>  	 */
>  	if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
>  		ppc64_rma_size = first_memblock_size;
>  		if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
>  			ppc64_rma_size = min_t(u64, ppc64_rma_size, 0x40000000);
> +		else
> +			ppc64_rma_size = min_t(u64, ppc64_rma_size,
> +					       1UL << SID_SHIFT_1T);
> 
>  		/* Finally limit subsequent allocations */
>  		memblock_set_current_limit(ppc64_rma_size);
> -- 
> 2.13.6
>
David Gibson July 10, 2019, 2:21 p.m. UTC | #2
On Wed, Jul 10, 2019 at 03:20:18PM +1000, Suraj Jitindar Singh wrote:
> The virtual real mode addressing (VRMA) mechanism is used when a
> partition is using HPT (Hash Page Table) translation and performs
> real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> mode effective address bits 0:23 are treated as zero (i.e. the access
> is aliased to 0) and the access is performed using an implicit 1TB SLB
> entry.
> 
> The size of the RMA (Real Memory Area) is communicated to the guest as
> the size of the first memory region in the device tree. And because of
> the mechanism described above can be expected to not exceed 1TB. In the
> event that the host erroneously represents the RMA as being larger than
> 1TB, guest accesses in real mode to memory addresses above 1TB will be
> aliased down to below 1TB. This means that a memory access performed in
> real mode may differ to one performed in virtual mode for the same memory
> address, which would likely have unintended consequences.
> 
> To avoid this outcome have the guest explicitly limit the size of the
> RMA to the current maximum, which is 1TB. This means that even if the
> first memory block is larger than 1TB, only the first 1TB should be
> accessed in real mode.
> 
> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Although I'd really like to also see some comments added in
allocate_paca_ptrs() explaining the constraints there.

Oh, also, basing this on the non-compat PVR is bogus, but it's still
better than what we had.

> ---
>  arch/powerpc/mm/book3s64/hash_utils.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index 28ced26f2a00..4d0e2cce9cd5 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1901,11 +1901,19 @@ void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
>  	 *
>  	 * For guests on platforms before POWER9, we clamp the it limit to 1G
>  	 * to avoid some funky things such as RTAS bugs etc...
> +	 * On POWER9 we limit to 1TB in case the host erroneously told us that
> +	 * the RMA was >1TB. Effective address bits 0:23 are treated as zero
> +	 * (meaning the access is aliased to zero i.e. addr = addr % 1TB)
> +	 * for virtual real mode addressing and so it doesn't make sense to
> +	 * have an area larger than 1TB as it can't be addressed.
>  	 */
>  	if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
>  		ppc64_rma_size = first_memblock_size;
>  		if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
>  			ppc64_rma_size = min_t(u64, ppc64_rma_size, 0x40000000);
> +		else
> +			ppc64_rma_size = min_t(u64, ppc64_rma_size,
> +					       1UL << SID_SHIFT_1T);
>  
>  		/* Finally limit subsequent allocations */
>  		memblock_set_current_limit(ppc64_rma_size);
Michael Ellerman July 12, 2019, 1:09 p.m. UTC | #3
Suraj Jitindar Singh <sjitindarsingh@gmail.com> writes:
> The virtual real mode addressing (VRMA) mechanism is used when a
> partition is using HPT (Hash Page Table) translation and performs
> real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> mode effective address bits 0:23 are treated as zero (i.e. the access
> is aliased to 0) and the access is performed using an implicit 1TB SLB
> entry.
>
> The size of the RMA (Real Memory Area) is communicated to the guest as
> the size of the first memory region in the device tree. And because of
> the mechanism described above can be expected to not exceed 1TB. In the
> event that the host erroneously represents the RMA as being larger than
> 1TB, guest accesses in real mode to memory addresses above 1TB will be
> aliased down to below 1TB. This means that a memory access performed in
> real mode may differ to one performed in virtual mode for the same memory
> address, which would likely have unintended consequences.
>
> To avoid this outcome have the guest explicitly limit the size of the
> RMA to the current maximum, which is 1TB. This means that even if the
> first memory block is larger than 1TB, only the first 1TB should be
> accessed in real mode.
>
> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

I added:

Fixes: c3ab300ea555 ("powerpc: Add POWER9 cputable entry")
Cc: stable@vger.kernel.org # v4.6+


Which is not exactly correct, but probably good enough?

cheers

> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index 28ced26f2a00..4d0e2cce9cd5 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1901,11 +1901,19 @@ void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
>  	 *
>  	 * For guests on platforms before POWER9, we clamp the it limit to 1G
>  	 * to avoid some funky things such as RTAS bugs etc...
> +	 * On POWER9 we limit to 1TB in case the host erroneously told us that
> +	 * the RMA was >1TB. Effective address bits 0:23 are treated as zero
> +	 * (meaning the access is aliased to zero i.e. addr = addr % 1TB)
> +	 * for virtual real mode addressing and so it doesn't make sense to
> +	 * have an area larger than 1TB as it can't be addressed.
>  	 */
>  	if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
>  		ppc64_rma_size = first_memblock_size;
>  		if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
>  			ppc64_rma_size = min_t(u64, ppc64_rma_size, 0x40000000);
> +		else
> +			ppc64_rma_size = min_t(u64, ppc64_rma_size,
> +					       1UL << SID_SHIFT_1T);
>  
>  		/* Finally limit subsequent allocations */
>  		memblock_set_current_limit(ppc64_rma_size);
> -- 
> 2.13.6
Suraj Jitindar Singh July 15, 2019, 1:58 a.m. UTC | #4
On Fri, 2019-07-12 at 23:09 +1000, Michael Ellerman wrote:
> Suraj Jitindar Singh <sjitindarsingh@gmail.com> writes:
> > The virtual real mode addressing (VRMA) mechanism is used when a
> > partition is using HPT (Hash Page Table) translation and performs
> > real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> > mode effective address bits 0:23 are treated as zero (i.e. the
> > access
> > is aliased to 0) and the access is performed using an implicit 1TB
> > SLB
> > entry.
> > 
> > The size of the RMA (Real Memory Area) is communicated to the guest
> > as
> > the size of the first memory region in the device tree. And because
> > of
> > the mechanism described above can be expected to not exceed 1TB. In
> > the
> > event that the host erroneously represents the RMA as being larger
> > than
> > 1TB, guest accesses in real mode to memory addresses above 1TB will
> > be
> > aliased down to below 1TB. This means that a memory access
> > performed in
> > real mode may differ to one performed in virtual mode for the same
> > memory
> > address, which would likely have unintended consequences.
> > 
> > To avoid this outcome have the guest explicitly limit the size of
> > the
> > RMA to the current maximum, which is 1TB. This means that even if
> > the
> > first memory block is larger than 1TB, only the first 1TB should be
> > accessed in real mode.
> > 
> > Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> 
> I added:
> 
> Fixes: c3ab300ea555 ("powerpc: Add POWER9 cputable entry")
> Cc: stable@vger.kernel.org # v4.6+
> 
> 
> Which is not exactly correct, but probably good enough?

I think we actually want:
Fixes: c610d65c0ad0 ("powerpc/pseries: lift RTAS limit for hash")

Which is what actually caused it to break and for the issue to present
itself.

> 
> cheers
> 
> > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c
> > b/arch/powerpc/mm/book3s64/hash_utils.c
> > index 28ced26f2a00..4d0e2cce9cd5 100644
> > --- a/arch/powerpc/mm/book3s64/hash_utils.c
> > +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> > @@ -1901,11 +1901,19 @@ void
> > hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
> >  	 *
> >  	 * For guests on platforms before POWER9, we clamp the it
> > limit to 1G
> >  	 * to avoid some funky things such as RTAS bugs etc...
> > +	 * On POWER9 we limit to 1TB in case the host erroneously
> > told us that
> > +	 * the RMA was >1TB. Effective address bits 0:23 are
> > treated as zero
> > +	 * (meaning the access is aliased to zero i.e. addr = addr
> > % 1TB)
> > +	 * for virtual real mode addressing and so it doesn't make
> > sense to
> > +	 * have an area larger than 1TB as it can't be addressed.
> >  	 */
> >  	if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
> >  		ppc64_rma_size = first_memblock_size;
> >  		if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
> >  			ppc64_rma_size = min_t(u64,
> > ppc64_rma_size, 0x40000000);
> > +		else
> > +			ppc64_rma_size = min_t(u64,
> > ppc64_rma_size,
> > +					       1UL <<
> > SID_SHIFT_1T);
> >  
> >  		/* Finally limit subsequent allocations */
> >  		memblock_set_current_limit(ppc64_rma_size);
> > -- 
> > 2.13.6
Michael Ellerman July 15, 2019, 2:23 a.m. UTC | #5
Suraj Jitindar Singh <sjitindarsingh@gmail.com> writes:
> On Fri, 2019-07-12 at 23:09 +1000, Michael Ellerman wrote:
>> Suraj Jitindar Singh <sjitindarsingh@gmail.com> writes:
>> > The virtual real mode addressing (VRMA) mechanism is used when a
>> > partition is using HPT (Hash Page Table) translation and performs
>> > real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
>> > mode effective address bits 0:23 are treated as zero (i.e. the
>> > access
>> > is aliased to 0) and the access is performed using an implicit 1TB
>> > SLB
>> > entry.
>> > 
>> > The size of the RMA (Real Memory Area) is communicated to the guest
>> > as
>> > the size of the first memory region in the device tree. And because
>> > of
>> > the mechanism described above can be expected to not exceed 1TB. In
>> > the
>> > event that the host erroneously represents the RMA as being larger
>> > than
>> > 1TB, guest accesses in real mode to memory addresses above 1TB will
>> > be
>> > aliased down to below 1TB. This means that a memory access
>> > performed in
>> > real mode may differ to one performed in virtual mode for the same
>> > memory
>> > address, which would likely have unintended consequences.
>> > 
>> > To avoid this outcome have the guest explicitly limit the size of
>> > the
>> > RMA to the current maximum, which is 1TB. This means that even if
>> > the
>> > first memory block is larger than 1TB, only the first 1TB should be
>> > accessed in real mode.
>> > 
>> > Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
>> 
>> I added:
>> 
>> Fixes: c3ab300ea555 ("powerpc: Add POWER9 cputable entry")
>> Cc: stable@vger.kernel.org # v4.6+
>> 
>> 
>> Which is not exactly correct, but probably good enough?
>
> I think we actually want:
> Fixes: c610d65c0ad0 ("powerpc/pseries: lift RTAS limit for hash")
>
> Which is what actually caused it to break and for the issue to present
> itself.

Thanks, I used that instead.

cheers
Michael Ellerman July 18, 2019, 1:56 p.m. UTC | #6
On Wed, 2019-07-10 at 05:20:18 UTC, Suraj Jitindar Singh wrote:
> The virtual real mode addressing (VRMA) mechanism is used when a
> partition is using HPT (Hash Page Table) translation and performs
> real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> mode effective address bits 0:23 are treated as zero (i.e. the access
> is aliased to 0) and the access is performed using an implicit 1TB SLB
> entry.
> 
> The size of the RMA (Real Memory Area) is communicated to the guest as
> the size of the first memory region in the device tree. And because of
> the mechanism described above can be expected to not exceed 1TB. In the
> event that the host erroneously represents the RMA as being larger than
> 1TB, guest accesses in real mode to memory addresses above 1TB will be
> aliased down to below 1TB. This means that a memory access performed in
> real mode may differ to one performed in virtual mode for the same memory
> address, which would likely have unintended consequences.
> 
> To avoid this outcome have the guest explicitly limit the size of the
> RMA to the current maximum, which is 1TB. This means that even if the
> first memory block is larger than 1TB, only the first 1TB should be
> accessed in real mode.
> 
> Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/da0ef93310e67ae6902efded60b6724dab27a5d1

cheers

Patch
diff mbox series

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 28ced26f2a00..4d0e2cce9cd5 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1901,11 +1901,19 @@  void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
 	 *
 	 * For guests on platforms before POWER9, we clamp the it limit to 1G
 	 * to avoid some funky things such as RTAS bugs etc...
+	 * On POWER9 we limit to 1TB in case the host erroneously told us that
+	 * the RMA was >1TB. Effective address bits 0:23 are treated as zero
+	 * (meaning the access is aliased to zero i.e. addr = addr % 1TB)
+	 * for virtual real mode addressing and so it doesn't make sense to
+	 * have an area larger than 1TB as it can't be addressed.
 	 */
 	if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
 		ppc64_rma_size = first_memblock_size;
 		if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
 			ppc64_rma_size = min_t(u64, ppc64_rma_size, 0x40000000);
+		else
+			ppc64_rma_size = min_t(u64, ppc64_rma_size,
+					       1UL << SID_SHIFT_1T);
 
 		/* Finally limit subsequent allocations */
 		memblock_set_current_limit(ppc64_rma_size);