diff mbox series

powerpc/powernv/prd: Validate whether address to be mapped is part of system RAM

Message ID 20191002074856.15014-1-hegdevasant@linux.vnet.ibm.com (mailing list archive)
State Rejected
Headers show
Series powerpc/powernv/prd: Validate whether address to be mapped is part of system RAM | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (6edfc6487b474fe01857dc3f1a9cd701bb9b21c8)
snowpatch_ozlabs/build-ppc64le success Build succeeded
snowpatch_ozlabs/build-ppc64be success Build succeeded
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch warning total: 0 errors, 1 warnings, 0 checks, 28 lines checked

Commit Message

Vasant Hegde Oct. 2, 2019, 7:48 a.m. UTC
Add check to validate whether requested page is part of system RAM
or not before mmap() and error out if its not part of system RAM.

cat /proc/iomem:
-----------------
00000000-27ffffffff : System RAM
2800000000-2fffffffff : namespace0.0
200000000000-2027ffffffff : System RAM
202800000000-202fffffffff : namespace1.0
6000000000000-6003fbfffffff : pciex@600c3c0000000
6004000000000-6007f7fffffff : pciex@600c3c0100000
....
....

Sample dmesg output with this fix:
----------------------------------
[  160.371911] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
[  160.665366] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
[  160.914627] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
[  161.165253] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
[  161.414604] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)

CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC: Jeremy Kerr <jk@ozlabs.org>
CC: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/opal-prd.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Vaidyanathan Srinivasan Oct. 2, 2019, 8:48 a.m. UTC | #1
* Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [2019-10-02 13:18:56]:

> Add check to validate whether requested page is part of system RAM
> or not before mmap() and error out if its not part of system RAM.
> 
> cat /proc/iomem:
> -----------------
> 00000000-27ffffffff : System RAM
> 2800000000-2fffffffff : namespace0.0
> 200000000000-2027ffffffff : System RAM
> 202800000000-202fffffffff : namespace1.0
> 6000000000000-6003fbfffffff : pciex@600c3c0000000
> 6004000000000-6007f7fffffff : pciex@600c3c0100000
> ....
> ....
> 
> Sample dmesg output with this fix:
> ----------------------------------
> [  160.371911] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
> [  160.665366] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
> [  160.914627] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
> [  161.165253] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)
> [  161.414604] opal-prd: mmap: Requested page is not part of system RAM (addr : 0x0000202ffcfc0000, size : 0x0000000000570000)

Thanks Vasant for the patch.  HBRT pages landing in NVDIMM area caused
the opal-prd failure.  Good debug and root-cause.

Thanks to Aneesh for the tip to look into ATT bits of the PTE mapping.

> CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> CC: Jeremy Kerr <jk@ozlabs.org>
> CC: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>

> ---
>  arch/powerpc/platforms/powernv/opal-prd.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal-prd.c b/arch/powerpc/platforms/powernv/opal-prd.c
> index 45f4223a790f..0f88752302a2 100644
> --- a/arch/powerpc/platforms/powernv/opal-prd.c
> +++ b/arch/powerpc/platforms/powernv/opal-prd.c
> @@ -3,7 +3,7 @@
>   * OPAL Runtime Diagnostics interface driver
>   * Supported on POWERNV platform
>   *
> - * Copyright IBM Corporation 2015
> + * Copyright IBM Corporation 2015-2019
>   */
> 
>  #define pr_fmt(fmt) "opal-prd: " fmt
> @@ -47,6 +47,20 @@ static bool opal_prd_range_is_valid(uint64_t addr, uint64_t size)
>  	if (addr + size < addr)
>  		return false;
> 
> +	/*
> +	 * Check if region is in system RAM and error out if the address
> +	 * belongs to special devices like NVDIMM. phys_mem_access_prot()
> +	 * routine will change ATT bits to non cachable if page is not in
> +	 * RAM, causing HBRT to not fetch and execute in the mapped memory
> +	 * and fail. Page permissions can be left at default since all
> +	 * firmware component should be in system RAM only.
> +	 */
> +	if (!page_is_ram(addr >> PAGE_SHIFT)) {
> +		pr_warn("mmap: Requested page is not part of system RAM "
> +			"(addr : 0x%016llx, size : 0x%016llx)\n", addr, size);
> +		return false;
> +	}
> +
>  	parent = of_find_node_by_path("/reserved-memory");
>  	if (!parent)
>  		return false;
> -- 
> 2.21.0
>
Jeremy Kerr Oct. 3, 2019, 1:47 a.m. UTC | #2
Hi Vasant,

> Add check to validate whether requested page is part of system RAM
> or not before mmap() and error out if its not part of system RAM.

opal_prd_range_is_valid() will return false if the reserved memory range
does not have an ibm,prd-label property. If this you're getting invalid
memory mapped through the PRD interface, that means the device tree is
incorrectly describing those ranges.

Or am I missing something?

Cheers,


Jeremy
Vasant Hegde Oct. 3, 2019, 4:51 a.m. UTC | #3
On 10/3/19 7:17 AM, Jeremy Kerr wrote:
> Hi Vasant,

Jeremy,

> 
>> Add check to validate whether requested page is part of system RAM
>> or not before mmap() and error out if its not part of system RAM.
> 
> opal_prd_range_is_valid() will return false if the reserved memory range
> does not have an ibm,prd-label property. If this you're getting invalid
> memory mapped through the PRD interface, that means the device tree is
> incorrectly describing those ranges.

Correct. We will have `ibm,prd-label` property. That's not the issue. Here issue
is HBRT is loaded into NVDIMM memory.


Copy-pasting Vaidy's explanation from internal bugzilla here:

------------------
The root-cause of the problem seem to be in HBRT using NVDIMM area/addresses for 
firmware operation.

Linux maps the required address for HBRT to read, write and execute. This all 
works fine for normal RAM addresses.  However NVDIMM is not normal RAM, it is 
device memory which can be used as RAM or through other layers and subsystem.

Linux kernel memory manager set page table attributes as 0b10 non-idempotent I/O 
instead of normal RAM 0b00 since this is a special type of device memory 
initialized and used by a firmware device driver.  This prevented instruction 
execution from that mapped page.  Since instruction could not be fetched, 
opal-prd application could not complete init and start.

------------------

Hostboot should detect NVDIMM areas and avoid using those areas for any firmware 
purposes including HBRT. Hostboot will fix this issue.

In this patch we are adding additional check to make sure mmap() fails 
gracefully and we log proper error log. That way opal-prd will fail to start 
instead of looping forever .

-Vasant
Jeremy Kerr Oct. 3, 2019, 4:56 a.m. UTC | #4
Hi Vasant,

> Correct. We will have `ibm,prd-label` property. That's not the issue. 

It sure sounds like the issue - someone has represented a range that
should be mapped by HBRT, but isn't appropriate for mapping by HBRT.

> Here issueis HBRT is loaded into NVDIMM memory.

OK. How about we just don't do that?

It sounds like we're just trying to work around an invalid
representation of the mappings.

Cheers,


Jeremy
Vasant Hegde Oct. 3, 2019, 5:31 a.m. UTC | #5
On 10/3/19 10:26 AM, Jeremy Kerr wrote:
> Hi Vasant,
> 
>> Correct. We will have `ibm,prd-label` property. That's not the issue.
> 
> It sure sounds like the issue - someone has represented a range that
> should be mapped by HBRT, but isn't appropriate for mapping by HBRT.
> 
>> Here issueis HBRT is loaded into NVDIMM memory.
> 
> OK. How about we just don't do that?

Yes. Hostboot will fix that. It will make sure that HBRT is loaded into regular 
memory.

> 
> It sounds like we're just trying to work around an invalid
> representation of the mappings.

Its not workaround. Its additional check. So that we don't mmap() if HBRT is not 
in system RAM
and throw proper error.. So that debugging becomes easy.

-Vasant
Jeremy Kerr Oct. 3, 2019, 7:07 a.m. UTC | #6
Hi Vasant,

> > OK. How about we just don't do that?
> 
> Yes. Hostboot will fix that. It will make sure that HBRT is loaded
> into regular memory.

Super.

> > It sounds like we're just trying to work around an invalid
> > representation of the mappings.
> 
> Its not workaround. Its additional check.

The issue is that you've added a check for stuff that the kernel doesn't
(and shouldn't) know about, and assumed that the kernel knows better
than the device tree. It may be the correct thing to do in this case,
but we can't guarantee that it's always correct.

For example, what if there is a future HBRT range that is fine to go
into NVRAM? With this change, that's not possible.

Or, what if there's a range of address-space that isn't backed by system
RAM (say, some MMIO-mapped hardware) that we want to expose to a future
HBRT implementation? This change will block that.

The kernel doesn't know what is and is not valid for a HBRT mapping, so
it has no reason to override what's specified in the device tree. We've
designed this so that the kernel provides the mechanism for mapping
pages, and not the policy of which pages can be mapped.

Regards,


Jeremy
Vaidyanathan Srinivasan Oct. 3, 2019, 10:29 a.m. UTC | #7
* Jeremy Kerr <jk@ozlabs.org> [2019-10-03 15:07:24]:

> Hi Vasant,
> 
> > > OK. How about we just don't do that?
> > 
> > Yes. Hostboot will fix that. It will make sure that HBRT is loaded
> > into regular memory.
> 
> Super.
> 
> > > It sounds like we're just trying to work around an invalid
> > > representation of the mappings.
> > 
> > Its not workaround. Its additional check.
> 
> The issue is that you've added a check for stuff that the kernel doesn't
> (and shouldn't) know about, and assumed that the kernel knows better
> than the device tree. It may be the correct thing to do in this case,
> but we can't guarantee that it's always correct.

Good point on the policy ownership.

> For example, what if there is a future HBRT range that is fine to go
> into NVRAM? With this change, that's not possible.

The current topic is who owns setting up the ATT bits for that piece
of memory.  It is the kernel today.  Kernel decides to set this up as
normal memory or I/O memory and sets the bits in page table entry.

> Or, what if there's a range of address-space that isn't backed by system
> RAM (say, some MMIO-mapped hardware) that we want to expose to a future
> HBRT implementation? This change will block that.
> 
> The kernel doesn't know what is and is not valid for a HBRT mapping, so
> it has no reason to override what's specified in the device tree. We've
> designed this so that the kernel provides the mechanism for mapping
> pages, and not the policy of which pages can be mapped.

The features altered are cache inhibit and guarding which affects
ability to fetch instructions.  If we allow HBRT to reside in an I/O
memory, the we need to tell kernel that it is ok to allow caching and
instruction execution in that region and accordingly change the ATT
bits.

This patch does not block a working function, but actually makes
debugging a failed case easier.  The failing scenario without this
check is such that HBRT cannot fetch from the region of memory and
loops in minor page faults doing nothing.  

As Vasant mentioned hostboot team will add code to relocate the HBRT
to the right place.  Addressing your concern, if we end up allowing
HBRT in non system-RAM area, we need to add some more flags in device
tree to instruct the driver to force change the page protection bits
as page_prot = pgprot_cached(page_prot); It does not make sense to
keep the region cache inhibited and just clear the Guard bit
(ATT=0b10 - non-idempotent I/O) so we should force to normal memory
ATT=0b00.

In summary, this check does not block any working function today.  We
fail to execute HBRT code after we successfully mmap() the memory
anyway.

When we need to move firmware components to other types of memory, we
should do a future patch to indicate in device tree that this is non
system-RAM and kernel should change PTE permissions and then setup the
mmap(). Or HBRT really has a use for NVRAM in which case we explicitly
need to indicate (via device-tree) the required attribute for this
mapping.

--Vaidy
Jeremy Kerr Oct. 4, 2019, 3:27 a.m. UTC | #8
Hi Vaidy,

> The current topic is who owns setting up the ATT bits for that piece
> of memory.  It is the kernel today.  Kernel decides to set this up as
> normal memory or I/O memory and sets the bits in page table entry.
> 
>> Or, what if there's a range of address-space that isn't backed by system
>> RAM (say, some MMIO-mapped hardware) that we want to expose to a future
>> HBRT implementation? This change will block that.
>> 
>> The kernel doesn't know what is and is not valid for a HBRT mapping, so
>> it has no reason to override what's specified in the device tree. We've
>> designed this so that the kernel provides the mechanism for mapping
>> pages, and not the policy of which pages can be mapped.
> 
> The features altered are cache inhibit and guarding which affects
> ability to fetch instructions.  If we allow HBRT to reside in an I/O
> memory, the we need to tell kernel that it is ok to allow caching and
> instruction execution in that region and accordingly change the ATT
> bits.

But this isn't only about the HBRT range itself (ie., the memory 
containing the HBRT binary). Everything that HBRT needs to map will come 
through this path. We may not need to fetch instructions from those ranges.

> This patch does not block a working function, but actually makes
> debugging a failed case easier.  The failing scenario without this
> check is such that HBRT cannot fetch from the region of memory and
> loops in minor page faults doing nothing.

Yep, that's not great, but the alternative means applying this kernel 
policy, which we can't guarantee is correct.

That is, unless the page protection bits mean that this won't work 
anyway, but we can probably fix that without a kernel policy, by 
applying the appropriate pgprot_t, perhaps.

> As Vasant mentioned hostboot team will add code to relocate the HBRT
> to the right place.  Addressing your concern, if we end up allowing
> HBRT in non system-RAM area

Not just HBRT, but anything that HBRT maps too.

> we need to add some more flags in device
> tree to instruct the driver to force change the page protection bits
> as page_prot = pgprot_cached(page_prot);

Doesn't phys_mem_access_prot() handle that for us? Or do I have my 
_noncached/_cached logic inverted?

Cheers,


Jeremy
Vaidyanathan Srinivasan Oct. 5, 2019, 5:08 a.m. UTC | #9
* Jeremy Kerr <jk@ozlabs.org> [2019-10-04 11:27:46]:

> Hi Vaidy,
> 
> > The current topic is who owns setting up the ATT bits for that piece
> > of memory.  It is the kernel today.  Kernel decides to set this up as
> > normal memory or I/O memory and sets the bits in page table entry.
> > 
> > > Or, what if there's a range of address-space that isn't backed by system
> > > RAM (say, some MMIO-mapped hardware) that we want to expose to a future
> > > HBRT implementation? This change will block that.
> > > 
> > > The kernel doesn't know what is and is not valid for a HBRT mapping, so
> > > it has no reason to override what's specified in the device tree. We've
> > > designed this so that the kernel provides the mechanism for mapping
> > > pages, and not the policy of which pages can be mapped.
> > 
> > The features altered are cache inhibit and guarding which affects
> > ability to fetch instructions.  If we allow HBRT to reside in an I/O
> > memory, the we need to tell kernel that it is ok to allow caching and
> > instruction execution in that region and accordingly change the ATT
> > bits.
> 
> But this isn't only about the HBRT range itself (ie., the memory containing
> the HBRT binary). Everything that HBRT needs to map will come through this
> path. We may not need to fetch instructions from those ranges.

Correct. Only HBRT code cannot be fetched, but data can be mapped and
used.  However cache inhibit mapping is unnecessary unless explicitly
requested by firmware.

> > This patch does not block a working function, but actually makes
> > debugging a failed case easier.  The failing scenario without this
> > check is such that HBRT cannot fetch from the region of memory and
> > loops in minor page faults doing nothing.
> 
> Yep, that's not great, but the alternative means applying this kernel
> policy, which we can't guarantee is correct.
> 
> That is, unless the page protection bits mean that this won't work anyway,
> but we can probably fix that without a kernel policy, by applying the
> appropriate pgprot_t, perhaps.

Currently if we allow the mapping, it won't work at all for code.  But
can work for data pages.  However cache inhibit mapping will cause
side effects apart from poor performance.  We should not do this
unless firmware has a reason to request for this.  We should pass
additional info from OPAL to driver to set/override permission.

> > As Vasant mentioned hostboot team will add code to relocate the HBRT
> > to the right place.  Addressing your concern, if we end up allowing
> > HBRT in non system-RAM area
> 
> Not just HBRT, but anything that HBRT maps too.

Correct.

> > we need to add some more flags in device
> > tree to instruct the driver to force change the page protection bits
> > as page_prot = pgprot_cached(page_prot);
> 
> Doesn't phys_mem_access_prot() handle that for us? Or do I have my
> _noncached/_cached logic inverted?

The kernel permission policy is implemented via phys_mem_access_prot().

pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
			      unsigned long size, pgprot_t vma_prot)
{
	if (ppc_md.phys_mem_access_prot)
		return ppc_md.phys_mem_access_prot(file, pfn, size, vma_prot);

	if (!page_is_ram(pfn))
		vma_prot = pgprot_noncached(vma_prot);

	return vma_prot;
}

The problem for firmware (HBRT) mapping is precisely this policy that
forces a different defaults pgprot_noncached() if the memory page is
not in system-RAM.

If !page_is_ram(pfn) check is how defaults will change.  This check is
what changes pgprot for us now for NVRAM.  We can override this
permission in opal-prd driver since we know the nature of the use
case. But I would expect OPAL to indicate that we should override and
then proceed to do it in the driver.

If we really architect a usage model where firmware uses NVRAM, then
it should be in its own namespace and not really pick just an address.
Also as per our opal-prd design, other than firmware memory, all
device access should come via host-kernel driver and not directly
mapped to application.

--Vaidy
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c b/arch/powerpc/platforms/powernv/opal-prd.c
index 45f4223a790f..0f88752302a2 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -3,7 +3,7 @@ 
  * OPAL Runtime Diagnostics interface driver
  * Supported on POWERNV platform
  *
- * Copyright IBM Corporation 2015
+ * Copyright IBM Corporation 2015-2019
  */
 
 #define pr_fmt(fmt) "opal-prd: " fmt
@@ -47,6 +47,20 @@  static bool opal_prd_range_is_valid(uint64_t addr, uint64_t size)
 	if (addr + size < addr)
 		return false;
 
+	/*
+	 * Check if region is in system RAM and error out if the address
+	 * belongs to special devices like NVDIMM. phys_mem_access_prot()
+	 * routine will change ATT bits to non cachable if page is not in
+	 * RAM, causing HBRT to not fetch and execute in the mapped memory
+	 * and fail. Page permissions can be left at default since all
+	 * firmware component should be in system RAM only.
+	 */
+	if (!page_is_ram(addr >> PAGE_SHIFT)) {
+		pr_warn("mmap: Requested page is not part of system RAM "
+			"(addr : 0x%016llx, size : 0x%016llx)\n", addr, size);
+		return false;
+	}
+
 	parent = of_find_node_by_path("/reserved-memory");
 	if (!parent)
 		return false;