[1/2] libnvdimm/altmap: Track namespace boundaries in altmap
diff mbox series

Message ID 20190910062826.10041-1-aneesh.kumar@linux.ibm.com
State Changes Requested
Headers show
Series
  • [1/2] libnvdimm/altmap: Track namespace boundaries in altmap
Related show

Checks

Context Check Description
snowpatch_ozlabs/checkpatch warning total: 0 errors, 1 warnings, 1 checks, 47 lines checked
snowpatch_ozlabs/apply_patch success Successfully applied on branch next (c317052c95bef1f977b023158e5aa929215f443d)

Commit Message

Aneesh Kumar K.V Sept. 10, 2019, 6:28 a.m. UTC
With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
area. Some architectures map the memmap area with large page size. On
architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
This maps a namespace size of 16G.

When populating memmap region with 16MB page from the device area,
make sure the allocated space is not used to map resources outside this
namespace. Such usage of device area will prevent a namespace destroy.

Add resource end pnf in altmap and use that to check if the memmap area
allocation can map pfn outside the namespace. On ppc64 in such case we fallback
to allocation from memory.

This fix kernel crash reported below:

[  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
[  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
[  133.464760] Faulting instruction address: 0xc00000000007580c
[  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
[  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
.....
[  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
[  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
[  133.464910] Call Trace:
[  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
[  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
[  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
[  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
[  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
[  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
[  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
[  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
[  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
[  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
[  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
[  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
[  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
[  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250

Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/init_64.c | 17 ++++++++++++++++-
 drivers/nvdimm/pfn_devs.c |  2 ++
 include/linux/memremap.h  |  1 +
 3 files changed, 19 insertions(+), 1 deletion(-)

Comments

Pankaj Gupta Sept. 10, 2019, 7:40 a.m. UTC | #1
> 
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
> 
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
> 
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we
> fallback
> to allocation from memory.
> 
> This fix kernel crash reported below:
> 
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133
> devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc00000000007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .....
> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0
> (unreliable)
> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44]
> section_deactivate+0x1a4/0x240
> [  133.464928] [c000007cbfd0f980] [c000000000386270]
> __remove_pages+0x3a0/0x590
> [  133.464935] [c000007cbfd0fa50] [c000000000074158]
> arch_remove_memory+0x88/0x160
> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0]
> devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0]
> devm_action_release+0x30/0x50
> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4]
> release_nodes+0x344/0x400
> [  133.464961] [c000007cbfd0fc40] [c00000000073378c]
> device_release_driver_internal+0x15c/0x250
> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc]
> kernfs_fop_write+0x17c/0x250
> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
> 
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/init_64.c | 17 ++++++++++++++++-
>  drivers/nvdimm/pfn_devs.c |  2 ++
>  include/linux/memremap.h  |  1 +
>  3 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a44f6281ca3a..4e08246acd79 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned
> long phys,
>  	vmemmap_list = vmem_back;
>  }
>  
> +static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long
> start,
> +				unsigned long page_size)
> +{
> +	unsigned long nr_pfn = page_size / sizeof(struct page);
> +	unsigned long start_pfn = page_to_pfn((struct page *)start);
> +
> +	if ((start_pfn + nr_pfn) > altmap->end_pfn)
> +		return true;
> +
> +	if (start_pfn < altmap->base_pfn)
> +		return true;
> +
> +	return false;
> +}
> +
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int
>  node,
>  		struct vmem_altmap *altmap)
>  {
> @@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start,
> unsigned long end, int node,
>  		 * fail due to alignment issues when using 16MB hugepages, so
>  		 * fall back to system memory if the altmap allocation fail.
>  		 */
> -		if (altmap) {
> +		if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
>  			p = altmap_alloc_block_buf(page_size, altmap);
>  			if (!p)
>  				pr_debug("altmap block allocation failed, falling back to system
>  				memory");
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11cf1aae..a616d69c8224 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn,
> struct dev_pagemap *pgmap)
>  	struct nd_namespace_common *ndns = nd_pfn->ndns;
>  	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
>  	resource_size_t base = nsio->res.start + start_pad;
> +	resource_size_t end = nsio->res.end - end_trunc;
>  	struct vmem_altmap __altmap = {
>  		.base_pfn = init_altmap_base(base),
>  		.reserve = init_altmap_reserve(base),
> +		.end_pfn = PHYS_PFN(end),
>  	};
>  
>  	memcpy(res, &nsio->res, sizeof(*res));
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index f8a5b2a19945..c70996fe48c8 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -17,6 +17,7 @@ struct device;
>   */
>  struct vmem_altmap {
>  	const unsigned long base_pfn;
> +	const unsigned long end_pfn;
>  	const unsigned long reserve;
>  	unsigned long free;
>  	unsigned long align;
> --
> 2.21.0

This patch looks good to me. It helps to prevent
namespace access across boundaries for altmap
hugepage allocation.

Reviewed-by: Pankaj Gupta <pagupta@redhat.com>

Thanks,
Pankaj
> 
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
>
Dan Williams Sept. 10, 2019, 8:10 a.m. UTC | #2
On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> to allocation from memory.

Shouldn't this instead be comprehended by nd_pfn_init() to increase
the reservation size so that it fits in the alignment? It may not
always be possible to fall back to allocation from memory for
extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
there may not be enough DRAM to store the memmap.
Santosh Sivaraj Sept. 10, 2019, 8:29 a.m. UTC | #3
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> to allocation from memory.
>
> This fix kernel crash reported below:
>
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc00000000007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .....
> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
>
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/init_64.c | 17 ++++++++++++++++-
>  drivers/nvdimm/pfn_devs.c |  2 ++
>  include/linux/memremap.h  |  1 +
>  3 files changed, 19 insertions(+), 1 deletion(-)

Tested-by: Santosh Sivaraj <santosh@fossix.org>

>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a44f6281ca3a..4e08246acd79 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -172,6 +172,21 @@ static __meminit void vmemmap_list_populate(unsigned long phys,
>  	vmemmap_list = vmem_back;
>  }
>  
> +static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
> +				unsigned long page_size)
> +{
> +	unsigned long nr_pfn = page_size / sizeof(struct page);
> +	unsigned long start_pfn = page_to_pfn((struct page *)start);
> +
> +	if ((start_pfn + nr_pfn) > altmap->end_pfn)
> +		return true;
> +
> +	if (start_pfn < altmap->base_pfn)
> +		return true;
> +
> +	return false;
> +}
> +
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  		struct vmem_altmap *altmap)
>  {
> @@ -194,7 +209,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  		 * fail due to alignment issues when using 16MB hugepages, so
>  		 * fall back to system memory if the altmap allocation fail.
>  		 */
> -		if (altmap) {
> +		if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
>  			p = altmap_alloc_block_buf(page_size, altmap);
>  			if (!p)
>  				pr_debug("altmap block allocation failed, falling back to system memory");
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 3e7b11cf1aae..a616d69c8224 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -618,9 +618,11 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
>  	struct nd_namespace_common *ndns = nd_pfn->ndns;
>  	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
>  	resource_size_t base = nsio->res.start + start_pad;
> +	resource_size_t end = nsio->res.end - end_trunc;
>  	struct vmem_altmap __altmap = {
>  		.base_pfn = init_altmap_base(base),
>  		.reserve = init_altmap_reserve(base),
> +		.end_pfn = PHYS_PFN(end),
>  	};
>  
>  	memcpy(res, &nsio->res, sizeof(*res));
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index f8a5b2a19945..c70996fe48c8 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -17,6 +17,7 @@ struct device;
>   */
>  struct vmem_altmap {
>  	const unsigned long base_pfn;
> +	const unsigned long end_pfn;
>  	const unsigned long reserve;
>  	unsigned long free;
>  	unsigned long align;
> -- 
> 2.21.0
>
> _______________________________________________
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
Aneesh Kumar K.V Sept. 10, 2019, 8:30 a.m. UTC | #4
On 9/10/19 1:40 PM, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
>> area. Some architectures map the memmap area with large page size. On
>> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
>> This maps a namespace size of 16G.
>>
>> When populating memmap region with 16MB page from the device area,
>> make sure the allocated space is not used to map resources outside this
>> namespace. Such usage of device area will prevent a namespace destroy.
>>
>> Add resource end pnf in altmap and use that to check if the memmap area
>> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
>> to allocation from memory.
> 
> Shouldn't this instead be comprehended by nd_pfn_init() to increase
> the reservation size so that it fits in the alignment? It may not
> always be possible to fall back to allocation from memory for
> extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
> there may not be enough DRAM to store the memmap.
> 

We do switch between DRAM and device for memmap allocation. ppc64 
vmemmap_populate  does

if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
	p = altmap_alloc_block_buf(page_size, altmap);
	if (!p)
		pr_debug("altmap block allocation failed, falling back to system memory");
	}
	if (!p)
		p = vmemmap_alloc_block_buf(page_size, node);
	

With that we should be using DRAM for the first and the last mapping, 
rest of the memmap should be backed by device.

-aneesh
Dan Williams Sept. 10, 2019, 9:08 a.m. UTC | #5
On Tue, Sep 10, 2019 at 1:31 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 9/10/19 1:40 PM, Dan Williams wrote:
> > On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> >> area. Some architectures map the memmap area with large page size. On
> >> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> >> This maps a namespace size of 16G.
> >>
> >> When populating memmap region with 16MB page from the device area,
> >> make sure the allocated space is not used to map resources outside this
> >> namespace. Such usage of device area will prevent a namespace destroy.
> >>
> >> Add resource end pnf in altmap and use that to check if the memmap area
> >> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> >> to allocation from memory.
> >
> > Shouldn't this instead be comprehended by nd_pfn_init() to increase
> > the reservation size so that it fits in the alignment? It may not
> > always be possible to fall back to allocation from memory for
> > extremely large pmem devices. I.e. at 64GB of memmap per 1TB of pmem
> > there may not be enough DRAM to store the memmap.
> >
>
> We do switch between DRAM and device for memmap allocation. ppc64
> vmemmap_populate  does
>
> if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
>         p = altmap_alloc_block_buf(page_size, altmap);
>         if (!p)
>                 pr_debug("altmap block allocation failed, falling back to system memory");
>         }
>         if (!p)
>                 p = vmemmap_alloc_block_buf(page_size, node);
>
>
> With that we should be using DRAM for the first and the last mapping,
> rest of the memmap should be backed by device.

Ah, ok, makes sense.
Johannes Thumshirn Sept. 11, 2019, 12:03 p.m. UTC | #6
Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Dan Williams Sept. 16, 2019, 5:58 p.m. UTC | #7
On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> area. Some architectures map the memmap area with large page size. On
> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> This maps a namespace size of 16G.
>
> When populating memmap region with 16MB page from the device area,
> make sure the allocated space is not used to map resources outside this
> namespace. Such usage of device area will prevent a namespace destroy.
>
> Add resource end pnf in altmap and use that to check if the memmap area
> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> to allocation from memory.
>
> This fix kernel crash reported below:
>
> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> [  133.464760] Faulting instruction address: 0xc00000000007580c
> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> .....
> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> [  133.464910] Call Trace:
> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250

Question, does this crash only happen when the namespace is not 16MB
aligned? In other words was this bug exposed by the subsection-hotplug
changes and should it contain Fixes: tag for those commits?
Aneesh Kumar K.V Sept. 17, 2019, 7:39 a.m. UTC | #8
On 9/16/19 11:28 PM, Dan Williams wrote:
> On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
>> area. Some architectures map the memmap area with large page size. On
>> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
>> This maps a namespace size of 16G.
>>
>> When populating memmap region with 16MB page from the device area,
>> make sure the allocated space is not used to map resources outside this
>> namespace. Such usage of device area will prevent a namespace destroy.
>>
>> Add resource end pnf in altmap and use that to check if the memmap area
>> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
>> to allocation from memory.
>>
>> This fix kernel crash reported below:
>>
>> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
>> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
>> [  133.464760] Faulting instruction address: 0xc00000000007580c
>> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
>> .....
>> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
>> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
>> [  133.464910] Call Trace:
>> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
>> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
>> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
>> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
>> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
>> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
>> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
>> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
>> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
>> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
>> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
>> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
>> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
>> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
> 
> Question, does this crash only happen when the namespace is not 16MB
> aligned? In other words was this bug exposed by the subsection-hotplug
> changes and should it contain Fixes: tag for those commits?
> 

We are able to hit this crash even with older kernels. This happens when 
we have multiple namespaces from the same region of size 26G. In that 
case we need to make sure we don't end up using altmap from one 
namespace for mapping vmemmap of the adjacent namespace.

Considering this impacts ppc64 and we got the ppc64 SCM support in 4.20. 
may be  we can do just
Cc: <stable@vger.kernel.org> # 4.20+

-aneesh
Dan Williams Sept. 17, 2019, 2:24 p.m. UTC | #9
On Tue, Sep 17, 2019 at 12:40 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 9/16/19 11:28 PM, Dan Williams wrote:
> > On Mon, Sep 9, 2019 at 11:29 PM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> With PFN_MODE_PMEM namespace, the memmap area is allocated from the device
> >> area. Some architectures map the memmap area with large page size. On
> >> architectures like ppc64, 16MB page for memap mapping can map 262144 pfns.
> >> This maps a namespace size of 16G.
> >>
> >> When populating memmap region with 16MB page from the device area,
> >> make sure the allocated space is not used to map resources outside this
> >> namespace. Such usage of device area will prevent a namespace destroy.
> >>
> >> Add resource end pnf in altmap and use that to check if the memmap area
> >> allocation can map pfn outside the namespace. On ppc64 in such case we fallback
> >> to allocation from memory.
> >>
> >> This fix kernel crash reported below:
> >>
> >> [  132.034989] WARNING: CPU: 13 PID: 13719 at mm/memremap.c:133 devm_memremap_pages_release+0x2d8/0x2e0
> >> [  133.464754] BUG: Unable to handle kernel data access at 0xc00c00010b204000
> >> [  133.464760] Faulting instruction address: 0xc00000000007580c
> >> [  133.464766] Oops: Kernel access of bad area, sig: 11 [#1]
> >> [  133.464771] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> >> .....
> >> [  133.464901] NIP [c00000000007580c] vmemmap_free+0x2ac/0x3d0
> >> [  133.464906] LR [c0000000000757f8] vmemmap_free+0x298/0x3d0
> >> [  133.464910] Call Trace:
> >> [  133.464914] [c000007cbfd0f7b0] [c0000000000757f8] vmemmap_free+0x298/0x3d0 (unreliable)
> >> [  133.464921] [c000007cbfd0f8d0] [c000000000370a44] section_deactivate+0x1a4/0x240
> >> [  133.464928] [c000007cbfd0f980] [c000000000386270] __remove_pages+0x3a0/0x590
> >> [  133.464935] [c000007cbfd0fa50] [c000000000074158] arch_remove_memory+0x88/0x160
> >> [  133.464942] [c000007cbfd0fae0] [c0000000003be8c0] devm_memremap_pages_release+0x150/0x2e0
> >> [  133.464949] [c000007cbfd0fb70] [c000000000738ea0] devm_action_release+0x30/0x50
> >> [  133.464955] [c000007cbfd0fb90] [c00000000073a5a4] release_nodes+0x344/0x400
> >> [  133.464961] [c000007cbfd0fc40] [c00000000073378c] device_release_driver_internal+0x15c/0x250
> >> [  133.464968] [c000007cbfd0fc80] [c00000000072fd14] unbind_store+0x104/0x110
> >> [  133.464973] [c000007cbfd0fcd0] [c00000000072ee24] drv_attr_store+0x44/0x70
> >> [  133.464981] [c000007cbfd0fcf0] [c0000000004a32bc] sysfs_kf_write+0x6c/0xa0
> >> [  133.464987] [c000007cbfd0fd10] [c0000000004a1dfc] kernfs_fop_write+0x17c/0x250
> >> [  133.464993] [c000007cbfd0fd60] [c0000000003c348c] __vfs_write+0x3c/0x70
> >> [  133.464999] [c000007cbfd0fd80] [c0000000003c75d0] vfs_write+0xd0/0x250
> >
> > Question, does this crash only happen when the namespace is not 16MB
> > aligned? In other words was this bug exposed by the subsection-hotplug
> > changes and should it contain Fixes: tag for those commits?
> >
>
> We are able to hit this crash even with older kernels. This happens when
> we have multiple namespaces from the same region of size 26G. In that
> case we need to make sure we don't end up using altmap from one
> namespace for mapping vmemmap of the adjacent namespace.
>
> Considering this impacts ppc64 and we got the ppc64 SCM support in 4.20.
> may be  we can do just
> Cc: <stable@vger.kernel.org> # 4.20+

Ok, thanks!

Patch
diff mbox series

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a44f6281ca3a..4e08246acd79 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -172,6 +172,21 @@  static __meminit void vmemmap_list_populate(unsigned long phys,
 	vmemmap_list = vmem_back;
 }
 
+static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+				unsigned long page_size)
+{
+	unsigned long nr_pfn = page_size / sizeof(struct page);
+	unsigned long start_pfn = page_to_pfn((struct page *)start);
+
+	if ((start_pfn + nr_pfn) > altmap->end_pfn)
+		return true;
+
+	if (start_pfn < altmap->base_pfn)
+		return true;
+
+	return false;
+}
+
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
@@ -194,7 +209,7 @@  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		 * fail due to alignment issues when using 16MB hugepages, so
 		 * fall back to system memory if the altmap allocation fail.
 		 */
-		if (altmap) {
+		if (altmap && !altmap_cross_boundary(altmap, start, page_size)) {
 			p = altmap_alloc_block_buf(page_size, altmap);
 			if (!p)
 				pr_debug("altmap block allocation failed, falling back to system memory");
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3e7b11cf1aae..a616d69c8224 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -618,9 +618,11 @@  static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 	struct nd_namespace_common *ndns = nd_pfn->ndns;
 	struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
 	resource_size_t base = nsio->res.start + start_pad;
+	resource_size_t end = nsio->res.end - end_trunc;
 	struct vmem_altmap __altmap = {
 		.base_pfn = init_altmap_base(base),
 		.reserve = init_altmap_reserve(base),
+		.end_pfn = PHYS_PFN(end),
 	};
 
 	memcpy(res, &nsio->res, sizeof(*res));
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index f8a5b2a19945..c70996fe48c8 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -17,6 +17,7 @@  struct device;
  */
 struct vmem_altmap {
 	const unsigned long base_pfn;
+	const unsigned long end_pfn;
 	const unsigned long reserve;
 	unsigned long free;
 	unsigned long align;