diff mbox

properly reserve in bootmem the lmb reserved regions that cross numa nodes

Message ID 48E23D6C.4030406@linux.vnet.ibm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Jon Tollefson Sept. 30, 2008, 2:53 p.m. UTC
If there are multiple reserved memory blocks via lmb_reserve() that are 
contiguous addresses and on different numa nodes we are losing track of which 
address ranges to reserve in bootmem on which node.  I discovered this 
when I only recently got to try 16GB huge pages on a system with more 
then 2 nodes.

When scanning the device tree in early boot we call lmb_reserve() with 
the addresses of the 16G pages that we find so that the memory doesn't 
get used for something else.  For example the addresses for the pages 
could be 4000000000, 4400000000, 4800000000, 4C00000000, etc - 8 pages, 
one on each of eight nodes.  In the lmb after all the pages have been 
reserved it will look something like the following:

lmb_dump_all:
    memory.cnt            = 0x2
    memory.size           = 0x3e80000000
    memory.region[0x0].base       = 0x0
                      .size     = 0x1e80000000
    memory.region[0x1].base       = 0x4000000000
                      .size     = 0x2000000000
    reserved.cnt          = 0x5
    reserved.size         = 0x3e80000000
    reserved.region[0x0].base       = 0x0
                      .size     = 0x7b5000
    reserved.region[0x1].base       = 0x2a00000
                      .size     = 0x78c000
    reserved.region[0x2].base       = 0x328c000
                      .size     = 0x43000
    reserved.region[0x3].base       = 0xf4e8000
                      .size     = 0xb18000
    reserved.region[0x4].base       = 0x4000000000
                      .size     = 0x2000000000


The reserved.region[0x4] contains the 16G pages.  In 
arch/powerpc/mm/num.c: do_init_bootmem() we loop through each of the 
node numbers looking for the reserved regions that belong to the 
particular node.  It is not able to identify region 0x4 as being a part 
of each of the 8 nodes.  It is assuming that a reserved region is only
on a single node.

This patch takes out the reserved region loop from inside
the loop that goes over each node.  It looks up the active region containing
the start of the reserved region.  If it extends past that active region then
it adjusts the size and gets the next active region containing it.


Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com>
---


 arch/powerpc/mm/numa.c |   63 ++++++++++++++++++++++++++++---------------------
 include/linux/mm.h     |    2 +
 mm/page_alloc.c        |   19 ++++++++++++++
 3 files changed, 57 insertions(+), 27 deletions(-)

Comments

Adam Litke Sept. 30, 2008, 3:47 p.m. UTC | #1
This seems like the right approach to me.  I have pointed out a few
stylistic issues below.

On Tue, 2008-09-30 at 09:53 -0500, Jon Tollefson wrote:
<snip>
> +	/* Mark reserved regions */
> +	for (i = 0; i < lmb.reserved.cnt; i++) {
> +		unsigned long physbase = lmb.reserved.region[i].base;
> +		unsigned long size = lmb.reserved.region[i].size;
> +		unsigned long start_pfn = physbase >> PAGE_SHIFT;
> +		unsigned long end_pfn = ((physbase+size-1) >> PAGE_SHIFT);

CodingStyle dictates that this should be:
unsigned long end_pfn = ((physbase + size - 1) >> PAGE_SHIFT);

<snip>

> +/**
> + * get_node_active_region - Return active region containing start_pfn
> + * @start_pfn The page to return the region for.
> + *
> + * It will return NULL if active region is not found.
> + */
> +struct node_active_region *get_node_active_region(
> +							unsigned long start_pfn)

Bad style.  I think the convention would be to write it like this:

struct node_active_region *
get_node_active_region(unsigned long start_pfn)

> +{
> +	int i;
> +	for (i = 0; i < nr_nodemap_entries; i++) {
> +		unsigned long node_start_pfn = early_node_map[i].start_pfn;
> +		unsigned long node_end_pfn = early_node_map[i].end_pfn;
> +
> +		if (node_start_pfn <= start_pfn && node_end_pfn > start_pfn)
> +			return &early_node_map[i];
> +	}
> +	return NULL;
> +}

Since this is using the early_node_map[], should we mark the function
__mminit?
Kumar Gala Oct. 1, 2008, 9:02 p.m. UTC | #2
Out of interest how to do you guys represent NUMA regions of memory in  
the device tree?

- k
Jon Tollefson Oct. 2, 2008, 9:52 p.m. UTC | #3
Adam Litke wrote:
> This seems like the right approach to me.  I have pointed out a few
> stylistic issues below.
>   
Thanks.  I'll make those changes.  I assume by __mminit you meant __meminit

Jon

> On Tue, 2008-09-30 at 09:53 -0500, Jon Tollefson wrote:
> <snip>
>   
>> +	/* Mark reserved regions */
>> +	for (i = 0; i < lmb.reserved.cnt; i++) {
>> +		unsigned long physbase = lmb.reserved.region[i].base;
>> +		unsigned long size = lmb.reserved.region[i].size;
>> +		unsigned long start_pfn = physbase >> PAGE_SHIFT;
>> +		unsigned long end_pfn = ((physbase+size-1) >> PAGE_SHIFT);
>>     
>
> CodingStyle dictates that this should be:
> unsigned long end_pfn = ((physbase + size - 1) >> PAGE_SHIFT);
>
> <snip>
>
>   
>> +/**
>> + * get_node_active_region - Return active region containing start_pfn
>> + * @start_pfn The page to return the region for.
>> + *
>> + * It will return NULL if active region is not found.
>> + */
>> +struct node_active_region *get_node_active_region(
>> +							unsigned long start_pfn)
>>     
>
> Bad style.  I think the convention would be to write it like this:
>
> struct node_active_region *
> get_node_active_region(unsigned long start_pfn)
>
>   
>> +{
>> +	int i;
>> +	for (i = 0; i < nr_nodemap_entries; i++) {
>> +		unsigned long node_start_pfn = early_node_map[i].start_pfn;
>> +		unsigned long node_end_pfn = early_node_map[i].end_pfn;
>> +
>> +		if (node_start_pfn <= start_pfn && node_end_pfn > start_pfn)
>> +			return &early_node_map[i];
>> +	}
>> +	return NULL;
>> +}
>>     
>
> Since this is using the early_node_map[], should we mark the function
> __mminit?  
>
>
Jon Tollefson Oct. 6, 2008, 3:42 p.m. UTC | #4
Kumar Gala wrote:
> Out of interest how to do you guys represent NUMA regions of memory in
> the device tree?
>
> - k
Looking at the source code in numa.c I see at the start of
do_init_bootmem() that parse_numa_properties() is called.  It appears to
be looking at memory nodes and getting the node id from it.  It gets an
associativity property for the memory node and indexes that array with a
'min_common_depth' value to get the node id.

This node id is then used to setup the active ranges in the
early_node_map[].

Is this what you are asking about?  There are others I am sure who know
more about it then I though.

Jon
Kumar Gala Oct. 6, 2008, 3:58 p.m. UTC | #5
On Oct 6, 2008, at 10:42 AM, Jon Tollefson wrote:

> Kumar Gala wrote:
>> Out of interest how to do you guys represent NUMA regions of memory  
>> in
>> the device tree?
>>
>> - k
> Looking at the source code in numa.c I see at the start of
> do_init_bootmem() that parse_numa_properties() is called.  It  
> appears to
> be looking at memory nodes and getting the node id from it.  It gets  
> an
> associativity property for the memory node and indexes that array  
> with a
> 'min_common_depth' value to get the node id.
>
> This node id is then used to setup the active ranges in the
> early_node_map[].
>
> Is this what you are asking about?  There are others I am sure who  
> know
> more about it then I though.

I was wondering if this was documented anywhere (like in sPAPR)?

- k
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jon Tollefson Oct. 6, 2008, 8:48 p.m. UTC | #6
Kumar Gala wrote:
>
> On Oct 6, 2008, at 10:42 AM, Jon Tollefson wrote:
>
>> Kumar Gala wrote:
>>> Out of interest how to do you guys represent NUMA regions of memory in
>>> the device tree?
>>>
>>> - k
>> Looking at the source code in numa.c I see at the start of
>> do_init_bootmem() that parse_numa_properties() is called.  It appears to
>> be looking at memory nodes and getting the node id from it.  It gets an
>> associativity property for the memory node and indexes that array with a
>> 'min_common_depth' value to get the node id.
>>
>> This node id is then used to setup the active ranges in the
>> early_node_map[].
>>
>> Is this what you are asking about?  There are others I am sure who know
>> more about it then I though.
>
> I was wondering if this was documented anywhere (like in sPAPR)?
>
> - k
I see some information on it in section "C.6.6".

Jon
diff mbox

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d9a1813..07b8726 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -837,36 +837,45 @@  void __init do_init_bootmem(void)
 				  start_pfn, end_pfn);

 		free_bootmem_with_active_regions(nid, end_pfn);
+	}

-		/* Mark reserved regions on this node */
-		for (i = 0; i < lmb.reserved.cnt; i++) {
-			unsigned long physbase = lmb.reserved.region[i].base;
-			unsigned long size = lmb.reserved.region[i].size;
-			unsigned long start_paddr = start_pfn << PAGE_SHIFT;
-			unsigned long end_paddr = end_pfn << PAGE_SHIFT;
-
-			if (early_pfn_to_nid(physbase >> PAGE_SHIFT) != nid &&
-			    early_pfn_to_nid((physbase+size-1) >> PAGE_SHIFT) != nid)
-				continue;
-
-			if (physbase < end_paddr &&
-			    (physbase+size) > start_paddr) {
-				/* overlaps */
-				if (physbase < start_paddr) {
-					size -= start_paddr - physbase;
-					physbase = start_paddr;
-				}
-
-				if (size > end_paddr - physbase)
-					size = end_paddr - physbase;
-
-				dbg("reserve_bootmem %lx %lx\n", physbase,
-				    size);
-				reserve_bootmem_node(NODE_DATA(nid), physbase,
-						     size, BOOTMEM_DEFAULT);
-			}
+	/* Mark reserved regions */
+	for (i = 0; i < lmb.reserved.cnt; i++) {
+		unsigned long physbase = lmb.reserved.region[i].base;
+		unsigned long size = lmb.reserved.region[i].size;
+		unsigned long start_pfn = physbase >> PAGE_SHIFT;
+		unsigned long end_pfn = ((physbase+size-1) >> PAGE_SHIFT);
+		struct node_active_region *node_ar;
+
+		node_ar = get_node_active_region(start_pfn);
+		while (start_pfn < end_pfn && node_ar != NULL) {
+			/*
+			 * if reserved region extends past active region
+			 * then trim size to active region
+			 */
+			if (end_pfn >= node_ar->end_pfn)
+				size = (node_ar->end_pfn << PAGE_SHIFT)
+					- (start_pfn << PAGE_SHIFT);
+			dbg("reserve_bootmem %lx %lx nid=%d\n", physbase, size,
+				node_ar->nid);
+			reserve_bootmem_node(NODE_DATA(node_ar->nid), physbase,
+						size, BOOTMEM_DEFAULT);
+			/*
+			 * if reserved region extends past the active region
+			 * then get next active region that contains
+			 *        this reserved region
+			 */
+			if (end_pfn >= node_ar->end_pfn) {
+				start_pfn = node_ar->end_pfn;
+				physbase = start_pfn << PAGE_SHIFT;
+				node_ar = get_node_active_region(start_pfn);
+			} else
+				break;
 		}

+	}
+
+	for_each_online_node(nid) {
 		sparse_memory_present_with_active_regions(nid);
 	}
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 72a15dc..d186a7e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1020,6 +1020,8 @@  extern void push_node_boundaries(unsigned int nid, unsigned long start_pfn,
 extern void remove_all_active_ranges(void);
 extern unsigned long absent_pages_in_range(unsigned long start_pfn,
 						unsigned long end_pfn);
+struct node_active_region *get_node_active_region(
+						unsigned long start_pfn);
 extern void get_pfn_range_for_nid(unsigned int nid,
 			unsigned long *start_pfn, unsigned long *end_pfn);
 extern unsigned long find_min_pfn_with_active_regions(void);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e293c58..3a15c49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3071,6 +3071,25 @@  static void __meminit account_node_boundary(unsigned int nid,
 		unsigned long *start_pfn, unsigned long *end_pfn) {}
 #endif

+/**
+ * get_node_active_region - Return active region containing start_pfn
+ * @start_pfn The page to return the region for.
+ *
+ * It will return NULL if active region is not found.
+ */
+struct node_active_region *get_node_active_region(
+							unsigned long start_pfn)
+{
+	int i;
+	for (i = 0; i < nr_nodemap_entries; i++) {
+		unsigned long node_start_pfn = early_node_map[i].start_pfn;
+		unsigned long node_end_pfn = early_node_map[i].end_pfn;
+
+		if (node_start_pfn <= start_pfn && node_end_pfn > start_pfn)
+			return &early_node_map[i];
+	}
+	return NULL;
+}

 /**
  * get_pfn_range_for_nid - Return the start and end page frames for a node