diff mbox

[v4,4/4] Use 2GB memory block size on large-memory x86-64 systems

Message ID CAE9FiQXH0SxTCaw-J=nEhYtbuZfBDU3R+v_XRZYF6ymeXnPsgg@mail.gmail.com
State Not Applicable
Headers show

Commit Message

Yinghai Lu Aug. 24, 2015, 11:59 p.m. UTC
On Mon, Aug 24, 2015 at 4:41 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck <tony.luck@gmail.com> wrote:
>> On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>
>>> Can you boot with "debug ignore_loglevel" so we can see following print out
>>> for vmemmap?
>>
>> See attached. There are a few extra messages from my own debug printk()
>> calls. It seems that we successfully deal with node 0 from topology_init()
>> but die walking node 1. I see that the NODE_DATA limits for memory
>> on node 1 were from 1d70000 to 3a00000. But when we get into
>> register_mem_sect_under_node() we have rounded the start pfn down to
>> 1d00000 ... and we panic processing that range (which is in a hole in e820).
>>
>> We seem to die here:
>>
>>         for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
>>                 int page_nid;
>>
>>                 page_nid = get_nid_for_pfn(pfn);
>
> oh, no.
> register_mem_sect_under_node() is assuming:
> first section in the block is present and first page in that section is present.

attached should fix the problem:

Comments

Yinghai Lu Aug. 25, 2015, 7:01 p.m. UTC | #1
On Tue, Aug 25, 2015 at 10:03 AM, Tony Luck <tony.luck@gmail.com> wrote:
> On Mon, Aug 24, 2015 at 4:59 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> attached should fix the problem:
>
> It does ... but this (attached) is simpler.  Your patch and mine both
> allow the system to boot ...

The version that fix with section_nr present checking may save couple thousands
calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k

> but it is not happy. See all the chatter from systemd in the attached dmesg.

because of you have "debug ignore_loglevel" ?

>
> x86 doesn't allow me to set CONFIG_HOLES_IN_ZONE ... but now I'm
> worried about all the other places use pfn_valid_within()
>
> Still trying to get an answer from the BIOS folks on whether these
> holes are normal when setting up mirrored areas of memory.

The problem only happens when memory block size is 512M and section
size is 128M.
when you have them both at 128M, the system works. so current kernel
should only has
problem with hole size  > 128M to leave some section not present.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Luck Aug. 25, 2015, 10:06 p.m. UTC | #2
On Tue, Aug 25, 2015 at 12:01 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> It does ... but this (attached) is simpler.  Your patch and mine both
>> allow the system to boot ...
>
> The version that fix with section_nr present checking may save couple thousands
> calling to get_nid_for_pfn(). section size / page_size = 128M/4k = 32k

Actually saves about 1.2 million calls. Your patch wins :-)

Reported-and-tested-by: Tony Luck <tony.luck@intel.com>

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Aug. 26, 2015, 4:17 a.m. UTC | #3
* Yinghai Lu <yinghai@kernel.org> wrote:

> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -390,8 +390,14 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
>  	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
>  	sect_end_pfn += PAGES_PER_SECTION - 1;
>  	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
> -		int page_nid;
> +		int page_nid, scn_nr;
>  
> +		scn_nr = pfn_to_section_nr(pfn);
> +		if (!present_section_nr(scn_nr)) {
> +			pfn = round_down(pfn + PAGES_PER_SECTION,
> +					 PAGES_PER_SECTION) - 1;
> +			continue;
> +		}
>  		page_nid = get_nid_for_pfn(pfn);
>  		if (page_nid < 0)
>  			continue;
> @@ -426,10 +432,18 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
>  		return -ENOMEM;
>  	nodes_clear(*unlinked_nodes);
>  
> -	sect_start_pfn = section_nr_to_pfn(phys_index);
> -	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
> +	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
> +	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
> +	sect_end_pfn += PAGES_PER_SECTION - 1;
>  	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
> -		int nid;
> +		int nid, scn_nr;
> +
> +		scn_nr = pfn_to_section_nr(pfn);
> +		if (!present_section_nr(scn_nr)) {
> +			pfn = round_down(pfn + PAGES_PER_SECTION,
> +					 PAGES_PER_SECTION) - 1;
> +			continue;
> +		}

NAK due to lack of cleanliness: the two loops look almost identical - this sure 
can be factored out...

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Aug. 26, 2015, 5:42 a.m. UTC | #4
On Tue, Aug 25, 2015 at 9:17 PM, Ingo Molnar <mingo@kernel.org> wrote:
> NAK due to lack of cleanliness: the two loops look almost identical - this sure
> can be factored out...

Please check complete version at

https://patchwork.kernel.org/patch/7074341/

Andrew,
Ingo NAKed raw version of this patch, so you may need to remove it
from -mm tree.

Thanks

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 31df474d..cc910ad 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -390,8 +390,14 @@  int register_mem_sect_under_node(struct memory_block *mem_blk, int nid)
 	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
 	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-		int page_nid;
+		int page_nid, scn_nr;
 
+		scn_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(scn_nr)) {
+			pfn = round_down(pfn + PAGES_PER_SECTION,
+					 PAGES_PER_SECTION) - 1;
+			continue;
+		}
 		page_nid = get_nid_for_pfn(pfn);
 		if (page_nid < 0)
 			continue;
@@ -426,10 +432,18 @@  int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
 
-	sect_start_pfn = section_nr_to_pfn(phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
+	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
-		int nid;
+		int nid, scn_nr;
+
+		scn_nr = pfn_to_section_nr(pfn);
+		if (!present_section_nr(scn_nr)) {
+			pfn = round_down(pfn + PAGES_PER_SECTION,
+					 PAGES_PER_SECTION) - 1;
+			continue;
+		}
 
 		nid = get_nid_for_pfn(pfn);
 		if (nid < 0)