diff mbox

[v4,4/4] Use 2GB memory block size on large-memory x86-64 systems

Message ID 20150821202707.GA920@agluck-desk.sc.intel.com
State Not Applicable
Headers show

Commit Message

Tony Luck Aug. 21, 2015, 8:27 p.m. UTC
On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote:
> That commit could be reverted.
> According to
> https://lkml.org/lkml/2014/11/10/123

Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small
systems?

What about this patch - which just uses max_pfn to choose
the block size.

It seems that many systems with large amounts of memory
will have a nicely aligned max_pfn ... so they will get
the 2GB block size.  If they don't have a well aligned
max_pfn, then they need to use a smaller size to avoid
the crash I saw.

-Tony


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yinghai Lu Aug. 21, 2015, 8:50 p.m. UTC | #1
On Fri, Aug 21, 2015 at 1:27 PM, Luck, Tony <tony.luck@intel.com> wrote:
> On Fri, Aug 21, 2015 at 11:38:13AM -0700, Yinghai Lu wrote:
>> That commit could be reverted.
>> According to
>> https://lkml.org/lkml/2014/11/10/123
>
> Do we really need to force the MIN_MEMORY_BLOCK_SIZE on small
> systems?

That is introduced in commit 982792c7 ("x86, mm: probe memory block
size for generic x86 64bit
").
that patch is used to make boot faster why create less entries
in /sys/device/system/memory/.
On system with less 64G ram, that will not have too many entries
even with MIN_MEMORY_BLOCK_SIZE.

>
> What about this patch - which just uses max_pfn to choose
> the block size.
>
> It seems that many systems with large amounts of memory
> will have a nicely aligned max_pfn ... so they will get
> the 2GB block size.  If they don't have a well aligned
> max_pfn, then they need to use a smaller size to avoid
> the crash I saw.

Good to me.

> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3fba623e3ba5..e14e90fd1cf8 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1195,15 +1195,6 @@ static unsigned long probe_memory_block_size(void)
>         /* start from 2g */
>         unsigned long bz = 1UL<<31;
>
> -       if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
> -               pr_info("Using 2GB memory block size for large-memory system\n");
> -               return 2UL * 1024 * 1024 * 1024;
> -       }
> -
> -       /* less than 64g installed */
> -       if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
> -               return MIN_MEMORY_BLOCK_SIZE;
> -
>         /* get the tail size */
>         while (bz > MIN_MEMORY_BLOCK_SIZE) {
>                 if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Luck Aug. 21, 2015, 11:54 p.m. UTC | #2
On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> It seems that many systems with large amounts of memory
>> will have a nicely aligned max_pfn ... so they will get
>> the 2GB block size.  If they don't have a well aligned
>> max_pfn, then they need to use a smaller size to avoid
>> the crash I saw.
>
> Good to me.

Still stuff going on that I don't understand here. I increased the amount of
mirrored memory in this machine which moved max_pfn to 0x7560000
and probe_memory_block_size() picked 512MB as the memory_block_size,
which seemed plausible.

But my kernel still crashed during boot with this value. :-(
Forcing the block size to 128M made the system boot.

Maybe all the holes in the e820 map matter too (specifically the
alignment of the holes)?

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu Aug. 24, 2015, 5:46 p.m. UTC | #3
On Fri, Aug 21, 2015 at 4:54 PM, Tony Luck <tony.luck@gmail.com> wrote:
> On Fri, Aug 21, 2015 at 1:50 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> Still stuff going on that I don't understand here. I increased the amount of
> mirrored memory in this machine which moved max_pfn to 0x7560000
> and probe_memory_block_size() picked 512MB as the memory_block_size,
> which seemed plausible.
>
> But my kernel still crashed during boot with this value. :-(
> Forcing the block size to 128M made the system boot.
>
> Maybe all the holes in the e820 map matter too (specifically the
> alignment of the holes)?

Then, what does the E820 look like?

Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Luck Aug. 24, 2015, 8:41 p.m. UTC | #4
On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> Then, what does the E820 look like?

See attached serial console log of the latest crash

-Tony
Yinghai Lu Aug. 24, 2015, 9:25 p.m. UTC | #5
On Mon, Aug 24, 2015 at 1:41 PM, Tony Luck <tony.luck@gmail.com> wrote:
> On Mon, Aug 24, 2015 at 10:46 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>> Then, what does the E820 look like?
>
> See attached serial console log of the latest crash

Can you boot with "debug ignore_loglevel" so we can see following print out
for vmemmap?

[    0.352486]  [ffffea0000000000-ffffea0001ffffff] PMD ->
[ffff88007de00000-ffff88007fdfffff] on node 0
[    0.358758]  [ffffea0004000000-ffffea0005ffffff] PMD ->
[ffff88017d600000-ffff88017f5fffff] on node 1
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Luck Aug. 24, 2015, 10:39 p.m. UTC | #6
On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu <yinghai@kernel.org> wrote:

> Can you boot with "debug ignore_loglevel" so we can see following print out
> for vmemmap?

See attached. There are a few extra messages from my own debug printk()
calls. It seems that we successfully deal with node 0 from topology_init()
but die walking node 1. I see that the NODE_DATA limits for memory
on node 1 were from 1d70000 to 3a00000. But when we get into
register_mem_sect_under_node() we have rounded the start pfn down to
1d00000 ... and we panic processing that range (which is in a hole in e820).

We seem to die here:

        for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
                int page_nid;

                page_nid = get_nid_for_pfn(pfn);

-Tony
Yinghai Lu Aug. 24, 2015, 11:41 p.m. UTC | #7
On Mon, Aug 24, 2015 at 3:39 PM, Tony Luck <tony.luck@gmail.com> wrote:
> On Mon, Aug 24, 2015 at 2:25 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>
>> Can you boot with "debug ignore_loglevel" so we can see following print out
>> for vmemmap?
>
> See attached. There are a few extra messages from my own debug printk()
> calls. It seems that we successfully deal with node 0 from topology_init()
> but die walking node 1. I see that the NODE_DATA limits for memory
> on node 1 were from 1d70000 to 3a00000. But when we get into
> register_mem_sect_under_node() we have rounded the start pfn down to
> 1d00000 ... and we panic processing that range (which is in a hole in e820).
>
> We seem to die here:
>
>         for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
>                 int page_nid;
>
>                 page_nid = get_nid_for_pfn(pfn);

oh, no.
register_mem_sect_under_node() is assuming:
first section in the block is present and first page in that section is present.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3fba623e3ba5..e14e90fd1cf8 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1195,15 +1195,6 @@  static unsigned long probe_memory_block_size(void)
 	/* start from 2g */
 	unsigned long bz = 1UL<<31;
 
-	if (totalram_pages >= (64ULL << (30 - PAGE_SHIFT))) {
-		pr_info("Using 2GB memory block size for large-memory system\n");
-		return 2UL * 1024 * 1024 * 1024;
-	}
-
-	/* less than 64g installed */
-	if ((max_pfn << PAGE_SHIFT) < (16UL << 32))
-		return MIN_MEMORY_BLOCK_SIZE;
-
 	/* get the tail size */
 	while (bz > MIN_MEMORY_BLOCK_SIZE) {
 		if (!((max_pfn << PAGE_SHIFT) & (bz - 1)))