diff mbox

[1/2] mm: Allow disabling deferred struct page initialisation

Message ID 1470143947-24443-2-git-send-email-srikar@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Srikar Dronamraju Aug. 2, 2016, 1:19 p.m. UTC
Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise
only certain size memory per node. The certain size takes into account
the dentry and inode cache sizes. However such a kernel when booting a
secondary kernel will not be able to allocate the required amount of
memory to suffice for the dentry and inode caches. This results in
crashes like the below on large systems such as 32 TB systems.

Dentry cache hash table entries: 536870912 (order: 16, 4294967296 bytes)
vmalloc: allocation failure, allocated 4097114112 of 17179934720 bytes
swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6-master+ #3
Call Trace:
[c00000000108fb10] [c0000000007fac88] dump_stack+0xb0/0xf0 (unreliable)
[c00000000108fb50] [c000000000235264] warn_alloc_failed+0x114/0x160
[c00000000108fbf0] [c000000000281484] __vmalloc_node_range+0x304/0x340
[c00000000108fca0] [c00000000028152c] __vmalloc+0x6c/0x90
[c00000000108fd40] [c000000000aecfb0]
alloc_large_system_hash+0x1b8/0x2c0
[c00000000108fe00] [c000000000af7240] inode_init+0x94/0xe4
[c00000000108fe80] [c000000000af6fec] vfs_caches_init+0x8c/0x13c
[c00000000108ff00] [c000000000ac4014] start_kernel+0x50c/0x578
[c00000000108ff90] [c000000000008c6c] start_here_common+0x20/0xa8

Allow such kernels to disable deferred page struct initialisation.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |  2 +-
 mm/page_alloc.c        | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Comments

Dave Hansen Aug. 2, 2016, 6:09 p.m. UTC | #1
On 08/02/2016 06:19 AM, Srikar Dronamraju wrote:
> Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise
> only certain size memory per node. The certain size takes into account
> the dentry and inode cache sizes. However such a kernel when booting a
> secondary kernel will not be able to allocate the required amount of
> memory to suffice for the dentry and inode caches. This results in
> crashes like the below on large systems such as 32 TB systems.

What's a "secondary kernel"?
Srikar Dronamraju Aug. 3, 2016, 6:38 a.m. UTC | #2
* Dave Hansen <dave.hansen@intel.com> [2016-08-02 11:09:21]:

> On 08/02/2016 06:19 AM, Srikar Dronamraju wrote:
> > Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise
> > only certain size memory per node. The certain size takes into account
> > the dentry and inode cache sizes. However such a kernel when booting a
> > secondary kernel will not be able to allocate the required amount of
> > memory to suffice for the dentry and inode caches. This results in
> > crashes like the below on large systems such as 32 TB systems.
> 
> What's a "secondary kernel"?
> 

I mean the kernel thats booted to collect the crash, On fadump, the
first kernel acts as the secondary kernel i.e the same kernel is booted
to collect the crash.
Dave Hansen Aug. 3, 2016, 6:17 p.m. UTC | #3
On 08/02/2016 11:38 PM, Srikar Dronamraju wrote:
> * Dave Hansen <dave.hansen@intel.com> [2016-08-02 11:09:21]:
>> On 08/02/2016 06:19 AM, Srikar Dronamraju wrote:
>>> Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise
>>> only certain size memory per node. The certain size takes into account
>>> the dentry and inode cache sizes. However such a kernel when booting a
>>> secondary kernel will not be able to allocate the required amount of
>>> memory to suffice for the dentry and inode caches. This results in
>>> crashes like the below on large systems such as 32 TB systems.
>>
>> What's a "secondary kernel"?
>>
> I mean the kernel thats booted to collect the crash, On fadump, the
> first kernel acts as the secondary kernel i.e the same kernel is booted
> to collect the crash.

OK, but I'm still not seeing what the problem is.  You've said that it
crashes and that it crashes during inode/dentry cache allocation.

But, *why* does the same kernel image crash in when it is used as a
"secondary kernel"?
Srikar Dronamraju Aug. 4, 2016, 5:25 a.m. UTC | #4
* Dave Hansen <dave.hansen@intel.com> [2016-08-03 11:17:43]:

> On 08/02/2016 11:38 PM, Srikar Dronamraju wrote:
> > * Dave Hansen <dave.hansen@intel.com> [2016-08-02 11:09:21]:
> >> On 08/02/2016 06:19 AM, Srikar Dronamraju wrote:
> >>> Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialise
> >>> only certain size memory per node. The certain size takes into account
> >>> the dentry and inode cache sizes. However such a kernel when booting a
> >>> secondary kernel will not be able to allocate the required amount of
> >>> memory to suffice for the dentry and inode caches. This results in
> >>> crashes like the below on large systems such as 32 TB systems.
> >>
> >> What's a "secondary kernel"?
> >>
> > I mean the kernel thats booted to collect the crash, On fadump, the
> > first kernel acts as the secondary kernel i.e the same kernel is booted
> > to collect the crash.
> 
> OK, but I'm still not seeing what the problem is.  You've said that it
> crashes and that it crashes during inode/dentry cache allocation.
> 
> But, *why* does the same kernel image crash in when it is used as a
> "secondary kernel"?
> 

I guess you already got it. But let me try to explain it again.

Lets say we have a 32 TB system with 16 nodes each node having 2T of
memory. We are assuming deferred page initialisation is configured.

When the regular kernel boots,
1. It reserves 5% of the memory for fadump.
2. It initializes 8GB per node, i.e 128GB
3. It allocated dentry/inode cache which is around 16GB.
4. It then kicks the parallel page struct initialization.

Now lets say kernel crashed and fadump was triggered.

1. The same kernel boots in the 5% reserved space which is 1600GB
2. It reserves the rest 95% memory.
3. It tries to initialize 8GB per node but can only initialize 8GB.
	(since except for 1st node the rest nodes are all reserved)
4. It tries to allocate dentry/inode cache of 16GB but fails.
	(tries to reclaim but reclaim needs spinlock 
	and spinlock is not yet initialized.)
diff mbox

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c60df92..1c55200 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1203,7 +1203,7 @@  unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
 #else
 #define pfn_valid_within(pfn) (1)
 #endif
-
+void disable_deferred_meminit(void);
 #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
 /*
  * pfn_valid() is meant to be able to tell if a given PFN has valid memmap
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1069ef..dc6ebac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -301,6 +301,19 @@  static inline bool early_page_nid_uninitialised(unsigned long pfn, int nid)
 }
 
 /*
+ * Deferred struct page initialisation may not work on a multinode machine,
+ * if a significant amount of memory is reserved at early boot.  Allow apis
+ * that reserve significant memory to disable deferred struct page
+ * initialisation.
+ */
+static bool defer_init_disabled;
+
+void disable_deferred_meminit(void)
+{
+	defer_init_disabled = true;
+}
+
+/*
  * Returns false when the remaining initialisation should be deferred until
  * later in the boot cycle when it can be parallelised.
  */
@@ -313,6 +326,9 @@  static inline bool update_defer_init(pg_data_t *pgdat,
 	/* Always populate low zones for address-contrained allocations */
 	if (zone_end < pgdat_end_pfn(pgdat))
 		return true;
+
+	if (defer_init_disabled)
+		return true;
 	/*
 	 * Initialise at least 2G of a node but also take into account that
 	 * two large system hashes that can take up 1GB for 0.25TB/node.
@@ -350,6 +366,10 @@  static inline bool update_defer_init(pg_data_t *pgdat,
 {
 	return true;
 }
+void disable_deferred_meminit(void)
+{
+}
+
 #endif