diff mbox

fadump: Register the memory reserved by fadump

Message ID 1470318165-2521-1-git-send-email-srikar@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Srikar Dronamraju Aug. 4, 2016, 1:42 p.m. UTC
Fadump kernel reserves large chunks of memory even before the pages are
initialized. This could mean memory that corresponds to several nodes might
fall in memblock reserved regions.

Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialize
only certain size memory per node. The certain size takes into account
the dentry and inode cache sizes. Currently the cache sizes are
calculated based on the total system memory including the reserved
memory. However such a kernel when booting the same kernel as fadump
kernel will not be able to allocate the required amount of memory to
suffice for the dentry and inode caches. This results in crashes like
the below on large systems such as 32 TB systems.

Dentry cache hash table entries: 536870912 (order: 16, 4294967296 bytes)
vmalloc: allocation failure, allocated 4097114112 of 17179934720 bytes
swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6-master+ #3
Call Trace:
[c00000000108fb10] [c0000000007fac88] dump_stack+0xb0/0xf0 (unreliable)
[c00000000108fb50] [c000000000235264] warn_alloc_failed+0x114/0x160
[c00000000108fbf0] [c000000000281484] __vmalloc_node_range+0x304/0x340
[c00000000108fca0] [c00000000028152c] __vmalloc+0x6c/0x90
[c00000000108fd40] [c000000000aecfb0]
alloc_large_system_hash+0x1b8/0x2c0
[c00000000108fe00] [c000000000af7240] inode_init+0x94/0xe4
[c00000000108fe80] [c000000000af6fec] vfs_caches_init+0x8c/0x13c
[c00000000108ff00] [c000000000ac4014] start_kernel+0x50c/0x578
[c00000000108ff90] [c000000000008c6c] start_here_common+0x20/0xa8

Register the memory reserved by fadump, so that the cache sizes are
calculated based on the free memory (i.e Total memory - reserved
memory).

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/fadump.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Mel Gorman Aug. 4, 2016, 2:09 p.m. UTC | #1
On Thu, Aug 04, 2016 at 07:12:45PM +0530, Srikar Dronamraju wrote:
> Fadump kernel reserves large chunks of memory even before the pages are
> initialized. This could mean memory that corresponds to several nodes might
> fall in memblock reserved regions.
> 
> Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialize
> only certain size memory per node. The certain size takes into account
> the dentry and inode cache sizes. Currently the cache sizes are
> calculated based on the total system memory including the reserved
> memory. However such a kernel when booting the same kernel as fadump
> kernel will not be able to allocate the required amount of memory to
> suffice for the dentry and inode caches. This results in crashes like
> the below on large systems such as 32 TB systems.
> 
> Dentry cache hash table entries: 536870912 (order: 16, 4294967296 bytes)
> vmalloc: allocation failure, allocated 4097114112 of 17179934720 bytes
> swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6-master+ #3
> Call Trace:
> [c00000000108fb10] [c0000000007fac88] dump_stack+0xb0/0xf0 (unreliable)
> [c00000000108fb50] [c000000000235264] warn_alloc_failed+0x114/0x160
> [c00000000108fbf0] [c000000000281484] __vmalloc_node_range+0x304/0x340
> [c00000000108fca0] [c00000000028152c] __vmalloc+0x6c/0x90
> [c00000000108fd40] [c000000000aecfb0]
> alloc_large_system_hash+0x1b8/0x2c0
> [c00000000108fe00] [c000000000af7240] inode_init+0x94/0xe4
> [c00000000108fe80] [c000000000af6fec] vfs_caches_init+0x8c/0x13c
> [c00000000108ff00] [c000000000ac4014] start_kernel+0x50c/0x578
> [c00000000108ff90] [c000000000008c6c] start_here_common+0x20/0xa8
> 
> Register the memory reserved by fadump, so that the cache sizes are
> calculated based on the free memory (i.e Total memory - reserved
> memory).
> 
> Suggested-by: Mel Gorman <mgorman@techsingularity.net>

I didn't suggest this specifically. While it happens to be safe on ppc64,
it potentially overwrites any future caller of set_dma_reserve. While the
only other one is for the e820 map, it may be better to change the API
to inc_dma_reserve?

It's also unfortunate that it's called dma_reserve because it has
nothing to do with DMA or ZONE_DMA. inc_kernel_reserve may be more
appropriate?
Srikar Dronamraju Aug. 4, 2016, 3:27 p.m. UTC | #2
* Mel Gorman <mgorman@techsingularity.net> [2016-08-04 15:09:34]:

> > 
> > Suggested-by: Mel Gorman <mgorman@techsingularity.net>
> 
> I didn't suggest this specifically. While it happens to be safe on ppc64,
> it potentially overwrites any future caller of set_dma_reserve. While the
> only other one is for the e820 map, it may be better to change the API
> to inc_dma_reserve?
> 
> It's also unfortunate that it's called dma_reserve because it has
> nothing to do with DMA or ZONE_DMA. inc_kernel_reserve may be more
> appropriate?

Yup Agree. Will redo and send out.

> 
> -- 
> Mel Gorman
> SUSE Labs
>
Michael Ellerman Aug. 5, 2016, 7:07 a.m. UTC | #3
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> Fadump kernel reserves large chunks of memory even before the pages are
> initialized. This could mean memory that corresponds to several nodes might
> fall in memblock reserved regions.
>
...
> Register the memory reserved by fadump, so that the cache sizes are
> calculated based on the free memory (i.e Total memory - reserved
> memory).

The memory is reserved, with memblock_reserve(). Why is that not sufficient?

cheers
Srikar Dronamraju Aug. 5, 2016, 7:28 a.m. UTC | #4
* Michael Ellerman <mpe@ellerman.id.au> [2016-08-05 17:07:01]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> 
> > Fadump kernel reserves large chunks of memory even before the pages are
> > initialized. This could mean memory that corresponds to several nodes might
> > fall in memblock reserved regions.
> >
> ...
> > Register the memory reserved by fadump, so that the cache sizes are
> > calculated based on the free memory (i.e Total memory - reserved
> > memory).
> 
> The memory is reserved, with memblock_reserve(). Why is that not sufficient?
> 
> cheers
> 

Because at page initialization time, the kernel doesnt know how many
pages are reserved. One way to do that would be to walk through the
different memory reserved blocks and calculate the size. But Mel feels
thats an overhead (from his reply to the other thread) esp for just one
use case.
Michael Ellerman Aug. 5, 2016, 9:25 a.m. UTC | #5
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

> * Michael Ellerman <mpe@ellerman.id.au> [2016-08-05 17:07:01]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> 
>> > Fadump kernel reserves large chunks of memory even before the pages are
>> > initialized. This could mean memory that corresponds to several nodes might
>> > fall in memblock reserved regions.
>> >
>> ...
>> > Register the memory reserved by fadump, so that the cache sizes are
>> > calculated based on the free memory (i.e Total memory - reserved
>> > memory).
>> 
>> The memory is reserved, with memblock_reserve(). Why is that not sufficient?
>
> Because at page initialization time, the kernel doesnt know how many
> pages are reserved.

The kernel does know, the fadump code that does the memblock reserve
runs before start_kernel(). AFAIK all calls to alloc_large_system_hash()
are after that.

So the problem seems to be just that alloc_large_system_hash() doesn't
know about reserved memory.

> One way to do that would be to walk through the different memory
> reserved blocks and calculate the size. But Mel feels thats an
> overhead (from his reply to the other thread) esp for just one use
> case.

OK. I think you're referring to this:

  If fadump is reserving memory and alloc_large_system_hash(HASH_EARLY)
  does not know about then then would an arch-specific callback for
  arch_reserved_kernel_pages() be more appropriate?
  ...
  
  That approach would limit the impact to ppc64 and would be less costly than
  doing a memblock walk instead of using nr_kernel_pages for everyone else.

That sounds more robust to me than this solution.

cheers
Mel Gorman Aug. 5, 2016, 10:06 a.m. UTC | #6
On Fri, Aug 05, 2016 at 07:25:03PM +1000, Michael Ellerman wrote:
> > One way to do that would be to walk through the different memory
> > reserved blocks and calculate the size. But Mel feels thats an
> > overhead (from his reply to the other thread) esp for just one use
> > case.
> 
> OK. I think you're referring to this:
> 
>   If fadump is reserving memory and alloc_large_system_hash(HASH_EARLY)
>   does not know about then then would an arch-specific callback for
>   arch_reserved_kernel_pages() be more appropriate?
>   ...
>   
>   That approach would limit the impact to ppc64 and would be less costly than
>   doing a memblock walk instead of using nr_kernel_pages for everyone else.
> 
> That sounds more robust to me than this solution.
> 

It would be the fastest with the least impact but not necessarily the
best. Ultimately that dma_reserve/memory_reserve is used for the sizing
calculation of the large system hashes but only the e820 map and fadump
is taken into account. That's a bit filthy even if it happens to work out ok.

Conceptually it would be cleaner, if expensive, to calculate the real
memblock reserves if HASH_EARLY and ditch the dma_reserve, memory_reserve
and nr_kernel_pages entirely. Unfortuantely, aside from the calculation,
there is a potential cost due to a smaller hash table that affects everyone,
not just ppc64. However, if the hash table is meant to be sized on the
number of available pages then it really should be based on that and not
just a made-up number.
Michael Ellerman Aug. 10, 2016, 6:02 a.m. UTC | #7
Mel Gorman <mgorman@techsingularity.net> writes:

> On Fri, Aug 05, 2016 at 07:25:03PM +1000, Michael Ellerman wrote:
>> > One way to do that would be to walk through the different memory
>> > reserved blocks and calculate the size. But Mel feels thats an
>> > overhead (from his reply to the other thread) esp for just one use
>> > case.
>> 
>> OK. I think you're referring to this:
>> 
>>   If fadump is reserving memory and alloc_large_system_hash(HASH_EARLY)
>>   does not know about then then would an arch-specific callback for
>>   arch_reserved_kernel_pages() be more appropriate?
>>   ...
>>   
>>   That approach would limit the impact to ppc64 and would be less costly than
>>   doing a memblock walk instead of using nr_kernel_pages for everyone else.
>> 
>> That sounds more robust to me than this solution.
>
> It would be the fastest with the least impact but not necessarily the
> best. Ultimately that dma_reserve/memory_reserve is used for the sizing
> calculation of the large system hashes but only the e820 map and fadump
> is taken into account. That's a bit filthy even if it happens to work out ok.

Right.

> Conceptually it would be cleaner, if expensive, to calculate the real
> memblock reserves if HASH_EARLY and ditch the dma_reserve, memory_reserve
> and nr_kernel_pages entirely.

Why is it expensive? memblock tracks the totals for all memory and
reserved memory AFAIK, so it should just be a case of subtracting one
from the other?

> Unfortuantely, aside from the calculation,
> there is a potential cost due to a smaller hash table that affects everyone,
> not just ppc64.

Yeah OK. We could make it an arch hook, or controlled by a CONFIG.

> However, if the hash table is meant to be sized on the
> number of available pages then it really should be based on that and not
> just a made-up number.

Yeah that seems to make sense.

The one complication I think is that we may have memory that's marked
reserved in memblock, but is later freed to the page allocator (eg.
initrd).

I'm not sure if that's actually a concern in practice given the relative
size of the initrd and memory on most systems. But possibly there are
other things that get reserved and then freed which could skew the hash
table size calculation.

cheers
Srikar Dronamraju Aug. 10, 2016, 6:40 a.m. UTC | #8
> 
> > Conceptually it would be cleaner, if expensive, to calculate the real
> > memblock reserves if HASH_EARLY and ditch the dma_reserve, memory_reserve
> > and nr_kernel_pages entirely.
> 
> Why is it expensive? memblock tracks the totals for all memory and
> reserved memory AFAIK, so it should just be a case of subtracting one
> from the other?

Are you suggesting that we use something like
memblock_phys_mem_size() but one which returns
memblock.reserved.total_size ? Maybe a new function like
memblock_reserved_mem_size()?

> 
> > Unfortuantely, aside from the calculation,
> > there is a potential cost due to a smaller hash table that affects everyone,
> > not just ppc64.
> 
> Yeah OK. We could make it an arch hook, or controlled by a CONFIG.

If its based on memblock.reserved.total_size, then should it be arch
specific?

> 
> > However, if the hash table is meant to be sized on the
> > number of available pages then it really should be based on that and not
> > just a made-up number.
> 
> Yeah that seems to make sense.
> 
> The one complication I think is that we may have memory that's marked
> reserved in memblock, but is later freed to the page allocator (eg.
> initrd).

Yes, this is a possibility, for example lets say we want fadump to
continue to run instead of rebooting to a new kernel as it does today.

> 
> I'm not sure if that's actually a concern in practice given the relative
> size of the initrd and memory on most systems. But possibly there are
> other things that get reserved and then freed which could skew the hash
> table size calculation.
>
Michael Ellerman Aug. 10, 2016, 6:57 a.m. UTC | #9
Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:

>> 
>> > Conceptually it would be cleaner, if expensive, to calculate the real
>> > memblock reserves if HASH_EARLY and ditch the dma_reserve, memory_reserve
>> > and nr_kernel_pages entirely.
>> 
>> Why is it expensive? memblock tracks the totals for all memory and
>> reserved memory AFAIK, so it should just be a case of subtracting one
>> from the other?
>
> Are you suggesting that we use something like
> memblock_phys_mem_size() but one which returns
> memblock.reserved.total_size ? Maybe a new function like
> memblock_reserved_mem_size()?

Yeah, something like that. I'm not sure if it actually needs a function,
AFAIK you can just look at the structure directly.

>> > Unfortuantely, aside from the calculation,
>> > there is a potential cost due to a smaller hash table that affects everyone,
>> > not just ppc64.
>> 
>> Yeah OK. We could make it an arch hook, or controlled by a CONFIG.
>
> If its based on memblock.reserved.total_size, then should it be arch
> specific?

Yes I think so. Otherwise you have to test it on every architecture :)

>> > However, if the hash table is meant to be sized on the
>> > number of available pages then it really should be based on that and not
>> > just a made-up number.
>> 
>> Yeah that seems to make sense.
>> 
>> The one complication I think is that we may have memory that's marked
>> reserved in memblock, but is later freed to the page allocator (eg.
>> initrd).
>
> Yes, this is a possibility, for example lets say we want fadump to
> continue to run instead of rebooting to a new kernel as it does today.

But that's a bad idea and no one should ever do it.

For starters all your caches will be undersized, and anything that is
allocated per-node early in boot will not be allocated on the nodes
which were reserved, so the system's performance will potentially differ
from a normal boot in weird and unpredictable ways.

cheers
Mel Gorman Aug. 10, 2016, 7:51 a.m. UTC | #10
On Wed, Aug 10, 2016 at 04:02:47PM +1000, Michael Ellerman wrote:
> > Conceptually it would be cleaner, if expensive, to calculate the real
> > memblock reserves if HASH_EARLY and ditch the dma_reserve, memory_reserve
> > and nr_kernel_pages entirely.
> 
> Why is it expensive? memblock tracks the totals for all memory and
> reserved memory AFAIK, so it should just be a case of subtracting one
> from the other?
> 

I didn't actually check that it tracks the totals. If it does, then the cost
will be negligible in comparison to the total cost of initialising memory.

> > Unfortuantely, aside from the calculation,
> > there is a potential cost due to a smaller hash table that affects everyone,
> > not just ppc64.
> 
> Yeah OK. We could make it an arch hook, or controlled by a CONFIG.
> 
> > However, if the hash table is meant to be sized on the
> > number of available pages then it really should be based on that and not
> > just a made-up number.
> 
> Yeah that seems to make sense.
> 
> The one complication I think is that we may have memory that's marked
> reserved in memblock, but is later freed to the page allocator (eg.
> initrd).
> 

It would be ideal if the amount of reserved memory that is freed later
in the normal case was estimated. If it's a small percentage of memory
then the difference is unlikely to be detectable and avoids ppc64 being
special.
Srikar Dronamraju Aug. 10, 2016, 9:21 a.m. UTC | #11
* Michael Ellerman <mpe@ellerman.id.au> [2016-08-10 16:57:57]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> 
> >> 
> >> > Conceptually it would be cleaner, if expensive, to calculate the real
> >> > memblock reserves if HASH_EARLY and ditch the dma_reserve, memory_reserve
> >> > and nr_kernel_pages entirely.
> >> 
> >> Why is it expensive? memblock tracks the totals for all memory and
> >> reserved memory AFAIK, so it should just be a case of subtracting one
> >> from the other?
> >
> > Are you suggesting that we use something like
> > memblock_phys_mem_size() but one which returns
> > memblock.reserved.total_size ? Maybe a new function like
> > memblock_reserved_mem_size()?
> 
> Yeah, something like that. I'm not sure if it actually needs a function,
> AFAIK you can just look at the structure directly.

For now memblock structure is only available in mm/memblock.c
Every other access to memblock from outside mm/memblock is through an
api.

> >
> > Yes, this is a possibility, for example lets say we want fadump to
> > continue to run instead of rebooting to a new kernel as it does today.
> 
> But that's a bad idea and no one should ever do it.
> 
> For starters all your caches will be undersized, and anything that is
> allocated per-node early in boot will not be allocated on the nodes
> which were reserved, so the system's performance will potentially differ
> from a normal boot in weird and unpredictable ways.
> 

Okay
diff mbox

Patch

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 3cb3b02a..eae547d 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -330,6 +330,7 @@  int __init fadump_reserve_mem(void)
 	}
 	fw_dump.reserve_dump_area_start = base;
 	fw_dump.reserve_dump_area_size = size;
+	set_dma_reserve(size/PAGE_SIZE);
 	return 1;
 }