diff mbox series

[v5,1/3] powerpc: make fadump resilient with memory add/remove events

Message ID 20231029124548.12198-2-sourabhjain@linux.ibm.com (mailing list archive)
State Superseded
Headers show
Series powerpc: make fadump resilient with memory add/remove events | expand

Commit Message

Sourabh Jain Oct. 29, 2023, 12:45 p.m. UTC
Due to changes in memory resources caused by either memory hotplug or
online/offline events, the elfcorehdr, which describes the CPUs and
memory of the crashed kernel to the kernel that collects the dump (known
as second/fadump kernel), becomes outdated. Consequently, attempting
dump collection with an outdated elfcorehdr can lead to failed or
inaccurate dump collection.

Memory hotplug or online/offline events is referred as memory add/remove
events in reset of the commit message.

The current solution to address the aforementioned issue is as follows:
Monitor memory add/remove events in userspace using udev rules, and
re-register fadump whenever there are changes in memory resources. This
leads to the creation of a new elfcorehdr with updated system memory
information.

There are several notable issues associated with re-registering fadump
for every memory add/remove events.

1. Bulk memory add/remove events with udev-based fadump re-registration
   can lead to race conditions and, more importantly, it creates a wide
   window during which fadump is inactive until all memory add/remove
   events are settled.
2. Re-registering fadump for every memory add/remove event is
   inefficient.
3. The memory for elfcorehdr is allocated based on the memblock regions
   available during early boot and remains fixed thereafter. However, if
   elfcorehdr is later recreated with to additional memblock regions,
   its size will increase, potentially leading to memory corruption.

Address the aforementioned challenges by shifting the creation of
elfcorehdr from the first kernel (also referred as the crashed kernel),
where it was created and frequently recreated for every memory
add/remove event, to the fadump kernel. As a result, the elfcorehdr only
needs to be created once, thus eliminating the necessity to re-register
fadump during memory add/remove events.

At present, the first kernel prepares the elfcorehdr and stores it in
the fadump reserved area. Subsequently, it includes the start address of
the elfcorehdr in the elfcorehdr_addr attribute of the fadump header.
The fadump header holds information like elfcorehdr address, crashing
CPU details, etc. The first kernel prepares the fadump header and stores
it in the fadump reserved area. In the event of first kernel crash the
second boots and access the fadump header prepared by first kernel and
do the following in a platform-specific function
[rtas|opal]_fadump_process:

1. Sanity check for fadump header
2. Update CPU notes in elfcorehdr
3. Set the global variable elfcorehdr_addr to the address of the
   fadump header's elfcorehdr. For vmcore module to process it later on.

Along with the above, update the setup_fadump()/fadump.c to create
elfcorehdr in second/fadump kernel.

Section below outlines the information required to create the elfcorehdr
and the changes made to make it available to the fadump kernel if it's
not already.

To create elfcorehdr, the following crashed kernel information is
required: CPU notes, vmcoreinfo, and memory ranges.

At present, the CPU notes are already prepared in the fadump kernel, so
no changes are needed in that regard. The fadump kernel has access to
all crashed kernel memory regions, including boot memory regions that
are relocated by firmware to fadump reserved areas, so no changes for
that either. However, it is necessary to add new members to the fadump
header, i.e., the 'fadump_crash_info_header' structure, in order to pass
the crashed kernel's vmcoreinfo address and size to the fadump kernel.

Table 1 below illustrates kernel's ability to collect dump if either the
first/crashed kernel or the second/fadump kernel does not have the
changes introduced here. Consider the 'old kernel' as the kernel without
this patch, and the 'new kernel' as the kernel with this patch included.

+----------+------------------------+------------------------+-------+
| scenario |  first/crashed kernel  |  second/fadump kernel  |  Dump |
+----------+------------------------+------------------------+-------+
|    1     |       old kernel       |        new kernel      |  Yes  |
+----------+------------------------+------------------------+-------+
|    2     |       new kernel       |        old kernel      |  No   |
+----------+------------------------+------------------------+-------+

			      Table 1

Scenario 1:
-----------
Since the magic number of fadump header is updated, the second kernel
can differentiate the crashed kernel is of type 'new kernel' or
'old kernel' and act accordingly. In this scenario, since the crashed
kernel is of type 'old kernel,' the fadump kernel skips elfcorehdr
creation and uses the one prepared in the first kernel itself to collect
the dump.

Scenario 2:
-----------
Since 'old kernel' as the fadump kernel is NOT capable of processing
fadump header with updated magic number from 'new kernel' hence it
gracefully fails with the below error and dump collection fails in this
scenario.

[    0.007365] rtas fadump: Crash info header is not valid.

Add a version field to the fadump_crash_info_header structure to avoid
the need to change its magic number in the future. Adding a version
field to the fadump header was one of the TODO items listed in
Documentation/powerpc/firmware-assisted-dump.rst.

Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Aditya Gupta <adityag@linux.ibm.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/include/asm/fadump-internal.h   |  24 +-
 arch/powerpc/kernel/fadump.c                 | 361 +++++++++++--------
 arch/powerpc/platforms/powernv/opal-fadump.c |  18 +-
 arch/powerpc/platforms/pseries/rtas-fadump.c |  23 +-
 4 files changed, 242 insertions(+), 184 deletions(-)

Comments

Michael Ellerman Nov. 9, 2023, 12:14 p.m. UTC | #1
Hi Sourabh,

This seems like a good change to the design, but I'm confused by
some things, more below ...

Sourabh Jain <sourabhjain@linux.ibm.com> writes:
>
...
>
> Table 1 below illustrates kernel's ability to collect dump if either the
> first/crashed kernel or the second/fadump kernel does not have the
> changes introduced here. Consider the 'old kernel' as the kernel without
> this patch, and the 'new kernel' as the kernel with this patch included.
>
> +----------+------------------------+------------------------+-------+
> | scenario |  first/crashed kernel  |  second/fadump kernel  |  Dump |
> +----------+------------------------+------------------------+-------+
> |    1     |       old kernel       |        new kernel      |  Yes  |
> +----------+------------------------+------------------------+-------+
> |    2     |       new kernel       |        old kernel      |  No   |
> +----------+------------------------+------------------------+-------+
>
> 			      Table 1
>
> Scenario 1:
> -----------
> Since the magic number of fadump header is updated, the second kernel
> can differentiate the crashed kernel is of type 'new kernel' or
> 'old kernel' and act accordingly. In this scenario, since the crashed
> kernel is of type 'old kernel,' the fadump kernel skips elfcorehdr
> creation and uses the one prepared in the first kernel itself to collect
> the dump.
>
> Scenario 2:
> -----------
> Since 'old kernel' as the fadump kernel is NOT capable of processing
> fadump header with updated magic number from 'new kernel' hence it
> gracefully fails with the below error and dump collection fails in this
> scenario.
>
> [    0.007365] rtas fadump: Crash info header is not valid.
>
> Add a version field to the fadump_crash_info_header structure to avoid
> the need to change its magic number in the future. Adding a version
> field to the fadump header was one of the TODO items listed in
> Documentation/powerpc/firmware-assisted-dump.rst.

This is a good analysis of the issues with different kernel versions,
and seems like an OK trade off, ie. that old kernels can't process dumps
from new kernels.

But do we actually support using different kernel versions for the
crash/dump kernel?

Because AFAICS the fadump_crash_info_header is not safe across kernel
versions, in its current form or with your changes.

> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
> index 27f9e11eda28..7be3d8894520 100644
> --- a/arch/powerpc/include/asm/fadump-internal.h
> +++ b/arch/powerpc/include/asm/fadump-internal.h
> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>  
>  #define FADUMP_CPU_UNKNOWN		(~((u32)0))
>  
> -#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPINF")
> +/*
> + * The introduction of new fields in the fadump crash info header has
> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
> + * This alteration ensures backward compatibility, enabling the kernel
> + * with the updated fadump crash info to handle kernel dumps from older
> + * kernels.
> + *
> + * To prevent the need for further changes to the magic number in the
> + * event of future modifications to the fadump header, a version field
> + * has been introduced to track the fadump crash info header version.
> + *
> + * Historically, there was no connection between the magic number and
> + * the fadump crash info header version. However, moving forward, the
> + * `FADMPINF` magic number in header will be treated as version 0, while
> + * the `FADMPSIG` magic number in header will include a version field to
> + * determine its version.
> + */
> +#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPSIG")
> +#define FADUMP_VERSION			1
>  
>  /* fadump crash info structure */
>  struct fadump_crash_info_header {
> @@ -51,6 +69,10 @@ struct fadump_crash_info_header {
> 
struct fadump_crash_info_header {
	u64		magic_number;
	u64		elfcorehdr_addr;

>  	u32		crashing_cpu;
>  	struct pt_regs	regs;
>  	struct cpumask	cpu_mask;
> +	u32		version;
> +	u64		elfcorehdr_size;
> +	u64		vmcoreinfo_raddr;
> +	u64		vmcoreinfo_size;
>  };

The reason I say it's not safe is because pt_regs and especially cpumask
can change size depending on the kernel configuration.

pt_regs probably doesn't change size in practice for common kernel
configurations, but some of the fields are under #ifdef.

cpumask on the other hand is directly controlled by CONFIG_NR_CPUS. So
if the first and second kernel have a different value for NR_CPUS they
will disagree on the size of the struct.

That has presumably worked OK so far because folks tend to use the same, or
similar kernels for the first/second kernel. And also the cpumask is the
last element of the struct, so a disagreement about it size doesn't
affect the location of other fields.

However with your new design the version field will move based on the
cpumask size, which will potentially break detection of the version by
the second kernel.

At least that's how it looks to me, I don't see any checks anywhere that
will save us, or did I miss something?

cheers
Sourabh Jain Nov. 13, 2023, 6:42 a.m. UTC | #2
Hello Michael,


On 09/11/23 17:44, Michael Ellerman wrote:
> Hi Sourabh,
>
> This seems like a good change to the design, but I'm confused by
> some things, more below ...

Thanks.

> Sourabh Jain <sourabhjain@linux.ibm.com> writes:
> ...
>> Table 1 below illustrates kernel's ability to collect dump if either the
>> first/crashed kernel or the second/fadump kernel does not have the
>> changes introduced here. Consider the 'old kernel' as the kernel without
>> this patch, and the 'new kernel' as the kernel with this patch included.
>>
>> +----------+------------------------+------------------------+-------+
>> | scenario |  first/crashed kernel  |  second/fadump kernel  |  Dump |
>> +----------+------------------------+------------------------+-------+
>> |    1     |       old kernel       |        new kernel      |  Yes  |
>> +----------+------------------------+------------------------+-------+
>> |    2     |       new kernel       |        old kernel      |  No   |
>> +----------+------------------------+------------------------+-------+
>>
>> 			      Table 1
>>
>> Scenario 1:
>> -----------
>> Since the magic number of fadump header is updated, the second kernel
>> can differentiate the crashed kernel is of type 'new kernel' or
>> 'old kernel' and act accordingly. In this scenario, since the crashed
>> kernel is of type 'old kernel,' the fadump kernel skips elfcorehdr
>> creation and uses the one prepared in the first kernel itself to collect
>> the dump.
>>
>> Scenario 2:
>> -----------
>> Since 'old kernel' as the fadump kernel is NOT capable of processing
>> fadump header with updated magic number from 'new kernel' hence it
>> gracefully fails with the below error and dump collection fails in this
>> scenario.
>>
>> [    0.007365] rtas fadump: Crash info header is not valid.
>>
>> Add a version field to the fadump_crash_info_header structure to avoid
>> the need to change its magic number in the future. Adding a version
>> field to the fadump header was one of the TODO items listed in
>> Documentation/powerpc/firmware-assisted-dump.rst.
> This is a good analysis of the issues with different kernel versions,
> and seems like an OK trade off, ie. that old kernels can't process dumps
> from new kernels.
>
> But do we actually support using different kernel versions for the
> crash/dump kernel?
>
> Because AFAICS the fadump_crash_info_header is not safe across kernel
> versions, in its current form or with your changes.

Yeah, I was also under the impression that it is not supported, but I 
was not aware
that the size of pt_regs and cpumask can change based on the configuration.

>
>> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
>> index 27f9e11eda28..7be3d8894520 100644
>> --- a/arch/powerpc/include/asm/fadump-internal.h
>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>   
>>   #define FADUMP_CPU_UNKNOWN		(~((u32)0))
>>   
>> -#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPINF")
>> +/*
>> + * The introduction of new fields in the fadump crash info header has
>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>> + * This alteration ensures backward compatibility, enabling the kernel
>> + * with the updated fadump crash info to handle kernel dumps from older
>> + * kernels.
>> + *
>> + * To prevent the need for further changes to the magic number in the
>> + * event of future modifications to the fadump header, a version field
>> + * has been introduced to track the fadump crash info header version.
>> + *
>> + * Historically, there was no connection between the magic number and
>> + * the fadump crash info header version. However, moving forward, the
>> + * `FADMPINF` magic number in header will be treated as version 0, while
>> + * the `FADMPSIG` magic number in header will include a version field to
>> + * determine its version.
>> + */
>> +#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPSIG")
>> +#define FADUMP_VERSION			1
>>   
>>   /* fadump crash info structure */
>>   struct fadump_crash_info_header {
>> @@ -51,6 +69,10 @@ struct fadump_crash_info_header {
>>
> struct fadump_crash_info_header {
> 	u64		magic_number;
> 	u64		elfcorehdr_addr;
>
>>   	u32		crashing_cpu;
>>   	struct pt_regs	regs;
>>   	struct cpumask	cpu_mask;
>> +	u32		version;
>> +	u64		elfcorehdr_size;
>> +	u64		vmcoreinfo_raddr;
>> +	u64		vmcoreinfo_size;
>>   };
> The reason I say it's not safe is because pt_regs and especially cpumask
> can change size depending on the kernel configuration.
>
> pt_regs probably doesn't change size in practice for common kernel
> configurations, but some of the fields are under #ifdef.
>
> cpumask on the other hand is directly controlled by CONFIG_NR_CPUS. So
> if the first and second kernel have a different value for NR_CPUS they
> will disagree on the size of the struct.
>
> That has presumably worked OK so far because folks tend to use the same, or
> similar kernels for the first/second kernel. And also the cpumask is the
> last element of the struct, so a disagreement about it size doesn't
> affect the location of other fields.
>
> However with your new design the version field will move based on the
> cpumask size, which will potentially break detection of the version by
> the second kernel.
>
> At least that's how it looks to me, I don't see any checks anywhere that
> will save us, or did I miss something?
I agree with you. If the sizes of pt_regs and cpu_mask are different between
the first/crash kernel and the fadump/second kernel, the dump collection
might not work correctly.

Given that dump capture is not supported across kernel versions, I think 
it is
better to keep new attributes introduced in this patch to struct 
fadump_crash_info_header
before pt_regs and cpumask. For example:

/* fadump crash info structure */
struct fadump_crash_info_header {
     u64        magic_number;
+  u32        version;
+  u64        vmcoreinfo_raddr;
+  u64        vmcoreinfo_size;
     u64        elfcorehdr_addr;
+  u64        elfcorehdr_size;
     u32        crashing_cpu;
     struct pt_regs    regs;
     struct cpumask    cpu_mask;
};

Kernel with this patch included will not process the dump if 
fadump_crash_info_header
holds old magic number.

Something like:

     if (fdh->magic_number == fadump_str_to_u64("FADMPINF")) {
         pr_err("Incompatible fadump header, can't process the dump");
     }

Does this approach look good to you?

Thanks,
Sourabh
Aneesh Kumar K V Nov. 15, 2023, 4:44 a.m. UTC | #3
Sourabh Jain <sourabhjain@linux.ibm.com> writes:

....

> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
> index 27f9e11eda28..7be3d8894520 100644
> --- a/arch/powerpc/include/asm/fadump-internal.h
> +++ b/arch/powerpc/include/asm/fadump-internal.h
> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>  
>  #define FADUMP_CPU_UNKNOWN		(~((u32)0))
>  
> -#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPINF")
> +/*
> + * The introduction of new fields in the fadump crash info header has
> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
> + * This alteration ensures backward compatibility, enabling the kernel
> + * with the updated fadump crash info to handle kernel dumps from older
> + * kernels.
> + *
> + * To prevent the need for further changes to the magic number in the
> + * event of future modifications to the fadump header, a version field
> + * has been introduced to track the fadump crash info header version.
> + *
> + * Historically, there was no connection between the magic number and
> + * the fadump crash info header version. However, moving forward, the
> + * `FADMPINF` magic number in header will be treated as version 0, while
> + * the `FADMPSIG` magic number in header will include a version field to
> + * determine its version.
> + */
> +#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPSIG")
> +#define FADUMP_VERSION			1
>

Can we keep the old magic details as

#define FADUMP_CRASH_INFO_MAGIC_OLD		fadump_str_to_u64("FADMPINF")
#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPSIG")

Also considering the struct need not be backward compatible, can we just
do

struct fadump_crash_info_header {
	u64		magic_number;
	u32		crashing_cpu;
	u64		elfcorehdr_addr;
	u64		elfcorehdr_size;
	u64		vmcoreinfo_raddr;
	u64		vmcoreinfo_size;
	struct pt_regs	regs;
	struct cpumask	cpu_mask;
};

static inline bool fadump_compatible(struct fadump_crash_info_header
*fdh)
{
	return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
}

and fail fadump if we find it not compatible?

-aneesh
Sourabh Jain Nov. 17, 2023, 4:33 a.m. UTC | #4
Hi Aneesh,

Thanks for reviewing the patch.

On 15/11/23 10:14, Aneesh Kumar K.V wrote:
> Sourabh Jain<sourabhjain@linux.ibm.com>  writes:
>
> ....
>
>> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
>> index 27f9e11eda28..7be3d8894520 100644
>> --- a/arch/powerpc/include/asm/fadump-internal.h
>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>   
>>   #define FADUMP_CPU_UNKNOWN		(~((u32)0))
>>   
>> -#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPINF")
>> +/*
>> + * The introduction of new fields in the fadump crash info header has
>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>> + * This alteration ensures backward compatibility, enabling the kernel
>> + * with the updated fadump crash info to handle kernel dumps from older
>> + * kernels.
>> + *
>> + * To prevent the need for further changes to the magic number in the
>> + * event of future modifications to the fadump header, a version field
>> + * has been introduced to track the fadump crash info header version.
>> + *
>> + * Historically, there was no connection between the magic number and
>> + * the fadump crash info header version. However, moving forward, the
>> + * `FADMPINF` magic number in header will be treated as version 0, while
>> + * the `FADMPSIG` magic number in header will include a version field to
>> + * determine its version.
>> + */
>> +#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPSIG")
>> +#define FADUMP_VERSION			1
>>
> Can we keep the old magic details as
>
> #define FADUMP_CRASH_INFO_MAGIC_OLD		fadump_str_to_u64("FADMPINF")
> #define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPSIG")

Sure.

> Also considering the struct need not be backward compatible, can we just
> do
>
> struct fadump_crash_info_header {
> 	u64		magic_number;
> 	u32		crashing_cpu;
> 	u64		elfcorehdr_addr;
> 	u64		elfcorehdr_size;
> 	u64		vmcoreinfo_raddr;
> 	u64		vmcoreinfo_size;
> 	struct pt_regs	regs;
> 	struct cpumask	cpu_mask;
> };
> static inline bool fadump_compatible(struct fadump_crash_info_header
> *fdh)
> {
> 	return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
> }
>
> and fail fadump if we find it not compatible?

Agree that it is unsafe to collect a dump with an incompatible fadump 
crash info header.

Given that I am updating the fadump crash info header, we can make a few 
arrangements
like adding a size filed for the dynamic size attribute like pt_regs and 
cpumask to ensure
better compatibility in the future.

Additionally, let's introduce a version field to the fadump crash info 
header to avoid changing
the magic number in the future.

Thanks,
Sourabh
Aneesh Kumar K V Nov. 17, 2023, 5:31 a.m. UTC | #5
On 11/17/23 10:03 AM, Sourabh Jain wrote:
> Hi Aneesh,
> 
> Thanks for reviewing the patch.
> 
> On 15/11/23 10:14, Aneesh Kumar K.V wrote:
>> Sourabh Jain<sourabhjain@linux.ibm.com>  writes:
>>
>> ....
>>
>>> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
>>> index 27f9e11eda28..7be3d8894520 100644
>>> --- a/arch/powerpc/include/asm/fadump-internal.h
>>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>>     #define FADUMP_CPU_UNKNOWN        (~((u32)0))
>>>   -#define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPINF")
>>> +/*
>>> + * The introduction of new fields in the fadump crash info header has
>>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>>> + * This alteration ensures backward compatibility, enabling the kernel
>>> + * with the updated fadump crash info to handle kernel dumps from older
>>> + * kernels.
>>> + *
>>> + * To prevent the need for further changes to the magic number in the
>>> + * event of future modifications to the fadump header, a version field
>>> + * has been introduced to track the fadump crash info header version.
>>> + *
>>> + * Historically, there was no connection between the magic number and
>>> + * the fadump crash info header version. However, moving forward, the
>>> + * `FADMPINF` magic number in header will be treated as version 0, while
>>> + * the `FADMPSIG` magic number in header will include a version field to
>>> + * determine its version.
>>> + */
>>> +#define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPSIG")
>>> +#define FADUMP_VERSION            1
>>>
>> Can we keep the old magic details as
>>
>> #define FADUMP_CRASH_INFO_MAGIC_OLD        fadump_str_to_u64("FADMPINF")
>> #define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPSIG")
> 
> Sure.
> 
>> Also considering the struct need not be backward compatible, can we just
>> do
>>
>> struct fadump_crash_info_header {
>>     u64        magic_number;
>>     u32        crashing_cpu;
>>     u64        elfcorehdr_addr;
>>     u64        elfcorehdr_size;
>>     u64        vmcoreinfo_raddr;
>>     u64        vmcoreinfo_size;
>>     struct pt_regs    regs;
>>     struct cpumask    cpu_mask;
>> };
>> static inline bool fadump_compatible(struct fadump_crash_info_header
>> *fdh)
>> {
>>     return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
>> }
>>
>> and fail fadump if we find it not compatible?
> 
> Agree that it is unsafe to collect a dump with an incompatible fadump crash info header.
> 
> Given that I am updating the fadump crash info header, we can make a few arrangements
> like adding a size filed for the dynamic size attribute like pt_regs and cpumask to ensure
> better compatibility in the future.
> 
> Additionally, let's introduce a version field to the fadump crash info header to avoid changing
> the magic number in the future.
>

I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
crash kernel and fadump kernel will be same version? if yes we can simply fail with a magic number
mismatch because that indicates an user config error? 

-aneesh
Hari Bathini Nov. 17, 2023, 6:14 a.m. UTC | #6
On 17/11/23 11:01 am, Aneesh Kumar K V wrote:
> On 11/17/23 10:03 AM, Sourabh Jain wrote:
>> Hi Aneesh,
>>
>> Thanks for reviewing the patch.
>>
>> On 15/11/23 10:14, Aneesh Kumar K.V wrote:
>>> Sourabh Jain<sourabhjain@linux.ibm.com>  writes:
>>>
>>> ....
>>>
>>>> diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
>>>> index 27f9e11eda28..7be3d8894520 100644
>>>> --- a/arch/powerpc/include/asm/fadump-internal.h
>>>> +++ b/arch/powerpc/include/asm/fadump-internal.h
>>>> @@ -42,7 +42,25 @@ static inline u64 fadump_str_to_u64(const char *str)
>>>>      #define FADUMP_CPU_UNKNOWN        (~((u32)0))
>>>>    -#define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPINF")
>>>> +/*
>>>> + * The introduction of new fields in the fadump crash info header has
>>>> + * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
>>>> + * This alteration ensures backward compatibility, enabling the kernel
>>>> + * with the updated fadump crash info to handle kernel dumps from older
>>>> + * kernels.
>>>> + *
>>>> + * To prevent the need for further changes to the magic number in the
>>>> + * event of future modifications to the fadump header, a version field
>>>> + * has been introduced to track the fadump crash info header version.
>>>> + *
>>>> + * Historically, there was no connection between the magic number and
>>>> + * the fadump crash info header version. However, moving forward, the
>>>> + * `FADMPINF` magic number in header will be treated as version 0, while
>>>> + * the `FADMPSIG` magic number in header will include a version field to
>>>> + * determine its version.
>>>> + */
>>>> +#define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPSIG")
>>>> +#define FADUMP_VERSION            1
>>>>
>>> Can we keep the old magic details as
>>>
>>> #define FADUMP_CRASH_INFO_MAGIC_OLD        fadump_str_to_u64("FADMPINF")
>>> #define FADUMP_CRASH_INFO_MAGIC        fadump_str_to_u64("FADMPSIG")
>>
>> Sure.
>>
>>> Also considering the struct need not be backward compatible, can we just
>>> do
>>>
>>> struct fadump_crash_info_header {
>>>      u64        magic_number;
>>>      u32        crashing_cpu;
>>>      u64        elfcorehdr_addr;
>>>      u64        elfcorehdr_size;
>>>      u64        vmcoreinfo_raddr;
>>>      u64        vmcoreinfo_size;
>>>      struct pt_regs    regs;
>>>      struct cpumask    cpu_mask;
>>> };
>>> static inline bool fadump_compatible(struct fadump_crash_info_header
>>> *fdh)
>>> {
>>>      return (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
>>> }
>>>
>>> and fail fadump if we find it not compatible?
>>
>> Agree that it is unsafe to collect a dump with an incompatible fadump crash info header.
>>
>> Given that I am updating the fadump crash info header, we can make a few arrangements
>> like adding a size filed for the dynamic size attribute like pt_regs and cpumask to ensure
>> better compatibility in the future.
>>
>> Additionally, let's introduce a version field to the fadump crash info header to avoid changing
>> the magic number in the future.
>>
> 
> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
> crash kernel and fadump kernel will be same version? if yes we can simply fail with a magic number
> mismatch because that indicates an user config error?

If we decide not to support different kernel versions for production
kernel and capture kernel, We can make that implicit by adding kernel
version info of production kernel in the header and bailing out if
there is kernel version mismatch as magic could still match for two
different kernel versions.

I would personally prefer something like the below though:

	struct fadump_crash_info_header {
		u64		magic_number;
		u32		version
		u32		crashing_cpu;
		u64		elfcorehdr_addr;
		u64		elfcorehdr_size;
		u64		vmcoreinfo_raddr;
		u64		vmcoreinfo_size;
		u8		kernel_version[];
		u32		pt_regs_sz;
		struct pt_regs	regs;
		u32		cpu_mask_sz;
		struct cpumask	cpu_mask;
	};

	if (magic_number != new_magic)
		goto err;	/* Error out */

	if (kernel_version != capture_kernel_version)
	{
		if (cpu_mask_sz == sizeof(struct pt_regs) && cpu_mask_sz == 
sizeof(struct cpumask))
			/*
			 * Warn about the kernel version mismatch and how data can be different
			 * across kernel versions and proceed anyway!
			 */
		else
			goto err;	/* Error out */
	}

This ensures we warn and proceed in cases where it is less likely to
have issues capturing kernel dump. This helps in dev environment where
we are trying to debug an early boot crash - in which case capture
kernel can't be the same kernel as it would likely hit the same problem
while booting..

Thanks
Hari
Michael Ellerman Nov. 22, 2023, 5:17 a.m. UTC | #7
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
...
>
> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
> crash kernel and fadump kernel will be same version?

How sure are we of that?

Don't we go through grub when we boot into the 2nd kernel. And so
couldn't it choose to boot a different kernel, for whatever reason.

I don't think we need to support different pt_reg / cpumask sizes, but
requiring the exact same kernel version is too strict I think.

But maybe I'm wrong. Would be good to hear what distro folks think.

cheers
Sourabh Jain Nov. 22, 2023, 10:35 a.m. UTC | #8
Hello Michael,


On 22/11/23 10:47, Michael Ellerman wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> ...
>> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
>> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
>> crash kernel and fadump kernel will be same version?
> How sure are we of that?
>
> Don't we go through grub when we boot into the 2nd kernel. And so
> couldn't it choose to boot a different kernel, for whatever reason.
>
> I don't think we need to support different pt_reg / cpumask sizes, but
> requiring the exact same kernel version is too strict I think.
Agree.
>
> But maybe I'm wrong. Would be good to hear what distro folks think.

How about checking fadump crash info header compatibility in the 
following way?

static bool is_fadump_header_compatible(struct fadump_crash_info_header 
*fdh)
{
     if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
         pr_err("Old magic number, can't process the dump.");
         return false;
     }

     if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
         pr_err("Fadump header is corrupted.");
         return false;
     }

     /*
      * If the kernel version of the first/crashed kernel and the 
second/fadump
      * kernel is not same, then only collect the dump if the size of all
      * non-primitive type members of the fadump header is the same 
across kernels.
      */
     if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
         if (fdh->pt_regs_sz != sizeof(struct pt_regs) || 
fdh->cpu_mask_sz != sizeof(struct cpumask)) {
             pr_err("Fadump header size mismatch.\n")
             return false;
         } else
             pr_warn("Kernel version mismatch; dump data is unreliable.\n");
     }

     return true;
}

And the new fadump crash info header will be: As suggested by Hari.

/* fadump crash info structure */
struct fadump_crash_info_header {
     u64        magic_number;
+  u32        version;
     u32        crashing_cpu;
     u64        elfcorehdr_addr;
+  u64        elfcorehdr_size;
+  u64        vmcoreinfo_raddr;
+  u64        vmcoreinfo_size;
+  u8          kernel_version[__NEW_UTS_LEN + 1];
+  u32        pt_regs_sz;
     struct pt_regs    regs;
+  u32        cpu_mask_sz;
     struct cpumask    cpu_mask;
};

Thanks,
Sourabh Jain
Aneesh Kumar K.V (IBM) Nov. 22, 2023, 12:20 p.m. UTC | #9
On 11/22/23 4:05 PM, Sourabh Jain wrote:
> Hello Michael,
> 
> 
> On 22/11/23 10:47, Michael Ellerman wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> ...
>>> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
>>> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
>>> crash kernel and fadump kernel will be same version?
>> How sure are we of that?
>>
>> Don't we go through grub when we boot into the 2nd kernel. And so
>> couldn't it choose to boot a different kernel, for whatever reason.
>>
>> I don't think we need to support different pt_reg / cpumask sizes, but
>> requiring the exact same kernel version is too strict I think.
> Agree.
>>
>> But maybe I'm wrong. Would be good to hear what distro folks think.
> 
> How about checking fadump crash info header compatibility in the following way?
> 
> static bool is_fadump_header_compatible(struct fadump_crash_info_header *fdh)
> {
>     if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
>         pr_err("Old magic number, can't process the dump.");
>         return false;
>     }
> 
>     if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>         pr_err("Fadump header is corrupted.");
>         return false;
>     }
> 
>     /*
>      * If the kernel version of the first/crashed kernel and the second/fadump
>      * kernel is not same, then only collect the dump if the size of all
>      * non-primitive type members of the fadump header is the same across kernels.
>      */
>     if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
>         if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != sizeof(struct cpumask)) {
>             pr_err("Fadump header size mismatch.\n")
>             return false;
>         } else
>             pr_warn("Kernel version mismatch; dump data is unreliable.\n");
>     }
> 

You also want a fdh->version check? I am still not sure you need the kernel_version check. IMHO that is too strict
and can hit that check where you have kexec kernel which is not installed in /boot crashing? 


>     return true;
> }
> 
> And the new fadump crash info header will be: As suggested by Hari.
> 
> /* fadump crash info structure */
> struct fadump_crash_info_header {
>     u64        magic_number;
> +  u32        version;
>     u32        crashing_cpu;
>     u64        elfcorehdr_addr;
> +  u64        elfcorehdr_size;
> +  u64        vmcoreinfo_raddr;
> +  u64        vmcoreinfo_size;
> +  u8          kernel_version[__NEW_UTS_LEN + 1];
> +  u32        pt_regs_sz;
>     struct pt_regs    regs;
> +  u32        cpu_mask_sz;
>     struct cpumask    cpu_mask;
> };
> 
> Thanks,
> Sourabh Jain
Michael Ellerman Nov. 22, 2023, 12:52 p.m. UTC | #10
Sourabh Jain <sourabhjain@linux.ibm.com> writes:
> On 22/11/23 10:47, Michael Ellerman wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> ...
>>> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
>>> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
>>> crash kernel and fadump kernel will be same version?
>> How sure are we of that?
>>
>> Don't we go through grub when we boot into the 2nd kernel. And so
>> couldn't it choose to boot a different kernel, for whatever reason.
>>
>> I don't think we need to support different pt_reg / cpumask sizes, but
>> requiring the exact same kernel version is too strict I think.
> Agree.
>>
>> But maybe I'm wrong. Would be good to hear what distro folks think.
>
> How about checking fadump crash info header compatibility in the 
> following way?
>
> static bool is_fadump_header_compatible(struct fadump_crash_info_header 
> *fdh)
> {
>      if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
>          pr_err("Old magic number, can't process the dump.");
>          return false;
>      }
>
>      if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>          pr_err("Fadump header is corrupted.");
>          return false;
>      }
>
>      /*
>       * If the kernel version of the first/crashed kernel and the 
> second/fadump
>       * kernel is not same, then only collect the dump if the size of all
>       * non-primitive type members of the fadump header is the same 
> across kernels.
>       */
>      if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
 
I don't think the kernel version check is necessary?

>          if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != sizeof(struct cpumask)) {
>              pr_err("Fadump header size mismatch.\n")
>              return false;

Yeah I think that works.

>          } else
>              pr_warn("Kernel version mismatch; dump data is unreliable.\n");
>      }
>
>      return true;
> }
>
> And the new fadump crash info header will be: As suggested by Hari.
>
> /* fadump crash info structure */
> struct fadump_crash_info_header {
>      u64        magic_number;
> +  u32        version;
 
Do we need version? We're effectively using the magic_number as a
version already. Having an actual version would allow us to make
backward compatible changes in future, but it's not clear we'll need or
want to do that.

>      u32        crashing_cpu;
>      u64        elfcorehdr_addr;
> +  u64        elfcorehdr_size;
> +  u64        vmcoreinfo_raddr;
> +  u64        vmcoreinfo_size;
> +  u8          kernel_version[__NEW_UTS_LEN + 1];
> +  u32        pt_regs_sz;
>      struct pt_regs    regs;
> +  u32        cpu_mask_sz;
 
Typically you would put all the size fields before the variable sized
fields, because otherwise the later size fields may not be where you
expect them to be. But because we're bailing out if any of the sizes
don't match it doesn't actually matter.

>      struct cpumask    cpu_mask;
> };

The other issue is endian. I assume we're just declaring that the
1st/2nd kernel must be the same endian? We should still make that
explicit though.

To make it clearer that this struct is somewhat an ABI, I think we
should move the definition into arch/powerpc/include/uapi/asm/fadump.h

We don't expect userspace to actually use the header, but it will
hopefully remind everyone that the struct needs to be changed with care :)

cheers
Sourabh Jain Nov. 24, 2023, 5:20 a.m. UTC | #11
Hello Aneesh,

On 22/11/23 17:50, Aneesh Kumar K V wrote:
> On 11/22/23 4:05 PM, Sourabh Jain wrote:
>> Hello Michael,
>>
>>
>> On 22/11/23 10:47, Michael Ellerman wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>> ...
>>>> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
>>>> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
>>>> crash kernel and fadump kernel will be same version?
>>> How sure are we of that?
>>>
>>> Don't we go through grub when we boot into the 2nd kernel. And so
>>> couldn't it choose to boot a different kernel, for whatever reason.
>>>
>>> I don't think we need to support different pt_reg / cpumask sizes, but
>>> requiring the exact same kernel version is too strict I think.
>> Agree.
>>> But maybe I'm wrong. Would be good to hear what distro folks think.
>> How about checking fadump crash info header compatibility in the following way?
>>
>> static bool is_fadump_header_compatible(struct fadump_crash_info_header *fdh)
>> {
>>      if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
>>          pr_err("Old magic number, can't process the dump.");
>>          return false;
>>      }
>>
>>      if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>>          pr_err("Fadump header is corrupted.");
>>          return false;
>>      }
>>
>>      /*
>>       * If the kernel version of the first/crashed kernel and the second/fadump
>>       * kernel is not same, then only collect the dump if the size of all
>>       * non-primitive type members of the fadump header is the same across kernels.
>>       */
>>      if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
>>          if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != sizeof(struct cpumask)) {
>>              pr_err("Fadump header size mismatch.\n")
>>              return false;
>>          } else
>>              pr_warn("Kernel version mismatch; dump data is unreliable.\n");
>>      }
>>
> You also want a fdh->version check?

Even though we don't have any action against an fdh->version right now, 
I think I should
check the fadump header version. Currently, if the version doesn't 
match, it means the
header is corrupted.

> I am still not sure you need the kernel_version check. IMHO that is too strict
> and can hit that check where you have kexec kernel which is not installed in /boot crashing?

If the kernel versions mismatch, we still collect the dump if the 
`pt_regs` and `cpu_mask`
sizes are the same across the kernels. The kernel version check is just 
to warn users that
the collected dump may be unreliable.

Should I remove the kernel_version filed from fadump crash info header 
and remove the
the kernel version check while processing the kernel dump?

Thanks,
Sourabh Jain
Sourabh Jain Nov. 24, 2023, 7:21 a.m. UTC | #12
Hello Michael,

On 22/11/23 18:22, Michael Ellerman wrote:
> Sourabh Jain <sourabhjain@linux.ibm.com> writes:
>> On 22/11/23 10:47, Michael Ellerman wrote:
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>> ...
>>>> I am not sure whether we need to add all the complexity to enable supporting different fadump kernel
>>>> version. Is that even a possible use case with fadump? Can't we always assume that with fadump the
>>>> crash kernel and fadump kernel will be same version?
>>> How sure are we of that?
>>>
>>> Don't we go through grub when we boot into the 2nd kernel. And so
>>> couldn't it choose to boot a different kernel, for whatever reason.
>>>
>>> I don't think we need to support different pt_reg / cpumask sizes, but
>>> requiring the exact same kernel version is too strict I think.
>> Agree.
>>> But maybe I'm wrong. Would be good to hear what distro folks think.
>> How about checking fadump crash info header compatibility in the
>> following way?
>>
>> static bool is_fadump_header_compatible(struct fadump_crash_info_header
>> *fdh)
>> {
>>       if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC_OLD) {
>>           pr_err("Old magic number, can't process the dump.");
>>           return false;
>>       }
>>
>>       if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
>>           pr_err("Fadump header is corrupted.");
>>           return false;
>>       }
>>
>>       /*
>>        * If the kernel version of the first/crashed kernel and the
>> second/fadump
>>        * kernel is not same, then only collect the dump if the size of all
>>        * non-primitive type members of the fadump header is the same
>> across kernels.
>>        */
>>       if (strcmp(fdh->kernel_version, init_uts_ns.name.release)) {
>   
> I don't think the kernel version check is necessary?

I didn't find a place where pt_regs members are accessed to take
a decision in fadump kernel. we just copy the pt_regs in fadump kernel.

So I think as long as size is same across kernels, we are good.

>
>>           if (fdh->pt_regs_sz != sizeof(struct pt_regs) || fdh->cpu_mask_sz != sizeof(struct cpumask)) {
>>               pr_err("Fadump header size mismatch.\n")
>>               return false;
> Yeah I think that works.
>
>>           } else
>>               pr_warn("Kernel version mismatch; dump data is unreliable.\n");
>>       }
>>
>>       return true;
>> }
>>
>> And the new fadump crash info header will be: As suggested by Hari.
>>
>> /* fadump crash info structure */
>> struct fadump_crash_info_header {
>>       u64        magic_number;
>> +  u32        version;
>   
> Do we need version? We're effectively using the magic_number as a
> version already. Having an actual version would allow us to make
> backward compatible changes in future, but it's not clear we'll need or
> want to do that.
Agree that currently version field is not used but I added a version 
field to
make the fadump header structure compatible with future changes without
changing the magic number.

I will add a comment on how version field should be utilized if one 
changes fadump
header in future.

>
>>       u32        crashing_cpu;
>>       u64        elfcorehdr_addr;
>> +  u64        elfcorehdr_size;
>> +  u64        vmcoreinfo_raddr;
>> +  u64        vmcoreinfo_size;
>> +  u8          kernel_version[__NEW_UTS_LEN + 1];
>> +  u32        pt_regs_sz;
>>       struct pt_regs    regs;
>> +  u32        cpu_mask_sz;
>   
> Typically you would put all the size fields before the variable sized
> fields, because otherwise the later size fields may not be where you
> expect them to be. But because we're bailing out if any of the sizes
> don't match it doesn't actually matter.

Yeah, but I will reorganize fadump header and put the size fields before the
variable sized fields.

>
>>       struct cpumask    cpu_mask;
>> };
> The other issue is endian. I assume we're just declaring that the
> 1st/2nd kernel must be the same endian? We should still make that
> explicit though.

A comment is fine or should we add a explicit check and error out
with relevant error message if endianness is not same across the
kernels?

Something like:

if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
     if (fdh->magic_number == swab64(FADUMP_CRASH_INFO_MAGIC)) {
         pr_err("Endianness mismatch");
     } else {
         pr_err("Fadump header is corrupted.");
     }
     return false;
}


>
> To make it clearer that this struct is somewhat an ABI, I think we
> should move the definition into arch/powerpc/include/uapi/asm/fadump.h
Sure
>
> We don't expect userspace to actually use the header, but it will
> hopefully remind everyone that the struct needs to be changed with care :)
Agree

Thanks,
Sourabh Jain
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
index 27f9e11eda28..7be3d8894520 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -42,7 +42,25 @@  static inline u64 fadump_str_to_u64(const char *str)
 
 #define FADUMP_CPU_UNKNOWN		(~((u32)0))
 
-#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPINF")
+/*
+ * The introduction of new fields in the fadump crash info header has
+ * led to a change in the magic key, from `FADMPINF` to `FADMPSIG`.
+ * This alteration ensures backward compatibility, enabling the kernel
+ * with the updated fadump crash info to handle kernel dumps from older
+ * kernels.
+ *
+ * To prevent the need for further changes to the magic number in the
+ * event of future modifications to the fadump header, a version field
+ * has been introduced to track the fadump crash info header version.
+ *
+ * Historically, there was no connection between the magic number and
+ * the fadump crash info header version. However, moving forward, the
+ * `FADMPINF` magic number in header will be treated as version 0, while
+ * the `FADMPSIG` magic number in header will include a version field to
+ * determine its version.
+ */
+#define FADUMP_CRASH_INFO_MAGIC		fadump_str_to_u64("FADMPSIG")
+#define FADUMP_VERSION			1
 
 /* fadump crash info structure */
 struct fadump_crash_info_header {
@@ -51,6 +69,10 @@  struct fadump_crash_info_header {
 	u32		crashing_cpu;
 	struct pt_regs	regs;
 	struct cpumask	cpu_mask;
+	u32		version;
+	u64		elfcorehdr_size;
+	u64		vmcoreinfo_raddr;
+	u64		vmcoreinfo_size;
 };
 
 struct fadump_memory_range {
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 3ff2da7b120b..8fd90afe886e 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -53,8 +53,6 @@  static struct kobject *fadump_kobj;
 static atomic_t cpus_in_fadump;
 static DEFINE_MUTEX(fadump_mutex);
 
-static struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0, false };
-
 #define RESERVED_RNGS_SZ	16384 /* 16K - 128 entries */
 #define RESERVED_RNGS_CNT	(RESERVED_RNGS_SZ / \
 				 sizeof(struct fadump_memory_range))
@@ -373,12 +371,6 @@  static unsigned long __init get_fadump_area_size(void)
 	size = PAGE_ALIGN(size);
 	size += fw_dump.boot_memory_size;
 	size += sizeof(struct fadump_crash_info_header);
-	size += sizeof(struct elfhdr); /* ELF core header.*/
-	size += sizeof(struct elf_phdr); /* place holder for cpu notes */
-	/* Program headers for crash memory regions. */
-	size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2);
-
-	size = PAGE_ALIGN(size);
 
 	/* This is to hold kernel metadata on platforms that support it */
 	size += (fw_dump.ops->fadump_get_metadata_size ?
@@ -931,36 +923,6 @@  static inline int fadump_add_mem_range(struct fadump_mrange_info *mrange_info,
 	return 0;
 }
 
-static int fadump_exclude_reserved_area(u64 start, u64 end)
-{
-	u64 ra_start, ra_end;
-	int ret = 0;
-
-	ra_start = fw_dump.reserve_dump_area_start;
-	ra_end = ra_start + fw_dump.reserve_dump_area_size;
-
-	if ((ra_start < end) && (ra_end > start)) {
-		if ((start < ra_start) && (end > ra_end)) {
-			ret = fadump_add_mem_range(&crash_mrange_info,
-						   start, ra_start);
-			if (ret)
-				return ret;
-
-			ret = fadump_add_mem_range(&crash_mrange_info,
-						   ra_end, end);
-		} else if (start < ra_start) {
-			ret = fadump_add_mem_range(&crash_mrange_info,
-						   start, ra_start);
-		} else if (ra_end < end) {
-			ret = fadump_add_mem_range(&crash_mrange_info,
-						   ra_end, end);
-		}
-	} else
-		ret = fadump_add_mem_range(&crash_mrange_info, start, end);
-
-	return ret;
-}
-
 static int fadump_init_elfcore_header(char *bufp)
 {
 	struct elfhdr *elf;
@@ -997,52 +959,6 @@  static int fadump_init_elfcore_header(char *bufp)
 	return 0;
 }
 
-/*
- * Traverse through memblock structure and setup crash memory ranges. These
- * ranges will be used create PT_LOAD program headers in elfcore header.
- */
-static int fadump_setup_crash_memory_ranges(void)
-{
-	u64 i, start, end;
-	int ret;
-
-	pr_debug("Setup crash memory ranges.\n");
-	crash_mrange_info.mem_range_cnt = 0;
-
-	/*
-	 * Boot memory region(s) registered with firmware are moved to
-	 * different location at the time of crash. Create separate program
-	 * header(s) for this memory chunk(s) with the correct offset.
-	 */
-	for (i = 0; i < fw_dump.boot_mem_regs_cnt; i++) {
-		start = fw_dump.boot_mem_addr[i];
-		end = start + fw_dump.boot_mem_sz[i];
-		ret = fadump_add_mem_range(&crash_mrange_info, start, end);
-		if (ret)
-			return ret;
-	}
-
-	for_each_mem_range(i, &start, &end) {
-		/*
-		 * skip the memory chunk that is already added
-		 * (0 through boot_memory_top).
-		 */
-		if (start < fw_dump.boot_mem_top) {
-			if (end > fw_dump.boot_mem_top)
-				start = fw_dump.boot_mem_top;
-			else
-				continue;
-		}
-
-		/* add this range excluding the reserved dump area. */
-		ret = fadump_exclude_reserved_area(start, end);
-		if (ret)
-			return ret;
-	}
-
-	return 0;
-}
-
 /*
  * If the given physical address falls within the boot memory region then
  * return the relocated address that points to the dump region reserved
@@ -1073,13 +989,28 @@  static inline unsigned long fadump_relocate(unsigned long paddr)
 	return raddr;
 }
 
-static int fadump_create_elfcore_headers(char *bufp)
+static void populate_elf_pt_load(struct elf_phdr *phdr, u64 start,
+			     u64 size, unsigned long long offset)
 {
-	unsigned long long raddr, offset;
-	struct elf_phdr *phdr;
+	phdr->p_align	= 0;
+	phdr->p_memsz	= size;
+	phdr->p_filesz	= size;
+	phdr->p_paddr	= start;
+	phdr->p_offset	= offset;
+	phdr->p_type	= PT_LOAD;
+	phdr->p_flags	= PF_R|PF_W|PF_X;
+	phdr->p_vaddr	= (unsigned long)__va(start);
+}
+
+static void __init fadump_populate_elfcorehdr(struct fadump_crash_info_header *fdh)
+{
+	char *bufp;
 	struct elfhdr *elf;
-	int i, j;
+	struct elf_phdr *phdr;
+	u64 boot_mem_dest_offset;
+	unsigned long long i, ra_start, ra_end, ra_size, mstart, mend;
 
+	bufp = (char *) __va(fdh->elfcorehdr_addr);
 	fadump_init_elfcore_header(bufp);
 	elf = (struct elfhdr *)bufp;
 	bufp += sizeof(struct elfhdr);
@@ -1113,54 +1044,68 @@  static int fadump_create_elfcore_headers(char *bufp)
 	phdr->p_vaddr	= 0;
 	phdr->p_align	= 0;
 
-	phdr->p_paddr	= fadump_relocate(paddr_vmcoreinfo_note());
-	phdr->p_offset	= phdr->p_paddr;
-	phdr->p_memsz	= phdr->p_filesz = VMCOREINFO_NOTE_SIZE;
+	phdr->p_paddr	= phdr->p_offset = fdh->vmcoreinfo_raddr;
+	phdr->p_memsz	= phdr->p_filesz = fdh->vmcoreinfo_size;
 
 	/* Increment number of program headers. */
 	(elf->e_phnum)++;
 
-	/* setup PT_LOAD sections. */
-	j = 0;
-	offset = 0;
-	raddr = fw_dump.boot_mem_addr[0];
-	for (i = 0; i < crash_mrange_info.mem_range_cnt; i++) {
-		u64 mbase, msize;
-
-		mbase = crash_mrange_info.mem_ranges[i].base;
-		msize = crash_mrange_info.mem_ranges[i].size;
-		if (!msize)
-			continue;
-
+	/*
+	 * setup PT_LOAD sections.
+	 * first include boot memory regions and then add rest of the memory
+	 * regions
+	 */
+	boot_mem_dest_offset = fw_dump.boot_mem_dest_addr;
+	for (i = 0; i < fw_dump.boot_mem_regs_cnt; i++) {
 		phdr = (struct elf_phdr *)bufp;
 		bufp += sizeof(struct elf_phdr);
-		phdr->p_type	= PT_LOAD;
-		phdr->p_flags	= PF_R|PF_W|PF_X;
-		phdr->p_offset	= mbase;
-
-		if (mbase == raddr) {
-			/*
-			 * The entire real memory region will be moved by
-			 * firmware to the specified destination_address.
-			 * Hence set the correct offset.
-			 */
-			phdr->p_offset = fw_dump.boot_mem_dest_addr + offset;
-			if (j < (fw_dump.boot_mem_regs_cnt - 1)) {
-				offset += fw_dump.boot_mem_sz[j];
-				raddr = fw_dump.boot_mem_addr[++j];
-			}
+		populate_elf_pt_load(phdr, fw_dump.boot_mem_addr[i],
+				 fw_dump.boot_mem_sz[i],
+				 boot_mem_dest_offset);
+		/* Increment number of program headers. */
+		(elf->e_phnum)++;
+		boot_mem_dest_offset += fw_dump.boot_mem_sz[i];
+	}
+
+	/* Memory reserved for fadump in first kernel */
+	ra_start = fw_dump.reserve_dump_area_start;
+	ra_size = get_fadump_area_size();
+	ra_end = ra_start + ra_size;
+
+	phdr = (struct elf_phdr *)bufp;
+	for_each_mem_range(i, &mstart, &mend) {
+		/*
+		 * Skip the memory regions that already added
+		 * to elfcorehdr.
+		 */
+		if (mstart < fw_dump.boot_mem_top) {
+			if (mend > fw_dump.boot_mem_top)
+				mstart = fw_dump.boot_mem_top;
+			else
+				continue;
 		}
 
-		phdr->p_paddr = mbase;
-		phdr->p_vaddr = (unsigned long)__va(mbase);
-		phdr->p_filesz = msize;
-		phdr->p_memsz = msize;
-		phdr->p_align = 0;
+		/* Handle memblock regions overlaps with reserved area */
+		if ((ra_start < mend) && (ra_end > mstart)) {
+			if ((mstart < ra_start) && (mend > ra_end)) {
+				populate_elf_pt_load(phdr, mstart, ra_start - mstart, mstart);
+				elf->e_phnum++;
+				bufp += sizeof(struct elf_phdr);
+				phdr = (struct elf_phdr *)bufp;
+				populate_elf_pt_load(phdr, ra_end, mend - ra_end, ra_end);
+			} else if (mstart < ra_start) {
+				populate_elf_pt_load(phdr, mstart, ra_start - mstart, mstart);
+			} else if (ra_end < mend) {
+				populate_elf_pt_load(phdr, ra_end, mend - ra_end, ra_end);
+			}
+		} else {
+			populate_elf_pt_load(phdr, mstart, mend - mstart, mstart);
+		}
 
-		/* Increment number of program headers. */
-		(elf->e_phnum)++;
+		elf->e_phnum++;
+		bufp += sizeof(struct elf_phdr);
+		phdr = (struct elf_phdr *) bufp;
 	}
-	return 0;
 }
 
 static unsigned long init_fadump_header(unsigned long addr)
@@ -1175,9 +1120,16 @@  static unsigned long init_fadump_header(unsigned long addr)
 
 	memset(fdh, 0, sizeof(struct fadump_crash_info_header));
 	fdh->magic_number = FADUMP_CRASH_INFO_MAGIC;
-	fdh->elfcorehdr_addr = addr;
+	/* Elfcorehdr will now be prepared in the second kernel. */
+	fdh->elfcorehdr_addr = 0;
 	/* We will set the crashing cpu id in crash_fadump() during crash. */
 	fdh->crashing_cpu = FADUMP_CPU_UNKNOWN;
+	fdh->version = FADUMP_VERSION;
+
+	/* Need below vmcoreinfo in second kernel to prepare elfcorehdr. */
+	fdh->vmcoreinfo_raddr = fadump_relocate(paddr_vmcoreinfo_note());
+	fdh->vmcoreinfo_size = VMCOREINFO_NOTE_SIZE;
+
 	/*
 	 * When LPAR is terminated by PYHP, ensure all possible CPUs'
 	 * register data is processed while exporting the vmcore.
@@ -1190,8 +1142,6 @@  static unsigned long init_fadump_header(unsigned long addr)
 static int register_fadump(void)
 {
 	unsigned long addr;
-	void *vaddr;
-	int ret;
 
 	/*
 	 * If no memory is reserved then we can not register for firmware-
@@ -1200,18 +1150,10 @@  static int register_fadump(void)
 	if (!fw_dump.reserve_dump_area_size)
 		return -ENODEV;
 
-	ret = fadump_setup_crash_memory_ranges();
-	if (ret)
-		return ret;
-
 	addr = fw_dump.fadumphdr_addr;
 
 	/* Initialize fadump crash info header. */
 	addr = init_fadump_header(addr);
-	vaddr = __va(addr);
-
-	pr_debug("Creating ELF core headers at %#016lx\n", addr);
-	fadump_create_elfcore_headers(vaddr);
 
 	/* register the future kernel dump with firmware. */
 	pr_debug("Registering for firmware-assisted kernel dump...\n");
@@ -1230,7 +1172,6 @@  void fadump_cleanup(void)
 	} else if (fw_dump.dump_registered) {
 		/* Un-register Firmware-assisted dump if it was registered. */
 		fw_dump.ops->fadump_unregister(&fw_dump);
-		fadump_free_mem_ranges(&crash_mrange_info);
 	}
 
 	if (fw_dump.ops->fadump_cleanup)
@@ -1416,6 +1357,54 @@  static void fadump_release_memory(u64 begin, u64 end)
 		fadump_release_reserved_area(tstart, end);
 }
 
+/*
+ * To determine fadump header version, the following conditions apply
+ * - If magic_number is FADMPINF, return 0
+ * - If magic_number is FADMPSIG, return the version value.
+ * - Otherwise, return -E (header is not valid).
+ */
+static u32 get_fadump_crash_header_version(struct fadump_crash_info_header *fdh)
+{
+	if (fdh->magic_number == fadump_str_to_u64("FADMPINF"))
+		return 0;
+	else if (fdh->magic_number == FADUMP_CRASH_INFO_MAGIC)
+		return fdh->version;
+
+	return -EINVAL;
+}
+
+static void fadump_free_elfcorehdr_buf(void)
+{
+	struct fadump_crash_info_header *fdh;
+	unsigned long elfcorehdr_vaddr;
+
+	if (!fw_dump.fadumphdr_addr)
+		return;
+
+	fdh = (struct fadump_crash_info_header *) __va(fw_dump.fadumphdr_addr);
+
+	/*
+	 * For fadump version 1 the memory for elfcorehdr is dynamically
+	 * allocated in second kernel so free it.
+	 */
+	if (get_fadump_crash_header_version(fdh) != 1)
+		return;
+
+	if (fdh->elfcorehdr_addr == 0 || fdh->elfcorehdr_size == 0)
+		return;
+
+	/*
+	 * Reset the global `elfcorehdr_addr` before freeing `elfcorehdr`
+	 * memory so that other kernel modules, like `vmcore`, do not
+	 * access the invalid memory.
+	 */
+	elfcorehdr_addr = ELFCORE_ADDR_MAX;
+	elfcorehdr_vaddr = (unsigned long) __va(fdh->elfcorehdr_addr);
+	fadump_free_buffer(elfcorehdr_vaddr, fdh->elfcorehdr_size);
+	fdh->elfcorehdr_addr = 0;
+	fdh->elfcorehdr_size = 0;
+}
+
 static void fadump_invalidate_release_mem(void)
 {
 	mutex_lock(&fadump_mutex);
@@ -1424,6 +1413,8 @@  static void fadump_invalidate_release_mem(void)
 		return;
 	}
 
+	/* Free elfcorehdr buf before cleaning up the fadump reserved area. */
+	fadump_free_elfcorehdr_buf();
 	fadump_cleanup();
 	mutex_unlock(&fadump_mutex);
 
@@ -1632,6 +1623,93 @@  static void __init fadump_init_files(void)
 	return;
 }
 
+static int __init fadump_setup_elfcorehdr_buf(struct fadump_crash_info_header *fdh)
+{
+	char *bufp;
+	int elf_phdr_cnt;
+	unsigned long elfcorehdr_size;
+
+	/*
+	 * Program header for CPU notes comes first, followed by one for
+	 * vmcoreinfo, and the remaining program headers correspond to
+	 * memory regions.
+	 */
+	elf_phdr_cnt = 2 + fw_dump.boot_mem_regs_cnt + memblock_num_regions(memory);
+	elfcorehdr_size = sizeof(struct elfhdr) + (elf_phdr_cnt * sizeof(struct elf_phdr));
+	elfcorehdr_size = PAGE_ALIGN(elfcorehdr_size);
+	bufp = (char *)fadump_alloc_buffer(elfcorehdr_size);
+	if (!bufp) {
+		pr_err("Failed to allocate %lu bytes for elfcorehdr\n",
+		       elfcorehdr_size);
+		return -ENOMEM;
+	}
+	fdh->elfcorehdr_addr = virt_to_phys(bufp);
+	fdh->elfcorehdr_size = elfcorehdr_size;
+	return 0;
+}
+
+/*
+ * Process an active dump in four steps. First, verify the crash info header
+ * signature/magic number for integrity and accuracy. Second, if the fadump
+ * version is greater than 0, prepare the elfcorehdr; for fadump version 0,
+ * it's already created in the first kernel as part of the fadump reserved
+ * area. Third, let the platform update CPU notes in elfcorehdr. Finally,
+ * set elfcorehdr_addr so that the vmcore module can export the elfcore
+ * header through '/proc/vmcore'.
+ */
+static void process_fadump(void)
+{
+	int fadump_version;
+	struct fadump_crash_info_header *fdh;
+
+	fdh = (struct fadump_crash_info_header *) __va(fw_dump.fadumphdr_addr);
+	if (!fdh) {
+		pr_err("Crash info header is empty.\n");
+		goto err_out;
+	}
+
+	fadump_version = get_fadump_crash_header_version(fdh);
+
+	/*
+	 * Negative fadump version indicates that fadump crash info header
+	 * is corrupted.
+	 */
+	if (fadump_version < 0) {
+		pr_err("Crash info header is not valid.\n");
+		goto err_out;
+	}
+
+	/*
+	 * If crashed kernel's fadump crash info header version is 1,
+	 * then create elfcorehdr here.
+	 */
+	if (fadump_version >= 1) {
+		if (fadump_setup_elfcorehdr_buf(fdh))
+			goto err_out;
+
+		fadump_populate_elfcorehdr(fdh);
+	}
+
+	/*
+	 * If dump process fails then invalidate the registration
+	 * and release memory before proceeding for re-registration.
+	 */
+	if (fw_dump.ops->fadump_process(&fw_dump) < 0)
+		goto err_out;
+
+	/*
+	 * elfcore header is now ready to be exported.
+	 *
+	 * set elfcorehdr_addr so that vmcore module will export the
+	 * elfcore header through '/proc/vmcore'.
+	 */
+	elfcorehdr_addr = fdh->elfcorehdr_addr;
+	return;
+
+err_out:
+	fadump_invalidate_release_mem();
+}
+
 /*
  * Prepare for firmware-assisted dump.
  */
@@ -1651,12 +1729,7 @@  int __init setup_fadump(void)
 	 * saving it to the disk.
 	 */
 	if (fw_dump.dump_active) {
-		/*
-		 * if dump process fails then invalidate the registration
-		 * and release memory before proceeding for re-registration.
-		 */
-		if (fw_dump.ops->fadump_process(&fw_dump) < 0)
-			fadump_invalidate_release_mem();
+		process_fadump();
 	}
 	/* Initialize the kernel dump memory structure and register with f/w */
 	else if (fw_dump.reserve_dump_area_size) {
diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c b/arch/powerpc/platforms/powernv/opal-fadump.c
index 964f464b1b0e..a44d75cefc9e 100644
--- a/arch/powerpc/platforms/powernv/opal-fadump.c
+++ b/arch/powerpc/platforms/powernv/opal-fadump.c
@@ -526,12 +526,7 @@  static int __init opal_fadump_process(struct fw_dump *fadump_conf)
 	if (!opal_fdm_active || !fadump_conf->fadumphdr_addr)
 		return rc;
 
-	/* Validate the fadump crash info header */
 	fdh = __va(fadump_conf->fadumphdr_addr);
-	if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
-		pr_err("Crash info header is not valid.\n");
-		return rc;
-	}
 
 #ifdef CONFIG_OPAL_CORE
 	/*
@@ -545,18 +540,7 @@  static int __init opal_fadump_process(struct fw_dump *fadump_conf)
 		kernel_initiated = true;
 #endif
 
-	rc = opal_fadump_build_cpu_notes(fadump_conf, fdh);
-	if (rc)
-		return rc;
-
-	/*
-	 * We are done validating dump info and elfcore header is now ready
-	 * to be exported. set elfcorehdr_addr so that vmcore module will
-	 * export the elfcore header through '/proc/vmcore'.
-	 */
-	elfcorehdr_addr = fdh->elfcorehdr_addr;
-
-	return rc;
+	return opal_fadump_build_cpu_notes(fadump_conf, fdh);
 }
 
 static void opal_fadump_region_show(struct fw_dump *fadump_conf,
diff --git a/arch/powerpc/platforms/pseries/rtas-fadump.c b/arch/powerpc/platforms/pseries/rtas-fadump.c
index b5853e9fcc3c..3366082705d5 100644
--- a/arch/powerpc/platforms/pseries/rtas-fadump.c
+++ b/arch/powerpc/platforms/pseries/rtas-fadump.c
@@ -394,9 +394,6 @@  static int __init rtas_fadump_build_cpu_notes(struct fw_dump *fadump_conf)
  */
 static int __init rtas_fadump_process(struct fw_dump *fadump_conf)
 {
-	struct fadump_crash_info_header *fdh;
-	int rc = 0;
-
 	if (!fdm_active || !fadump_conf->fadumphdr_addr)
 		return -EINVAL;
 
@@ -415,25 +412,7 @@  static int __init rtas_fadump_process(struct fw_dump *fadump_conf)
 		return -EINVAL;
 	}
 
-	/* Validate the fadump crash info header */
-	fdh = __va(fadump_conf->fadumphdr_addr);
-	if (fdh->magic_number != FADUMP_CRASH_INFO_MAGIC) {
-		pr_err("Crash info header is not valid.\n");
-		return -EINVAL;
-	}
-
-	rc = rtas_fadump_build_cpu_notes(fadump_conf);
-	if (rc)
-		return rc;
-
-	/*
-	 * We are done validating dump info and elfcore header is now ready
-	 * to be exported. set elfcorehdr_addr so that vmcore module will
-	 * export the elfcore header through '/proc/vmcore'.
-	 */
-	elfcorehdr_addr = fdh->elfcorehdr_addr;
-
-	return 0;
+	return rtas_fadump_build_cpu_notes(fadump_conf);
 }
 
 static void rtas_fadump_region_show(struct fw_dump *fadump_conf,