Message ID | 20111107095521.1997.34844.stgit@mars.in.ibm.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
于 2011年11月07日 17:55, Mahesh J Salgaonkar 写道: > From: Mahesh Salgaonkar<mahesh@linux.vnet.ibm.com> > > Documentation for firmware-assisted dump. This document is based on the > original documentation written for phyp assisted dump by Linas Vepstas > and Manish Ahuja, with few changes to reflect the current implementation. > > Change in v3: > - Modified the documentation to reflect introdunction of fadump_registered > sysfs file and few minor changes. > > Change in v2: > - Modified the documentation to reflect the change of fadump_region > file under debugfs filesystem. > > Signed-off-by: Mahesh Salgaonkar<mahesh@linux.vnet.ibm.com> Please Cc Randy Dunlap <rdunlap@xenotime.net> for kernel documentation patch. I have some inline comments below. > --- > Documentation/powerpc/firmware-assisted-dump.txt | 262 ++++++++++++++++++++++ > 1 files changed, 262 insertions(+), 0 deletions(-) > create mode 100644 Documentation/powerpc/firmware-assisted-dump.txt > > diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt > new file mode 100644 > index 0000000..ba6724a > --- /dev/null > +++ b/Documentation/powerpc/firmware-assisted-dump.txt > @@ -0,0 +1,262 @@ > + > + Firmware-Assisted Dump > + ------------------------ > + July 2011 > + > +The goal of firmware-assisted dump is to enable the dump of > +a crashed system, and to do so from a fully-reset system, and > +to minimize the total elapsed time until the system is back > +in production use. > + > +As compared to kdump or other strategies, firmware-assisted > +dump offers several strong, practical advantages: Comparing with kdump or... > + > +-- Unlike kdump, the system has been reset, and loaded > + with a fresh copy of the kernel. In particular, > + PCI and I/O devices have been reinitialized and are > + in a clean, consistent state. > +-- Once the dump is copied out, the memory that held the dump > + is immediately available to the running kernel. A further > + reboot isn't required. > + > +The above can only be accomplished by coordination with, > +and assistance from the Power firmware. The procedure is > +as follows: > + > +-- The first kernel registers the sections of memory with the > + Power firmware for dump preservation during OS initialization. > + This registered sections of memory is reserved by the first These registered sections of memory are... > + kernel during early boot. > + > +-- When a system crashes, the Power firmware will save > + the low memory (boot memory of size larger of 5% of system RAM > + or 256MB) of RAM to a previously registered save region. It ...to the previous registered region... > + will also save system registers, and hardware PTE's. > + > + NOTE: The term 'boot memory' means size of the low memory chunk > + that is required for a kernel to boot successfully when > + booted with restricted memory. By default, the boot memory > + size will be calculated to larger of 5% of system RAM or will be the larger of... > + 256MB. Alternatively, user can also specify boot memory > + size through boot parameter 'fadump_reserve_mem=' which > + will override the default calculated size. > + > +-- After the low memory (boot memory) area has been saved, the > + firmware will reset PCI and other hardware state. It will > + *not* clear the RAM. It will then launch the bootloader, as > + normal. > + > +-- The freshly booted kernel will notice that there is a new > + node (ibm,dump-kernel) in the device tree, indicating that > + there is crash data available from a previous boot. During > + the early boot OS will reserve rest of the memory above > + boot memory size effectively booting with restricted memory > + size. This will make sure that the second kernel will not > + touch any of the dump memory area. > + > +-- Userspace tools will read /proc/vmcore to obtain the contents > + of memory, which holds the previous crashed kernel dump in ELF > + format. The userspace tools may copy this info to disk, or > + network, nas, san, iscsi, etc. as desired. s/Userspace/User-space/ > + > +-- Once the userspace tool is done saving dump, it will echo > + '1' to /sys/kernel/fadump_release_mem to release the reserved > + memory back to general use, except the memory required for > + next firmware-assisted dump registration. > + > + e.g. > + # echo 1> /sys/kernel/fadump_release_mem > + > +Please note that the firmware-assisted dump feature > +is only available on Power6 and above systems with recent > +firmware versions. > + > +Implementation details: > +---------------------- > + > +During boot, a check is made to see if firmware supports > +this feature on that particular machine. If it does, then > +we check to see if an active dump is waiting for us. If yes > +then everything but boot memory size of RAM is reserved during > +early boot (See Fig. 2). This area is released once we collect a > +dump from user land scripts (kdump scripts) that are run. If This area is released once we finish collecting the dump from user land scripts (e.g. kdump scripts). > +there is dump data, then the /sys/kernel/fadump_release_mem > +file is created, and the reserved memory is held. > + > +If there is no waiting dump data, then only the memory required > +to hold CPU state, HPTE region, boot memory dump and elfcore > +header, is reserved at the top of memory (see Fig. 1). This area > +is *not* released: this region will be kept permanently reserved, > +so that it can act as a receptacle for a copy of the boot memory > +content in addition to CPU state and HPTE region, in the case a > +crash does occur. > + > + o Memory Reservation during first kernel > + > + Low memory Top of memory > + 0 boot memory size | > + | | |<--Reserved dump area -->| > + V V | Permanent Reservation V > + +-----------+----------/ /----------+---+----+-----------+----+ > + | | |CPU|HPTE| DUMP |ELF | > + +-----------+----------/ /----------+---+----+-----------+----+ > + | ^ > + | | > + \ / > + ------------------------------------------- > + Boot memory content gets transferred to > + reserved area by firmware at the time of > + crash > + Fig. 1 > + > + o Memory Reservation during second kernel after crash > + > + Low memory Top of memory > + 0 boot memory size | > + | |<------------- Reserved dump area ----------- -->| > + V V V > + +-----------+----------/ /----------+---+----+-----------+----+ > + | | |CPU|HPTE| DUMP |ELF | > + +-----------+----------/ /----------+---+----+-----------+----+ > + | | > + V V > + Used by second /proc/vmcore > + kernel to boot > + Fig. 2 > + > +Currently the dump will be copied from /proc/vmcore to a > +a new file upon user intervention. The dump data available through > +/proc/vmcore will be in ELF format. Hence the existing kdump > +infrastructure (kdump scripts) to save the dump works fine > +with minor modifications. The kdump script requires following > +modifications: > +-- During service kdump start if /proc/vmcore entry is not present, > + look for the existence of /sys/kernel/fadump_enabled and read > + value exported by it. If value is set to '0' then fallback to > + existing kexec based kdump. If value is set to '1' then check the > + value exported by /sys/kernel/fadump_registered. If value it set > + to '1' then print success otherwise register for fadump by > + echo'ing 1> /sys/kernel/fadump_registered file. > + > +-- During service kdump start if /proc/vmcore entry is present, > + execute the existing routine to save the dump. Once the dump > + is saved, echo 1> /sys/kernel/fadump_release_mem (if the > + file exists) to release the reserved memory for general use > + and continue without rebooting. At this point the memory > + reservation map will look like as shown in Fig. 1. If the file > + /sys/kernel/fadump_release_mem is not present then follow > + the existing routine to reboot into new kernel. > + > +-- During service kdump stop echo 0> /sys/kernel/fadump_registered > + to un-register the fadump. > + I don't think you need to document kdump script changes in a kernel doc. > +The tools to examine the dump will be same as the ones > +used for kdump. > + > +How to enable firmware-assisted dump (fadump): > +------------------------------------- > + > +1. Set config option CONFIG_FA_DUMP=y and build kernel. > +2. Boot into linux kernel with 'fadump=1' kernel cmdline option. > +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline > + to specify size of the memory to reserve for boot memory dump > + preservation. > + > +NOTE: If firmware-assisted dump fails to reserve memory then it will > + fallback to existing kdump mechanism if 'crashkernel=' option > + is set at kernel cmdline. > + > +Sysfs/debugfs files: > +------------ > + > +Firmware-assisted dump feature uses sysfs file system to hold > +the control files and debugfs file to display memory reserved region. > + > +Here is the list of files under kernel sysfs: > + > + /sys/kernel/fadump_enabled > + > + This is used to display the fadump status. > + 0 = fadump is disabled > + 1 = fadump is enabled > + > + /sys/kernel/fadump_registered > + > + This is used to display the fadump registration status as well > + as to control (start/stop) the fadump registration. > + 0 = fadump is not registered. > + 1 = fadump is registered and ready to handle system crash. > + > + To register fadump echo 1> /sys/kernel/fadump_registered and > + echo 0> /sys/kernel/fadump_registered for un-register and stop the > + fadump. Once the fadump is un-registered, the system crash will not > + be handled and vmcore will not be captured. > + > + /sys/kernel/fadump_release_mem > + > + This file is available only when fadump is active during > + second kernel. This is used to release the reserved memory > + region that are held for saving crash dump. To release the > + reserved memory echo 1 to it: > + > + echo 1> /sys/kernel/fadump_release_mem > + > + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region > + file will change to reflect the new memory reservations. > + > +Here is the list of files under powerpc debugfs: > +(Assuming debugfs is mounted on /sys/kernel/debug directory.) > + > + /sys/kernel/debug/powerpc/fadump_region > + > + This file shows the reserved memory regions if fadump is > + enabled otherwise this file is empty. The output format > + is: > +<region>: [<start>-<end>]<reserved-size> bytes, Dumped:<dump-size> > + > + e.g. > + Contents when fadump is registered during first kernel > + > + # cat /sys/kernel/debug/powerpc/fadump_region > + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 > + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 > + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 > + > + Contents when fadump is active during second kernel > + > + # cat /sys/kernel/debug/powerpc/fadump_region > + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 > + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 > + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 > + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 > + > +NOTE: Please refer to debugfs documentation on how to mount the debugfs > + filesystem. > + That is Documentation/filesystems/debugfs.txt. > + > +TODO: > +----- > + o Need to come up with the better approach to find out more > + accurate boot memory size that is required for a kernel to > + boot successfully when booted with restricted memory. > + o The fadump implementation introduces a fadump crash info structure > + in the scratch area before the ELF core header. The idea of introducing > + this structure is to pass some important crash info data to the second > + kernel which will help second kernel to populate ELF core header with > + correct data before it gets exported through /proc/vmcore. The current > + design implementation does not address a possibility of introducing > + additional fields (in future) to this structure without affecting > + compatibility. Need to come up with the better approach to address this. > + The possible approaches are: > + 1. Introduce version field for version tracking, bump up the version > + whenever a new field is added to the structure in future. The version > + field can be used to find out what fields are valid for the current > + version of the structure. > + 2. Reserve the area of predefined size (say PAGE_SIZE) for this > + structure and have unused area as reserved (initialized to zero) > + for future field additions. > + The advantage of approach 1 over 2 is we don't need to reserve extra space. > +--- Why do we keep TODO in this doc? Thanks!
On 2011-11-10 17:46:30 Thu, Cong Wang wrote: > 于 2011年11月07日 17:55, Mahesh J Salgaonkar 写道: > >From: Mahesh Salgaonkar<mahesh@linux.vnet.ibm.com> > > > >Documentation for firmware-assisted dump. This document is based on the > >original documentation written for phyp assisted dump by Linas Vepstas > >and Manish Ahuja, with few changes to reflect the current implementation. > > > >Change in v3: > >- Modified the documentation to reflect introdunction of fadump_registered > > sysfs file and few minor changes. > > > >Change in v2: > >- Modified the documentation to reflect the change of fadump_region > > file under debugfs filesystem. > > > >Signed-off-by: Mahesh Salgaonkar<mahesh@linux.vnet.ibm.com> > > > Please Cc Randy Dunlap <rdunlap@xenotime.net> for kernel documentation > patch. > > I have some inline comments below. > Thanks for your review. I will incorporate all your comments. <...> > >+with minor modifications. The kdump script requires following > >+modifications: > >+-- During service kdump start if /proc/vmcore entry is not present, > >+ look for the existence of /sys/kernel/fadump_enabled and read > >+ value exported by it. If value is set to '0' then fallback to > >+ existing kexec based kdump. If value is set to '1' then check the > >+ value exported by /sys/kernel/fadump_registered. If value it set > >+ to '1' then print success otherwise register for fadump by > >+ echo'ing 1> /sys/kernel/fadump_registered file. > >+ > >+-- During service kdump start if /proc/vmcore entry is present, > >+ execute the existing routine to save the dump. Once the dump > >+ is saved, echo 1> /sys/kernel/fadump_release_mem (if the > >+ file exists) to release the reserved memory for general use > >+ and continue without rebooting. At this point the memory > >+ reservation map will look like as shown in Fig. 1. If the file > >+ /sys/kernel/fadump_release_mem is not present then follow > >+ the existing routine to reboot into new kernel. > >+ > >+-- During service kdump stop echo 0> /sys/kernel/fadump_registered > >+ to un-register the fadump. > >+ > > I don't think you need to document kdump script changes in a kernel > doc. > Agree. I will remove it. > >+ > >+TODO: > >+----- > >+ o Need to come up with the better approach to find out more > >+ accurate boot memory size that is required for a kernel to > >+ boot successfully when booted with restricted memory. > >+ o The fadump implementation introduces a fadump crash info structure > >+ in the scratch area before the ELF core header. The idea of introducing > >+ this structure is to pass some important crash info data to the second > >+ kernel which will help second kernel to populate ELF core header with > >+ correct data before it gets exported through /proc/vmcore. The current > >+ design implementation does not address a possibility of introducing > >+ additional fields (in future) to this structure without affecting > >+ compatibility. Need to come up with the better approach to address this. > >+ The possible approaches are: > >+ 1. Introduce version field for version tracking, bump up the version > >+ whenever a new field is added to the structure in future. The version > >+ field can be used to find out what fields are valid for the current > >+ version of the structure. > >+ 2. Reserve the area of predefined size (say PAGE_SIZE) for this > >+ structure and have unused area as reserved (initialized to zero) > >+ for future field additions. > >+ The advantage of approach 1 over 2 is we don't need to reserve extra space. > >+--- > > Why do we keep TODO in this doc? > I see most of the kernel doc do contain TODO, hence I added it here. Thanks, -Mahesh.
diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt new file mode 100644 index 0000000..ba6724a --- /dev/null +++ b/Documentation/powerpc/firmware-assisted-dump.txt @@ -0,0 +1,262 @@ + + Firmware-Assisted Dump + ------------------------ + July 2011 + +The goal of firmware-assisted dump is to enable the dump of +a crashed system, and to do so from a fully-reset system, and +to minimize the total elapsed time until the system is back +in production use. + +As compared to kdump or other strategies, firmware-assisted +dump offers several strong, practical advantages: + +-- Unlike kdump, the system has been reset, and loaded + with a fresh copy of the kernel. In particular, + PCI and I/O devices have been reinitialized and are + in a clean, consistent state. +-- Once the dump is copied out, the memory that held the dump + is immediately available to the running kernel. A further + reboot isn't required. + +The above can only be accomplished by coordination with, +and assistance from the Power firmware. The procedure is +as follows: + +-- The first kernel registers the sections of memory with the + Power firmware for dump preservation during OS initialization. + This registered sections of memory is reserved by the first + kernel during early boot. + +-- When a system crashes, the Power firmware will save + the low memory (boot memory of size larger of 5% of system RAM + or 256MB) of RAM to a previously registered save region. It + will also save system registers, and hardware PTE's. + + NOTE: The term 'boot memory' means size of the low memory chunk + that is required for a kernel to boot successfully when + booted with restricted memory. By default, the boot memory + size will be calculated to larger of 5% of system RAM or + 256MB. Alternatively, user can also specify boot memory + size through boot parameter 'fadump_reserve_mem=' which + will override the default calculated size. + +-- After the low memory (boot memory) area has been saved, the + firmware will reset PCI and other hardware state. It will + *not* clear the RAM. It will then launch the bootloader, as + normal. + +-- The freshly booted kernel will notice that there is a new + node (ibm,dump-kernel) in the device tree, indicating that + there is crash data available from a previous boot. During + the early boot OS will reserve rest of the memory above + boot memory size effectively booting with restricted memory + size. This will make sure that the second kernel will not + touch any of the dump memory area. + +-- Userspace tools will read /proc/vmcore to obtain the contents + of memory, which holds the previous crashed kernel dump in ELF + format. The userspace tools may copy this info to disk, or + network, nas, san, iscsi, etc. as desired. + +-- Once the userspace tool is done saving dump, it will echo + '1' to /sys/kernel/fadump_release_mem to release the reserved + memory back to general use, except the memory required for + next firmware-assisted dump registration. + + e.g. + # echo 1 > /sys/kernel/fadump_release_mem + +Please note that the firmware-assisted dump feature +is only available on Power6 and above systems with recent +firmware versions. + +Implementation details: +---------------------- + +During boot, a check is made to see if firmware supports +this feature on that particular machine. If it does, then +we check to see if an active dump is waiting for us. If yes +then everything but boot memory size of RAM is reserved during +early boot (See Fig. 2). This area is released once we collect a +dump from user land scripts (kdump scripts) that are run. If +there is dump data, then the /sys/kernel/fadump_release_mem +file is created, and the reserved memory is held. + +If there is no waiting dump data, then only the memory required +to hold CPU state, HPTE region, boot memory dump and elfcore +header, is reserved at the top of memory (see Fig. 1). This area +is *not* released: this region will be kept permanently reserved, +so that it can act as a receptacle for a copy of the boot memory +content in addition to CPU state and HPTE region, in the case a +crash does occur. + + o Memory Reservation during first kernel + + Low memory Top of memory + 0 boot memory size | + | | |<--Reserved dump area -->| + V V | Permanent Reservation V + +-----------+----------/ /----------+---+----+-----------+----+ + | | |CPU|HPTE| DUMP |ELF | + +-----------+----------/ /----------+---+----+-----------+----+ + | ^ + | | + \ / + ------------------------------------------- + Boot memory content gets transferred to + reserved area by firmware at the time of + crash + Fig. 1 + + o Memory Reservation during second kernel after crash + + Low memory Top of memory + 0 boot memory size | + | |<------------- Reserved dump area ----------- -->| + V V V + +-----------+----------/ /----------+---+----+-----------+----+ + | | |CPU|HPTE| DUMP |ELF | + +-----------+----------/ /----------+---+----+-----------+----+ + | | + V V + Used by second /proc/vmcore + kernel to boot + Fig. 2 + +Currently the dump will be copied from /proc/vmcore to a +a new file upon user intervention. The dump data available through +/proc/vmcore will be in ELF format. Hence the existing kdump +infrastructure (kdump scripts) to save the dump works fine +with minor modifications. The kdump script requires following +modifications: +-- During service kdump start if /proc/vmcore entry is not present, + look for the existence of /sys/kernel/fadump_enabled and read + value exported by it. If value is set to '0' then fallback to + existing kexec based kdump. If value is set to '1' then check the + value exported by /sys/kernel/fadump_registered. If value it set + to '1' then print success otherwise register for fadump by + echo'ing 1 > /sys/kernel/fadump_registered file. + +-- During service kdump start if /proc/vmcore entry is present, + execute the existing routine to save the dump. Once the dump + is saved, echo 1 > /sys/kernel/fadump_release_mem (if the + file exists) to release the reserved memory for general use + and continue without rebooting. At this point the memory + reservation map will look like as shown in Fig. 1. If the file + /sys/kernel/fadump_release_mem is not present then follow + the existing routine to reboot into new kernel. + +-- During service kdump stop echo 0 > /sys/kernel/fadump_registered + to un-register the fadump. + +The tools to examine the dump will be same as the ones +used for kdump. + +How to enable firmware-assisted dump (fadump): +------------------------------------- + +1. Set config option CONFIG_FA_DUMP=y and build kernel. +2. Boot into linux kernel with 'fadump=1' kernel cmdline option. +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline + to specify size of the memory to reserve for boot memory dump + preservation. + +NOTE: If firmware-assisted dump fails to reserve memory then it will + fallback to existing kdump mechanism if 'crashkernel=' option + is set at kernel cmdline. + +Sysfs/debugfs files: +------------ + +Firmware-assisted dump feature uses sysfs file system to hold +the control files and debugfs file to display memory reserved region. + +Here is the list of files under kernel sysfs: + + /sys/kernel/fadump_enabled + + This is used to display the fadump status. + 0 = fadump is disabled + 1 = fadump is enabled + + /sys/kernel/fadump_registered + + This is used to display the fadump registration status as well + as to control (start/stop) the fadump registration. + 0 = fadump is not registered. + 1 = fadump is registered and ready to handle system crash. + + To register fadump echo 1 > /sys/kernel/fadump_registered and + echo 0 > /sys/kernel/fadump_registered for un-register and stop the + fadump. Once the fadump is un-registered, the system crash will not + be handled and vmcore will not be captured. + + /sys/kernel/fadump_release_mem + + This file is available only when fadump is active during + second kernel. This is used to release the reserved memory + region that are held for saving crash dump. To release the + reserved memory echo 1 to it: + + echo 1 > /sys/kernel/fadump_release_mem + + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region + file will change to reflect the new memory reservations. + +Here is the list of files under powerpc debugfs: +(Assuming debugfs is mounted on /sys/kernel/debug directory.) + + /sys/kernel/debug/powerpc/fadump_region + + This file shows the reserved memory regions if fadump is + enabled otherwise this file is empty. The output format + is: + <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> + + e.g. + Contents when fadump is registered during first kernel + + # cat /sys/kernel/debug/powerpc/fadump_region + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 + + Contents when fadump is active during second kernel + + # cat /sys/kernel/debug/powerpc/fadump_region + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 + +NOTE: Please refer to debugfs documentation on how to mount the debugfs + filesystem. + + +TODO: +----- + o Need to come up with the better approach to find out more + accurate boot memory size that is required for a kernel to + boot successfully when booted with restricted memory. + o The fadump implementation introduces a fadump crash info structure + in the scratch area before the ELF core header. The idea of introducing + this structure is to pass some important crash info data to the second + kernel which will help second kernel to populate ELF core header with + correct data before it gets exported through /proc/vmcore. The current + design implementation does not address a possibility of introducing + additional fields (in future) to this structure without affecting + compatibility. Need to come up with the better approach to address this. + The possible approaches are: + 1. Introduce version field for version tracking, bump up the version + whenever a new field is added to the structure in future. The version + field can be used to find out what fields are valid for the current + version of the structure. + 2. Reserve the area of predefined size (say PAGE_SIZE) for this + structure and have unused area as reserved (initialized to zero) + for future field additions. + The advantage of approach 1 over 2 is we don't need to reserve extra space. +--- +Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> +This document is based on the original documentation written for phyp +assisted dump by Linas Vepstas and Manish Ahuja.