Message ID | 20190616171024.22799-21-hegdevasant@linux.vnet.ibm.com |
---|---|
State | Superseded |
Headers | show |
Series | MPIPL support | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch master (dbf27b6c4af84addb36bd3be34f96580aba9c873) |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot | fail | Test snowpatch/job/snowpatch-skiboot on branch master |
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco | success | Signed-off-by present |
Vasant Hegde's on June 17, 2019 3:10 am: > Pre-MPIPL kernel saves various information required to create vmcore in > metadata area and passes metadata area pointer to OPAL. OPAL will preserve > this pointer across MPIPL. Post MPIPL kernel will request for saved tags > via this API. Kernel also needs below tags: > - Saved CPU registers data to access CPU registers > - OPAL metadata area to create opalcore > > Format: > opal_mpipl_query_tag(uint32_t idx, uint64_t *tag) > > idx : tag index (0..n) > tag : OPAL will pass saved tag > > Kernel will make this call with increased `index` until OPAL returns > OPAL_EMPTY. > > Return values: > OPAL_SUCCESS : Operation success > OPAL_PARAMETER : Invalid parameter > OPAL_EMPTY : OPAL completed sending all tags to kernel Thanks for doing this API as well. I like giving the client the ability to just define their own metadata in preserved memory. Thanks, Nick
On 06/28/2019 07:16 AM, Nicholas Piggin wrote: > Vasant Hegde's on June 17, 2019 3:10 am: >> Pre-MPIPL kernel saves various information required to create vmcore in >> metadata area and passes metadata area pointer to OPAL. OPAL will preserve >> this pointer across MPIPL. Post MPIPL kernel will request for saved tags >> via this API. Kernel also needs below tags: >> - Saved CPU registers data to access CPU registers >> - OPAL metadata area to create opalcore >> >> Format: >> opal_mpipl_query_tag(uint32_t idx, uint64_t *tag) >> >> idx : tag index (0..n) >> tag : OPAL will pass saved tag >> >> Kernel will make this call with increased `index` until OPAL returns >> OPAL_EMPTY. >> >> Return values: >> OPAL_SUCCESS : Operation success >> OPAL_PARAMETER : Invalid parameter >> OPAL_EMPTY : OPAL completed sending all tags to kernel > > Thanks for doing this API as well. I like giving the client the > ability to just define their own metadata in preserved memory. Thanks! -Vasant
On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: > Pre-MPIPL kernel saves various information required to create vmcore in > metadata area and passes metadata area pointer to OPAL. OPAL will preserve > this pointer across MPIPL. Post MPIPL kernel will request for saved tags > via this API. Kernel also needs below tags: > - Saved CPU registers data to access CPU registers > - OPAL metadata area to create opalcore > > Format: > opal_mpipl_query_tag(uint32_t idx, uint64_t *tag) > > idx : tag index (0..n) > tag : OPAL will pass saved tag > > Kernel will make this call with increased `index` until OPAL returns > OPAL_EMPTY. > > Return values: > OPAL_SUCCESS : Operation success > OPAL_PARAMETER : Invalid parameter > OPAL_EMPTY : OPAL completed sending all tags to kernel After spending a while picking through the linux patches I'll say that the revised API is mostly fine, BUT: a) The documentation is *woefully* inadequate, and b) We need to think a bit more about how tags are used. My interpretation is that the kernel tag is there to allow the crashing kernel to define it's own dump metadata for the recieving kernel. As a result firmware, which includes the petitboot kernel, can't make assumptions about the contents of the tagged data since it could be populated by a version of Linux with a different metadata ABI or by another OS entirely. So as far as I can tell the overall process is something like: 1. We boot an OS. This may or may not be based on a Linux kernel. 2. OS configures MPIPL and sets the kernel tag to some structure it defines. 3. OS triggers a MPIPL. We re-run through the SBE and hostboot. Most of memory is left as-is. Some memory used by hostboot, the ranges in the MDDT, and the skiboot region at 0x3000_0000 are overwritten. 4. Skiboot parses the MDRT and sets a few DT flags to indicate an MPIPL happened and populates the tags which are defined by OPAL. 5. Petitkernel boots and detects that an MPIPL happened and tries to preserve the crashed kernel's memory. 6. We kexec into a capture kernel that knows how to parse the kernel tag's metadata and generates a dump from that. This raises a few questions: 1) How much memory does firmware have to work with? This doesn't seem to be actually documented anywhere. The closest thing I can find to an answer is that in the Linux patches will preserve at least 0x0...0x3000_0000 since that's what the fadump.ops->get_bootmem_min() returns for powernv. However, I think the minimum memory that firmware needs is something that it should tell the OS via the DT rather than having the OS guess. We already communicate some of that information via the fw-load-area DT property, but that only contains the ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). Odds are hostboot doesn't use *that* much memory, but the petitboot kernel can use quite a bit. I checked on a random lab system and it was using ~400MB after a reboot, just sitting at the petitboot shell prompt. 2) What happens if the tag points into a range that is moved by firmware? If this is considered a kernel bug then that needs to be documented in the API. 3) How does a kernel know if it can parse the data pointed to be the kernel tag? Right now we assume that each tag (kernel tag included) points to a structure with the following header: struct tag_header { u8 type; u8 version; u8 reserved[6] /* stuff */ } The fact this is a common header is not documented anywhere or described in any of the relevant structures (mpipl_fadump in OPAL, and opal_fadump_mem_struct in Linux), but the code in Linux which scans for the kernel tag assumes it's there: From arch/powerpc/platforms/powernv/opal-fadump.c, line 597: do { ret = opal_mpipl_query_tag(idx, &addr); if (ret != OPAL_SUCCESS) break; addr = be64_to_cpu(addr); type = *((u8 *)addr); switch (type) { case MPIPL_FADUMP_TYPE_CPU: opal_cpu_metadata = __va(addr); break; case MPIPL_FADUMP_TYPE_KERNEL: opal_fdm_active = __va(addr); break; } pr_debug("idx: %d, addr: %llx, type: %x\n", idx, addr, type); idx++; } while (ret == OPAL_SUCCESS); So if the crashing OS doesn't use that header and the first byte happens to be the same as MPIPL_FADUMP_TYPE_KERNEL, we'll probably blow up horrible later on while parsing the kernel-defined fadump structure. While we're here, why is MPIPL_FADUMP_TYPE_KERNEL = 0x80? It seems like a relic of the old API. That all said, I think most of these problems can be solved by making the "type" field an output parameter of opal_mpipl_query_tag() so the common header can be dropped. Without the common header you could then add an address-only tag that's defined by the crashing kernel that indicates the upper limit of preserved memory. That way the crashing kernel can configure the dump table to make a few GB available for petitboot and all petitboot needs to do is query that tag to determine how much memory is availble to it. Adding a uint64_t structure magic field to the structures defined by OPAL and Linux is also a good idea since it gives you a reliable way to determine if it's a structure you know how deal with or not. Thoughts? p.s. Can we please settle on some kind of sane naming convention, like: opal_mpipl_ - only used for stuff defined by OPAL and fadump_ - generic fadump in linux pnv_fadump_ - powernv specific fadump stuff. One of the things that has made reviewing the mpipl patches painful is that there's no clear seperation between what's defined in OPAL vs the kernel. It's not a showstopper, but it makes an already hard job even harder. A particularly eregious example is this bastard structure: From skiboot's opal-api.h: struct mpipl_fadump { u8 type; /* MPIPL_FADUMP_TYPE_* */ u8 version; u8 reserved[6]; u32 crashing_pir; /* OPAL crashing CPU PIR */ u32 cpu_data_version; u32 cpu_data_size; u32 fadump_region_cnt; struct fadump_region region[]; } __packed; From the kernel's opal-api.h: struct opal_mpipl_fadump { u8 type; u8 version; u8 reserved[6]; __be32 crashing_pir; __be32 cpu_data_version; __be32 cpu_data_size; __be32 region_cnt; struct opal_fadump_region region[OPAL_FADUMP_NR_SECTIONS]; } __attribute__((packed)); NOT EVEN THE SAME NAME! OR DEFINITION!
On 07/09/2019 03:33 PM, Oliver O'Halloran wrote: > On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: >> Pre-MPIPL kernel saves various information required to create vmcore in >> metadata area and passes metadata area pointer to OPAL. OPAL will preserve >> this pointer across MPIPL. Post MPIPL kernel will request for saved tags >> via this API. Kernel also needs below tags: >> - Saved CPU registers data to access CPU registers >> - OPAL metadata area to create opalcore >> >> Format: >> opal_mpipl_query_tag(uint32_t idx, uint64_t *tag) >> >> idx : tag index (0..n) >> tag : OPAL will pass saved tag >> >> Kernel will make this call with increased `index` until OPAL returns >> OPAL_EMPTY. >> >> Return values: >> OPAL_SUCCESS : Operation success >> OPAL_PARAMETER : Invalid parameter >> OPAL_EMPTY : OPAL completed sending all tags to kernel > > After spending a while picking through the linux patches I'll say that the > revised API is mostly fine, BUT: > > a) The documentation is *woefully* inadequate, and I have documented API under `doc/opal-api/opal-mpipl-173-174.rst` with expected behavior under various scenarios (pretty much similar to how we documented other APIs). > b) We need to think a bit more about how tags are used. We (Myself, Ananth, Nick, Hari, Mahesh) had a detailed *offline* discussion on tag. We did thought about passing `tag` as part of opal_mpipl_query_tag() API. But Nick don't wanted that and wanted to keep API simple. So now our API doesn't care about `tag` type. Its just a interface to retrieve various tags. So we had to push the tag field to structure. > > My interpretation is that the kernel tag is there to allow the crashing kernel > to define it's own dump metadata for the recieving kernel. As a result firmware, Yes. Kernel will have its metadata area and tag points to the metadata area. (well, actually tag can be anything using which post kernel can retrieve its metadata area). > which includes the petitboot kernel, can't make assumptions about the contents > of the tagged data since it could be populated by a version of Linux with a > different metadata ABI or by another OS entirely. OPAL will not care about tag content. Its job is to preserve tag across MPIPL and pass it back to kernel. > > So as far as I can tell the overall process is something like: > > 1. We boot an OS. This may or may not be based on a Linux kernel. > 2. OS configures MPIPL and sets the kernel tag to some structure it defines. > 3. OS triggers a MPIPL. We re-run through the SBE and hostboot. Most of > memory is left as-is. Some memory used by hostboot, the ranges in the MDDT, > and the skiboot region at 0x3000_0000 are overwritten. > 4. Skiboot parses the MDRT and sets a few DT flags to indicate an MPIPL > happened and populates the tags which are defined by OPAL. > 5. Petitkernel boots and detects that an MPIPL happened and tries to preserve > the crashed kernel's memory. > 6. We kexec into a capture kernel that knows how to parse the kernel tag's > metadata and generates a dump from that. Correct. > > > This raises a few questions: > > 1) How much memory does firmware have to work with? This doesn't seem to be So Post MPIPL, - hostboot needs to be loaded and executed. Its done at predefined address and that memory is part of reserved-memory range. So we don't need to worry about this memory. - OPAL memory By the time OPAL loads, hostboot already preserved memory based on MDST, MDDT table. These table contains OPAL runtime memory as well. So by the time we load OPAL its already preserved. We are good to reload OPAL on same memory. - Memory used by OPAL to load petitboot kernel First kernel advertises these range to kernel via `fw-load-area` so that kernel care of preserving these memory. > actually documented anywhere. The closest thing I can find to an answer is that > in the Linux patches will preserve at least 0x0...0x3000_0000 since that's what > the fadump.ops->get_bootmem_min() returns for powernv. Kernel will preserve kernel memory ranges. It won't care about firmware memory area. > However, I think the > minimum memory that firmware needs is something that it should tell the OS via > the DT rather than having the OS guess. OS won't guess anything about firmware memory usage. In fact it won't care about firmware memory except `fw-load-area`. > We already communicate some of that > information via the fw-load-area DT property, but that only contains the > ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). OS needs to know about these memory ranges *only*...as first kernel may use these memory. If it uses it should take care of reserving destination memory to preserve these memory and pass it during MPIPL registration. > > Odds are hostboot doesn't use *that* much memory, but the petitboot kernel can As mentioned above kernel need not worry about hostboot memory usage...as its part of reserved-memory ranges. > use quite a bit. I checked on a random lab system and it was using ~400MB > after a reboot, just sitting at the petitboot shell prompt. We are introducing new kernel config (FADUMP_PRESERVE) and that will take care of petitboot kernel part. Hari can elaborate this. > > > 2) What happens if the tag points into a range that is moved by firmware? If > this is considered a kernel bug then that needs to be documented in the API. Kernel tag : During registration kernel passes tag (which points to metadata area) and post MPIPL OPAL has to send back this tag. It won't care whether its part of moved memory or not. From kernel point of view it should get back what it passed. OPAL tag : Points to kernel meta data area. Using this pointer kernel can retrieve OPAL dump area and generate opalcore. Kernel won't care whether its part of original memory or moved memory. > > > 3) How does a kernel know if it can parse the data pointed to be the kernel tag? OPAL guarantees to return tag sent by kernel during registration without modification. Additionally it can validate the header. > Right now we assume that each tag (kernel tag included) points to a structure > with the following header: > > struct tag_header { > u8 type; > u8 version; > u8 reserved[6] > /* stuff */ > } > > The fact this is a common header is not documented anywhere or described in any > of the relevant structures (mpipl_fadump in OPAL, and opal_fadump_mem_struct > in Linux), but the code in Linux which scans for the kernel tag assumes it's > there: I think we missed to sync opal-api.h before sending patches. It will be fixed in next version. > > From arch/powerpc/platforms/powernv/opal-fadump.c, line 597: > > do { > ret = opal_mpipl_query_tag(idx, &addr); > if (ret != OPAL_SUCCESS) > break; > > addr = be64_to_cpu(addr); > type = *((u8 *)addr); > switch (type) { > case MPIPL_FADUMP_TYPE_CPU: > opal_cpu_metadata = __va(addr); > break; > case MPIPL_FADUMP_TYPE_KERNEL: > opal_fdm_active = __va(addr); > break; > } > > pr_debug("idx: %d, addr: %llx, type: %x\n", > idx, addr, type); > idx++; > } while (ret == OPAL_SUCCESS); > > So if the crashing OS doesn't use that header and the first byte happens to be > the same as MPIPL_FADUMP_TYPE_KERNEL, we'll probably blow up horrible later on First OS registers for MPIPL only after setting up metadata area. If it crashes before that means it is not registered for MPIPL. And OPAL promises to return same tag after MPIPL. So I don't see why things will go wrong here. > while parsing the kernel-defined fadump structure. While we're here, why is > MPIPL_FADUMP_TYPE_KERNEL = 0x80? It seems like a relic of the old API. Well, we need a way to differentiate tag type. Kernel, OPAL. etc. (in future we may have new tag type as well). And we have byte. So I just used 0x01 -> 0x7F for OPAL range and 0x80 - 0xff for kernel. > > > That all said, I think most of these problems can be solved by making the > "type" field an output parameter of opal_mpipl_query_tag() so the common > header can be dropped. We don't really gain much by moving "type" field from structure to API. Whether its structure -OR- API, OPAL will just return whatever kernel passed tag. (and of course OPAL tag for parsing OPAL core). We cannot drop common structure anyway. Kernel needs to know how to parse OPAL metadata. So we need that. > > Without the common header you could then add an address-only tag that's defined > by the crashing kernel that indicates the upper limit of preserved memory. That way > the crashing kernel can configure the dump table to make a few GB available for > petitboot and all petitboot needs to do is query that tag to determine how much > memory is availble to it. > > Adding a uint64_t structure magic field to the structures defined by OPAL and > Linux is also a good idea since it gives you a reliable way to determine if > it's a structure you know how deal with or not. > > Thoughts? I don't think we gain anything with that. > > > > p.s. > > Can we please settle on some kind of sane naming convention, like: > > opal_mpipl_ - only used for stuff defined by OPAL and > fadump_ - generic fadump in linux > pnv_fadump_ - powernv specific fadump stuff. I think I had mixed up fadump and mpipl in earlier version. Now I have settled with opal_mpipl for OPAL side. Kernel side I think we are using above convention. If we missed it we will fix it in next version. > > One of the things that has made reviewing the mpipl patches painful is that > there's no clear seperation between what's defined in OPAL vs the kernel. It's > not a showstopper, but it makes an already hard job even harder. A particularly > eregious example is this bastard structure: As mentioned above we missed to sync structures before sending out patches. Sorry for that. It will be fixed in next version. -Vasant > From skiboot's opal-api.h: > > struct mpipl_fadump { > u8 type; /* MPIPL_FADUMP_TYPE_* */ > u8 version; > u8 reserved[6]; > u32 crashing_pir; /* OPAL crashing CPU PIR */ > u32 cpu_data_version; > u32 cpu_data_size; > u32 fadump_region_cnt; > struct fadump_region region[]; > } __packed; > > From the kernel's opal-api.h: > > struct opal_mpipl_fadump { > u8 type; > u8 version; > u8 reserved[6]; > __be32 crashing_pir; > __be32 cpu_data_version; > __be32 cpu_data_size; > __be32 region_cnt; > struct opal_fadump_region region[OPAL_FADUMP_NR_SECTIONS]; > } __attribute__((packed)); > > NOT EVEN THE SAME NAME! OR DEFINITION! >
On 09/07/19 3:33 PM, Oliver O'Halloran wrote: [...] > Can we please settle on some kind of sane naming convention, like: > > opal_mpipl_ - only used for stuff defined by OPAL and My bad! Will make sure the definitions in arch/powerpc/include/asm/opal-api.h are consistent with how the definitions are in `include/opal-api.h` > fadump_ - generic fadump in linux Yes. This is how I am having it currently. Will fix aberrations if any, in the next revision.. > pnv_fadump_ - powernv specific fadump stuff. Using opal_fadump_ for powernv specific stuff and I should probably use rtas_fadump_ instead of pseries_fadump_ for pseries stuff.. Thanks Hari
On Tue, 2019-07-09 at 20:53 +0530, Vasant Hegde wrote: > On 07/09/2019 03:33 PM, Oliver O'Halloran wrote: > > On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: > > > Pre-MPIPL kernel saves various information required to create vmcore in > > > metadata area and passes metadata area pointer to OPAL. OPAL will preserve > > > this pointer across MPIPL. Post MPIPL kernel will request for saved tags > > > via this API. Kernel also needs below tags: > > > - Saved CPU registers data to access CPU registers > > > - OPAL metadata area to create opalcore > > > > > > Format: > > > opal_mpipl_query_tag(uint32_t idx, uint64_t *tag) > > > > > > idx : tag index (0..n) > > > tag : OPAL will pass saved tag > > > > > > Kernel will make this call with increased `index` until OPAL returns > > > OPAL_EMPTY. > > > > > > Return values: > > > OPAL_SUCCESS : Operation success > > > OPAL_PARAMETER : Invalid parameter > > > OPAL_EMPTY : OPAL completed sending all tags to kernel > > > > After spending a while picking through the linux patches I'll say that the > > revised API is mostly fine, BUT: > > > > a) The documentation is *woefully* inadequate, and > > I have documented API under `doc/opal-api/opal-mpipl-173-174.rst` with > expected > behavior > under various scenarios (pretty much similar to how we documented other APIs). > > > > b) We need to think a bit more about how tags are used. > > We (Myself, Ananth, Nick, Hari, Mahesh) had a detailed *offline* discussion > on > tag. We did > thought about passing `tag` as part of opal_mpipl_query_tag() API. But Nick > don't wanted > that and wanted to keep API simple. So now our API doesn't care about `tag` > type. Its just > a interface to retrieve various tags. So we had to push the tag field to > structure. > > > > My interpretation is that the kernel tag is there to allow the crashing > > kernel > > to define it's own dump metadata for the recieving kernel. As a result > > firmware, > > Yes. Kernel will have its metadata area and tag points to the metadata area. > (well, actually tag > can be anything using which post kernel can retrieve its metadata area). > > > which includes the petitboot kernel, can't make assumptions about the > > contents > > of the tagged data since it could be populated by a version of Linux with a > > different metadata ABI or by another OS entirely. > > OPAL will not care about tag content. Its job is to preserve tag across MPIPL > and > pass it back to kernel. > > > > So as far as I can tell the overall process is something like: > > > > 1. We boot an OS. This may or may not be based on a Linux kernel. > > 2. OS configures MPIPL and sets the kernel tag to some structure it defines. > > 3. OS triggers a MPIPL. We re-run through the SBE and hostboot. Most of > > memory is left as-is. Some memory used by hostboot, the ranges in the > > MDDT, > > and the skiboot region at 0x3000_0000 are overwritten. > > 4. Skiboot parses the MDRT and sets a few DT flags to indicate an MPIPL > > happened and populates the tags which are defined by OPAL. > > 5. Petitkernel boots and detects that an MPIPL happened and tries to > > preserve > > the crashed kernel's memory. > > 6. We kexec into a capture kernel that knows how to parse the kernel tag's > > metadata and generates a dump from that. > > Correct. > > > > > This raises a few questions: > > > > 1) How much memory does firmware have to work with? This doesn't seem to be > > So Post MPIPL, > - hostboot needs to be loaded and executed. Its done at predefined address > and > that memory is part of reserved-memory range. So we don't need to worry > about this memory. > - OPAL memory > By the time OPAL loads, hostboot already preserved memory based on MDST, > MDDT table. > These table contains OPAL runtime memory as well. So by the time we load > OPAL its already > preserved. We are good to reload OPAL on same memory. > - Memory used by OPAL to load petitboot kernel > First kernel advertises these range to kernel via `fw-load-area` so that > kernel care of > preserving these memory. > > > > > actually documented anywhere. The closest thing I can find to an answer is > > that > > in the Linux patches will preserve at least 0x0...0x3000_0000 since that's > > what > > the fadump.ops->get_bootmem_min() returns for powernv. > > Kernel will preserve kernel memory ranges. It won't care about firmware memory > area. > > > However, I think the > > minimum memory that firmware needs is something that it should tell the OS > > via > > the DT rather than having the OS guess. > > OS won't guess anything about firmware memory usage. In fact it won't care > about > firmware memory except `fw-load-area`. > > > We already communicate some of that > > information via the fw-load-area DT property, but that only contains the > > ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). > > OS needs to know about these memory ranges *only*...as first kernel may use > these > memory. If it uses it should take care of reserving destination memory to > preserve these > memory and pass it during MPIPL registration. > > > Odds are hostboot doesn't use *that* much memory, but the petitboot kernel > > can > > As mentioned above kernel need not worry about hostboot memory usage...as its > part of reserved-memory ranges. Vasant, I think you may have missed Oliver's point in the above. The Petitboot kernel needs some amount space to operate. In your current patch series, the amount of space "firmware" is allowed to use is hard-wired by the crashing kernel. Olivers point is that this size should be defined dynamically by firmware, not hard-wired into the crashing kernel. If we every change petitboot or some other component of the firmware so that it needs to use more memory, then the crashing kernel won't have any way of discovering that and it won't allocate enough space for this process to start. Mikey
On 10/07/19 9:22 AM, Michael Neuling wrote: > On Tue, 2019-07-09 at 20:53 +0530, Vasant Hegde wrote: >> On 07/09/2019 03:33 PM, Oliver O'Halloran wrote: >>> On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: [...] >> OS won't guess anything about firmware memory usage. In fact it won't care >> about >> firmware memory except `fw-load-area`. >> >>> We already communicate some of that >>> information via the fw-load-area DT property, but that only contains the >>> ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). >> OS needs to know about these memory ranges *only*...as first kernel may use >> these >> memory. If it uses it should take care of reserving destination memory to >> preserve these >> memory and pass it during MPIPL registration. >> >>> Odds are hostboot doesn't use *that* much memory, but the petitboot kernel >>> can >> As mentioned above kernel need not worry about hostboot memory usage...as its >> part of reserved-memory ranges. > Vasant, > > I think you may have missed Oliver's point in the above. > > The Petitboot kernel needs some amount space to operate. In your current patch > series, the amount of space "firmware" is allowed to use is hard-wired by the > crashing kernel. Olivers point is that this size should be defined dynamically > by firmware, not hard-wired into the crashing kernel. > > If we every change petitboot or some other component of the firmware so that it > needs to use more memory, then the crashing kernel won't have any way of > discovering that and it won't allocate enough space for this process to start. Mikey, the memory to be used by petitboot kernel is restricted by crashkernel parameter. This is not hard-wired. This has to be configured by the user (a systemadmin, generally). The value is typically something that has to be sufficient to boot a petitboot kernel and also the production kernel to capture the dump... Thanks Hari
On Wed, Jul 10, 2019 at 2:50 PM Hari Bathini <hbathini@linux.ibm.com> wrote: > > > On 10/07/19 9:22 AM, Michael Neuling wrote: > > On Tue, 2019-07-09 at 20:53 +0530, Vasant Hegde wrote: > > On 07/09/2019 03:33 PM, Oliver O'Halloran wrote: > > On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: > > [...] > > OS won't guess anything about firmware memory usage. In fact it won't care > about > firmware memory except `fw-load-area`. > > We already communicate some of that > information via the fw-load-area DT property, but that only contains the > ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). > > OS needs to know about these memory ranges *only*...as first kernel may use > these > memory. If it uses it should take care of reserving destination memory to > preserve these > memory and pass it during MPIPL registration. > > Odds are hostboot doesn't use *that* much memory, but the petitboot kernel > can > > As mentioned above kernel need not worry about hostboot memory usage...as its > part of reserved-memory ranges. > > Vasant, > > I think you may have missed Oliver's point in the above. > > The Petitboot kernel needs some amount space to operate. In your current patch > series, the amount of space "firmware" is allowed to use is hard-wired by the > crashing kernel. Olivers point is that this size should be defined dynamically > by firmware, not hard-wired into the crashing kernel. > > If we every change petitboot or some other component of the firmware so that it > needs to use more memory, then the crashing kernel won't have any way of > discovering that and it won't allocate enough space for this process to start. > > Mikey, the memory to be used by petitboot kernel is restricted by crashkernel parameter. > This is not hard-wired. This has to be configured by the user (a systemadmin, generally). > The value is typically something that has to be sufficient to boot a petitboot kernel and > also the production kernel to capture the dump... Right, the problem is that the metadata set by the production kernel metadata *must* be in the format that the petitboot kernel expects. The point of making OPAL oblivious to the contents of the kernel tag was to allow the production kernel to define it's own metadata format rather than having one imposed on it by the platform. Previously the platform imposed structure was defined by the OPAL API and we "fixed" that by moving to the tags API where OPAL is oblivious of the kernel metadata. This is an improvement, but the way the tag API is structured means that we now have a platform imposed structure coming from the petitboot kernel which defeats the point of changing the API in the first place. As far as I can tell the only reason we've done this is to allow the petitboot kernel to parse the metadata to work out how much memory it's allowed to trash. Why not just solve that very specific problem and let the production kernel do whatever it likes? Oliver
On 10/07/19 11:22 AM, Oliver O'Halloran wrote: > On Wed, Jul 10, 2019 at 2:50 PM Hari Bathini <hbathini@linux.ibm.com> wrote: >> >> On 10/07/19 9:22 AM, Michael Neuling wrote: >> >> On Tue, 2019-07-09 at 20:53 +0530, Vasant Hegde wrote: >> >> On 07/09/2019 03:33 PM, Oliver O'Halloran wrote: >> >> On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: >> >> [...] >> >> OS won't guess anything about firmware memory usage. In fact it won't care >> about >> firmware memory except `fw-load-area`. >> >> We already communicate some of that >> information via the fw-load-area DT property, but that only contains the >> ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). >> >> OS needs to know about these memory ranges *only*...as first kernel may use >> these >> memory. If it uses it should take care of reserving destination memory to >> preserve these >> memory and pass it during MPIPL registration. >> >> Odds are hostboot doesn't use *that* much memory, but the petitboot kernel >> can >> >> As mentioned above kernel need not worry about hostboot memory usage...as its >> part of reserved-memory ranges. >> >> Vasant, >> >> I think you may have missed Oliver's point in the above. >> >> The Petitboot kernel needs some amount space to operate. In your current patch >> series, the amount of space "firmware" is allowed to use is hard-wired by the >> crashing kernel. Olivers point is that this size should be defined dynamically >> by firmware, not hard-wired into the crashing kernel. >> >> If we every change petitboot or some other component of the firmware so that it >> needs to use more memory, then the crashing kernel won't have any way of >> discovering that and it won't allocate enough space for this process to start. >> >> Mikey, the memory to be used by petitboot kernel is restricted by crashkernel parameter. >> This is not hard-wired. This has to be configured by the user (a systemadmin, generally). >> The value is typically something that has to be sufficient to boot a petitboot kernel and >> also the production kernel to capture the dump... > Right, the problem is that the metadata set by the production kernel > metadata *must* be in the format that the petitboot kernel expects. > The point of making OPAL oblivious to the contents of the kernel tag > was to allow the production kernel to define it's own metadata format > rather than having one imposed on it by the platform. > > Previously the platform imposed structure was defined by the OPAL API > and we "fixed" that by moving to the tags API where OPAL is oblivious > of the kernel metadata. This is an improvement, but the way the tag > API is structured means that we now have a platform imposed structure > coming from the petitboot kernel which defeats the point of changing > the API in the first place. > > As far as I can tell the only reason we've done this is to allow the > petitboot kernel to parse the metadata to work out how much memory > it's allowed to trash. Why not just solve that very specific problem > and let the production kernel do whatever it likes? > Sorry, if I was not clear but that is how it is. Metadata is setup by production kernel based on user configuration and after crash, petitboot kernel looks for this metadata to work out how much memory it is allowed to trash (https://patchwork.ozlabs.org/patch/1122300/). As for the type field in the metadata, that requirement comes from the need to keep metadata retrieval simple. Without a type field, we need to have a predefined ordering or kernel has to remember which tag no. corresponds to which data type. While the former seems naive, the later, brings with it a need to handle kernel metadata separately as we need to get that data before processing tags.. I would prefer it the current way as it is simple and we do any special handling for kernel metadata.. We should get that requirement documented though.. Thanks Hari
On Wed, Jul 10, 2019 at 5:11 PM Hari Bathini <hbathini@linux.ibm.com> wrote: > > Sorry, if I was not clear but that is how it is. Metadata is setup by > production kernel based on user configuration and after crash, petitboot kernel > looks for this metadata to work out how much memory it is allowed to trash > (https://patchwork.ozlabs.org/patch/1122300/). > > As for the type field in the metadata, that requirement comes from the need to > keep metadata retrieval simple. Without a type field, we need to have a predefined > ordering or kernel has to remember which tag no. corresponds to which data type. > While the former seems naive, the later, brings with it a need to handle kernel > metadata separately as we need to get that data before processing tags. I don't see a problem with either approach. Say we do the "naive" thing and change the API to this: enum opal_mpipl_tag { OPAL_MPIPL_TAG_PROC_DATA, OPAL_MPIPL_TAG_CORE, OPAL_MPIPL_TAG_KERNEL, } And fetch the tags with: int64_t opal_mpipl_query_tag(enum opal_mpipl_tag tag, uint64_t *tag_value); Is this any more complex, or less flexible, than the current query-and-iterate approach? The kernel is going to ignore tags that it doesn't recognise in either case. Anyway, take that, add an extra tag that's set by the production kernel to indicate where preserved memory starts and CONFIG_FADUMP_PRESERVE boils down to: rc = opal_mpipl_query_tag(OPAL_MPIPL_TAG_PRESERVE_LIMIT, &limit); if(!rc) memblock_reserve(limit, memblock_end_of_DRAM()); > I would prefer it the current way as it is simple and we do any special > handling for kernel metadata.. No one is suggesting adding any metadata handling to OPAL. So I don't understand what you're complaining about here. > We should get that requirement documented though.. > > Thanks > Hari >
On 07/10/2019 09:22 AM, Michael Neuling wrote: > On Tue, 2019-07-09 at 20:53 +0530, Vasant Hegde wrote: >> On 07/09/2019 03:33 PM, Oliver O'Halloran wrote: >>> On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: >>>> Pre-MPIPL kernel saves various information required to create vmcore in >>>> metadata area and passes metadata area pointer to OPAL. OPAL will preserve >>>> this pointer across MPIPL. Post MPIPL kernel will request for saved tags >>>> via this API. Kernel also needs below tags: >>>> - Saved CPU registers data to access CPU registers >>>> - OPAL metadata area to create opalcore >>>> .../... >>> We already communicate some of that >>> information via the fw-load-area DT property, but that only contains the >>> ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). >> >> OS needs to know about these memory ranges *only*...as first kernel may use >> these >> memory. If it uses it should take care of reserving destination memory to >> preserve these >> memory and pass it during MPIPL registration. >> >>> Odds are hostboot doesn't use *that* much memory, but the petitboot kernel >>> can >> >> As mentioned above kernel need not worry about hostboot memory usage...as its >> part of reserved-memory ranges. Mikey, > > Vasant, > > I think you may have missed Oliver's point in the above. Yes. I did miss the point of ABI -OR- endianess difference in petitboot and host kernel. > > The Petitboot kernel needs some amount space to operate. In your current patch > series, the amount of space "firmware" is allowed to use is hard-wired by the > crashing kernel. Olivers point is that this size should be defined dynamically > by firmware, not hard-wired into the crashing kernel. Nope. Its not hard wired. But the problem with current patch series is petitboot kernel goes and reads host Linux metadata area to find the memory it can use to boot... which is not correct. > > If we every change petitboot or some other component of the firmware so that it > needs to use more memory, then the crashing kernel won't have any way of > discovering that and it won't allocate enough space for this process to start. We do not have problem with rest of the firmware stack.. as they have their own load area and its reserved. Only issue is with petitboot kernel. How about something like below : - First boot : Host kernel will tell OPAL about amount of memory firmware (including petitboot) can use during MPIPL boot opal_mpipl_update(OPAL_MPIPL_BOOT_MEM_SIZE, <size of memory petitboot can use>) - OPAL will preserve this across MPIPL - Post MPIPL : Petitboot kernel will make a query to get usable size opal_mpipl_query_tag(type, *tag) type = GET_BOOT_SIZE , *tag = size (assumption is that kernel will load at 0th address) This means we have to again change the opal_mpipl_query_tag() API. We will replace `index` field with `type` field. opal_mpipl_query_tag(type, *tag) type = OPAL, Kernel tag, CPU tag, boot mem size tag etg Post MPIPL boot, kernel will make opal_mpipl_query_tag with specific type field to tag the data. I'm fine with making above changes. @Nick, any thoughts on above changes? -Vasant
On 10/07/19 3:05 PM, Oliver O'Halloran wrote: > On Wed, Jul 10, 2019 at 5:11 PM Hari Bathini <hbathini@linux.ibm.com> wrote: >> Sorry, if I was not clear but that is how it is. Metadata is setup by >> production kernel based on user configuration and after crash, petitboot kernel >> looks for this metadata to work out how much memory it is allowed to trash >> (https://patchwork.ozlabs.org/patch/1122300/). >> >> As for the type field in the metadata, that requirement comes from the need to >> keep metadata retrieval simple. Without a type field, we need to have a predefined >> ordering or kernel has to remember which tag no. corresponds to which data type. >> While the former seems naive, the later, brings with it a need to handle kernel >> metadata separately as we need to get that data before processing tags. > I don't see a problem with either approach. Say we do the "naive" > thing and change the API to this: > > enum opal_mpipl_tag { > OPAL_MPIPL_TAG_PROC_DATA, > OPAL_MPIPL_TAG_CORE, > OPAL_MPIPL_TAG_KERNEL, > } > > And fetch the tags with: > > int64_t opal_mpipl_query_tag(enum opal_mpipl_tag tag, uint64_t *tag_value); Actually, I would love to have it this way but I remember Nick was not for having a type field in the opal call.. Thanks Hari
On 07/10/2019 03:20 PM, Vasant Hegde wrote: > On 07/10/2019 09:22 AM, Michael Neuling wrote: >> On Tue, 2019-07-09 at 20:53 +0530, Vasant Hegde wrote: >>> On 07/09/2019 03:33 PM, Oliver O'Halloran wrote: >>>> On Sun, 2019-06-16 at 22:40 +0530, Vasant Hegde wrote: >>>>> Pre-MPIPL kernel saves various information required to create vmcore in >>>>> metadata area and passes metadata area pointer to OPAL. OPAL will preserve >>>>> this pointer across MPIPL. Post MPIPL kernel will request for saved tags >>>>> via this API. Kernel also needs below tags: >>>>> - Saved CPU registers data to access CPU registers >>>>> - OPAL metadata area to create opalcore >>>>> > > .../... > >>>> We already communicate some of that >>>> information via the fw-load-area DT property, but that only contains the >>>> ranges where skiboot loads the kernel and initrd (0x2000_000...0x2fff_ffff). >>> >>> OS needs to know about these memory ranges *only*...as first kernel may use >>> these >>> memory. If it uses it should take care of reserving destination memory to >>> preserve these >>> memory and pass it during MPIPL registration. >>> >>>> Odds are hostboot doesn't use *that* much memory, but the petitboot kernel >>>> can >>> >>> As mentioned above kernel need not worry about hostboot memory usage...as its >>> part of reserved-memory ranges. > > Mikey, > >> >> Vasant, >> >> I think you may have missed Oliver's point in the above. > > Yes. I did miss the point of ABI -OR- endianess difference in petitboot and host > kernel. > >> >> The Petitboot kernel needs some amount space to operate. In your current patch >> series, the amount of space "firmware" is allowed to use is hard-wired by the >> crashing kernel. Olivers point is that this size should be defined dynamically >> by firmware, not hard-wired into the crashing kernel. > > Nope. Its not hard wired. But the problem with current patch series is petitboot > kernel > goes and reads host Linux metadata area to find the memory it can use to boot... > which > is not correct. > >> >> If we every change petitboot or some other component of the firmware so that it >> needs to use more memory, then the crashing kernel won't have any way of >> discovering that and it won't allocate enough space for this process to start. > > We do not have problem with rest of the firmware stack.. as they have their own > load area and its reserved. Only issue is with petitboot kernel. > > How about something like below : > - First boot : Host kernel will tell OPAL about amount of memory firmware > (including > petitboot) can use during MPIPL boot > opal_mpipl_update(OPAL_MPIPL_BOOT_MEM_SIZE, <size of memory petitboot > can use>) > - OPAL will preserve this across MPIPL > - Post MPIPL : Petitboot kernel will make a query to get usable size > opal_mpipl_query_tag(type, *tag) > type = GET_BOOT_SIZE , *tag = size (assumption is that kernel will load at > 0th address) > > This means we have to again change the opal_mpipl_query_tag() API. We will replace > `index` field with `type` field. > opal_mpipl_query_tag(type, *tag) > type = OPAL, Kernel tag, CPU tag, boot mem size tag etg > > Post MPIPL boot, kernel will make opal_mpipl_query_tag with specific type field > to tag the > data. I think it makes sense to have separate API for tag registration instead of overloading opal_mpipl_update API. So we will have 3 APIs 1 - opal_mpipl_update(ops, src, dest, size) enum opal_mpipl_ops { OPAL_MPIPL_ADD_RANGE = 0, OPAL_MPIPL_REMOVE_RANGE = 1, OPAL_MPIPL_REMOVE_ALL = 2, OPAL_MPIPL_FREE_PRESERVED_MEMORY= 3, }; enum opal_mpipl_tags { OPAL_MPIPL_TAG_CPU = 0, OPAL_MPIPL_TAG_OPAL = 1, OPAL_MPIPL_TAG_KERNEL = 2, OPAL_MPIPL_TAG_BOOT_MEM = 3, }; 2 - opal_mpipl_register_tag(tag, tag_val) kernel will pass tag = OPAL_MPIPL_TAG_KERNEL and OPAL_MPIPL_TAG_BOOT_MEM 3 - opal_mpipl_query_tag(tag, *tag_val) Kernel will make this call for all the tags listed in opal_mpipl_tags. -Vasant > > I'm fine with making above changes. > > @Nick, any thoughts on above changes? > > > -Vasant > > _______________________________________________ > Skiboot mailing list > Skiboot@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/skiboot >
diff --git a/core/opal-dump.c b/core/opal-dump.c index 7b8823d84..e53991784 100644 --- a/core/opal-dump.c +++ b/core/opal-dump.c @@ -44,6 +44,17 @@ static struct mpipl_metadata *mpipl_metadata; /* Dump metadata area */ static struct mpipl_fadump *mpipl_opal_data; +/* + * Number of tags passed by OPAL to kernel after MPIPL boot. + * Currently it supports below tags: + * - Kernel passed tag during MPIPL registration + * - OPAL metadata area address + * - CPU register data area + */ +#define MAX_MPIPL_TAGS 0x03 +static u64 mpipl_tags[MAX_MPIPL_TAGS]; +static int max_tags; + static int opal_mpipl_add_entry(u8 region, u64 src, u64 dest, u64 size) { @@ -308,6 +319,20 @@ static int64_t opal_mpipl_update(enum mpipl_ops ops, return rc; } +static uint64_t opal_mpipl_query_tag(uint32_t idx, uint64_t *tag) +{ + if (!opal_addr_valid(tag)) { + prlog(PR_DEBUG, "Invalid tag address\n"); + return OPAL_PARAMETER; + } + + if (idx >= max_tags) + return OPAL_EMPTY; + + *tag = mpipl_tags[idx]; + return OPAL_SUCCESS; +} + static void post_mpipl_get_opal_data(void) { struct mdrt_table *mdrt = (void *)(MDRT_TABLE_BASE); @@ -356,6 +381,12 @@ static void post_mpipl_get_opal_data(void) if (j == count) break; } + + /* Update tags */ + if (mpipl_metadata->kernel_tag) + mpipl_tags[max_tags++] = mpipl_metadata->kernel_tag; + + mpipl_tags[max_tags++] = (u64)mpipl_opal_data; } void opal_mpipl_save_crashing_pir(void) @@ -406,4 +437,5 @@ void opal_mpipl_init(void) /* OPAL API for MPIPL update */ opal_register(OPAL_MPIPL_UPDATE, opal_mpipl_update, 4); + opal_register(OPAL_MPIPL_QUERY_TAG, opal_mpipl_query_tag, 2); } diff --git a/include/opal-api.h b/include/opal-api.h index 5af1a6d0f..6f6656557 100644 --- a/include/opal-api.h +++ b/include/opal-api.h @@ -233,7 +233,8 @@ #define OPAL_NPU_MEM_ALLOC 171 #define OPAL_NPU_MEM_RELEASE 172 #define OPAL_MPIPL_UPDATE 173 -#define OPAL_LAST 173 +#define OPAL_MPIPL_QUERY_TAG 174 +#define OPAL_LAST 174 #define QUIESCE_HOLD 1 /* Spin all calls at entry */ #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
Pre-MPIPL kernel saves various information required to create vmcore in metadata area and passes metadata area pointer to OPAL. OPAL will preserve this pointer across MPIPL. Post MPIPL kernel will request for saved tags via this API. Kernel also needs below tags: - Saved CPU registers data to access CPU registers - OPAL metadata area to create opalcore Format: opal_mpipl_query_tag(uint32_t idx, uint64_t *tag) idx : tag index (0..n) tag : OPAL will pass saved tag Kernel will make this call with increased `index` until OPAL returns OPAL_EMPTY. Return values: OPAL_SUCCESS : Operation success OPAL_PARAMETER : Invalid parameter OPAL_EMPTY : OPAL completed sending all tags to kernel Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> --- core/opal-dump.c | 32 ++++++++++++++++++++++++++++++++ include/opal-api.h | 3 ++- 2 files changed, 34 insertions(+), 1 deletion(-)