diff mbox

[V4] powerpc/prom: Export device tree physical address via proc

Message ID 1279120733-13584-1-git-send-email-msm@freescale.com (mailing list archive)
State Rejected
Headers show

Commit Message

Matthew McClintock July 14, 2010, 3:18 p.m. UTC
To build a proper flat device tree for kexec we need to know which
memreserve region was used for the device tree for the currently
running kernel, so we can remove it and replace it with the new
memreserve for the kexec'ed kernel

Signed-off-by: Matthew McClintock <msm@freescale.com>
---
V4: Fixed misspelling

V3: Remove unneeded cast, and fixed indentation screw up

V2: messed up changes

 arch/powerpc/kernel/prom.c |   45 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 45 insertions(+), 0 deletions(-)

Comments

Tabi Timur-B04825 July 14, 2010, 3:33 p.m. UTC | #1
Matthew McClintock wrote:

> +static struct property flat_dt_start_prop = {
> +	.name = "linux,devicetree-start",
> +	.length = sizeof(phys_addr_t),
> +	.value =&flat_dt_start,
> +};
> +
> +static struct property flat_dt_end_prop = {
> +	.name = "linux,devicetree-end",
> +	.length = sizeof(phys_addr_t),
> +	.value =&flat_dt_end,
> +};

I think Segher was suggesting that you use "linux,device-tree-xxx".

> +
> +static int __init export_flat_device_tree_phys_addr(void)
> +{
> +	struct property *prop;
> +	struct device_node *node;
> +
> +	node = of_find_node_by_path("/chosen");
> +	if (!node)
> +		return -ENOENT;
> +
> +	prop = of_find_property(node, "linux,devicetree-start", NULL);

Does this work?

prop = of_find_property(node, flat_dt_start_prop.name, NULL);

> +	if (prop)
> +		prom_remove_property(node, prop);
> +
> +	prop = of_find_property(node, "linux,devicetree-end", NULL);
> +	if (prop)
> +		prom_remove_property(node, prop);
> +
> +	flat_dt_start = virt_to_phys(initial_boot_params);
> +	flat_dt_end = virt_to_phys(initial_boot_params)
> +				+ initial_boot_params->totalsize;

This is better, I think:

	flat_dt_end = flat_dt_start + initial_boot_params->totalsize;
Segher Boessenkool July 14, 2010, 3:35 p.m. UTC | #2
> V4: Fixed misspelling

Any particular reason you fixed only one of the two
mispelings I pointed out?  (device tree is two words,
not one).

> +	prop = of_find_property(node, "linux,devicetree-start", NULL);
> +	if (prop)
> +		prom_remove_property(node, prop);
> +
> +	prop = of_find_property(node, "linux,devicetree-end", NULL);
> +	if (prop)
> +		prom_remove_property(node, prop);
> +
> +	flat_dt_start = virt_to_phys(initial_boot_params);
> +	flat_dt_end = virt_to_phys(initial_boot_params)
> +				+ initial_boot_params->totalsize;
> +	prom_add_property(node, &flat_dt_start_prop);
> +	prom_add_property(node, &flat_dt_end_prop);

You could use one property instead of two; use addr+len
like every other property does.

You also should use a better name for the property; is this
the previous kernel's device tree?  Just "device-tree" makes
no sense, it is not pointing to "the" device tree for sure!


Segher
Matthew McClintock July 14, 2010, 3:42 p.m. UTC | #3
On Wed, 2010-07-14 at 17:35 +0200, Segher Boessenkool wrote:
> > V4: Fixed misspelling
> 
> Any particular reason you fixed only one of the two
> mispelings I pointed out?  (device tree is two words,
> not one).

Ahh, my fault.

> 
> > +	prop = of_find_property(node, "linux,devicetree-start", NULL);
> > +	if (prop)
> > +		prom_remove_property(node, prop);
> > +
> > +	prop = of_find_property(node, "linux,devicetree-end", NULL);
> > +	if (prop)
> > +		prom_remove_property(node, prop);
> > +
> > +	flat_dt_start = virt_to_phys(initial_boot_params);
> > +	flat_dt_end = virt_to_phys(initial_boot_params)
> > +				+ initial_boot_params->totalsize;
> > +	prom_add_property(node, &flat_dt_start_prop);
> > +	prom_add_property(node, &flat_dt_end_prop);
> 
> You could use one property instead of two; use addr+len
> like every other property does.
> 
> You also should use a better name for the property; is this
> the previous kernel's device tree?  Just "device-tree" makes
> no sense, it is not pointing to "the" device tree for sure!
> 

What about just one node called "flat-device-tree"?

-Matthew
Segher Boessenkool July 14, 2010, 3:46 p.m. UTC | #4
>> Any particular reason you fixed only one of the two
>> mispelings I pointed out?  (device tree is two words,
>> not one).
>
> Ahh, my fault.

Well I wasn't terribly clear ;-)

>> You could use one property instead of two; use addr+len
>> like every other property does.
>>
>> You also should use a better name for the property; is this
>> the previous kernel's device tree?  Just "device-tree" makes
>> no sense, it is not pointing to "the" device tree for sure!
>
> What about just one node called "flat-device-tree"?

But *what* flat device tree?  It cannot be "the" flat device tree,
or it would be useless information, since we are already reading it!


Segher
Matthew McClintock July 15, 2010, 6:17 a.m. UTC | #5
On Wed, 2010-07-14 at 17:46 +0200, Segher Boessenkool wrote:
> > What about just one node called "flat-device-tree"?
> 
> But *what* flat device tree?  It cannot be "the" flat device tree,
> or it would be useless information, since we are already reading it! 

I thought about it all day and did not come up with anything terribly
insightful...

"boot-fdt-phys-addr" 
"fdt-phys-addr"
"boot-fdt-phys-reg"
"fdt-phys-addr-reg"

Thoughts?

-M
Grant Likely July 15, 2010, 6:21 a.m. UTC | #6
On Wed, Jul 14, 2010 at 9:18 AM, Matthew McClintock <msm@freescale.com> wrote:
> To build a proper flat device tree for kexec we need to know which
> memreserve region was used for the device tree for the currently
> running kernel, so we can remove it and replace it with the new
> memreserve for the kexec'ed kernel
>
> Signed-off-by: Matthew McClintock <msm@freescale.com>

Hi Matthew.

I don't understand.  Why does userspace need to know about the old
memreserve sections?  Doesn't kexec tear down all of the old
allocations anyway?  How are they relevant for constructing the dtb
for the kexec kernel?  I'll need a lot more details before I consider
merging this.

Also, please cc: me and Ben Herrenschmidt on powerpc related device
tree changes.

Cheers,
g.

> ---
> V4: Fixed misspelling
>
> V3: Remove unneeded cast, and fixed indentation screw up
>
> V2: messed up changes
>
>  arch/powerpc/kernel/prom.c |   45 ++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 45 insertions(+), 0 deletions(-)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index fd9359a..ff3e240 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -32,6 +32,7 @@
>  #include <linux/debugfs.h>
>  #include <linux/irq.h>
>  #include <linux/lmb.h>
> +#include <linux/bootmem.h>
>
>  #include <asm/prom.h>
>  #include <asm/rtas.h>
> @@ -911,3 +912,47 @@ static int __init export_flat_device_tree(void)
>  }
>  __initcall(export_flat_device_tree);
>  #endif
> +
> +#ifdef CONFIG_KEXEC
> +static phys_addr_t flat_dt_start;
> +static phys_addr_t flat_dt_end;
> +
> +static struct property flat_dt_start_prop = {
> +       .name = "linux,devicetree-start",
> +       .length = sizeof(phys_addr_t),
> +       .value = &flat_dt_start,
> +};
> +
> +static struct property flat_dt_end_prop = {
> +       .name = "linux,devicetree-end",
> +       .length = sizeof(phys_addr_t),
> +       .value = &flat_dt_end,
> +};
> +
> +static int __init export_flat_device_tree_phys_addr(void)
> +{
> +       struct property *prop;
> +       struct device_node *node;
> +
> +       node = of_find_node_by_path("/chosen");
> +       if (!node)
> +               return -ENOENT;
> +
> +       prop = of_find_property(node, "linux,devicetree-start", NULL);
> +       if (prop)
> +               prom_remove_property(node, prop);
> +
> +       prop = of_find_property(node, "linux,devicetree-end", NULL);
> +       if (prop)
> +               prom_remove_property(node, prop);
> +
> +       flat_dt_start = virt_to_phys(initial_boot_params);
> +       flat_dt_end = virt_to_phys(initial_boot_params)
> +                               + initial_boot_params->totalsize;
> +       prom_add_property(node, &flat_dt_start_prop);
> +       prom_add_property(node, &flat_dt_end_prop);
> +
> +       return 0;
> +}
> +__initcall(export_flat_device_tree_phys_addr);
> +#endif
> --
> 1.6.6.1
>
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Matthew McClintock July 15, 2010, 3:19 p.m. UTC | #7
On Thu, 2010-07-15 at 00:21 -0600, Grant Likely wrote:
> On Wed, Jul 14, 2010 at 9:18 AM, Matthew McClintock
> <msm@freescale.com> wrote:
> > To build a proper flat device tree for kexec we need to know which
> > memreserve region was used for the device tree for the currently
> > running kernel, so we can remove it and replace it with the new
> > memreserve for the kexec'ed kernel
> >
> > Signed-off-by: Matthew McClintock <msm@freescale.com>
> 
> Hi Matthew.
> 
> I don't understand.  Why does userspace need to know about the old
> memreserve sections?  Doesn't kexec tear down all of the old
> allocations anyway?  How are they relevant for constructing the dtb
> for the kexec kernel?  I'll need a lot more details before I consider
> merging this.
> 
> Also, please cc: me and Ben Herrenschmidt on powerpc related device
> tree changes.
> 
> Cheers,
> g. 

Grant,

Thanks for taking a look. My first thought was to just blow away all the
memreserve regions and start over. But, there are reserve regions for
other things that I might not want to blow away. For example, on mpc85xx
SMP systems we have an additional reserve region for our boot page. 

There is already precedence for exporting the initrd physical addresses
in the same fashion, which is mostly why I took this route. So instead I
just choose to find and replace the device tree and initrd reserve
regions.

I'm open to other ideas, just let me know. 

Regards,
Matthew
Grant Likely July 15, 2010, 4:22 p.m. UTC | #8
On Thu, Jul 15, 2010 at 9:19 AM, Matthew McClintock <msm@freescale.com> wrote:
> On Thu, 2010-07-15 at 00:21 -0600, Grant Likely wrote:
>> On Wed, Jul 14, 2010 at 9:18 AM, Matthew McClintock
>> <msm@freescale.com> wrote:
>> > To build a proper flat device tree for kexec we need to know which
>> > memreserve region was used for the device tree for the currently
>> > running kernel, so we can remove it and replace it with the new
>> > memreserve for the kexec'ed kernel
>> >
>> > Signed-off-by: Matthew McClintock <msm@freescale.com>
>>
>> Hi Matthew.
>>
>> I don't understand.  Why does userspace need to know about the old
>> memreserve sections?  Doesn't kexec tear down all of the old
>> allocations anyway?  How are they relevant for constructing the dtb
>> for the kexec kernel?  I'll need a lot more details before I consider
>> merging this.
>>
>> Also, please cc: me and Ben Herrenschmidt on powerpc related device
>> tree changes.
>>
>> Cheers,
>> g.
>
> Grant,
>
> Thanks for taking a look. My first thought was to just blow away all the
> memreserve regions and start over. But, there are reserve regions for
> other things that I might not want to blow away. For example, on mpc85xx
> SMP systems we have an additional reserve region for our boot page.

What is your starting point?  Where does the device tree (and
memreserve list) come from
that you're passing to kexec?  My first impression is that if you have
to scrub the memreserve list, then the source being used to
obtain the memreserves is either faulty or unsuitable to the task.

g.
Matthew McClintock July 15, 2010, 4:39 p.m. UTC | #9
On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
> > Thanks for taking a look. My first thought was to just blow away all
> the
> > memreserve regions and start over. But, there are reserve regions
> for
> > other things that I might not want to blow away. For example, on
> mpc85xx
> > SMP systems we have an additional reserve region for our boot page.
> 
> What is your starting point?  Where does the device tree (and
> memreserve list) come from
> that you're passing to kexec?  My first impression is that if you have
> to scrub the memreserve list, then the source being used to
> obtain the memreserves is either faulty or unsuitable to the task. 

I'm pulling the device tree passed in via u-boot and passing it to
kexec. It is the most complete device tree and requires the least amount
of fixup.

I have to scrub two items, the ramdisk/initrd and the device tree
because upon kexec'ing the kernel we have the ability to pass in new
ramdisk/initrd and device tree. They can also live at different physical
addresses for the second reboot.

The initrd addresses are already exposed, so we can update/remove/reuse
that entry, we just need a way for kexec to determine the current device
tree address so it can replace the correct memreserve region for the
kexec'ing kernels' device tree.

The whole problem comes from repeatedly kexec'ing, we need to make sure
we don't keep losing blobs of memory to reserve regions (so we can't
just blindly add). We also need to make sure we don't lose other
memreserve regions that might be important for other things (so we can't
just blow them all away).

-M
Grant Likely July 15, 2010, 4:57 p.m. UTC | #10
On Thu, Jul 15, 2010 at 10:39 AM, Matthew McClintock <msm@freescale.com> wrote:
> On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
>> > Thanks for taking a look. My first thought was to just blow away all
>> the
>> > memreserve regions and start over. But, there are reserve regions
>> for
>> > other things that I might not want to blow away. For example, on
>> mpc85xx
>> > SMP systems we have an additional reserve region for our boot page.
>>
>> What is your starting point?  Where does the device tree (and
>> memreserve list) come from
>> that you're passing to kexec?  My first impression is that if you have
>> to scrub the memreserve list, then the source being used to
>> obtain the memreserves is either faulty or unsuitable to the task.
>
> I'm pulling the device tree passed in via u-boot and passing it to
> kexec.

How?  (what mechanism?)  I hope you're not using the debugfs
flat-device-tree file.

> It is the most complete device tree and requires the least amount
> of fixup.
>
> I have to scrub two items, the ramdisk/initrd and the device tree
> because upon kexec'ing the kernel we have the ability to pass in new
> ramdisk/initrd and device tree. They can also live at different physical
> addresses for the second reboot.

This sounds like the model is backwards.  Rather than scrubbing items,
the memreserve list should be built up from a known good source.

> The initrd addresses are already exposed, so we can update/remove/reuse
> that entry, we just need a way for kexec to determine the current device
> tree address so it can replace the correct memreserve region for the
> kexec'ing kernels' device tree.
>
> The whole problem comes from repeatedly kexec'ing, we need to make sure
> we don't keep losing blobs of memory to reserve regions (so we can't
> just blindly add). We also need to make sure we don't lose other
> memreserve regions that might be important for other things (so we can't
> just blow them all away).

Right, so you need to have a known-good list of reserve sections.
Trying to go the other way sounds very fragile.

g.


>
> -M
>
>
>
Matthew McClintock July 15, 2010, 6:03 p.m. UTC | #11
On Thu, 2010-07-15 at 10:57 -0600, Grant Likely wrote:
> On Thu, Jul 15, 2010 at 10:39 AM, Matthew McClintock <msm@freescale.com> wrote:
> > On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
> >> > Thanks for taking a look. My first thought was to just blow away all
> >> the
> >> > memreserve regions and start over. But, there are reserve regions
> >> for
> >> > other things that I might not want to blow away. For example, on
> >> mpc85xx
> >> > SMP systems we have an additional reserve region for our boot page.
> >>
> >> What is your starting point?  Where does the device tree (and
> >> memreserve list) come from
> >> that you're passing to kexec?  My first impression is that if you have
> >> to scrub the memreserve list, then the source being used to
> >> obtain the memreserves is either faulty or unsuitable to the task.
> >
> > I'm pulling the device tree passed in via u-boot and passing it to
> > kexec.
> 
> How?  (what mechanism?)  I hope you're not using the debugfs
> flat-device-tree file.

That is one way to get a good working copy. What is wrong with this
mechanism?

Should we duplicate everything u-boot does in kexec to build up a flat
device tree? Or is there another way to get a good tree? Ideally, we
don't make the end user manually edit a device tree.

> 
> > It is the most complete device tree and requires the least amount
> > of fixup.
> >
> > I have to scrub two items, the ramdisk/initrd and the device tree
> > because upon kexec'ing the kernel we have the ability to pass in new
> > ramdisk/initrd and device tree. They can also live at different physical
> > addresses for the second reboot.
> 
> This sounds like the model is backwards.  Rather than scrubbing items,
> the memreserve list should be built up from a known good source.

You can build one up yourself and it will still work out fine. Or you
can pull one from debugfs to get yourself started. Or you can pull it
every time. 

> 
> > The initrd addresses are already exposed, so we can update/remove/reuse
> > that entry, we just need a way for kexec to determine the current device
> > tree address so it can replace the correct memreserve region for the
> > kexec'ing kernels' device tree.
> >
> > The whole problem comes from repeatedly kexec'ing, we need to make sure
> > we don't keep losing blobs of memory to reserve regions (so we can't
> > just blindly add). We also need to make sure we don't lose other
> > memreserve regions that might be important for other things (so we can't
> > just blow them all away).
> 
> Right, so you need to have a known-good list of reserve sections.
> Trying to go the other way sounds very fragile.
> 

Yes. Where would we get a list of memreserve sections? Should we export
the reserve sections instead of the device tree location? We just need a
way to preserve what was there at boot to pass to the new kernel.

-M
Grant Likely July 15, 2010, 6:37 p.m. UTC | #12
On Thu, Jul 15, 2010 at 12:03 PM, Matthew McClintock <msm@freescale.com> wrote:
> On Thu, 2010-07-15 at 10:57 -0600, Grant Likely wrote:
>> On Thu, Jul 15, 2010 at 10:39 AM, Matthew McClintock <msm@freescale.com> wrote:
>> > On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
>> >> > Thanks for taking a look. My first thought was to just blow away all
>> >> the
>> >> > memreserve regions and start over. But, there are reserve regions
>> >> for
>> >> > other things that I might not want to blow away. For example, on
>> >> mpc85xx
>> >> > SMP systems we have an additional reserve region for our boot page.
>> >>
>> >> What is your starting point?  Where does the device tree (and
>> >> memreserve list) come from
>> >> that you're passing to kexec?  My first impression is that if you have
>> >> to scrub the memreserve list, then the source being used to
>> >> obtain the memreserves is either faulty or unsuitable to the task.
>> >
>> > I'm pulling the device tree passed in via u-boot and passing it to
>> > kexec.
>>
>> How?  (what mechanism?)  I hope you're not using the debugfs
>> flat-device-tree file.
>
> That is one way to get a good working copy. What is wrong with this
> mechanism?

It's unstable.  It is in the debugfs, so there are no guarantees that
the ABI will remain the same.  Plus it doesn't reflect any changes
that the kernel may make to the device tree.  That interface is *debug
only*.  Do not use it.

> Should we duplicate everything u-boot does in kexec to build up a flat
> device tree? Or is there another way to get a good tree?

That is one option.  U-Boot really shouldn't be modifying the tree
very much anyway (I know on some platforms U-Boot is almost creating a
tree from scratch, but that is insane and an entirely different
discussion).  /proc/device-tree always gives the kernel's current view
of the tree.  You can use dtc to extract it and write it into a dtb.

> Ideally, we
> don't make the end user manually edit a device tree.

Of course not, any device tree manipulation is the job of the kexec
tools.  None of this should be manual.  However, the data source is a
significant and important question.

>> > It is the most complete device tree and requires the least amount
>> > of fixup.
>> >
>> > I have to scrub two items, the ramdisk/initrd and the device tree
>> > because upon kexec'ing the kernel we have the ability to pass in new
>> > ramdisk/initrd and device tree. They can also live at different physical
>> > addresses for the second reboot.
>>
>> This sounds like the model is backwards.  Rather than scrubbing items,
>> the memreserve list should be built up from a known good source.
>
> You can build one up yourself and it will still work out fine. Or you
> can pull one from debugfs to get yourself started. Or you can pull it
> every time.

What do you mean by "pull it every time"?

Out of curiosity, what is responsible for building up the memreserve
list?  The userspace portion, or the kernel portion of kexec?  Or is
it done by a totally separate program?

>> > The initrd addresses are already exposed, so we can update/remove/reuse
>> > that entry, we just need a way for kexec to determine the current device
>> > tree address so it can replace the correct memreserve region for the
>> > kexec'ing kernels' device tree.
>> >
>> > The whole problem comes from repeatedly kexec'ing, we need to make sure
>> > we don't keep losing blobs of memory to reserve regions (so we can't
>> > just blindly add). We also need to make sure we don't lose other
>> > memreserve regions that might be important for other things (so we can't
>> > just blow them all away).
>>
>> Right, so you need to have a known-good list of reserve sections.
>> Trying to go the other way sounds very fragile.
>>
>
> Yes. Where would we get a list of memreserve sections?

I would say the list of reserves that are not under the control of
Linux should be explicitly described in the device tree proper.  For
instance, if you have a region that firmware depends on, then have a
node for describing the firmware and a property stating the memory
regions that it depends on.  The memreserve regions can be generated
from that.

> Should we export
> the reserve sections instead of the device tree location?

It shouldn't really be something that the kernel is explicitly
exporting because it is a characteristic of the board design.  It is
something that belongs in the tree-proper.  ie. when you extract the
tree you have data telling what the region is, and why it is reserved.

> We just need a
> way to preserve what was there at boot to pass to the new kernel.

Yet there is no differentiation between the board-dictated memory
reserves and the things that U-Boot/Linux made an arbitrary decision
on.  The solution should focus not on "can I throw this one away?" but
rather "Is this one I should keep?"  :-)  A subtle difference, I know,
but it changes the way you approach the solution.

Cheers,
g.
Matthew McClintock July 15, 2010, 6:58 p.m. UTC | #13
On Thu, 2010-07-15 at 12:37 -0600, Grant Likely wrote:
> On Thu, Jul 15, 2010 at 12:03 PM, Matthew McClintock <msm@freescale.com> wrote:
> > On Thu, 2010-07-15 at 10:57 -0600, Grant Likely wrote:
> >> On Thu, Jul 15, 2010 at 10:39 AM, Matthew McClintock <msm@freescale.com> wrote:
> >> > On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
> >> >> > Thanks for taking a look. My first thought was to just blow away all
> >> >> the
> >> >> > memreserve regions and start over. But, there are reserve regions
> >> >> for
> >> >> > other things that I might not want to blow away. For example, on
> >> >> mpc85xx
> >> >> > SMP systems we have an additional reserve region for our boot page.
> >> >>
> >> >> What is your starting point?  Where does the device tree (and
> >> >> memreserve list) come from
> >> >> that you're passing to kexec?  My first impression is that if you have
> >> >> to scrub the memreserve list, then the source being used to
> >> >> obtain the memreserves is either faulty or unsuitable to the task.
> >> >
> >> > I'm pulling the device tree passed in via u-boot and passing it to
> >> > kexec.
> >>
> >> How?  (what mechanism?)  I hope you're not using the debugfs
> >> flat-device-tree file.
> >
> > That is one way to get a good working copy. What is wrong with this
> > mechanism?
> 
> It's unstable.  It is in the debugfs, so there are no guarantees that
> the ABI will remain the same.  Plus it doesn't reflect any changes
> that the kernel may make to the device tree.  That interface is *debug
> only*.  Do not use it.

Ok.

> 
> > Should we duplicate everything u-boot does in kexec to build up a flat
> > device tree? Or is there another way to get a good tree?
> 
> That is one option.  U-Boot really shouldn't be modifying the tree
> very much anyway (I know on some platforms U-Boot is almost creating a
> tree from scratch, but that is insane and an entirely different
> discussion).  /proc/device-tree always gives the kernel's current view
> of the tree.  You can use dtc to extract it and write it into a dtb.

Ok wow, I've missed this completely. dtc to extract the device tree is a
very good option. I will pursue that line of thinking.

> 
> > Ideally, we
> > don't make the end user manually edit a device tree.
> 
> Of course not, any device tree manipulation is the job of the kexec
> tools.  None of this should be manual.  However, the data source is a
> significant and important question.

Ideally, we don't duplicate this in kexec and u-boot. Right now there is
nothing specific for say mpc85xx in kexec it's just ppc32. I would
prefer it stay this way.

> 
> >> > It is the most complete device tree and requires the least amount
> >> > of fixup.
> >> >
> >> > I have to scrub two items, the ramdisk/initrd and the device tree
> >> > because upon kexec'ing the kernel we have the ability to pass in new
> >> > ramdisk/initrd and device tree. They can also live at different physical
> >> > addresses for the second reboot.
> >>
> >> This sounds like the model is backwards.  Rather than scrubbing items,
> >> the memreserve list should be built up from a known good source.
> >
> > You can build one up yourself and it will still work out fine. Or you
> > can pull one from debugfs to get yourself started. Or you can pull it
> > every time.
> 
> What do you mean by "pull it every time"?

Exactly what you are saying is bad to do ;-P. Pull it from debugfs. But
the above "dts -I fs" solution practically fixes that issue.

> 
> Out of curiosity, what is responsible for building up the memreserve
> list?  The userspace portion, or the kernel portion of kexec?  Or is
> it done by a totally separate program?

Currently, neither. I have submitted patches for the user space tool to
fixup the memreserve regions.

> 
> >> > The initrd addresses are already exposed, so we can update/remove/reuse
> >> > that entry, we just need a way for kexec to determine the current device
> >> > tree address so it can replace the correct memreserve region for the
> >> > kexec'ing kernels' device tree.
> >> >
> >> > The whole problem comes from repeatedly kexec'ing, we need to make sure
> >> > we don't keep losing blobs of memory to reserve regions (so we can't
> >> > just blindly add). We also need to make sure we don't lose other
> >> > memreserve regions that might be important for other things (so we can't
> >> > just blow them all away).
> >>
> >> Right, so you need to have a known-good list of reserve sections.
> >> Trying to go the other way sounds very fragile.
> >>
> >
> > Yes. Where would we get a list of memreserve sections?
> 
> I would say the list of reserves that are not under the control of
> Linux should be explicitly described in the device tree proper.  For
> instance, if you have a region that firmware depends on, then have a
> node for describing the firmware and a property stating the memory
> regions that it depends on.  The memreserve regions can be generated
> from that.

Ok, so we could traverse the tree node-by-bode for a
persistent-memreserve property and add them to the /memreserve/ list in
the kexec user space tools?

> 
> > Should we export
> > the reserve sections instead of the device tree location?
> 
> It shouldn't really be something that the kernel is explicitly
> exporting because it is a characteristic of the board design.  It is
> something that belongs in the tree-proper.  ie. when you extract the
> tree you have data telling what the region is, and why it is reserved.

Agreed.

> 
> > We just need a
> > way to preserve what was there at boot to pass to the new kernel.
> 
> Yet there is no differentiation between the board-dictated memory
> reserves and the things that U-Boot/Linux made an arbitrary decision
> on.  The solution should focus not on "can I throw this one away?" but
> rather "Is this one I should keep?"  :-)  A subtle difference, I know,
> but it changes the way you approach the solution.

Fair enough. I think the above solution will work nicely, and I can
start implementing something if you agree - if I interpreted your idea
correctly. Although it should not require any changes to the kernel
proper.

-M
Grant Likely July 15, 2010, 7:18 p.m. UTC | #14
On Thu, Jul 15, 2010 at 12:58 PM, Matthew McClintock <msm@freescale.com> wrote:
> On Thu, 2010-07-15 at 12:37 -0600, Grant Likely wrote:
>> On Thu, Jul 15, 2010 at 12:03 PM, Matthew McClintock <msm@freescale.com> wrote:
>> > Yes. Where would we get a list of memreserve sections?
>>
>> I would say the list of reserves that are not under the control of
>> Linux should be explicitly described in the device tree proper.  For
>> instance, if you have a region that firmware depends on, then have a
>> node for describing the firmware and a property stating the memory
>> regions that it depends on.  The memreserve regions can be generated
>> from that.
>
> Ok, so we could traverse the tree node-by-bode for a
> persistent-memreserve property and add them to the /memreserve/ list in
> the kexec user space tools?

I *think* that is okay, but I'd like to hear from Segher, Ben, Mitch,
David Gibson, and other device tree experts on whether or not that
exact property naming is a good one.

Write up a proposed binding (you can use devicetree.org).  Post it for
review (make sure you cc: both devicetree-discuss and linuxppc-dev, as
well as cc'ing the people listed above.)

>> > Should we export
>> > the reserve sections instead of the device tree location?
>>
>> It shouldn't really be something that the kernel is explicitly
>> exporting because it is a characteristic of the board design.  It is
>> something that belongs in the tree-proper.  ie. when you extract the
>> tree you have data telling what the region is, and why it is reserved.
>
> Agreed.
>
>>
>> > We just need a
>> > way to preserve what was there at boot to pass to the new kernel.
>>
>> Yet there is no differentiation between the board-dictated memory
>> reserves and the things that U-Boot/Linux made an arbitrary decision
>> on.  The solution should focus not on "can I throw this one away?" but
>> rather "Is this one I should keep?"  :-)  A subtle difference, I know,
>> but it changes the way you approach the solution.
>
> Fair enough. I think the above solution will work nicely, and I can
> start implementing something if you agree - if I interpreted your idea
> correctly. Although it should not require any changes to the kernel
> proper.

Correct.

g.
Mitch Bradley July 16, 2010, 5:44 a.m. UTC | #15
Grant Likely wrote:
> On Thu, Jul 15, 2010 at 12:58 PM, Matthew McClintock <msm@freescale.com> wrote:
>   
>> On Thu, 2010-07-15 at 12:37 -0600, Grant Likely wrote:
>>     
>>> On Thu, Jul 15, 2010 at 12:03 PM, Matthew McClintock <msm@freescale.com> wrote:
>>>       
>>>> Yes. Where would we get a list of memreserve sections?
>>>>         
>>> I would say the list of reserves that are not under the control of
>>> Linux should be explicitly described in the device tree proper.  For
>>> instance, if you have a region that firmware depends on, then have a
>>> node for describing the firmware and a property stating the memory
>>> regions that it depends on.  The memreserve regions can be generated
>>> from that.
>>>       
>> Ok, so we could traverse the tree node-by-bode for a
>> persistent-memreserve property and add them to the /memreserve/ list in
>> the kexec user space tools?
>>     
>
> I *think* that is okay, but I'd like to hear from Segher, Ben, Mitch,
> David Gibson, and other device tree experts on whether or not that
> exact property naming is a good one.
>   

In the /memory node, the "reg" property specifies all of memory and the 
"available" property specifies those portions that the OS is permitted 
to use.  Subtracting "available" from "reg" gives you the regions that 
are used for other purposes, such as frame buffers or firmware needs.

Often the OS can just look at "available", as it typically wants to know 
what it can use, not what it can't.

The full size as given by "reg" is useful for system configuration 
reporting purposes - the user thinks he bought 2G of memory, so it's 
good to report that 2G is indeed installed in the system.  (As an aside, 
when I first invented Open Boot, 16M was a typical memory size.  I'm 
rather gratified that the overall device tree design has held up 
reasonably well over the scale factors that have happened since then.)

It would be possible to mark the "used" regions with a finer-grained 
distinction than "they are unavailable to the OS", but that quickly gets 
into the diminishing returns realm - a lot of trouble for fairly small 
incremental value. The PC BIOS "E820" memory description scheme has a 
few extra categories of memory.  The one category that seems like it 
might (just barely) be worth the effort is "temporarily used by firmware 
but reclaimable after a certain point" - but then you have to define 
rather carefully the reclamation time and conditions.

> Write up a proposed binding (you can use devicetree.org).  Post it for
> review (make sure you cc: both devicetree-discuss and linuxppc-dev, as
> well as cc'ing the people listed above.)
>
>   
>>>> Should we export
>>>> the reserve sections instead of the device tree location?
>>>>         
>>> It shouldn't really be something that the kernel is explicitly
>>> exporting because it is a characteristic of the board design.  It is
>>> something that belongs in the tree-proper.  ie. when you extract the
>>> tree you have data telling what the region is, and why it is reserved.
>>>       
>> Agreed.
>>
>>     
>>>> We just need a
>>>> way to preserve what was there at boot to pass to the new kernel.
>>>>         
>>> Yet there is no differentiation between the board-dictated memory
>>> reserves and the things that U-Boot/Linux made an arbitrary decision
>>> on.  The solution should focus not on "can I throw this one away?" but
>>> rather "Is this one I should keep?"  :-)  A subtle difference, I know,
>>> but it changes the way you approach the solution.
>>>       
>> Fair enough. I think the above solution will work nicely, and I can
>> start implementing something if you agree - if I interpreted your idea
>> correctly. Although it should not require any changes to the kernel
>> proper.
>>     
>
> Correct.
>
> g.
>
>
>
Segher Boessenkool July 17, 2010, 4:41 p.m. UTC | #16
>>>> Yes. Where would we get a list of memreserve sections?
>>>
>>> I would say the list of reserves that are not under the control of
>>> Linux should be explicitly described in the device tree proper.  For
>>> instance, if you have a region that firmware depends on, then have a
>>> node for describing the firmware and a property stating the memory
>>> regions that it depends on.  The memreserve regions can be generated
>>> from that.
>>
>> Ok, so we could traverse the tree node-by-bode for a
>> persistent-memreserve property and add them to the /memreserve/  
>> list in
>> the kexec user space tools?
>
> I *think* that is okay, but I'd like to hear from Segher, Ben, Mitch,
> David Gibson, and other device tree experts on whether or not that
> exact property naming is a good one.

Let's take a step back: what exactly _are_ "memreserve sections", what
are they used for, who (kernel, firmware, bootloader, etc.) holds what
responsibility wrt them?


Segher
Benjamin Herrenschmidt July 18, 2010, 11:41 p.m. UTC | #17
On Thu, 2010-07-15 at 00:21 -0600, Grant Likely wrote:
> On Wed, Jul 14, 2010 at 9:18 AM, Matthew McClintock <msm@freescale.com> wrote:
> > To build a proper flat device tree for kexec we need to know which
> > memreserve region was used for the device tree for the currently
> > running kernel, so we can remove it and replace it with the new
> > memreserve for the kexec'ed kernel
> >
> > Signed-off-by: Matthew McClintock <msm@freescale.com>
> 
> Hi Matthew.
> 
> I don't understand.  Why does userspace need to know about the old
> memreserve sections?  Doesn't kexec tear down all of the old
> allocations anyway?  How are they relevant for constructing the dtb
> for the kexec kernel?  I'll need a lot more details before I consider
> merging this.
> 
> Also, please cc: me and Ben Herrenschmidt on powerpc related device
> tree changes.

I admit to be the victim of a similar misunderstanding...

Care to explain in more details ? (with examples)

Cheers,
Ben.

> Cheers,
> g.
> 
> > ---
> > V4: Fixed misspelling
> >
> > V3: Remove unneeded cast, and fixed indentation screw up
> >
> > V2: messed up changes
> >
> >  arch/powerpc/kernel/prom.c |   45 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 45 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> > index fd9359a..ff3e240 100644
> > --- a/arch/powerpc/kernel/prom.c
> > +++ b/arch/powerpc/kernel/prom.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/debugfs.h>
> >  #include <linux/irq.h>
> >  #include <linux/lmb.h>
> > +#include <linux/bootmem.h>
> >
> >  #include <asm/prom.h>
> >  #include <asm/rtas.h>
> > @@ -911,3 +912,47 @@ static int __init export_flat_device_tree(void)
> >  }
> >  __initcall(export_flat_device_tree);
> >  #endif
> > +
> > +#ifdef CONFIG_KEXEC
> > +static phys_addr_t flat_dt_start;
> > +static phys_addr_t flat_dt_end;
> > +
> > +static struct property flat_dt_start_prop = {
> > +       .name = "linux,devicetree-start",
> > +       .length = sizeof(phys_addr_t),
> > +       .value = &flat_dt_start,
> > +};
> > +
> > +static struct property flat_dt_end_prop = {
> > +       .name = "linux,devicetree-end",
> > +       .length = sizeof(phys_addr_t),
> > +       .value = &flat_dt_end,
> > +};
> > +
> > +static int __init export_flat_device_tree_phys_addr(void)
> > +{
> > +       struct property *prop;
> > +       struct device_node *node;
> > +
> > +       node = of_find_node_by_path("/chosen");
> > +       if (!node)
> > +               return -ENOENT;
> > +
> > +       prop = of_find_property(node, "linux,devicetree-start", NULL);
> > +       if (prop)
> > +               prom_remove_property(node, prop);
> > +
> > +       prop = of_find_property(node, "linux,devicetree-end", NULL);
> > +       if (prop)
> > +               prom_remove_property(node, prop);
> > +
> > +       flat_dt_start = virt_to_phys(initial_boot_params);
> > +       flat_dt_end = virt_to_phys(initial_boot_params)
> > +                               + initial_boot_params->totalsize;
> > +       prom_add_property(node, &flat_dt_start_prop);
> > +       prom_add_property(node, &flat_dt_end_prop);
> > +
> > +       return 0;
> > +}
> > +__initcall(export_flat_device_tree_phys_addr);
> > +#endif
> > --
> > 1.6.6.1
> >
> >
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> >
> 
> 
>
Benjamin Herrenschmidt July 19, 2010, 12:09 a.m. UTC | #18
On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
> What is your starting point?  Where does the device tree (and
> memreserve list) come from
> that you're passing to kexec?  My first impression is that if you have
> to scrub the memreserve list, then the source being used to
> obtain the memreserves is either faulty or unsuitable to the task.

The kernel should ultimately pass the thing to userspace I reckon, with
an appropriate hook for platform code to insert/recover reserved
regions.

Ben.
Matthew McClintock July 19, 2010, 4:24 a.m. UTC | #19
On Jul 17, 2010, at 11:41 AM, Segher Boessenkool wrote:

>>>>> Yes. Where would we get a list of memreserve sections?
>>>> 
>>>> I would say the list of reserves that are not under the control of
>>>> Linux should be explicitly described in the device tree proper.  For
>>>> instance, if you have a region that firmware depends on, then have a
>>>> node for describing the firmware and a property stating the memory
>>>> regions that it depends on.  The memreserve regions can be generated
>>>> from that.
>>> 
>>> Ok, so we could traverse the tree node-by-bode for a
>>> persistent-memreserve property and add them to the /memreserve/ list in
>>> the kexec user space tools?
>> 
>> I *think* that is okay, but I'd like to hear from Segher, Ben, Mitch,
>> David Gibson, and other device tree experts on whether or not that
>> exact property naming is a good one.
> 
> Let's take a step back: what exactly _are_ "memreserve sections", what
> are they used for, who (kernel, firmware, bootloader, etc.) holds what
> responsibility wrt them?
> 

On the platform I'm working on (mpc85xx) they can be the following:

1) Boot page (aka cpu-release-addr) - always present on MP
2) Flat Device Tree - always present
3) Initrd - optional

When kexec'ing a kernel, we will provide new memory regions for the flat device tree and the initrd (if present). However, all others will not be replaced by kexec and should either be tossed or preserved. The question is how to decide what to do... save them all by default? Save none of them? If we save them all at a minimum we need to remove/replace the device tree and initrd regions as well. So we need a way to identify which regions correspond to these.

Grant and I talked and though a property that lives in the device tree describing a persistant-memreseve might work. And Mitch talked about an available memory property to go along with the current one which could be used to determine which ranges were invalid and need a memreserve for...

-Matthew

> 
> Segher
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
Matthew McClintock July 19, 2010, 4:28 a.m. UTC | #20
On Jul 18, 2010, at 6:41 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2010-07-15 at 00:21 -0600, Grant Likely wrote:
>> On Wed, Jul 14, 2010 at 9:18 AM, Matthew McClintock <msm@freescale.com> wrote:
>>> To build a proper flat device tree for kexec we need to know which
>>> memreserve region was used for the device tree for the currently
>>> running kernel, so we can remove it and replace it with the new
>>> memreserve for the kexec'ed kernel
>>> 
>>> Signed-off-by: Matthew McClintock <msm@freescale.com>
>> 
>> Hi Matthew.
>> 
>> I don't understand.  Why does userspace need to know about the old
>> memreserve sections?  Doesn't kexec tear down all of the old
>> allocations anyway?  How are they relevant for constructing the dtb
>> for the kexec kernel?  I'll need a lot more details before I consider
>> merging this.
>> 
>> Also, please cc: me and Ben Herrenschmidt on powerpc related device
>> tree changes.
> 
> I admit to be the victim of a similar misunderstanding...
> 
> Care to explain in more details ? (with examples)
> 

Upon first examining the details of getting kexec working on our platform I noticed our device tree passed from u-boot contained an additional memreserve section for the boot page. Subsequently, I've been trying to preserve the ones passed in for the kexec'ed kernel thinking anyone could add anything they wanted for their own particular needs and would quite possibly need those regions preserved across reboots.

Recently, I've discovered the boot page address is given in the device tree as a property. So, for our platform (mpc85xx) in particular, I'm back to not needing the read the old memreserve sections... I can completely reconstruct the appropriate memreserve regions for kexec'ing from the information passed in the device tree.

That isn't to say there might not be more memreserve regions that are not there for some valid reason. I'm not sure if we need to attempt to address another scenario where there are other memreserve regions. So this would be a good time to address this issue if anyone believes it is a worthwhile pursuit to have a mechanism to have memreserve regions persistent across kexec'ing - or any other reboot.

-Matthew


> Cheers,
> Ben.
> 
>> Cheers,
>> g.
>> 
>>> ---
>>> V4: Fixed misspelling
>>> 
>>> V3: Remove unneeded cast, and fixed indentation screw up
>>> 
>>> V2: messed up changes
>>> 
>>> arch/powerpc/kernel/prom.c |   45 ++++++++++++++++++++++++++++++++++++++++++++
>>> 1 files changed, 45 insertions(+), 0 deletions(-)
>>> 
>>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>>> index fd9359a..ff3e240 100644
>>> --- a/arch/powerpc/kernel/prom.c
>>> +++ b/arch/powerpc/kernel/prom.c
>>> @@ -32,6 +32,7 @@
>>> #include <linux/debugfs.h>
>>> #include <linux/irq.h>
>>> #include <linux/lmb.h>
>>> +#include <linux/bootmem.h>
>>> 
>>> #include <asm/prom.h>
>>> #include <asm/rtas.h>
>>> @@ -911,3 +912,47 @@ static int __init export_flat_device_tree(void)
>>> }
>>> __initcall(export_flat_device_tree);
>>> #endif
>>> +
>>> +#ifdef CONFIG_KEXEC
>>> +static phys_addr_t flat_dt_start;
>>> +static phys_addr_t flat_dt_end;
>>> +
>>> +static struct property flat_dt_start_prop = {
>>> +       .name = "linux,devicetree-start",
>>> +       .length = sizeof(phys_addr_t),
>>> +       .value = &flat_dt_start,
>>> +};
>>> +
>>> +static struct property flat_dt_end_prop = {
>>> +       .name = "linux,devicetree-end",
>>> +       .length = sizeof(phys_addr_t),
>>> +       .value = &flat_dt_end,
>>> +};
>>> +
>>> +static int __init export_flat_device_tree_phys_addr(void)
>>> +{
>>> +       struct property *prop;
>>> +       struct device_node *node;
>>> +
>>> +       node = of_find_node_by_path("/chosen");
>>> +       if (!node)
>>> +               return -ENOENT;
>>> +
>>> +       prop = of_find_property(node, "linux,devicetree-start", NULL);
>>> +       if (prop)
>>> +               prom_remove_property(node, prop);
>>> +
>>> +       prop = of_find_property(node, "linux,devicetree-end", NULL);
>>> +       if (prop)
>>> +               prom_remove_property(node, prop);
>>> +
>>> +       flat_dt_start = virt_to_phys(initial_boot_params);
>>> +       flat_dt_end = virt_to_phys(initial_boot_params)
>>> +                               + initial_boot_params->totalsize;
>>> +       prom_add_property(node, &flat_dt_start_prop);
>>> +       prom_add_property(node, &flat_dt_end_prop);
>>> +
>>> +       return 0;
>>> +}
>>> +__initcall(export_flat_device_tree_phys_addr);
>>> +#endif
>>> --
>>> 1.6.6.1
>>> 
>>> 
>>> _______________________________________________
>>> Linuxppc-dev mailing list
>>> Linuxppc-dev@lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>>> 
>> 
>> 
>> 
> 
> 
>
Grant Likely July 19, 2010, 4:32 a.m. UTC | #21
On Sun, Jul 18, 2010 at 10:28 PM, Matthew McClintock <msm@freescale.com> wrote:
> On Jul 18, 2010, at 6:41 PM, Benjamin Herrenschmidt wrote:
>> On Thu, 2010-07-15 at 00:21 -0600, Grant Likely wrote:
>>> On Wed, Jul 14, 2010 at 9:18 AM, Matthew McClintock <msm@freescale.com> wrote:
>>>> To build a proper flat device tree for kexec we need to know which
>>>> memreserve region was used for the device tree for the currently
>>>> running kernel, so we can remove it and replace it with the new
>>>> memreserve for the kexec'ed kernel
>>>>
>>>> Signed-off-by: Matthew McClintock <msm@freescale.com>
>>>
>>> Hi Matthew.
>>>
>>> I don't understand.  Why does userspace need to know about the old
>>> memreserve sections?  Doesn't kexec tear down all of the old
>>> allocations anyway?  How are they relevant for constructing the dtb
>>> for the kexec kernel?  I'll need a lot more details before I consider
>>> merging this.
>>>
>>> Also, please cc: me and Ben Herrenschmidt on powerpc related device
>>> tree changes.
>>
>> I admit to be the victim of a similar misunderstanding...
>>
>> Care to explain in more details ? (with examples)
>>
>
> Upon first examining the details of getting kexec working on our platform I noticed our device tree passed from u-boot contained an additional memreserve section for the boot page. Subsequently, I've been trying to preserve the ones passed in for the kexec'ed kernel thinking anyone could add anything they wanted for their own particular needs and would quite possibly need those regions preserved across reboots.
>
> Recently, I've discovered the boot page address is given in the device tree as a property. So, for our platform (mpc85xx) in particular, I'm back to not needing the read the old memreserve sections... I can completely reconstruct the appropriate memreserve regions for kexec'ing from the information passed in the device tree.
>
> That isn't to say there might not be more memreserve regions that are not there for some valid reason. I'm not sure if we need to attempt to address another scenario where there are other memreserve regions. So this would be a good time to address this issue if anyone believes it is a worthwhile pursuit to have a mechanism to have memreserve regions persistent across kexec'ing - or any other reboot.

No, we haven't needed anything to date, so no sense trying to design a
solution for a theoretical need.  Leave it be for now.

g.
Matthew McClintock July 19, 2010, 4:34 a.m. UTC | #22
On Jul 18, 2010, at 7:09 PM, Benjamin Herrenschmidt wrote:

> On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
>> What is your starting point?  Where does the device tree (and
>> memreserve list) come from
>> that you're passing to kexec?  My first impression is that if you have
>> to scrub the memreserve list, then the source being used to
>> obtain the memreserves is either faulty or unsuitable to the task.
> 
> The kernel should ultimately pass the thing to userspace I reckon, with
> an appropriate hook for platform code to insert/recover reserved
> regions.
> 

Or possibly in the device tree itself? As I mentioned in my previous email - my particular case can already be handled by the information passed in the device tree (as I recently discovered), is this something we would want to make generic for the device tree or add platform code to expose these memreserve regions?

-Matthew
Scott Wood July 19, 2010, 4:57 p.m. UTC | #23
On Sun, 18 Jul 2010 22:32:38 -0600
Grant Likely <grant.likely@secretlab.ca> wrote:

> On Sun, Jul 18, 2010 at 10:28 PM, Matthew McClintock <msm@freescale.com> wrote:
> > Upon first examining the details of getting kexec working on our platform I noticed our device tree passed from u-boot contained an additional memreserve section for the boot page. Subsequently, I've been trying to preserve the ones passed in for the kexec'ed kernel thinking anyone could add anything they wanted for their own particular needs and would quite possibly need those regions preserved across reboots.
> >
> > Recently, I've discovered the boot page address is given in the device tree as a property. So, for our platform (mpc85xx) in particular, I'm back to not needing the read the old memreserve sections... I can completely reconstruct the appropriate memreserve regions for kexec'ing from the information passed in the device tree.
> >
> > That isn't to say there might not be more memreserve regions that are not there for some valid reason. I'm not sure if we need to attempt to address another scenario where there are other memreserve regions. So this would be a good time to address this issue if anyone believes it is a worthwhile pursuit to have a mechanism to have memreserve regions persistent across kexec'ing - or any other reboot.
> 
> No, we haven't needed anything to date, so no sense trying to design a
> solution for a theoretical need.  Leave it be for now.

So why do we have memreserve at all?

If we honor it on the first boot, seems like we should keep that
information around on subsequent boots.  I wouldn't want to be the one
to have to debug when that theoretical need becomes actual. :-(

Or am I misunderstanding what this is trying to do?

-Scott
David Gibson July 30, 2010, 1:23 a.m. UTC | #24
On Thu, Jul 15, 2010 at 11:39:21AM -0500, Matthew McClintock wrote:
> On Thu, 2010-07-15 at 10:22 -0600, Grant Likely wrote:
> > > Thanks for taking a look. My first thought was to just blow away all
> > the
> > > memreserve regions and start over. But, there are reserve regions
> > for
> > > other things that I might not want to blow away. For example, on
> > mpc85xx
> > > SMP systems we have an additional reserve region for our boot page.
> > 
> > What is your starting point?  Where does the device tree (and
> > memreserve list) come from
> > that you're passing to kexec?  My first impression is that if you have
> > to scrub the memreserve list, then the source being used to
> > obtain the memreserves is either faulty or unsuitable to the task. 
> 
> I'm pulling the device tree passed in via u-boot and passing it to
> kexec. It is the most complete device tree and requires the least amount
> of fixup.
> 
> I have to scrub two items, the ramdisk/initrd and the device tree
> because upon kexec'ing the kernel we have the ability to pass in new
> ramdisk/initrd and device tree. They can also live at different
> physical addresses for the second reboot.

> The initrd addresses are already exposed, so we can
> update/remove/reuse that entry, we just need a way for kexec to
> determine the current device tree address so it can replace the
> correct memreserve region for the kexec'ing kernels' device tree.

Ok, be careful with this.  You do have the information you need, but
you might have to split an existing entry.  Having a single reserve
entry to cover the initrd would be typical, but it doesn't have to
happen that way - e.g. if a firmware reserves a big region for its own
purposes, and places the initrd within that region.

Also, the latest specs do *not* require the device tree itself to be
mem reserved.

> The whole problem comes from repeatedly kexec'ing, we need to make
> sure we don't keep losing blobs of memory to reserve regions (so we
> can't just blindly add). We also need to make sure we don't lose
> other memreserve regions that might be important for other things
> (so we can't just blow them all away).
David Gibson July 30, 2010, 1:38 a.m. UTC | #25
On Thu, Jul 15, 2010 at 01:18:21PM -0600, Grant Likely wrote:
> On Thu, Jul 15, 2010 at 12:58 PM, Matthew McClintock <msm@freescale.com> wrote:
> > On Thu, 2010-07-15 at 12:37 -0600, Grant Likely wrote:
> >> On Thu, Jul 15, 2010 at 12:03 PM, Matthew McClintock <msm@freescale.com> wrote:
> >> > Yes. Where would we get a list of memreserve sections?
> >>
> >> I would say the list of reserves that are not under the control of
> >> Linux should be explicitly described in the device tree proper.  For
> >> instance, if you have a region that firmware depends on, then have a
> >> node for describing the firmware and a property stating the memory
> >> regions that it depends on.  The memreserve regions can be generated
> >> from that.
> >
> > Ok, so we could traverse the tree node-by-bode for a
> > persistent-memreserve property and add them to the /memreserve/ list in
> > the kexec user space tools?

Well.. I don't think it should be this way as a matter of spec.  But
you could use a property as an interim stash for memreserve
information.

I agree that the precise defined semantics of the memreserve regions
is kind of fuzzy and non-obvious.  Here's how I believe they need to
work:

	memory in a reserved region must *never* be touched by the OS
(or subsequent kexec-invoked OSes) unless something else in the device
tree explicitly instructs it how

There already exist several mechanisms for instructing the OS to use
particular reserved regions for particular purposes: e.g. the initrd
properties, and the spin-table properties.  More such mechanisms might
be added in future ePAPR (or whatever) revisions.  But if the OS
version doesn't understand such a future mechanism, it must fall back
to assuming that the memory is reserved in perpetuity.

Now, some of these mechanisms (implicitly) permit the OS to re-use the
reserved memory after it's done using them as instructed (initrd is
the most obvious one).  In that case the OS can re-add the reserved
space to it's general pools, and excise it from the reserved space for
subsequent kexec()-style boots.  However that's (potentially) a more
complex process than just removing an entry - the initial firmware is
free to combine adjacent reserved regions into one reserve entry, or
even to cover a single reserved region with multiple entries.  So in
order to do this manipulation you will need an allocator of sorts that
does the region reservation/dereservation correctly handling the
semantics on a byte-by-byte basis.

You should also be careful that the regions you're handling do
actually lie in memory space.  Linux doesn't support this right now,
but I do have an experimental patch that allows the initrd properties
to point to (e.g.) flash instead of RAM.  In that case the initrd
wouldn't have to lie in an explicitly reserved region, and obviously
could not be returned to the general pool after use.

> I *think* that is okay, but I'd like to hear from Segher, Ben, Mitch,
> David Gibson, and other device tree experts on whether or not that
> exact property naming is a good one.
> 
> Write up a proposed binding (you can use devicetree.org).  Post it for
> review (make sure you cc: both devicetree-discuss and linuxppc-dev, as
> well as cc'ing the people listed above.)
> 
> >> > Should we export
> >> > the reserve sections instead of the device tree location?
> >>
> >> It shouldn't really be something that the kernel is explicitly
> >> exporting because it is a characteristic of the board design.  It is
> >> something that belongs in the tree-proper.  ie. when you extract the
> >> tree you have data telling what the region is, and why it is reserved.
> >
> > Agreed.
> >
> >>
> >> > We just need a
> >> > way to preserve what was there at boot to pass to the new kernel.
> >>
> >> Yet there is no differentiation between the board-dictated memory
> >> reserves and the things that U-Boot/Linux made an arbitrary decision
> >> on.  The solution should focus not on "can I throw this one away?" but
> >> rather "Is this one I should keep?"  :-)  A subtle difference, I know,
> >> but it changes the way you approach the solution.
> >
> > Fair enough. I think the above solution will work nicely, and I can
> > start implementing something if you agree - if I interpreted your idea
> > correctly. Although it should not require any changes to the kernel
> > proper.
> 
> Correct.
> 
> g.
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
diff mbox

Patch

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index fd9359a..ff3e240 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -32,6 +32,7 @@ 
 #include <linux/debugfs.h>
 #include <linux/irq.h>
 #include <linux/lmb.h>
+#include <linux/bootmem.h>
 
 #include <asm/prom.h>
 #include <asm/rtas.h>
@@ -911,3 +912,47 @@  static int __init export_flat_device_tree(void)
 }
 __initcall(export_flat_device_tree);
 #endif
+
+#ifdef CONFIG_KEXEC
+static phys_addr_t flat_dt_start;
+static phys_addr_t flat_dt_end;
+
+static struct property flat_dt_start_prop = {
+	.name = "linux,devicetree-start",
+	.length = sizeof(phys_addr_t),
+	.value = &flat_dt_start,
+};
+
+static struct property flat_dt_end_prop = {
+	.name = "linux,devicetree-end",
+	.length = sizeof(phys_addr_t),
+	.value = &flat_dt_end,
+};
+
+static int __init export_flat_device_tree_phys_addr(void)
+{
+	struct property *prop;
+	struct device_node *node;
+
+	node = of_find_node_by_path("/chosen");
+	if (!node)
+		return -ENOENT;
+
+	prop = of_find_property(node, "linux,devicetree-start", NULL);
+	if (prop)
+		prom_remove_property(node, prop);
+
+	prop = of_find_property(node, "linux,devicetree-end", NULL);
+	if (prop)
+		prom_remove_property(node, prop);
+
+	flat_dt_start = virt_to_phys(initial_boot_params);
+	flat_dt_end = virt_to_phys(initial_boot_params)
+				+ initial_boot_params->totalsize;
+	prom_add_property(node, &flat_dt_start_prop);
+	prom_add_property(node, &flat_dt_end_prop);
+
+	return 0;
+}
+__initcall(export_flat_device_tree_phys_addr);
+#endif