Patchwork [1/2] Allow QEMUMachine to override reset sequencing

login
register
mail settings
Submitter David Gibson
Date Aug. 2, 2012, 2:10 a.m.
Message ID <1343873409-8571-2-git-send-email-david@gibson.dropbear.id.au>
Download mbox | patch
Permalink /patch/174655/
State New
Headers show

Comments

David Gibson - Aug. 2, 2012, 2:10 a.m.
At present the qemu_system_reset() function always performs the same basic
actions on all machines.  This includes running all the reset handler
hooks, however the order in which they run is not controlled by the board
logic.

This is incorrect: any modern real hardware will generally have some sort
of reset controller, sometimes a full microcontroller or even service
processor which will control the order in which components on the board are
reset.

For one specific example, the machine level reset handlers may need to
re-establish certain things in memory to meet the emulated platform's
specified entry conditions, before reexecuting the guest software.
re-establish the correct specified entry conditions for the emulated
platform.  This must happen _after_ resetting peripheral devices, or they
could still be in the middle of DMA operations which would clobber the new
memory content.  However with the normal ordering of reset hooks, the
machine is not able to register a hook which will run after the normal
qdev reset registered by generic code.

This patch allows the machine to take control of the sequence of operations
during reset, by adding a new reset hook to the QEMUMachine.  This hook is
optional - without it we just use the same reset sequence as always.  That
logic is available in qemu_default_system_reset() to make it easy to
implement the case of the machine logic just needing to do some things
either before or after the normal reset.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/boards.h |    3 +++
 sysemu.h    |    1 +
 vl.c        |   10 +++++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)
Anthony Liguori - Aug. 2, 2012, 2:37 a.m.
David Gibson <david@gibson.dropbear.id.au> writes:

> At present the qemu_system_reset() function always performs the same basic
> actions on all machines.  This includes running all the reset handler
> hooks, however the order in which they run is not controlled by the board
> logic.

Let's be careful here in referring to "board logic".

Reset should propagate--there's no argument there.  Having all devices
register in a list with no control over how reset is dispatched is
wrong.  It's workable because of qdev imposes per-bus reset functions
and typically we have a small number of busses and ordering doesn't
matter.

> This is incorrect: any modern real hardware will generally have some sort
> of reset controller, sometimes a full microcontroller or even service
> processor which will control the order in which components on the board are
> reset.

I think this is true only with a loose definition of "modern", but okay.

I don't think this is the right argument.  The machine should serve as
the entry point for reset.  What components get reset in what order will
need to be a machine hook for practical purposes (since not everything
will be fully QOMized all at once).

So I think this is the right approach for now.

But all of the DT initialization stuff that is leading to this
discussion in the first place is a gross hack to make a PV architecture
work that took far too many short cuts.

Building a DT in memory representing hardware instead of making things
discoverable is not how hardware works.  This sort of stuff is either
(1) hard coded in a firmware/flashrom or (2) built dynamically in
firmware.  Let's not pretend like we're doing this because it's needed
for real hardware.

Regards,

Anthony Liguori

> For one specific example, the machine level reset handlers may need to
> re-establish certain things in memory to meet the emulated platform's
> specified entry conditions, before reexecuting the guest software.
> re-establish the correct specified entry conditions for the emulated
> platform.  This must happen _after_ resetting peripheral devices, or they
> could still be in the middle of DMA operations which would clobber the new
> memory content.  However with the normal ordering of reset hooks, the
> machine is not able to register a hook which will run after the normal
> qdev reset registered by generic code.
>
> This patch allows the machine to take control of the sequence of operations
> during reset, by adding a new reset hook to the QEMUMachine.  This hook is
> optional - without it we just use the same reset sequence as always.  That
> logic is available in qemu_default_system_reset() to make it easy to
> implement the case of the machine logic just needing to do some things
> either before or after the normal reset.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/boards.h |    3 +++
>  sysemu.h    |    1 +
>  vl.c        |   10 +++++++++-
>  3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/hw/boards.h b/hw/boards.h
> index 59c01d0..f2bfd3e 100644
> --- a/hw/boards.h
> +++ b/hw/boards.h
> @@ -12,11 +12,14 @@ typedef void QEMUMachineInitFunc(ram_addr_t ram_size,
>                                   const char *initrd_filename,
>                                   const char *cpu_model);
>  
> +typedef void QEMUMachineResetFunc(bool report);
> +
>  typedef struct QEMUMachine {
>      const char *name;
>      const char *alias;
>      const char *desc;
>      QEMUMachineInitFunc *init;
> +    QEMUMachineResetFunc *reset;
>      int use_scsi;
>      int max_cpus;
>      unsigned int no_serial:1,
> diff --git a/sysemu.h b/sysemu.h
> index 6540c79..f74319b 100644
> --- a/sysemu.h
> +++ b/sysemu.h
> @@ -62,6 +62,7 @@ int qemu_powerdown_requested(void);
>  void qemu_system_killed(int signal, pid_t pid);
>  void qemu_kill_report(void);
>  extern qemu_irq qemu_system_powerdown;
> +void qemu_default_system_reset(bool report);
>  void qemu_system_reset(bool report);
>  
>  void qemu_add_exit_notifier(Notifier *notify);
> diff --git a/vl.c b/vl.c
> index 9fea320..ac47a7c 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -1396,7 +1396,7 @@ void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
>      }
>  }
>  
> -void qemu_system_reset(bool report)
> +void qemu_default_system_reset(bool report)
>  {
>      QEMUResetEntry *re, *nre;
>  
> @@ -1410,6 +1410,14 @@ void qemu_system_reset(bool report)
>      cpu_synchronize_all_post_reset();
>  }

Not a huge fan of the naming here.  How about qemu_devices_reset()?

> +void qemu_system_reset(bool report)
> +{
> +    if (current_machine->reset)
> +        current_machine->reset(report);
> +    else
> +        qemu_default_system_reset(report);
> +}

Missing curly braces.

Regards,

Anthony Liguori

> +
>  void qemu_system_reset_request(void)
>  {
>      if (no_reboot) {
> -- 
> 1.7.10.4
Benjamin Herrenschmidt - Aug. 2, 2012, 3:50 a.m.
On Wed, 2012-08-01 at 21:37 -0500, Anthony Liguori wrote:
> 
> But all of the DT initialization stuff that is leading to this
> discussion in the first place is a gross hack to make a PV
> architecture
> work that took far too many short cuts.
> 
> Building a DT in memory representing hardware instead of making things
> discoverable is not how hardware works. 

It depends :-)

Take a Power7 machine for example... the system & processor are actually
initialized by the service processor which loads some kind of similar
data structure into memory before starting the initial firmware :-)

So from the P7 itself point of view, it starts with a memory based data
structure pre-build.

Also some processors have a pretty much turing complete "POR engine"
which executes code from some kind of EEPROM to initialize the system,
cores & busses on reset, which can involve writing things to memory as
well (though it generally doesn't).

>  This sort of stuff is either
> (1) hard coded in a firmware/flashrom or (2) built dynamically in
> firmware.  Let's not pretend like we're doing this because it's needed
> for real hardware.

Doesn't matter, we do things like -kernel which means pre-loading the
kernel from qemu, even on x86. That doesn't match real HW either, but
it's convenient to have. But overall there are real HW reasons to
control the reset as well so the hook makes sense both ways.

Cheers,
Ben.
Lluís Vilanova - Aug. 2, 2012, 3 p.m.
David Gibson writes:
[...]
> diff --git a/vl.c b/vl.c
> index 9fea320..ac47a7c 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -1410,6 +1410,14 @@ void qemu_system_reset(bool report)
>      cpu_synchronize_all_post_reset();
>  }
 
> +void qemu_system_reset(bool report)
> +{
> +    if (current_machine->reset)
> +        current_machine->reset(report);
> +    else
> +        qemu_default_system_reset(report);
> +}
> +
>  void qemu_system_reset_request(void)
>  {
>      if (no_reboot) {

Nitpick: you're missing the braces.


Lluis
Paolo Bonzini - Aug. 2, 2012, 3:17 p.m.
Il 02/08/2012 05:50, Benjamin Herrenschmidt ha scritto:
> 
>> >  This sort of stuff is either
>> > (1) hard coded in a firmware/flashrom or (2) built dynamically in
>> > firmware.  Let's not pretend like we're doing this because it's needed
>> > for real hardware.
> Doesn't matter, we do things like -kernel which means pre-loading the
> kernel from qemu, even on x86. That doesn't match real HW either, but
> it's convenient to have. But overall there are real HW reasons to
> control the reset as well so the hook makes sense both ways.

On x86 we do not pre-load the kernel.  Neither the kernel/initramfs and
the option ROM that loads the kernel are written in memory, they are
passed to the guest via fw_cfg.  Then the option ROM is loaded by the
BIOS, and loads the kernel/initramfs.

Paolo
Benjamin Herrenschmidt - Aug. 2, 2012, 8:46 p.m.
On Thu, 2012-08-02 at 17:17 +0200, Paolo Bonzini wrote:
> 
> On x86 we do not pre-load the kernel.  Neither the kernel/initramfs
> and
> the option ROM that loads the kernel are written in memory, they are
> passed to the guest via fw_cfg.  Then the option ROM is loaded by the
> BIOS, and loads the kernel/initramfs.

How does it do ? IE. how does it access a random pair of files on the
host side ? Via some magic ports ?

Cheers,
Ben.
Anthony Liguori - Aug. 2, 2012, 8:58 p.m.
Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Thu, 2012-08-02 at 17:17 +0200, Paolo Bonzini wrote:
>> 
>> On x86 we do not pre-load the kernel.  Neither the kernel/initramfs
>> and
>> the option ROM that loads the kernel are written in memory, they are
>> passed to the guest via fw_cfg.  Then the option ROM is loaded by the
>> BIOS, and loads the kernel/initramfs.
>
> How does it do ? IE. how does it access a random pair of files on the
> host side ? Via some magic ports ?

Yes, see hw/fw_cfg.c

Regards,

Anthony Liguori

>
> Cheers,
> Ben.
David Gibson - Aug. 3, 2012, 2:25 a.m.
On Thu, Aug 02, 2012 at 06:00:06PM +0300, Lluís Vilanova wrote:
> David Gibson writes:
> [...]
> > diff --git a/vl.c b/vl.c
> > index 9fea320..ac47a7c 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -1410,6 +1410,14 @@ void qemu_system_reset(bool report)
> >      cpu_synchronize_all_post_reset();
> >  }
>  
> > +void qemu_system_reset(bool report)
> > +{
> > +    if (current_machine->reset)
> > +        current_machine->reset(report);
> > +    else
> > +        qemu_default_system_reset(report);
> > +}
> > +
> >  void qemu_system_reset_request(void)
> >  {
> >      if (no_reboot) {
> 
> Nitpick: you're missing the braces.

Ah, oops, thank you.
David Gibson - Aug. 3, 2012, 2:54 a.m.
On Thu, Aug 02, 2012 at 05:17:27PM +0200, Paolo Bonzini wrote:
> Il 02/08/2012 05:50, Benjamin Herrenschmidt ha scritto:
> > 
> >> >  This sort of stuff is either
> >> > (1) hard coded in a firmware/flashrom or (2) built dynamically in
> >> > firmware.  Let's not pretend like we're doing this because it's needed
> >> > for real hardware.
> > Doesn't matter, we do things like -kernel which means pre-loading the
> > kernel from qemu, even on x86. That doesn't match real HW either, but
> > it's convenient to have. But overall there are real HW reasons to
> > control the reset as well so the hook makes sense both ways.
> 
> On x86 we do not pre-load the kernel.  Neither the kernel/initramfs and
> the option ROM that loads the kernel are written in memory, they are
> passed to the guest via fw_cfg.  Then the option ROM is loaded by the
> BIOS, and loads the kernel/initramfs.

On x86 (or rather PC platform), yes.  But spapr is far from the only
one to load the kernel images direct into RAM - arm_load_kernel() does
it too, so does milkymist, sun4[mu] and a bunch of others.
David Gibson - Aug. 3, 2012, 3:08 a.m.
On Wed, Aug 01, 2012 at 09:37:39PM -0500, Anthony Liguori wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > At present the qemu_system_reset() function always performs the same basic
> > actions on all machines.  This includes running all the reset handler
> > hooks, however the order in which they run is not controlled by the board
> > logic.
> 
> Let's be careful here in referring to "board logic".
> 
> Reset should propagate--there's no argument there.  Having all devices
> register in a list with no control over how reset is dispatched is
> wrong.  It's workable because of qdev imposes per-bus reset functions
> and typically we have a small number of busses and ordering doesn't
> matter.
> 
> > This is incorrect: any modern real hardware will generally have some sort
> > of reset controller, sometimes a full microcontroller or even service
> > processor which will control the order in which components on the board are
> > reset.
> 
> I think this is true only with a loose definition of "modern", but okay.
> 
> I don't think this is the right argument.  The machine should serve as
> the entry point for reset.  What components get reset in what order will
> need to be a machine hook for practical purposes (since not everything
> will be fully QOMized all at once).
> 
> So I think this is the right approach for now.
> 
> But all of the DT initialization stuff that is leading to this
> discussion in the first place is a gross hack to make a PV architecture
> work that took far too many short cuts.

Hrm.  The "short cuts" you seem to refer to is the fact that we do a
bunch of our machine initialization direct from qemu, rather than from
firmware executed within the guest.  Doing it this way is both easier
to write, easier to follow and faster to execute, so if that counts as
a gross hack then I think your design priorities need
re-consideration.

Because of the arcane history of the PC boot sequence I can see why
executing a BIOS image within the guest is the preferred model there,
but that's no reason to put the same restriction on platforms that
have a clear and well documented firmware to OS handoff state.

[snip]
> > -void qemu_system_reset(bool report)
> > +void qemu_default_system_reset(bool report)
> >  {
> >      QEMUResetEntry *re, *nre;
> >  
> > @@ -1410,6 +1410,14 @@ void qemu_system_reset(bool report)
> >      cpu_synchronize_all_post_reset();
> >  }
> 
> Not a huge fan of the naming here.  How about qemu_devices_reset()?

Ok, I'll change that.

Patch

diff --git a/hw/boards.h b/hw/boards.h
index 59c01d0..f2bfd3e 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -12,11 +12,14 @@  typedef void QEMUMachineInitFunc(ram_addr_t ram_size,
                                  const char *initrd_filename,
                                  const char *cpu_model);
 
+typedef void QEMUMachineResetFunc(bool report);
+
 typedef struct QEMUMachine {
     const char *name;
     const char *alias;
     const char *desc;
     QEMUMachineInitFunc *init;
+    QEMUMachineResetFunc *reset;
     int use_scsi;
     int max_cpus;
     unsigned int no_serial:1,
diff --git a/sysemu.h b/sysemu.h
index 6540c79..f74319b 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -62,6 +62,7 @@  int qemu_powerdown_requested(void);
 void qemu_system_killed(int signal, pid_t pid);
 void qemu_kill_report(void);
 extern qemu_irq qemu_system_powerdown;
+void qemu_default_system_reset(bool report);
 void qemu_system_reset(bool report);
 
 void qemu_add_exit_notifier(Notifier *notify);
diff --git a/vl.c b/vl.c
index 9fea320..ac47a7c 100644
--- a/vl.c
+++ b/vl.c
@@ -1396,7 +1396,7 @@  void qemu_unregister_reset(QEMUResetHandler *func, void *opaque)
     }
 }
 
-void qemu_system_reset(bool report)
+void qemu_default_system_reset(bool report)
 {
     QEMUResetEntry *re, *nre;
 
@@ -1410,6 +1410,14 @@  void qemu_system_reset(bool report)
     cpu_synchronize_all_post_reset();
 }
 
+void qemu_system_reset(bool report)
+{
+    if (current_machine->reset)
+        current_machine->reset(report);
+    else
+        qemu_default_system_reset(report);
+}
+
 void qemu_system_reset_request(void)
 {
     if (no_reboot) {