Patchwork [5/5] Port apic to new VMState design

login
register
mail settings
Submitter Juan Quintela
Date Aug. 18, 2009, 1:34 p.m.
Message ID <a3855a79d668644c469314af26cc3b4e67ae0054.1250601335.git.quintela@redhat.com>
Download mbox | patch
Permalink /patch/31560/
State Superseded
Headers show

Comments

Juan Quintela - Aug. 18, 2009, 1:34 p.m.
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/apic.c |   96 +++++++++++++++++-------------------------------------------
 1 files changed, 27 insertions(+), 69 deletions(-)
Reimar Döffinger - Aug. 18, 2009, 2:24 p.m.
Hello,
sorry for replying in the middle of the thread, I was to fast and
deleted the other mails already.
And just in case I mention I am new around here, so feel free to ignore
me if you feel I am completely wrong.
One thing I don't like too much about it is that you can't really handle
"calculated" fields. It seems qemu does not use this (much/yet?), but
it seems good design to me that when a device emulation can handle
multiple devices, you store in one place the device name but to avoid
switch...case in lots of places you also have capability flags stored
somewhere.
Saving both seems a bit like a bad design: the value of one implies the
exact value of the other, so it is at least pointless.
More importantly (though I do not know if qemu intends to care about
this) it might be able to hand-craft a saved vm that after loading then
violates some assumptions of the emulation code, possibly being
exploitable.
If nothing else, I'd at least add support for a "verify" function that
gets a "const state *" and can abort loading the VM in case someone
tries something evil (or can print some useful hint instead of having
qemu crash silently on the user, possibly at some later time).
And yes I see that today almost nothing (of what I saw) verifies
anything, but it feels wrong to me to code that into a API design.

Greetings,
Reimar Döffinger
Reimar Döffinger - Aug. 18, 2009, 3:21 p.m.
On Tue, Aug 18, 2009 at 04:41:42PM +0200, Juan Quintela wrote:
> Reimar Döffinger <Reimar.Doeffinger@gmx.de> wrote:
> > Hello,
> > sorry for replying in the middle of the thread, I was to fast and
> > deleted the other mails already.
> > And just in case I mention I am new around here, so feel free to ignore
> > me if you feel I am completely wrong.
> > One thing I don't like too much about it is that you can't really handle
> > "calculated" fields.
> 
> Calculated fields are not by definition part of the state :)
> The state are the other fields that are used to save the state.  I
> haven't yet seen calculade fields (but I haven't looked throughfully
> yet).  With the current design, basically you can only save things that
> are in one struct (the way it is stored is an offset against a base
> address).

I am not sure we 100% understand each other, so I maybe tell the
specific example.
I made a change to the eepro100 driver to fix dumping the network
statistics.
The main problem is, depending on which device you emulate, the size of
the statistics struct changes.
Since it looked ugly etc. I decided not to calculate the size of these
statistics each time but instead save it in the device state.
But instead of adding it to save_vm and load_vm (which also would change
the version) I just set that statistics size again according to which
device is emulated. This also assures that the emulated device and
the statistics size always fit together, even if someone fiddles with
the saved state.
The "problem" with your approach if I understand it right is that I
couldn't do that since the device never knows when it would have to
re-fill these fields.
Basically what I am asking is if you couldn't just add an optional
callback so some additional stuff can be done after the "basic" state
has been loaded - or if that isn't desired at least a callback that
allows verifying the loaded values and aborting execution.

> > Saving both seems a bit like a bad design: the value of one implies the
> > exact value of the other, so it is at least pointless.
> > More importantly (though I do not know if qemu intends to care about
> > this) it might be able to hand-craft a saved vm that after loading then
> > violates some assumptions of the emulation code, possibly being
> > exploitable.
> 
> It is already that way.  This design don't change anything.  And I am
> not sure how to fix it.  We don't have a "is this value safe for this
> field", around yet.  It is possible to add some support for it, but I
> would like to 1st have an use case.

Well, I meant nowadays it is just possible to add a check in load_vm and
fix any values that are off. While it is quite a bit of work there is
nothing in the API stopping you from doing it, you even can return
-EINVAL and hopefully the core will print some somewhat useful message.

> > If nothing else, I'd at least add support for a "verify" function that
> > gets a "const state *" and can abort loading the VM in case someone
> > tries something evil (or can print some useful hint instead of having
> > qemu crash silently on the user, possibly at some later time).
> 
> This is as different problem that is not solved in qemu either.  I agree
> that it would be nice to have such a function, but I am not sure that I
> know how to do it.  and what is worse.  if you can modify the image, you
> can always change anything in the middle of the RAM.  I don't really see
> too much point trying to get a verify function for devices, when we
> can't have it for RAM.

That is completely different from what I meant.
Changing the RAM compromises the VM and only the VM, an exploit in a
device emulation might allow to compromise the _host_. Is it now clearer
what I meant?

Greetings,
Reimar Döffinger
Juan Quintela - Aug. 18, 2009, 3:38 p.m.
Reply-to: quintela@redhat.com
Reimar Döffinger <Reimar.Doeffinger@gmx.de> wrote:
> On Tue, Aug 18, 2009 at 04:41:42PM +0200, Juan Quintela wrote:
>> Reimar Döffinger <Reimar.Doeffinger@gmx.de> wrote:
>> > Hello,
>> > sorry for replying in the middle of the thread, I was to fast and
>> > deleted the other mails already.
>> > And just in case I mention I am new around here, so feel free to ignore
>> > me if you feel I am completely wrong.
>> > One thing I don't like too much about it is that you can't really handle
>> > "calculated" fields.
>> 
>> Calculated fields are not by definition part of the state :)
>> The state are the other fields that are used to save the state.  I
>> haven't yet seen calculade fields (but I haven't looked throughfully
>> yet).  With the current design, basically you can only save things that
>> are in one struct (the way it is stored is an offset against a base
>> address).
>
> I am not sure we 100% understand each other, so I maybe tell the
> specific example.
> I made a change to the eepro100 driver to fix dumping the network
> statistics.
> The main problem is, depending on which device you emulate, the size of
> the statistics struct changes.
> Since it looked ugly etc. I decided not to calculate the size of these
> statistics each time but instead save it in the device state.
> But instead of adding it to save_vm and load_vm (which also would change
> the version) I just set that statistics size again according to which
> device is emulated. This also assures that the emulated device and
> the statistics size always fit together, even if someone fiddles with
> the saved state.
> The "problem" with your approach if I understand it right is that I
> couldn't do that since the device never knows when it would have to
> re-fill these fields.
> Basically what I am asking is if you couldn't just add an optional
> callback so some additional stuff can be done after the "basic" state
> has been loaded - or if that isn't desired at least a callback that
> allows verifying the loaded values and aborting execution.

Ah, ok.  I guess we are going to need that callbacks, one before we load
state, and another one after we load it.

Are your changes on upstream hw/eepro100.c?  I can't see anything there
that can't be done in a table approach.

>> It is already that way.  This design don't change anything.  And I am
>> not sure how to fix it.  We don't have a "is this value safe for this
>> field", around yet.  It is possible to add some support for it, but I
>> would like to 1st have an use case.
>
> Well, I meant nowadays it is just possible to add a check in load_vm and
> fix any values that are off. While it is quite a bit of work there is
> nothing in the API stopping you from doing it, you even can return
> -EINVAL and hopefully the core will print some somewhat useful message.

I guess we are going to have an optional callback to be called
before/after loading the state.  You should be able to put your verify
there.

>> > If nothing else, I'd at least add support for a "verify" function that
>> > gets a "const state *" and can abort loading the VM in case someone
>> > tries something evil (or can print some useful hint instead of having
>> > qemu crash silently on the user, possibly at some later time).
>> 
>> This is as different problem that is not solved in qemu either.  I agree
>> that it would be nice to have such a function, but I am not sure that I
>> know how to do it.  and what is worse.  if you can modify the image, you
>> can always change anything in the middle of the RAM.  I don't really see
>> too much point trying to get a verify function for devices, when we
>> can't have it for RAM.
>
> That is completely different from what I meant.
> Changing the RAM compromises the VM and only the VM, an exploit in a
> device emulation might allow to compromise the _host_. Is it now clearer
> what I meant?

yes, I see where you are meaning now.  But I guess that one is needed to
be solved, not only for migration.  Not sure what to do about this.

Later, Juan.
Reimar Döffinger - Aug. 18, 2009, 4:06 p.m.
On Tue, Aug 18, 2009 at 05:38:57PM +0200, Juan Quintela wrote:
> Are your changes on upstream hw/eepro100.c?  I can't see anything there
> that can't be done in a table approach.

No, so far noone got around to taking my patches apart (and that is
actually one I have not yet properly submitted, it is mangled into that
patch: http://article.gmane.org/gmane.comp.emulators.qemu/49853

> >> It is already that way.  This design don't change anything.  And I am
> >> not sure how to fix it.  We don't have a "is this value safe for this
> >> field", around yet.  It is possible to add some support for it, but I
> >> would like to 1st have an use case.
> >
> > Well, I meant nowadays it is just possible to add a check in load_vm and
> > fix any values that are off. While it is quite a bit of work there is
> > nothing in the API stopping you from doing it, you even can return
> > -EINVAL and hopefully the core will print some somewhat useful message.
> 
> I guess we are going to have an optional callback to be called
> before/after loading the state.  You should be able to put your verify
> there.

Maybe I'm silly, but what would the callback for before loading state be
good for?

> > That is completely different from what I meant.
> > Changing the RAM compromises the VM and only the VM, an exploit in a
> > device emulation might allow to compromise the _host_. Is it now clearer
> > what I meant?
> 
> yes, I see where you are meaning now.  But I guess that one is needed to
> be solved, not only for migration.  Not sure what to do about this.

I think it is mostly leg-work of finding the assumptions the emulations
do. That really should be left to maintainers where available IMO.
I'm just suggesting that it's better to design the API in a way that
doesn't further discourage fixing this :-).
If the patch is close to being accepted maybe I can help out by writing
such verification code for vmware_vga, there e.g. depth, bypp, wred,
wgreen and wblue must fit together as well as
width/height/new_width/new_height and fb_size (I think)
and width/height/bypp must be limit to ensure no integer overflows...
Juan Quintela - Aug. 18, 2009, 4:37 p.m.
Reply-to: quintela@redhat.com
Reimar Döffinger <Reimar.Doeffinger@gmx.de> wrote:
> On Tue, Aug 18, 2009 at 05:38:57PM +0200, Juan Quintela wrote:
>> Are your changes on upstream hw/eepro100.c?  I can't see anything there
>> that can't be done in a table approach.
>
> No, so far noone got around to taking my patches apart (and that is
> actually one I have not yet properly submitted, it is mangled into that
> patch: http://article.gmane.org/gmane.comp.emulators.qemu/49853
>
>> >> It is already that way.  This design don't change anything.  And I am
>> >> not sure how to fix it.  We don't have a "is this value safe for this
>> >> field", around yet.  It is possible to add some support for it, but I
>> >> would like to 1st have an use case.
>> >
>> > Well, I meant nowadays it is just possible to add a check in load_vm and
>> > fix any values that are off. While it is quite a bit of work there is
>> > nothing in the API stopping you from doing it, you even can return
>> > -EINVAL and hopefully the core will print some somewhat useful message.
>> 
>> I guess we are going to have an optional callback to be called
>> before/after loading the state.  You should be able to put your verify
>> there.
>
> Maybe I'm silly, but what would the callback for before loading state be
> good for?

qemu-kvm has in-kernel apic and pit (at least).  You just need to sync
state with the kernel after loading (the other way for saving).

>> > That is completely different from what I meant.
>> > Changing the RAM compromises the VM and only the VM, an exploit in a
>> > device emulation might allow to compromise the _host_. Is it now clearer
>> > what I meant?
>> 
>> yes, I see where you are meaning now.  But I guess that one is needed to
>> be solved, not only for migration.  Not sure what to do about this.
>
> I think it is mostly leg-work of finding the assumptions the emulations
> do. That really should be left to maintainers where available IMO.
> I'm just suggesting that it's better to design the API in a way that
> doesn't further discourage fixing this :-).
> If the patch is close to being accepted maybe I can help out by writing
> such verification code for vmware_vga, there e.g. depth, bypp, wred,
> wgreen and wblue must fit together as well as
> width/height/new_width/new_height and fb_size (I think)
> and width/height/bypp must be limit to ensure no integer overflows...

If you sent a patch, I will take a look.

Later, Juan.
Gerd Hoffmann - Aug. 19, 2009, 8 a.m.
> Basically what I am asking is if you couldn't just add an optional
> callback so some additional stuff can be done after the "basic" state
> has been loaded - or if that isn't desired at least a callback that
> allows verifying the loaded values and aborting execution.

I think we are going to need post-processing callbacks anyway.  Several 
drivers have to do some actions after loading the state information. 
There you'll be able to set the stats size and also perform sanity 
checks on the loaded state.

> That is completely different from what I meant.
> Changing the RAM compromises the VM and only the VM, an exploit in a
> device emulation might allow to compromise the _host_. Is it now clearer
> what I meant?

When you are able modify the savevm state you already have access to the 
host ...

cheers,
   Gerd
Reimar Döffinger - Aug. 19, 2009, 9:10 a.m.
On Wed, Aug 19, 2009 at 10:00:01AM +0200, Gerd Hoffmann wrote:
> > Basically what I am asking is if you couldn't just add an optional
> > callback so some additional stuff can be done after the "basic" state
> > has been loaded - or if that isn't desired at least a callback that
> > allows verifying the loaded values and aborting execution.
> 
> I think we are going to need post-processing callbacks anyway.  Several 
> drivers have to do some actions after loading the state information. 
> There you'll be able to set the stats size and also perform sanity 
> checks on the loaded state.

I don't care what they are called, I was just pointing out that the
patch was not implementing any of it.
And I by now say that e.g. vmware_vga needs it already in the current
implementation, it checks for example that the depth of the current
display and the one when it was saved match.

> > That is completely different from what I meant.
> > Changing the RAM compromises the VM and only the VM, an exploit in a
> > device emulation might allow to compromise the _host_. Is it now clearer
> > what I meant?
> 
> When you are able modify the savevm state you already have access to the 
> host ...

Huh? Being able to modify the savevm state is not the same as being able
to run arbitrary code on the host. At least ideally it shouldn't be.
Currently there is no way you could even consider running a savevm from
an untrusted source, but I think that is just because of qemu's current
implementation, not because it has to be.
Reimar Döffinger - Aug. 19, 2009, 9:16 a.m.
On Wed, Aug 19, 2009 at 11:07:19AM +0200, Gerd Hoffmann wrote:
> >> When you are able modify the savevm state you already have access to the
> >> host ...
> >
> > Huh? Being able to modify the savevm state is not the same as being able
> > to run arbitrary code on the host.
> 
> Yes, in theory.  And in practice?  What is the point in allowing remote 
> write access to savevm state?

E.g. migration between entities that do not 100% trust each other?
Or debugging, a user does savevm and a developer can look at it and
debug after loadvm?

> > Currently there is no way you could even consider running a savevm from
> > an untrusted source, but I think that is just because of qemu's current
> > implementation, not because it has to be.
> 
> Getting that right is a pretty big job though ...

I said that already, but I don't think that's a valid excuse to not
consider that _for the design of a new API_. Unless you enjoy designing
a new API every few months...

Patch

diff --git a/hw/apic.c b/hw/apic.c
index 1927811..2c1dbf4 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -864,75 +864,33 @@  static void apic_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val)
     }
 }

-static void apic_save(QEMUFile *f, void *opaque)
-{
-    APICState *s = opaque;
-    int i;
-
-    qemu_put_be32s(f, &s->apicbase);
-    qemu_put_8s(f, &s->id);
-    qemu_put_8s(f, &s->arb_id);
-    qemu_put_8s(f, &s->tpr);
-    qemu_put_be32s(f, &s->spurious_vec);
-    qemu_put_8s(f, &s->log_dest);
-    qemu_put_8s(f, &s->dest_mode);
-    for (i = 0; i < 8; i++) {
-        qemu_put_be32s(f, &s->isr[i]);
-        qemu_put_be32s(f, &s->tmr[i]);
-        qemu_put_be32s(f, &s->irr[i]);
-    }
-    for (i = 0; i < APIC_LVT_NB; i++) {
-        qemu_put_be32s(f, &s->lvt[i]);
+static VMStateDescription vmstate_apic = {
+    .name = "apic",
+    .version_id = 3,
+    .minimum_version_id = 3,
+    .fields      = (VMStateField []) {
+        VMSTATE_UINT32(apicbase, APICState, 0),
+        VMSTATE_UINT8(id, APICState, 0),
+        VMSTATE_UINT8(arb_id, APICState, 0),
+        VMSTATE_UINT8(tpr, APICState, 0),
+        VMSTATE_UINT32(spurious_vec, APICState, 0),
+        VMSTATE_UINT8(log_dest, APICState, 0),
+        VMSTATE_UINT8(dest_mode, APICState, 0),
+        VMSTATE_UINT32_ARRAY(isr, APICState, 0, 8),
+        VMSTATE_UINT32_ARRAY(tmr, APICState, 0, 8),
+        VMSTATE_UINT32_ARRAY(irr, APICState, 0, 8),
+        VMSTATE_UINT32_ARRAY(lvt, APICState, 0, APIC_LVT_NB),
+        VMSTATE_UINT32(esr, APICState, 0),
+        VMSTATE_UINT32_ARRAY(icr, APICState, 0, 2),
+        VMSTATE_UINT32(divide_conf, APICState, 0),
+        VMSTATE_INT32(count_shift, APICState, 0),
+        VMSTATE_UINT32(initial_count, APICState, 0),
+        VMSTATE_INT64(initial_count_load_time, APICState, 0),
+        VMSTATE_INT64(next_time, APICState, 0),
+        VMSTATE_TIMER(timer, APICState, 2),
+        VMSTATE_END_OF_LIST()
     }
-    qemu_put_be32s(f, &s->esr);
-    qemu_put_be32s(f, &s->icr[0]);
-    qemu_put_be32s(f, &s->icr[1]);
-    qemu_put_be32s(f, &s->divide_conf);
-    qemu_put_be32(f, s->count_shift);
-    qemu_put_be32s(f, &s->initial_count);
-    qemu_put_be64(f, s->initial_count_load_time);
-    qemu_put_be64(f, s->next_time);
-
-    qemu_put_timer(f, s->timer);
-}
-
-static int apic_load(QEMUFile *f, void *opaque, int version_id)
-{
-    APICState *s = opaque;
-    int i;
-
-    if (version_id > 2)
-        return -EINVAL;
-
-    /* XXX: what if the base changes? (registered memory regions) */
-    qemu_get_be32s(f, &s->apicbase);
-    qemu_get_8s(f, &s->id);
-    qemu_get_8s(f, &s->arb_id);
-    qemu_get_8s(f, &s->tpr);
-    qemu_get_be32s(f, &s->spurious_vec);
-    qemu_get_8s(f, &s->log_dest);
-    qemu_get_8s(f, &s->dest_mode);
-    for (i = 0; i < 8; i++) {
-        qemu_get_be32s(f, &s->isr[i]);
-        qemu_get_be32s(f, &s->tmr[i]);
-        qemu_get_be32s(f, &s->irr[i]);
-    }
-    for (i = 0; i < APIC_LVT_NB; i++) {
-        qemu_get_be32s(f, &s->lvt[i]);
-    }
-    qemu_get_be32s(f, &s->esr);
-    qemu_get_be32s(f, &s->icr[0]);
-    qemu_get_be32s(f, &s->icr[1]);
-    qemu_get_be32s(f, &s->divide_conf);
-    s->count_shift=qemu_get_be32(f);
-    qemu_get_be32s(f, &s->initial_count);
-    s->initial_count_load_time=qemu_get_be64(f);
-    s->next_time=qemu_get_be64(f);
-
-    if (version_id >= 2)
-        qemu_get_timer(f, s->timer);
-    return 0;
-}
+};

 static void apic_reset(void *opaque)
 {
@@ -996,7 +954,7 @@  int apic_init(CPUState *env)
     }
     s->timer = qemu_new_timer(vm_clock, apic_timer, s);

-    register_savevm("apic", s->idx, 2, apic_save, apic_load, s);
+    vmstate_register(s->idx, &vmstate_apic, s);
     qemu_register_reset(apic_reset, s);

     local_apics[s->idx] = s;