Patchwork Re: [PATCH] rtl8139: IO memory is not part of vmstate

login
register
mail settings
Submitter Michael S. Tsirkin
Date Dec. 12, 2010, 10:28 a.m.
Message ID <20101212102812.GA13033@redhat.com>
Download mbox | patch
Permalink /patch/75224/
State New
Headers show

Comments

Michael S. Tsirkin - Dec. 12, 2010, 10:28 a.m.
On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:
> On Thu, 2010-12-09 at 22:49 +0100, Juan Quintela wrote:
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > The cpu_register_io_memory() value is unique to the VM instance and
> > > should not be restored after migration/save.  Doing so means we
> > > could be pointing at arbitrary device's io regions after migration/restore.
> > >
> > > In this case, if we start a VM with a single rtl8139, hot add a 2nd,
> > > migrate the VM, then hot remove the added NIC, the 1st NIC stops
> > > working and the VM segfaults on reboot.
> > >
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >
> > >  hw/rtl8139.c |    4 ++--
> > >  1 files changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/hw/rtl8139.c b/hw/rtl8139.c
> > > index d92981d..9c5fc84 100644
> > > --- a/hw/rtl8139.c
> > > +++ b/hw/rtl8139.c
> > > @@ -3186,7 +3186,7 @@ static void rtl8139_pre_save(void *opaque)
> > >  
> > >  static const VMStateDescription vmstate_rtl8139 = {
> > >      .name = "rtl8139",
> > > -    .version_id = 4,
> > > +    .version_id = 5,
> > 
> > No need to change version, format is still the same.
> > 
> > >      .minimum_version_id = 3,
> > >      .minimum_version_id_old = 3,
> > >      .post_load = rtl8139_post_load,
> > > @@ -3234,7 +3234,7 @@ static const VMStateDescription vmstate_rtl8139 = {
> > >  
> > >          VMSTATE_UNUSED(4),
> > >          VMSTATE_MACADDR(conf.macaddr, RTL8139State),
> > > -        VMSTATE_INT32(rtl8139_mmio_io_addr, RTL8139State),
> > > +        VMSTATE_UNUSED(4),
> > 
> > If we migrate from an old guest: we just ignore the value.
> > If we migrate to one old guest, we send garbage, but as you told that we
> > were already sending garbage, it looks ok, no?
> 
> NAK, if we don't bump the version, we don't know if we're migration
> to/from a version 4 that includes the io address or not.  We have no
> good way to debug different binaries on different systems.  It seems to
> work today if we don't involve hotplug, so I think we have to maintain
> version 4 as including the io address, and let version 5 drop it.  I
> tested old to new migrations, and as you expect, it does work.
> 
> Alex

How about we keep migrating the index for the benefit of
old versions, but ignore the value on load?
Something like the following:


Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Juan Quintela - Dec. 12, 2010, 11:53 a.m.
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:
>> On Thu, 2010-12-09 at 22:49 +0100, Juan Quintela wrote:
>> > Alex Williamson <alex.williamson@redhat.com> wrote:
>> > > The cpu_register_io_memory() value is unique to the VM instance and
>> > > should not be restored after migration/save.  Doing so means we
>> > > could be pointing at arbitrary device's io regions after migration/restore.
>> > >
>> > > In this case, if we start a VM with a single rtl8139, hot add a 2nd,
>> > > migrate the VM, then hot remove the added NIC, the 1st NIC stops
>> > > working and the VM segfaults on reboot.
>> > >
>> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> > > ---
>> > >
>> > >  hw/rtl8139.c |    4 ++--
>> > >  1 files changed, 2 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/hw/rtl8139.c b/hw/rtl8139.c
>> > > index d92981d..9c5fc84 100644
>> > > --- a/hw/rtl8139.c
>> > > +++ b/hw/rtl8139.c
>> > > @@ -3186,7 +3186,7 @@ static void rtl8139_pre_save(void *opaque)
>> > >  
>> > >  static const VMStateDescription vmstate_rtl8139 = {
>> > >      .name = "rtl8139",
>> > > -    .version_id = 4,
>> > > +    .version_id = 5,
>> > 
>> > No need to change version, format is still the same.
>> > 
>> > >      .minimum_version_id = 3,
>> > >      .minimum_version_id_old = 3,
>> > >      .post_load = rtl8139_post_load,
>> > > @@ -3234,7 +3234,7 @@ static const VMStateDescription vmstate_rtl8139 = {
>> > >  
>> > >          VMSTATE_UNUSED(4),
>> > >          VMSTATE_MACADDR(conf.macaddr, RTL8139State),
>> > > -        VMSTATE_INT32(rtl8139_mmio_io_addr, RTL8139State),
>> > > +        VMSTATE_UNUSED(4),
>> > 
>> > If we migrate from an old guest: we just ignore the value.
>> > If we migrate to one old guest, we send garbage, but as you told that we
>> > were already sending garbage, it looks ok, no?
>> 
>> NAK, if we don't bump the version, we don't know if we're migration
>> to/from a version 4 that includes the io address or not.  We have no
>> good way to debug different binaries on different systems.  It seems to
>> work today if we don't involve hotplug, so I think we have to maintain
>> version 4 as including the io address, and let version 5 drop it.  I
>> tested old to new migrations, and as you expect, it does work.
>> 
>> Alex
>
> How about we keep migrating the index for the benefit of
> old versions, but ignore the value on load?
> Something like the following:

This was my 1st suggestion to Alex O:-)
So, I am in.  he think this is bad for upstream,  I don't think so (but
I understand that it is oppinable).

Later, Juan.


>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> diff --git a/hw/rtl8139.c b/hw/rtl8139.c
> index d92981d..e999dd9 100644
> --- a/hw/rtl8139.c
> +++ b/hw/rtl8139.c
> @@ -494,6 +494,13 @@ typedef struct RTL8139State {
>      QEMUTimer *timer;
>      int64_t TimerExpire;
>  
> +    /* Hack for migration compatibility: older
> +     * versions of rtl8139 put mmio index in save image,
> +     * and override the index that we have with that one.
> +     * This does not make any sense but happens to
> +     * work sometimes, if we are lucky and index matches.
> +     * On load, we can simply ignore this value. */
> +    int compat_rtl8139_mmio_io_addr;
>  } RTL8139State;
>  
>  static void rtl8139_set_next_tctr_time(RTL8139State *s, int64_t current_time);
> @@ -3182,6 +3189,7 @@ static void rtl8139_pre_save(void *opaque)
>      rtl8139_set_next_tctr_time(s, current_time);
>      s->TCTR = muldiv64(current_time - s->TCTR_base, PCI_FREQUENCY,
>                         get_ticks_per_sec());
> +    s->compat_rtl8139_mmio_io_addr = s->rtl8139_mmio_io_addr;
>  }
>  
>  static const VMStateDescription vmstate_rtl8139 = {
> @@ -3234,7 +3242,7 @@ static const VMStateDescription vmstate_rtl8139 = {
>  
>          VMSTATE_UNUSED(4),
>          VMSTATE_MACADDR(conf.macaddr, RTL8139State),
> -        VMSTATE_INT32(rtl8139_mmio_io_addr, RTL8139State),
> +        VMSTATE_INT32(compat_rtl8139_mmio_io_addr, RTL8139State),
>  
>          VMSTATE_UINT32(currTxDesc, RTL8139State),
>          VMSTATE_UINT32(currCPlusRxDesc, RTL8139State),
Michael S. Tsirkin - Dec. 12, 2010, 12:01 p.m.
On Sun, Dec 12, 2010 at 05:23:39PM +0530, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:
> >> On Thu, 2010-12-09 at 22:49 +0100, Juan Quintela wrote:
> >> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >> > > The cpu_register_io_memory() value is unique to the VM instance and
> >> > > should not be restored after migration/save.  Doing so means we
> >> > > could be pointing at arbitrary device's io regions after migration/restore.
> >> > >
> >> > > In this case, if we start a VM with a single rtl8139, hot add a 2nd,
> >> > > migrate the VM, then hot remove the added NIC, the 1st NIC stops
> >> > > working and the VM segfaults on reboot.
> >> > >
> >> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >> > > ---
> >> > >
> >> > >  hw/rtl8139.c |    4 ++--
> >> > >  1 files changed, 2 insertions(+), 2 deletions(-)
> >> > >
> >> > > diff --git a/hw/rtl8139.c b/hw/rtl8139.c
> >> > > index d92981d..9c5fc84 100644
> >> > > --- a/hw/rtl8139.c
> >> > > +++ b/hw/rtl8139.c
> >> > > @@ -3186,7 +3186,7 @@ static void rtl8139_pre_save(void *opaque)
> >> > >  
> >> > >  static const VMStateDescription vmstate_rtl8139 = {
> >> > >      .name = "rtl8139",
> >> > > -    .version_id = 4,
> >> > > +    .version_id = 5,
> >> > 
> >> > No need to change version, format is still the same.
> >> > 
> >> > >      .minimum_version_id = 3,
> >> > >      .minimum_version_id_old = 3,
> >> > >      .post_load = rtl8139_post_load,
> >> > > @@ -3234,7 +3234,7 @@ static const VMStateDescription vmstate_rtl8139 = {
> >> > >  
> >> > >          VMSTATE_UNUSED(4),
> >> > >          VMSTATE_MACADDR(conf.macaddr, RTL8139State),
> >> > > -        VMSTATE_INT32(rtl8139_mmio_io_addr, RTL8139State),
> >> > > +        VMSTATE_UNUSED(4),
> >> > 
> >> > If we migrate from an old guest: we just ignore the value.
> >> > If we migrate to one old guest, we send garbage, but as you told that we
> >> > were already sending garbage, it looks ok, no?
> >> 
> >> NAK, if we don't bump the version, we don't know if we're migration
> >> to/from a version 4 that includes the io address or not.  We have no
> >> good way to debug different binaries on different systems.  It seems to
> >> work today if we don't involve hotplug, so I think we have to maintain
> >> version 4 as including the io address, and let version 5 drop it.  I
> >> tested old to new migrations, and as you expect, it does work.
> >> 
> >> Alex
> >
> > How about we keep migrating the index for the benefit of
> > old versions, but ignore the value on load?
> > Something like the following:
> 
> This was my 1st suggestion to Alex O:-)

The difference here is that instead of sending garbage to the
old version we send an actual index value.

> So, I am in.  he think this is bad for upstream,  I don't think so (but
> I understand that it is oppinable).
> 
> Later, Juan.

I think it makes sense to fix this for the stable branch,
and I think we should try as hard as we can to avoid bumping up the
version number there.

For master we can bump the version number but it might be easier to
just keep the code the same there.

> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > diff --git a/hw/rtl8139.c b/hw/rtl8139.c
> > index d92981d..e999dd9 100644
> > --- a/hw/rtl8139.c
> > +++ b/hw/rtl8139.c
> > @@ -494,6 +494,13 @@ typedef struct RTL8139State {
> >      QEMUTimer *timer;
> >      int64_t TimerExpire;
> >  
> > +    /* Hack for migration compatibility: older
> > +     * versions of rtl8139 put mmio index in save image,
> > +     * and override the index that we have with that one.
> > +     * This does not make any sense but happens to
> > +     * work sometimes, if we are lucky and index matches.
> > +     * On load, we can simply ignore this value. */
> > +    int compat_rtl8139_mmio_io_addr;
> >  } RTL8139State;
> >  
> >  static void rtl8139_set_next_tctr_time(RTL8139State *s, int64_t current_time);
> > @@ -3182,6 +3189,7 @@ static void rtl8139_pre_save(void *opaque)
> >      rtl8139_set_next_tctr_time(s, current_time);
> >      s->TCTR = muldiv64(current_time - s->TCTR_base, PCI_FREQUENCY,
> >                         get_ticks_per_sec());
> > +    s->compat_rtl8139_mmio_io_addr = s->rtl8139_mmio_io_addr;
> >  }
> >  
> >  static const VMStateDescription vmstate_rtl8139 = {
> > @@ -3234,7 +3242,7 @@ static const VMStateDescription vmstate_rtl8139 = {
> >  
> >          VMSTATE_UNUSED(4),
> >          VMSTATE_MACADDR(conf.macaddr, RTL8139State),
> > -        VMSTATE_INT32(rtl8139_mmio_io_addr, RTL8139State),
> > +        VMSTATE_INT32(compat_rtl8139_mmio_io_addr, RTL8139State),
> >  
> >          VMSTATE_UINT32(currTxDesc, RTL8139State),
> >          VMSTATE_UINT32(currCPlusRxDesc, RTL8139State),
Juan Quintela - Dec. 12, 2010, 2:37 p.m.
"Michael S. Tsirkin" <mst@redhat.com> wrote:
> On Sun, Dec 12, 2010 at 05:23:39PM +0530, Juan Quintela wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> > On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:

>> > How about we keep migrating the index for the benefit of
>> > old versions, but ignore the value on load?
>> > Something like the following:
>> 
>> This was my 1st suggestion to Alex O:-)
>
> The difference here is that instead of sending garbage to the
> old version we send an actual index value.
>
>> So, I am in.  he think this is bad for upstream,  I don't think so (but
>> I understand that it is oppinable).
>> 
>> Later, Juan.
>
> I think it makes sense to fix this for the stable branch,
> and I think we should try as hard as we can to avoid bumping up the
> version number there.
>
> For master we can bump the version number but it might be easier to
> just keep the code the same there.

I think that your solution is better.  For older versions, it works as
expected.  For new versions, problem is fixed.  Solution is not the
"purest", but you can say the same about uping the version for a state
that is exactly the same length & fields O:-)

Later, Juan.
Alex Williamson - Dec. 12, 2010, 4:29 p.m.
On Sun, 2010-12-12 at 20:07 +0530, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Sun, Dec 12, 2010 at 05:23:39PM +0530, Juan Quintela wrote:
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> > On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:
> 
> >> > How about we keep migrating the index for the benefit of
> >> > old versions, but ignore the value on load?
> >> > Something like the following:
> >> 
> >> This was my 1st suggestion to Alex O:-)
> >
> > The difference here is that instead of sending garbage to the
> > old version we send an actual index value.
> >
> >> So, I am in.  he think this is bad for upstream,  I don't think so (but
> >> I understand that it is oppinable).
> >> 
> >> Later, Juan.
> >
> > I think it makes sense to fix this for the stable branch,
> > and I think we should try as hard as we can to avoid bumping up the
> > version number there.
> >
> > For master we can bump the version number but it might be easier to
> > just keep the code the same there.
> 
> I think that your solution is better.  For older versions, it works as
> expected.  For new versions, problem is fixed.  Solution is not the
> "purest", but you can say the same about uping the version for a state
> that is exactly the same length & fields O:-)

I disagree, without bumping the version number, we can never guarantee
the problem is behind us.  We can always migrate to the bad version,
which puts our users at risk.  The responsible behavior is to allow
forward migrations and prevent migrations to a version with an issue
known to compromise VM integrity.  Perhaps I feel more strongly about
this because I actually had to debug this problem.  Obvious in
retrospect, but a huge pain in the butt to get there.

I had sent Juan a similar patch to the one Michael proposed, but with
the following change:

+    if (s->dev.qdev.hotplugged) {
+        s->rtl8139_mmio_io_addr_dummy = IO_MEM_UNASSIGNED;
+    } else {
+        s->rtl8139_mmio_io_addr_dummy = s->rtl8139_mmio_io_addr;
+    }

With this, we at least limit the damage that the hotplugged NIC can do,
but all it takes is a reboot of the VM to touch the BARs before the
device becomes unusable (which is better than taking out the whole VM).
If we need something like this for stable, I will begrudgingly agree,
but for master I think we have to bump the version.  Thanks,

Alex
Michael S. Tsirkin - Dec. 12, 2010, 4:41 p.m.
On Sun, Dec 12, 2010 at 09:29:05AM -0700, Alex Williamson wrote:
> On Sun, 2010-12-12 at 20:07 +0530, Juan Quintela wrote:
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > On Sun, Dec 12, 2010 at 05:23:39PM +0530, Juan Quintela wrote:
> > >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >> > On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:
> > 
> > >> > How about we keep migrating the index for the benefit of
> > >> > old versions, but ignore the value on load?
> > >> > Something like the following:
> > >> 
> > >> This was my 1st suggestion to Alex O:-)
> > >
> > > The difference here is that instead of sending garbage to the
> > > old version we send an actual index value.
> > >
> > >> So, I am in.  he think this is bad for upstream,  I don't think so (but
> > >> I understand that it is oppinable).
> > >> 
> > >> Later, Juan.
> > >
> > > I think it makes sense to fix this for the stable branch,
> > > and I think we should try as hard as we can to avoid bumping up the
> > > version number there.
> > >
> > > For master we can bump the version number but it might be easier to
> > > just keep the code the same there.
> > 
> > I think that your solution is better.  For older versions, it works as
> > expected.  For new versions, problem is fixed.  Solution is not the
> > "purest", but you can say the same about uping the version for a state
> > that is exactly the same length & fields O:-)
> 
> I disagree, without bumping the version number, we can never guarantee
> the problem is behind us.  We can always migrate to the bad version,
> which puts our users at risk.

Well, they can also migrate between old versions, right?

>  The responsible behavior is to allow
> forward migrations and prevent migrations to a version with an issue
> known to compromise VM integrity.  Perhaps I feel more strongly about
> this because I actually had to debug this problem.  Obvious in
> retrospect, but a huge pain in the butt to get there.
> 
> I had sent Juan a similar patch to the one Michael proposed, but with
> the following change:
> 
> +    if (s->dev.qdev.hotplugged) {
> +        s->rtl8139_mmio_io_addr_dummy = IO_MEM_UNASSIGNED;
> +    } else {
> +        s->rtl8139_mmio_io_addr_dummy = s->rtl8139_mmio_io_addr;
> +    }
> 
> With this, we at least limit the damage that the hotplugged NIC can do,
> but all it takes is a reboot of the VM to touch the BARs before the
> device becomes unusable (which is better than taking out the whole VM).

Looks good to me.

> If we need something like this for stable, I will begrudgingly agree,
> but for master I think we have to bump the version.  Thanks,
> 
> Alex
Juan Quintela - Dec. 12, 2010, 9:25 p.m.
Alex Williamson <alex.williamson@redhat.com> wrote:
> On Sun, 2010-12-12 at 20:07 +0530, Juan Quintela wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> > On Sun, Dec 12, 2010 at 05:23:39PM +0530, Juan Quintela wrote:
>> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> >> > On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:
>> 
>> >> > How about we keep migrating the index for the benefit of
>> >> > old versions, but ignore the value on load?
>> >> > Something like the following:
>> >> 
>> >> This was my 1st suggestion to Alex O:-)
>> >
>> > The difference here is that instead of sending garbage to the
>> > old version we send an actual index value.
>> >
>> >> So, I am in.  he think this is bad for upstream,  I don't think so (but
>> >> I understand that it is oppinable).
>> >> 
>> >> Later, Juan.
>> >
>> > I think it makes sense to fix this for the stable branch,
>> > and I think we should try as hard as we can to avoid bumping up the
>> > version number there.
>> >
>> > For master we can bump the version number but it might be easier to
>> > just keep the code the same there.
>> 
>> I think that your solution is better.  For older versions, it works as
>> expected.  For new versions, problem is fixed.  Solution is not the
>> "purest", but you can say the same about uping the version for a state
>> that is exactly the same length & fields O:-)
>
> I disagree, without bumping the version number, we can never guarantee
> the problem is behind us.

we can, if we use the latest version.

> We can always migrate to the bad version,

That is the whole point.  Bumping the version makes this impossible.

> which puts our users at risk.  The responsible behavior is to allow
> forward migrations and prevent migrations to a version with an issue
> known to compromise VM integrity.  Perhaps I feel more strongly about
> this because I actually had to debug this problem.  Obvious in
> retrospect, but a huge pain in the butt to get there.

Obviously, my point of view is different, and is related with
maintaining a stable migration ABI. So, ... I am also "biased".

We have to make a decission (in general, not just this case):
- we are going to never bump the version:
  this gives an stable ABI, but bugs stay with us forever
- we are not ever going to prettend that we care
  this makes changes trivial, as we don't have to maintain
  backward compatiblity.

And that is it.  Basically anything in the middle don't matter.  If I
have a machine definition, with only a single device that has bumped
version, I can't migrate to the backwards one.

This is the reason why I am against the changes like this, if we are
prettending that we are going to maintain the versions stable.

Notice that there are (at least) two ways to look at this specific
problem:
- don't bump the version.
  * new -> new : works
  * old -> new : works
  * new -> old : works (at least as well as old -> old that existed
                        before)
- bump the version
  * new -> new: works
  * old -> new: works
  * new -> old: fails always

Obviously, never of the options is perfect :(

Thanks, Juan.
Alex Williamson - Dec. 13, 2010, 5:43 p.m.
On Mon, 2010-12-13 at 02:55 +0530, Juan Quintela wrote:
> Alex Williamson <alex.williamson@redhat.com> wrote:
> > On Sun, 2010-12-12 at 20:07 +0530, Juan Quintela wrote:
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> > On Sun, Dec 12, 2010 at 05:23:39PM +0530, Juan Quintela wrote:
> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >> > On Thu, Dec 09, 2010 at 03:14:17PM -0700, Alex Williamson wrote:
> >> 
> >> >> > How about we keep migrating the index for the benefit of
> >> >> > old versions, but ignore the value on load?
> >> >> > Something like the following:
> >> >> 
> >> >> This was my 1st suggestion to Alex O:-)
> >> >
> >> > The difference here is that instead of sending garbage to the
> >> > old version we send an actual index value.
> >> >
> >> >> So, I am in.  he think this is bad for upstream,  I don't think so (but
> >> >> I understand that it is oppinable).
> >> >> 
> >> >> Later, Juan.
> >> >
> >> > I think it makes sense to fix this for the stable branch,
> >> > and I think we should try as hard as we can to avoid bumping up the
> >> > version number there.
> >> >
> >> > For master we can bump the version number but it might be easier to
> >> > just keep the code the same there.
> >> 
> >> I think that your solution is better.  For older versions, it works as
> >> expected.  For new versions, problem is fixed.  Solution is not the
> >> "purest", but you can say the same about uping the version for a state
> >> that is exactly the same length & fields O:-)
> >
> > I disagree, without bumping the version number, we can never guarantee
> > the problem is behind us.
> 
> we can, if we use the latest version.

And we determine we're using the latest version via the vmsd
version_id...

> > We can always migrate to the bad version,
> 
> That is the whole point.  Bumping the version makes this impossible.

Which seems like a good thing to me.  Yes, it sucks that a user may
upgrade a host, migrate a guest to it, and suddenly not be able to
migrate back to the original host.  On the other hand, isn't it better
that we don't allow a migration that could potentially risk the
integrity of the guest?  I think so.

> > which puts our users at risk.  The responsible behavior is to allow
> > forward migrations and prevent migrations to a version with an issue
> > known to compromise VM integrity.  Perhaps I feel more strongly about
> > this because I actually had to debug this problem.  Obvious in
> > retrospect, but a huge pain in the butt to get there.
> 
> Obviously, my point of view is different, and is related with
> maintaining a stable migration ABI. So, ... I am also "biased".
> 
> We have to make a decission (in general, not just this case):
> - we are going to never bump the version:
>   this gives an stable ABI, but bugs stay with us forever

This is impossible.

> - we are not ever going to prettend that we care
>   this makes changes trivial, as we don't have to maintain
>   backward compatiblity.

That's a little dramatic.  If we can come up with a way to not bump the
version number, I'm all for it.  I haven't seen one so far.

> 
> And that is it.  Basically anything in the middle don't matter.  If I
> have a machine definition, with only a single device that has bumped
> version, I can't migrate to the backwards one.

Sorry, it's for your own good.  AIUI, there is plenty of grey between
your criteria above.  Yes we should try to preserve the migration ABI.
However, we will hit bugs where that's impossible.  Then it's good to
have discussions like this and investigate whether we can safely make a
change without bumping the version_id.  IMHO, the integrity of the guest
is always more important than maintaining a static ABI.

> This is the reason why I am against the changes like this, if we are
> prettending that we are going to maintain the versions stable.
> 
> Notice that there are (at least) two ways to look at this specific
> problem:
> - don't bump the version.
>   * new -> new : works
>   * old -> new : works
>   * new -> old : works (at least as well as old -> old that existed
>                         before)

If it worked, I wouldn't be working on this bug ;)  Here are some
failure scenarios:

a)
   1. Boot guest with single rtl8139
   2. Hot add 2nd rtl8139
   3. Migrate guest
   4. Hot remove 2nd rtl8139
   Result: 1st NIC stops working, guest segfaults on reboot

Too complicated?  How about this:

b)
   1. Boot guest with 2 rtl8139 NICs
   2. Boot migration target with NICs listed in reverse order
   3. Migrate
   Result: NICs get swapped at reboot!!

Or how about:

c)
   1. Boot guest with e1000, rtl8139
   2. Boot migration target with rtl8139, e1000
   3. Migrate
   Result: rtl8139 now points at e1000 mmio space, fails on reboot,
e1000 fails if rtl8139 is removed

I don't think it's fair to call any of these working, and in fact, I
retract my patch that sets the mmio space to unassigned if the device is
hotplugged, since issues can clearly happen without hotplug involved.
The index the device uses depends entirely on instantiation ordering,
which is bound to cause confusing, hard to reproduce, and difficult to
debug issues.

> - bump the version
>   * new -> new: works
>   * old -> new: works
>   * new -> old: fails always

Correction here too, new->old is prevented, the migration source will
continue running after the incompatible versions are transmitted and the
migration canceled.

So, unfortunately, I stand by my original patch.  I understand your
desire to avoid bumping the migration ABI, but doing so is the only way
we can guarantee integrity of the guest... and that's what it's there
for.  Thanks,

Alex
Michael S. Tsirkin - Dec. 13, 2010, 5:50 p.m.
On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> So, unfortunately, I stand by my original patch.

What about the one that put -1 in saved index for a hotplugged device?
Alex Williamson - Dec. 13, 2010, 6 p.m.
On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > So, unfortunately, I stand by my original patch.
> 
> What about the one that put -1 in saved index for a hotplugged device?

There are still examples that don't work even without hotplug (example 2
and example 3 after the reboot).  That hack limits the damage, but still
leaves a latent bug for reboot and doesn't address the non-hotplug
scenarios.  So, I don't think it's worthwhile to pursue, and we
shouldn't pretend we can use it to avoid bumping the version_id.
Thanks,

Alex
Michael S. Tsirkin - Dec. 13, 2010, 6:54 p.m.
On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > So, unfortunately, I stand by my original patch.
> > 
> > What about the one that put -1 in saved index for a hotplugged device?
> 
> There are still examples that don't work even without hotplug (example 2
> and example 3 after the reboot).  That hack limits the damage, but still
> leaves a latent bug for reboot and doesn't address the non-hotplug
> scenarios.  So, I don't think it's worthwhile to pursue, and we
> shouldn't pretend we can use it to avoid bumping the version_id.
> Thanks,
> 
> Alex

I guess when we bump it we tell users: migration is completely
borken to the old version, don't even try it.

Is there a way for libvirt to discover such incompatibilities
and avoid the migration?
Alex Williamson - Dec. 13, 2010, 6:59 p.m.
On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > So, unfortunately, I stand by my original patch.
> > > 
> > > What about the one that put -1 in saved index for a hotplugged device?
> > 
> > There are still examples that don't work even without hotplug (example 2
> > and example 3 after the reboot).  That hack limits the damage, but still
> > leaves a latent bug for reboot and doesn't address the non-hotplug
> > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > shouldn't pretend we can use it to avoid bumping the version_id.
> > Thanks,
> > 
> > Alex
> 
> I guess when we bump it we tell users: migration is completely
> borken to the old version, don't even try it.
> 
> Is there a way for libvirt to discover such incompatibilities
> and avoid the migration?

I don't know if libvirt has a way to query this in advance.  If a
migration is attempted, the target will report:

savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4

And the source will continue running.  We waste plenty of bits getting
to that point, but hopefully libvirt understands that it failed.
Thanks,

Alex
Michael S. Tsirkin - Dec. 13, 2010, 7:06 p.m.
On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > So, unfortunately, I stand by my original patch.
> > > > 
> > > > What about the one that put -1 in saved index for a hotplugged device?
> > > 
> > > There are still examples that don't work even without hotplug (example 2
> > > and example 3 after the reboot).  That hack limits the damage, but still
> > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > Thanks,
> > > 
> > > Alex
> > 
> > I guess when we bump it we tell users: migration is completely
> > borken to the old version, don't even try it.
> > 
> > Is there a way for libvirt to discover such incompatibilities
> > and avoid the migration?
> 
> I don't know if libvirt has a way to query this in advance.  If a
> migration is attempted, the target will report:
> 
> savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> 
> And the source will continue running.  We waste plenty of bits getting
> to that point,

Yes, this happens after all of memory has been migrated.

> but hopefully libvirt understands that it failed.
> Thanks,
> 
> Alex
Alex Williamson - Dec. 13, 2010, 7:15 p.m.
On Mon, 2010-12-13 at 21:06 +0200, Michael S. Tsirkin wrote:
> On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> > On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > > So, unfortunately, I stand by my original patch.
> > > > > 
> > > > > What about the one that put -1 in saved index for a hotplugged device?
> > > > 
> > > > There are still examples that don't work even without hotplug (example 2
> > > > and example 3 after the reboot).  That hack limits the damage, but still
> > > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > > Thanks,
> > > > 
> > > > Alex
> > > 
> > > I guess when we bump it we tell users: migration is completely
> > > borken to the old version, don't even try it.
> > > 
> > > Is there a way for libvirt to discover such incompatibilities
> > > and avoid the migration?
> > 
> > I don't know if libvirt has a way to query this in advance.  If a
> > migration is attempted, the target will report:
> > 
> > savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> > 
> > And the source will continue running.  We waste plenty of bits getting
> > to that point,
> 
> Yes, this happens after all of memory has been migrated.

Better late than never :^\
Michael S. Tsirkin - Dec. 14, 2010, 4:43 a.m.
On Mon, Dec 13, 2010 at 12:15:08PM -0700, Alex Williamson wrote:
> On Mon, 2010-12-13 at 21:06 +0200, Michael S. Tsirkin wrote:
> > On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> > > On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > > > So, unfortunately, I stand by my original patch.
> > > > > > 
> > > > > > What about the one that put -1 in saved index for a hotplugged device?
> > > > > 
> > > > > There are still examples that don't work even without hotplug (example 2
> > > > > and example 3 after the reboot).  That hack limits the damage, but still
> > > > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > > > Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > I guess when we bump it we tell users: migration is completely
> > > > borken to the old version, don't even try it.
> > > > 
> > > > Is there a way for libvirt to discover such incompatibilities
> > > > and avoid the migration?
> > > 
> > > I don't know if libvirt has a way to query this in advance.  If a
> > > migration is attempted, the target will report:
> > > 
> > > savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> > > 
> > > And the source will continue running.  We waste plenty of bits getting
> > > to that point,
> > 
> > Yes, this happens after all of memory has been migrated.
> 
> Better late than never :^\

One other question: can we do the same by creating a new (empty)
section? As was discussed in the past this is easier for
downstreams to cherry-pick.
Alex Williamson - Dec. 14, 2010, 5 a.m.
On Tue, 2010-12-14 at 06:43 +0200, Michael S. Tsirkin wrote:
> On Mon, Dec 13, 2010 at 12:15:08PM -0700, Alex Williamson wrote:
> > On Mon, 2010-12-13 at 21:06 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> > > > On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > > > > So, unfortunately, I stand by my original patch.
> > > > > > > 
> > > > > > > What about the one that put -1 in saved index for a hotplugged device?
> > > > > > 
> > > > > > There are still examples that don't work even without hotplug (example 2
> > > > > > and example 3 after the reboot).  That hack limits the damage, but still
> > > > > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > > > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > > > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > > > > Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > I guess when we bump it we tell users: migration is completely
> > > > > borken to the old version, don't even try it.
> > > > > 
> > > > > Is there a way for libvirt to discover such incompatibilities
> > > > > and avoid the migration?
> > > > 
> > > > I don't know if libvirt has a way to query this in advance.  If a
> > > > migration is attempted, the target will report:
> > > > 
> > > > savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> > > > 
> > > > And the source will continue running.  We waste plenty of bits getting
> > > > to that point,
> > > 
> > > Yes, this happens after all of memory has been migrated.
> > 
> > Better late than never :^\
> 
> One other question: can we do the same by creating a new (empty)
> section? As was discussed in the past this is easier for
> downstreams to cherry-pick.

The only way I can think to do that would be to have a subsection that
is always included, but saves no data.  That would force a failure on
new->old migration, but I don't think it really matches the intended
purpose of subsections and feels like it's adding cruft for no gain.
Maybe I'm missing something.  Juan, is there any advantage to trapping
this in a subsection?  Thanks,

Alex
Michael S. Tsirkin - Dec. 14, 2010, 12:32 p.m.
On Mon, Dec 13, 2010 at 10:00:48PM -0700, Alex Williamson wrote:
> On Tue, 2010-12-14 at 06:43 +0200, Michael S. Tsirkin wrote:
> > On Mon, Dec 13, 2010 at 12:15:08PM -0700, Alex Williamson wrote:
> > > On Mon, 2010-12-13 at 21:06 +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> > > > > On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > > > > > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > > > > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > > > > > So, unfortunately, I stand by my original patch.
> > > > > > > > 
> > > > > > > > What about the one that put -1 in saved index for a hotplugged device?
> > > > > > > 
> > > > > > > There are still examples that don't work even without hotplug (example 2
> > > > > > > and example 3 after the reboot).  That hack limits the damage, but still
> > > > > > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > > > > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > > > > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > > > > > Thanks,
> > > > > > > 
> > > > > > > Alex
> > > > > > 
> > > > > > I guess when we bump it we tell users: migration is completely
> > > > > > borken to the old version, don't even try it.
> > > > > > 
> > > > > > Is there a way for libvirt to discover such incompatibilities
> > > > > > and avoid the migration?
> > > > > 
> > > > > I don't know if libvirt has a way to query this in advance.  If a
> > > > > migration is attempted, the target will report:
> > > > > 
> > > > > savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> > > > > 
> > > > > And the source will continue running.  We waste plenty of bits getting
> > > > > to that point,
> > > > 
> > > > Yes, this happens after all of memory has been migrated.
> > > 
> > > Better late than never :^\
> > 
> > One other question: can we do the same by creating a new (empty)
> > section? As was discussed in the past this is easier for
> > downstreams to cherry-pick.
> 
> The only way I can think to do that would be to have a subsection that
> is always included, but saves no data.  That would force a failure on
> new->old migration, but I don't think it really matches the intended
> purpose of subsections and feels like it's adding cruft for no gain.
> Maybe I'm missing something.  Juan, is there any advantage to trapping
> this in a subsection?  Thanks,
> 
> Alex

Maybe in this particular case the advantage is minimal.
But it seems easier to stick to a rule of no more version
bumps than argue about each case.
Alex Williamson - Dec. 14, 2010, 3:41 p.m.
On Tue, 2010-12-14 at 14:32 +0200, Michael S. Tsirkin wrote:
> On Mon, Dec 13, 2010 at 10:00:48PM -0700, Alex Williamson wrote:
> > On Tue, 2010-12-14 at 06:43 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Dec 13, 2010 at 12:15:08PM -0700, Alex Williamson wrote:
> > > > On Mon, 2010-12-13 at 21:06 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > > > > > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > > > > > > So, unfortunately, I stand by my original patch.
> > > > > > > > > 
> > > > > > > > > What about the one that put -1 in saved index for a hotplugged device?
> > > > > > > > 
> > > > > > > > There are still examples that don't work even without hotplug (example 2
> > > > > > > > and example 3 after the reboot).  That hack limits the damage, but still
> > > > > > > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > > > > > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > > > > > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > > > > > > Thanks,
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > I guess when we bump it we tell users: migration is completely
> > > > > > > borken to the old version, don't even try it.
> > > > > > > 
> > > > > > > Is there a way for libvirt to discover such incompatibilities
> > > > > > > and avoid the migration?
> > > > > > 
> > > > > > I don't know if libvirt has a way to query this in advance.  If a
> > > > > > migration is attempted, the target will report:
> > > > > > 
> > > > > > savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> > > > > > 
> > > > > > And the source will continue running.  We waste plenty of bits getting
> > > > > > to that point,
> > > > > 
> > > > > Yes, this happens after all of memory has been migrated.
> > > > 
> > > > Better late than never :^\
> > > 
> > > One other question: can we do the same by creating a new (empty)
> > > section? As was discussed in the past this is easier for
> > > downstreams to cherry-pick.
> > 
> > The only way I can think to do that would be to have a subsection that
> > is always included, but saves no data.  That would force a failure on
> > new->old migration, but I don't think it really matches the intended
> > purpose of subsections and feels like it's adding cruft for no gain.
> > Maybe I'm missing something.  Juan, is there any advantage to trapping
> > this in a subsection?  Thanks,
> > 
> > Alex
> 
> Maybe in this particular case the advantage is minimal.
> But it seems easier to stick to a rule of no more version
> bumps than argue about each case.

Do we have such a rule?  If we have a subsection who's needed function
is return 1, I think that's a good indication that it's not appropriate
for a subsection and the end result is equivalent to bumping the main
driver vmstate version.  It's convoluted to try to hide a one-way
upgrade in a subsection.  Thanks,

Alex
Paolo Bonzini - Dec. 14, 2010, 3:59 p.m.
On 12/14/2010 04:41 PM, Alex Williamson wrote:
> >  Maybe in this particular case the advantage is minimal.
> >  But it seems easier to stick to a rule of no more version
> >  bumps than argue about each case.
>
> Do we have such a rule?  If we have a subsection who's needed function
> is return 1, I think that's a good indication that it's not appropriate
> for a subsection and the end result is equivalent to bumping the main
> driver vmstate version.  It's convoluted to try to hide a one-way
> upgrade in a subsection.  Thanks,

Indeed, subsections are for data that is rarely needed so that there's 
some chance (sometimes ~100%) of migration working seemlessly.  In this 
case it's either no-bump-and-live-with-the-consequences, or changing the 
version id.

Paolo
Michael S. Tsirkin - Dec. 15, 2010, 10 a.m.
On Tue, Dec 14, 2010 at 04:59:55PM +0100, Paolo Bonzini wrote:
> On 12/14/2010 04:41 PM, Alex Williamson wrote:
> >>  Maybe in this particular case the advantage is minimal.
> >>  But it seems easier to stick to a rule of no more version
> >>  bumps than argue about each case.
> >
> >Do we have such a rule?  If we have a subsection who's needed function
> >is return 1, I think that's a good indication that it's not appropriate
> >for a subsection and the end result is equivalent to bumping the main
> >driver vmstate version.  It's convoluted to try to hide a one-way
> >upgrade in a subsection.  Thanks,
> 
> Indeed, subsections are for data that is rarely needed so that
> there's some chance (sometimes ~100%) of migration working
> seemlessly.

If a subsection arrives that qemu does
not know about, won't migratin fail?

>  In this case it's either
> no-bump-and-live-with-the-consequences, or changing the version id.
> 
> Paolo

This was discussed to death already.  version ids have the problem
that they don't play nicely with downstreams.
Michael S. Tsirkin - Dec. 15, 2010, 10:07 a.m.
On Tue, Dec 14, 2010 at 08:41:55AM -0700, Alex Williamson wrote:
> On Tue, 2010-12-14 at 14:32 +0200, Michael S. Tsirkin wrote:
> > On Mon, Dec 13, 2010 at 10:00:48PM -0700, Alex Williamson wrote:
> > > On Tue, 2010-12-14 at 06:43 +0200, Michael S. Tsirkin wrote:
> > > > On Mon, Dec 13, 2010 at 12:15:08PM -0700, Alex Williamson wrote:
> > > > > On Mon, 2010-12-13 at 21:06 +0200, Michael S. Tsirkin wrote:
> > > > > > On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> > > > > > > On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > > > > > > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > > > > > > > So, unfortunately, I stand by my original patch.
> > > > > > > > > > 
> > > > > > > > > > What about the one that put -1 in saved index for a hotplugged device?
> > > > > > > > > 
> > > > > > > > > There are still examples that don't work even without hotplug (example 2
> > > > > > > > > and example 3 after the reboot).  That hack limits the damage, but still
> > > > > > > > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > > > > > > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > > > > > > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > > > > > > > Thanks,
> > > > > > > > > 
> > > > > > > > > Alex
> > > > > > > > 
> > > > > > > > I guess when we bump it we tell users: migration is completely
> > > > > > > > borken to the old version, don't even try it.
> > > > > > > > 
> > > > > > > > Is there a way for libvirt to discover such incompatibilities
> > > > > > > > and avoid the migration?
> > > > > > > 
> > > > > > > I don't know if libvirt has a way to query this in advance.  If a
> > > > > > > migration is attempted, the target will report:
> > > > > > > 
> > > > > > > savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> > > > > > > 
> > > > > > > And the source will continue running.  We waste plenty of bits getting
> > > > > > > to that point,
> > > > > > 
> > > > > > Yes, this happens after all of memory has been migrated.
> > > > > 
> > > > > Better late than never :^\
> > > > 
> > > > One other question: can we do the same by creating a new (empty)
> > > > section? As was discussed in the past this is easier for
> > > > downstreams to cherry-pick.
> > > 
> > > The only way I can think to do that would be to have a subsection that
> > > is always included, but saves no data.  That would force a failure on
> > > new->old migration, but I don't think it really matches the intended
> > > purpose of subsections and feels like it's adding cruft for no gain.
> > > Maybe I'm missing something.  Juan, is there any advantage to trapping
> > > this in a subsection?  Thanks,
> > > 
> > > Alex
> > 
> > Maybe in this particular case the advantage is minimal.
> > But it seems easier to stick to a rule of no more version
> > bumps than argue about each case.
> 
> Do we have such a rule?

I think it's easier to have a hard rule than start arguing whether
each change is a subsection or not for each patch.
What is the downside?

> If we have a subsection who's needed function
> is return 1, I think that's a good indication that it's not appropriate
> for a subsection and the end result is equivalent to bumping the main
> driver vmstate version.

So if you feel strongly about it, go ahead, I'm just concerned
we'll now need to argue version or subsection for each migration-related
change.

> It's convoluted to try to hide a one-way upgrade in a subsection.

It may be convoluted but I think it works better with downstreams.
When we'll bump the version again in the future, it
should be possible for a downstream to pick up
just the next change, and skip this one.

But that's a general comment. In this case, I expect all
downstream to be able to pick this bugfix with no real downsides.

>  Thanks,
> 
> Alex
Alex Williamson - Dec. 15, 2010, 4:05 p.m.
On Wed, 2010-12-15 at 12:07 +0200, Michael S. Tsirkin wrote:
> On Tue, Dec 14, 2010 at 08:41:55AM -0700, Alex Williamson wrote:
> > On Tue, 2010-12-14 at 14:32 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Dec 13, 2010 at 10:00:48PM -0700, Alex Williamson wrote:
> > > > On Tue, 2010-12-14 at 06:43 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Dec 13, 2010 at 12:15:08PM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2010-12-13 at 21:06 +0200, Michael S. Tsirkin wrote:
> > > > > > > On Mon, Dec 13, 2010 at 11:59:16AM -0700, Alex Williamson wrote:
> > > > > > > > On Mon, 2010-12-13 at 20:54 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > On Mon, Dec 13, 2010 at 11:00:44AM -0700, Alex Williamson wrote:
> > > > > > > > > > On Mon, 2010-12-13 at 19:50 +0200, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Mon, Dec 13, 2010 at 10:43:22AM -0700, Alex Williamson wrote:
> > > > > > > > > > > > So, unfortunately, I stand by my original patch.
> > > > > > > > > > > 
> > > > > > > > > > > What about the one that put -1 in saved index for a hotplugged device?
> > > > > > > > > > 
> > > > > > > > > > There are still examples that don't work even without hotplug (example 2
> > > > > > > > > > and example 3 after the reboot).  That hack limits the damage, but still
> > > > > > > > > > leaves a latent bug for reboot and doesn't address the non-hotplug
> > > > > > > > > > scenarios.  So, I don't think it's worthwhile to pursue, and we
> > > > > > > > > > shouldn't pretend we can use it to avoid bumping the version_id.
> > > > > > > > > > Thanks,
> > > > > > > > > > 
> > > > > > > > > > Alex
> > > > > > > > > 
> > > > > > > > > I guess when we bump it we tell users: migration is completely
> > > > > > > > > borken to the old version, don't even try it.
> > > > > > > > > 
> > > > > > > > > Is there a way for libvirt to discover such incompatibilities
> > > > > > > > > and avoid the migration?
> > > > > > > > 
> > > > > > > > I don't know if libvirt has a way to query this in advance.  If a
> > > > > > > > migration is attempted, the target will report:
> > > > > > > > 
> > > > > > > > savevm: unsupported version 5 for '0000:00:03.0/rtl8139' v4
> > > > > > > > 
> > > > > > > > And the source will continue running.  We waste plenty of bits getting
> > > > > > > > to that point,
> > > > > > > 
> > > > > > > Yes, this happens after all of memory has been migrated.
> > > > > > 
> > > > > > Better late than never :^\
> > > > > 
> > > > > One other question: can we do the same by creating a new (empty)
> > > > > section? As was discussed in the past this is easier for
> > > > > downstreams to cherry-pick.
> > > > 
> > > > The only way I can think to do that would be to have a subsection that
> > > > is always included, but saves no data.  That would force a failure on
> > > > new->old migration, but I don't think it really matches the intended
> > > > purpose of subsections and feels like it's adding cruft for no gain.
> > > > Maybe I'm missing something.  Juan, is there any advantage to trapping
> > > > this in a subsection?  Thanks,
> > > > 
> > > > Alex
> > > 
> > > Maybe in this particular case the advantage is minimal.
> > > But it seems easier to stick to a rule of no more version
> > > bumps than argue about each case.
> > 
> > Do we have such a rule?
> 
> I think it's easier to have a hard rule than start arguing whether
> each change is a subsection or not for each patch.

We need to have these discussions to make sure we're doing everything we
can do preserve the ABI.  AIUI, subsections are only useful in helping
us with that if there's a worthwhile percentage of the time that the
subsection isn't needed in the migration stream.  Even then it's unclear
to me how high level tools are supposed to deal with this.  Do they
re-try the migration, hoping the subsection was only needed temporarily?
In this case, the subsection would be needed all the time, so I don't
think it's applicable and I don't see a dogmatic rule about using them
to have any value.

> What is the downside?

Some empty structures to do things in a completely convoluted way.
What's the upside?

> > If we have a subsection who's needed function
> > is return 1, I think that's a good indication that it's not appropriate
> > for a subsection and the end result is equivalent to bumping the main
> > driver vmstate version.
> 
> So if you feel strongly about it, go ahead, I'm just concerned
> we'll now need to argue version or subsection for each migration-related
> change.

Subsections don't magically preserve the migration ABI, so we still need
to have these discussions

> > It's convoluted to try to hide a one-way upgrade in a subsection.
> 
> It may be convoluted but I think it works better with downstreams.
> When we'll bump the version again in the future, it
> should be possible for a downstream to pick up
> just the next change, and skip this one.
> 
> But that's a general comment. In this case, I expect all
> downstream to be able to pick this bugfix with no real downsides.

If a subsection can maintain the migration ABI in a worthwhile
percentage of use cases and high level tools are signed up to understand
whether to retry a migration blocked by a subsection describing an
inflight DMA vs avoiding a basic incompatibility, sure, let's use them.
I don't think this problem fits or see how it makes life any easier for
downstreams.  Thanks,

Alex
Paolo Bonzini - Dec. 15, 2010, 5:12 p.m.
On 12/15/2010 11:00 AM, Michael S. Tsirkin wrote:
>> Indeed, subsections are for data that is rarely needed so that
>> there's some chance (sometimes ~100%) of migration working
>> seemlessly.
>
> If a subsection arrives that qemu does
> not know about, won't migratin fail?

Yes, that's why rarely needed => some high chance of migration working 
(though no certainty).

>> In this case it's either
>> no-bump-and-live-with-the-consequences, or changing the version id.
>
> This was discussed to death already.  version ids have the problem
> that they don't play nicely with downstreams.

Downstream version bumps don't play nicely with upstream, so downstream 
does have a reason for always-necessary subsections.  But upstream can 
bump the version id as much as they care.

Paolo
Michael S. Tsirkin - Dec. 15, 2010, 7:04 p.m.
On Wed, Dec 15, 2010 at 06:12:25PM +0100, Paolo Bonzini wrote:
> On 12/15/2010 11:00 AM, Michael S. Tsirkin wrote:
> >>Indeed, subsections are for data that is rarely needed so that
> >>there's some chance (sometimes ~100%) of migration working
> >>seemlessly.
> >
> >If a subsection arrives that qemu does
> >not know about, won't migratin fail?
> 
> Yes, that's why rarely needed => some high chance of migration
> working (though no certainty).
> 
> >>In this case it's either
> >>no-bump-and-live-with-the-consequences, or changing the version id.
> >
> >This was discussed to death already.  version ids have the problem
> >that they don't play nicely with downstreams.
> 
> Downstream version bumps don't play nicely with upstream, so
> downstream does have a reason for always-necessary subsections.  But
> upstream can bump the version id as much as they care.
> 
> Paolo

This assuming upstream developers do not care about downstreams.
To give a chance for downstream to cherry-pick changes, upstream
should use subsections instead of version ids too.
Paolo Bonzini - Dec. 16, 2010, 12:45 p.m.
On 12/15/2010 08:04 PM, Michael S. Tsirkin wrote:
> This assuming upstream developers do not care about downstreams.
> To give a chance for downstream to cherry-pick changes, upstream
> should use subsections instead of version ids too.

Then version ids should be deprecated altogether.  Nothing against it, 
but I never heard about this plan.

Paolo

Patch

diff --git a/hw/rtl8139.c b/hw/rtl8139.c
index d92981d..e999dd9 100644
--- a/hw/rtl8139.c
+++ b/hw/rtl8139.c
@@ -494,6 +494,13 @@  typedef struct RTL8139State {
     QEMUTimer *timer;
     int64_t TimerExpire;
 
+    /* Hack for migration compatibility: older
+     * versions of rtl8139 put mmio index in save image,
+     * and override the index that we have with that one.
+     * This does not make any sense but happens to
+     * work sometimes, if we are lucky and index matches.
+     * On load, we can simply ignore this value. */
+    int compat_rtl8139_mmio_io_addr;
 } RTL8139State;
 
 static void rtl8139_set_next_tctr_time(RTL8139State *s, int64_t current_time);
@@ -3182,6 +3189,7 @@  static void rtl8139_pre_save(void *opaque)
     rtl8139_set_next_tctr_time(s, current_time);
     s->TCTR = muldiv64(current_time - s->TCTR_base, PCI_FREQUENCY,
                        get_ticks_per_sec());
+    s->compat_rtl8139_mmio_io_addr = s->rtl8139_mmio_io_addr;
 }
 
 static const VMStateDescription vmstate_rtl8139 = {
@@ -3234,7 +3242,7 @@  static const VMStateDescription vmstate_rtl8139 = {
 
         VMSTATE_UNUSED(4),
         VMSTATE_MACADDR(conf.macaddr, RTL8139State),
-        VMSTATE_INT32(rtl8139_mmio_io_addr, RTL8139State),
+        VMSTATE_INT32(compat_rtl8139_mmio_io_addr, RTL8139State),
 
         VMSTATE_UINT32(currTxDesc, RTL8139State),
         VMSTATE_UINT32(currCPlusRxDesc, RTL8139State),