diff mbox

migrate: Migration aborts abruptly for machine "none"

Message ID 1485422212-31546-1-git-send-email-ashijeetacharya@gmail.com
State New
Headers show

Commit Message

Ashijeet Acharya Jan. 26, 2017, 9:16 a.m. UTC
Migration of a "none" machine with no RAM crashes abruptly as
bitmap_new() fails and thus aborts. Instead, place a check for
last_ram_offset() being '0' at the start of ram_save_setup() and
error out with a meaningful error message.

Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
---
 migration/ram.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Kashyap Chamarthy Jan. 26, 2017, 11:25 a.m. UTC | #1
On Thu, Jan 26, 2017 at 02:46:52PM +0530, Ashijeet Acharya wrote:
> Migration of a "none" machine with no RAM crashes abruptly as
> bitmap_new() fails and thus aborts. Instead, place a check for
> last_ram_offset() being '0' at the start of ram_save_setup() and
> error out with a meaningful error message.
> 
> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
> ---
>  migration/ram.c | 5 +++++
>  1 file changed, 5 insertions(+)

You state the problem in the one-line Git commit summary message, it's
usually preferred to summarize the _fix_ that you're making :-).  If the
below variant sounds any better, maybe the maintainer can reword it upon
applying:

  migrate: Gracefully handle crash of a 'none' machine with no RAM

[...]
Dr. David Alan Gilbert Jan. 26, 2017, 7:41 p.m. UTC | #2
* Ashijeet Acharya (ashijeetacharya@gmail.com) wrote:
> Migration of a "none" machine with no RAM crashes abruptly as
> bitmap_new() fails and thus aborts. Instead, place a check for
> last_ram_offset() being '0' at the start of ram_save_setup() and
> error out with a meaningful error message.
> 
> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
> ---
>  migration/ram.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index ef8fadf..bf05d69 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1947,6 +1947,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  {
>      RAMBlock *block;
>  
> +    if (last_ram_offset() == 0) {
> +        error_report("Failed to migrate: No RAM available!");
> +        return -1;
> +    }
> +

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Dave

>      /* migration has already setup the bitmap, reuse it. */
>      if (!migration_in_colo_state()) {
>          if (ram_save_init_globals() < 0) {
> -- 
> 2.6.2
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Daniel P. Berrangé Jan. 27, 2017, 9:37 a.m. UTC | #3
On Thu, Jan 26, 2017 at 02:46:52PM +0530, Ashijeet Acharya wrote:
> Migration of a "none" machine with no RAM crashes abruptly as
> bitmap_new() fails and thus aborts. Instead, place a check for
> last_ram_offset() being '0' at the start of ram_save_setup() and
> error out with a meaningful error message.
> 
> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
> ---
>  migration/ram.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index ef8fadf..bf05d69 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1947,6 +1947,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  {
>      RAMBlock *block;
>  
> +    if (last_ram_offset() == 0) {
> +        error_report("Failed to migrate: No RAM available!");
> +        return -1;
> +    }
> +

If we're merely going to block migration, as opposed to making it work,
then IMHO we should use migration blockers registered at machine setup
for this task.

We have a new cli arg added to QEMU to tell it to abort startup if the
machine configuration is not migratable. That only works if using
migration blockers - this check you've added is too late to be detected
at startup.


Regards,
Daniel
Dr. David Alan Gilbert Jan. 27, 2017, 9:46 a.m. UTC | #4
* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Thu, Jan 26, 2017 at 02:46:52PM +0530, Ashijeet Acharya wrote:
> > Migration of a "none" machine with no RAM crashes abruptly as
> > bitmap_new() fails and thus aborts. Instead, place a check for
> > last_ram_offset() being '0' at the start of ram_save_setup() and
> > error out with a meaningful error message.
> > 
> > Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
> > ---
> >  migration/ram.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/migration/ram.c b/migration/ram.c
> > index ef8fadf..bf05d69 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1947,6 +1947,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> >  {
> >      RAMBlock *block;
> >  
> > +    if (last_ram_offset() == 0) {
> > +        error_report("Failed to migrate: No RAM available!");
> > +        return -1;
> > +    }
> > +
> 
> If we're merely going to block migration, as opposed to making it work,
> then IMHO we should use migration blockers registered at machine setup
> for this task.
> 
> We have a new cli arg added to QEMU to tell it to abort startup if the
> machine configuration is not migratable. That only works if using
> migration blockers - this check you've added is too late to be detected
> at startup.

Hmm, yes you're right, that would be better.
Although it does lead to an interesting situation where starting with a 'none'
and constructing it component at a time would get rejected by --only-migratable
even though the final construction is fine.

Dave

> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Ashijeet Acharya Jan. 27, 2017, 9:52 a.m. UTC | #5
Okay

On Friday, 27 January 2017, Daniel P. Berrange <berrange@redhat.com> wrote:

> On Thu, Jan 26, 2017 at 02:46:52PM +0530, Ashijeet Acharya wrote:
> > Migration of a "none" machine with no RAM crashes abruptly as
> > bitmap_new() fails and thus aborts. Instead, place a check for
> > last_ram_offset() being '0' at the start of ram_save_setup() and
> > error out with a meaningful error message.
> >
> > Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com
> <javascript:;>>
> > ---
> >  migration/ram.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index ef8fadf..bf05d69 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1947,6 +1947,11 @@ static int ram_save_setup(QEMUFile *f, void
> *opaque)
> >  {
> >      RAMBlock *block;
> >
> > +    if (last_ram_offset() == 0) {
> > +        error_report("Failed to migrate: No RAM available!");
> > +        return -1;
> > +    }
> > +
>
> If we're merely going to block migration, as opposed to making it work,
> then IMHO we should use migration blockers registered at machine setup
> for this task.
>
> We have a new cli arg added to QEMU to tell it to abort startup if the
> machine configuration is not migratable. That only works if using


Are you referring to the new "--only-migratable" option I added along with
John last week?

migration blockers - this check you've added is too late to be detected
> at startup.


I think that the machine is not completely 'non-migratable' because if I
boot qemu with "-m 1G" (say) the migration actually completes successfully.

Ashijeet


>
> Regards,
> Daniel
> --
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/
> :|
> |: http://libvirt.org              -o-             http://virt-manager.org
> :|
> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/
> :|
>
Daniel P. Berrangé Jan. 27, 2017, 9:57 a.m. UTC | #6
On Fri, Jan 27, 2017 at 09:46:13AM +0000, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > On Thu, Jan 26, 2017 at 02:46:52PM +0530, Ashijeet Acharya wrote:
> > > Migration of a "none" machine with no RAM crashes abruptly as
> > > bitmap_new() fails and thus aborts. Instead, place a check for
> > > last_ram_offset() being '0' at the start of ram_save_setup() and
> > > error out with a meaningful error message.
> > > 
> > > Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
> > > ---
> > >  migration/ram.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index ef8fadf..bf05d69 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -1947,6 +1947,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> > >  {
> > >      RAMBlock *block;
> > >  
> > > +    if (last_ram_offset() == 0) {
> > > +        error_report("Failed to migrate: No RAM available!");
> > > +        return -1;
> > > +    }
> > > +
> > 
> > If we're merely going to block migration, as opposed to making it work,
> > then IMHO we should use migration blockers registered at machine setup
> > for this task.
> > 
> > We have a new cli arg added to QEMU to tell it to abort startup if the
> > machine configuration is not migratable. That only works if using
> > migration blockers - this check you've added is too late to be detected
> > at startup.
> 
> Hmm, yes you're right, that would be better.
> Although it does lead to an interesting situation where starting with a 'none'
> and constructing it component at a time would get rejected by --only-migratable
> even though the final construction is fine.

Even in that case I would expect the ram to be specified upfront e.g.

 $qemu -machine none -m 500

thus avoiding addition of the migration blocker

Regards,
Daniel
Daniel P. Berrangé Jan. 27, 2017, 9:57 a.m. UTC | #7
On Fri, Jan 27, 2017 at 03:22:38PM +0530, Ashijeet Acharya wrote:
> Okay
> 
> On Friday, 27 January 2017, Daniel P. Berrange <berrange@redhat.com> wrote:
> 
> > On Thu, Jan 26, 2017 at 02:46:52PM +0530, Ashijeet Acharya wrote:
> > > Migration of a "none" machine with no RAM crashes abruptly as
> > > bitmap_new() fails and thus aborts. Instead, place a check for
> > > last_ram_offset() being '0' at the start of ram_save_setup() and
> > > error out with a meaningful error message.
> > >
> > > Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com
> > <javascript:;>>
> > > ---
> > >  migration/ram.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/migration/ram.c b/migration/ram.c
> > > index ef8fadf..bf05d69 100644
> > > --- a/migration/ram.c
> > > +++ b/migration/ram.c
> > > @@ -1947,6 +1947,11 @@ static int ram_save_setup(QEMUFile *f, void
> > *opaque)
> > >  {
> > >      RAMBlock *block;
> > >
> > > +    if (last_ram_offset() == 0) {
> > > +        error_report("Failed to migrate: No RAM available!");
> > > +        return -1;
> > > +    }
> > > +
> >
> > If we're merely going to block migration, as opposed to making it work,
> > then IMHO we should use migration blockers registered at machine setup
> > for this task.
> >
> > We have a new cli arg added to QEMU to tell it to abort startup if the
> > machine configuration is not migratable. That only works if using
> 
> 
> Are you referring to the new "--only-migratable" option I added along with
> John last week?
> 
> migration blockers - this check you've added is too late to be detected
> > at startup.
> 
> 
> I think that the machine is not completely 'non-migratable' because if I
> boot qemu with "-m 1G" (say) the migration actually completes successfully.

Of course - you would only add the migration blocker when ram size is 0


Regards,
Daniel
Greg Kurz Jan. 28, 2017, 6:41 p.m. UTC | #8
On Thu, 26 Jan 2017 14:46:52 +0530
Ashijeet Acharya <ashijeetacharya@gmail.com> wrote:

> Migration of a "none" machine with no RAM crashes abruptly as
> bitmap_new() fails and thus aborts. Instead, place a check for
> last_ram_offset() being '0' at the start of ram_save_setup() and
> error out with a meaningful error message.
> 
> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
> ---

Maybe a naive question: why a "none" machine with zero RAM should fail to
migrate ?

>  migration/ram.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index ef8fadf..bf05d69 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1947,6 +1947,11 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>  {
>      RAMBlock *block;
>  
> +    if (last_ram_offset() == 0) {
> +        error_report("Failed to migrate: No RAM available!");
> +        return -1;
> +    }
> +
>      /* migration has already setup the bitmap, reuse it. */
>      if (!migration_in_colo_state()) {
>          if (ram_save_init_globals() < 0) {
Ashijeet Acharya Jan. 28, 2017, 7:36 p.m. UTC | #9
On Sun, Jan 29, 2017 at 12:11 AM, Greg Kurz <groug@kaod.org> wrote:
> On Thu, 26 Jan 2017 14:46:52 +0530
> Ashijeet Acharya <ashijeetacharya@gmail.com> wrote:
>
>> Migration of a "none" machine with no RAM crashes abruptly as
>> bitmap_new() fails and thus aborts. Instead, place a check for
>> last_ram_offset() being '0' at the start of ram_save_setup() and
>> error out with a meaningful error message.
>>
>> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
>> ---
>

cc'ing Paolo in : I had an IRC chat with him and he has a very
interesting twist in the tale to add here.

> Maybe a naive question: why a "none" machine with zero RAM should fail to
> migrate ?

Assuming you are referring to why its failing ATM; it fails because
g_try_malloc0() inside bitmap_try_new() returns a NULL pointer for
zero bits and thus the check for NULL inside bitmap_new() becomes true
and it aborts. Check bitmap_new() for convenience.

Ignore the noise if you already knew this! :-)

Ashijeet
Greg Kurz Jan. 29, 2017, 1:03 p.m. UTC | #10
On Sun, 29 Jan 2017 01:06:47 +0530
Ashijeet Acharya <ashijeetacharya@gmail.com> wrote:

> On Sun, Jan 29, 2017 at 12:11 AM, Greg Kurz <groug@kaod.org> wrote:
> > On Thu, 26 Jan 2017 14:46:52 +0530
> > Ashijeet Acharya <ashijeetacharya@gmail.com> wrote:
> >  
> >> Migration of a "none" machine with no RAM crashes abruptly as
> >> bitmap_new() fails and thus aborts. Instead, place a check for
> >> last_ram_offset() being '0' at the start of ram_save_setup() and
> >> error out with a meaningful error message.
> >>
> >> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
> >> ---  
> >  
> 
> cc'ing Paolo in : I had an IRC chat with him and he has a very
> interesting twist in the tale to add here.
> 
> > Maybe a naive question: why a "none" machine with zero RAM should fail to
> > migrate ?  
> 
> Assuming you are referring to why its failing ATM; it fails because

My question was more: why deciding to fail migration instead of fixing the
crash ? One would naively think that no RAM is *just* less state to
migrate... but maybe the current code assumes that a machine always has
RAM.

> g_try_malloc0() inside bitmap_try_new() returns a NULL pointer for
> zero bits and thus the check for NULL inside bitmap_new() becomes true
> and it aborts. Check bitmap_new() for convenience.
> 
> Ignore the noise if you already knew this! :-)
> 

I hadn't checked, thanks for the details.

Cheers.

--
Greg

> Ashijeet
Ashijeet Acharya Jan. 29, 2017, 1:19 p.m. UTC | #11
On Sun, Jan 29, 2017 at 6:33 PM, Greg Kurz <groug@kaod.org> wrote:
> On Sun, 29 Jan 2017 01:06:47 +0530
> Ashijeet Acharya <ashijeetacharya@gmail.com> wrote:
>
>> On Sun, Jan 29, 2017 at 12:11 AM, Greg Kurz <groug@kaod.org> wrote:
>> > On Thu, 26 Jan 2017 14:46:52 +0530
>> > Ashijeet Acharya <ashijeetacharya@gmail.com> wrote:
>> >
>> >> Migration of a "none" machine with no RAM crashes abruptly as
>> >> bitmap_new() fails and thus aborts. Instead, place a check for
>> >> last_ram_offset() being '0' at the start of ram_save_setup() and
>> >> error out with a meaningful error message.
>> >>
>> >> Signed-off-by: Ashijeet Acharya <ashijeetacharya@gmail.com>
>> >> ---
>> >
>>
>> cc'ing Paolo in : I had an IRC chat with him and he has a very
>> interesting twist in the tale to add here.
>>
>> > Maybe a naive question: why a "none" machine with zero RAM should fail to
>> > migrate ?
>>
>> Assuming you are referring to why its failing ATM; it fails because
>
> My question was more: why deciding to fail migration instead of fixing the
> crash ? One would naively think that no RAM is *just* less state to
> migrate... but maybe the current code assumes that a machine always has
> RAM.

Actually, I think, the current code does assume that the machine
always has RAM which is why it crashes here. I am not aware if we can
also boot machines other than 'none'  with zero RAM so I have not
tested that case.
Also, Paolo has a similar opinion to yours that we should fix it
instead of blocking it. He suggests to migrate only the device states
and skip all the RAM related stuff. Maybe he can explain it better
when he is available.

I have had mixed opinions on this one which is why I am waiting until
we reach a common ground.

Ashijeet
>
>> g_try_malloc0() inside bitmap_try_new() returns a NULL pointer for
>> zero bits and thus the check for NULL inside bitmap_new() becomes true
>> and it aborts. Check bitmap_new() for convenience.
>>
>> Ignore the noise if you already knew this! :-)
>>
>
> I hadn't checked, thanks for the details.
>
> Cheers.
>
> --
> Greg
>
>> Ashijeet
>
Paolo Bonzini Jan. 30, 2017, 8:48 p.m. UTC | #12
On 29/01/2017 08:19, Ashijeet Acharya wrote:
> Also, Paolo has a similar opinion to yours that we should fix it
> instead of blocking it. He suggests to migrate only the device states
> and skip all the RAM related stuff. Maybe he can explain it better
> when he is available.

Yes, there is even support already in ram_migration_cleanup for not
allocating the bitmap:

    struct BitmapRcu *bitmap = migration_bitmap_rcu;
    atomic_rcu_set(&migration_bitmap_rcu, NULL);
    if (bitmap) {
        memory_global_dirty_log_stop();
        call_rcu(bitmap, migration_bitmap_free, rcu);
    }

I think you should just iterate until you find no more misbehaving code.
 The allocation is one, but ram_find_and_save_block should also return 0
if QLIST_FIRST_RCU(&ram_list.blocks) is NULL, for example.

Paolo
Ashijeet Acharya Feb. 1, 2017, 5:53 p.m. UTC | #13
On Tue, Jan 31, 2017 at 2:18 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 29/01/2017 08:19, Ashijeet Acharya wrote:
>> Also, Paolo has a similar opinion to yours that we should fix it
>> instead of blocking it. He suggests to migrate only the device states
>> and skip all the RAM related stuff. Maybe he can explain it better
>> when he is available.
>
> Yes, there is even support already in ram_migration_cleanup for not
> allocating the bitmap:
>
>     struct BitmapRcu *bitmap = migration_bitmap_rcu;
>     atomic_rcu_set(&migration_bitmap_rcu, NULL);
>     if (bitmap) {
>         memory_global_dirty_log_stop();
>         call_rcu(bitmap, migration_bitmap_free, rcu);
>     }
>
> I think you should just iterate until you find no more misbehaving code.
>  The allocation is one, but ram_find_and_save_block should also return 0
> if QLIST_FIRST_RCU(&ram_list.blocks) is NULL, for example.
>

Okay, I will drop the idea of adding a migration blocker in this case
and proceed as you say.

Ashijeet
> Paolo
diff mbox

Patch

diff --git a/migration/ram.c b/migration/ram.c
index ef8fadf..bf05d69 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1947,6 +1947,11 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
 {
     RAMBlock *block;
 
+    if (last_ram_offset() == 0) {
+        error_report("Failed to migrate: No RAM available!");
+        return -1;
+    }
+
     /* migration has already setup the bitmap, reuse it. */
     if (!migration_in_colo_state()) {
         if (ram_save_init_globals() < 0) {