diff mbox series

[PULL,16/16] migration: fix crash in when incoming client channel setup fails

Message ID 20180627125604.15275-17-quintela@redhat.com
State New
Headers show
Series [PULL,01/16] migration: Create multipage support | expand

Commit Message

Juan Quintela June 27, 2018, 12:56 p.m. UTC
From: Daniel P. Berrangé <berrange@redhat.com>

The way we determine if we can start the incoming migration was
changed to use migration_has_all_channels() in:

  commit 428d89084c709e568f9cd301c2f6416a54c53d6d
  Author: Juan Quintela <quintela@redhat.com>
  Date:   Mon Jul 24 13:06:25 2017 +0200

    migration: Create migration_has_all_channels

This method in turn calls multifd_recv_all_channels_created()
which is hardcoded to always return 'true' when multifd is
not in use. This is a latent bug...

...activated in a following commit where that return result
ends up acting as the flag to indicate whether it is possible
to start processing the migration:

  commit 36c2f8be2c4eb0003ac77a14910842b7ddd7337e
  Author: Juan Quintela <quintela@redhat.com>
  Date:   Wed Mar 7 08:40:52 2018 +0100

    migration: Delay start of migration main routines

This means that if channel initialization fails with normal
migration, it'll never notice and attempt to start the
incoming migration regardless and crash on a NULL pointer.

This can be seen, for example, if a client connects to a server
requiring TLS, but has an invalid x509 certificate:

qemu-system-x86_64: The certificate hasn't got a known issuer
qemu-system-x86_64: migration/migration.c:386: process_incoming_migration_co: Assertion `mis->from_src_file' failed.

 #0  0x00007fffebd24f2b in raise () at /lib64/libc.so.6
 #1  0x00007fffebd0f561 in abort () at /lib64/libc.so.6
 #2  0x00007fffebd0f431 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
 #3  0x00007fffebd1d692 in  () at /lib64/libc.so.6
 #4  0x0000555555ad027e in process_incoming_migration_co (opaque=<optimized out>) at migration/migration.c:386
 #5  0x0000555555c45e8b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:116
 #6  0x00007fffebd3a6a0 in __start_context () at /lib64/libc.so.6
 #7  0x0000000000000000 in  ()

To handle the non-multifd case, we check whether mis->from_src_file
is non-NULL. With this in place, the migration server drops the
rejected client and stays around waiting for another, hopefully
valid, client to arrive.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20180619163552.18206-1-berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 migration/migration.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Balamuruhan S June 28, 2018, 9:55 a.m. UTC | #1
On Wed, Jun 27, 2018 at 02:56:04PM +0200, Juan Quintela wrote:
> From: Daniel P. Berrangé <berrange@redhat.com>
> 
> The way we determine if we can start the incoming migration was
> changed to use migration_has_all_channels() in:
> 
>   commit 428d89084c709e568f9cd301c2f6416a54c53d6d
>   Author: Juan Quintela <quintela@redhat.com>
>   Date:   Mon Jul 24 13:06:25 2017 +0200
> 
>     migration: Create migration_has_all_channels
> 
> This method in turn calls multifd_recv_all_channels_created()
> which is hardcoded to always return 'true' when multifd is
> not in use. This is a latent bug...
> 
> ...activated in a following commit where that return result
> ends up acting as the flag to indicate whether it is possible
> to start processing the migration:
> 
>   commit 36c2f8be2c4eb0003ac77a14910842b7ddd7337e
>   Author: Juan Quintela <quintela@redhat.com>
>   Date:   Wed Mar 7 08:40:52 2018 +0100
> 
>     migration: Delay start of migration main routines
> 
> This means that if channel initialization fails with normal
> migration, it'll never notice and attempt to start the
> incoming migration regardless and crash on a NULL pointer.
> 
> This can be seen, for example, if a client connects to a server
> requiring TLS, but has an invalid x509 certificate:
> 
> qemu-system-x86_64: The certificate hasn't got a known issuer
> qemu-system-x86_64: migration/migration.c:386: process_incoming_migration_co: Assertion `mis->from_src_file' failed.
> 
>  #0  0x00007fffebd24f2b in raise () at /lib64/libc.so.6
>  #1  0x00007fffebd0f561 in abort () at /lib64/libc.so.6
>  #2  0x00007fffebd0f431 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
>  #3  0x00007fffebd1d692 in  () at /lib64/libc.so.6
>  #4  0x0000555555ad027e in process_incoming_migration_co (opaque=<optimized out>) at migration/migration.c:386
>  #5  0x0000555555c45e8b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:116
>  #6  0x00007fffebd3a6a0 in __start_context () at /lib64/libc.so.6
>  #7  0x0000000000000000 in  ()
> 
> To handle the non-multifd case, we check whether mis->from_src_file
> is non-NULL. With this in place, the migration server drops the
> rejected client and stays around waiting for another, hopefully
> valid, client to arrive.

Hi Juan,

I tried to perform multifd enabled migration and from qemu monitor
enabled mutlifd capability on source and target,
(qemu) migrate_set_capability x-multifd on
(qemu) migrate -d tcp:127.0.0.1:4444

The migration succeeds and its cool to have the feature :)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
Migration status: completed
total time: 1051 milliseconds
downtime: 260 milliseconds
setup: 17 milliseconds
transferred ram: 8270 kbytes
throughput: 143.91 mbps
remaining ram: 0 kbytes
total ram: 4194560 kbytes
duplicate: 940989 pages
skipped: 0 pages
normal: 109635 pages
normal bytes: 438540 kbytes
dirty sync count: 3
page size: 4 kbytes


But when I just enable the multifd in souce but not in target

source:
x-multifd: on

target:
x-multifd: off

when migration is triggered with,
migrate -d tcp:127.0.0.1:4444 (port I used)

The VM is lost in source with Segmentation fault.

I think the correct way is to enable multifd on both source and target
similar to postcopy, but in this negative scenario we should consider
the right way of handling not to loose the VM instead error out
appropriately.

Please correct me if I miss something.


-- Bala
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> Message-Id: <20180619163552.18206-1-berrange@redhat.com>
> Reviewed-by: Juan Quintela <quintela@redhat.com>
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  migration/migration.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index d075c27886..94d71f8b24 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -518,11 +518,12 @@ void migration_ioc_process_incoming(QIOChannel *ioc)
>   */
>  bool migration_has_all_channels(void)
>  {
> +    MigrationIncomingState *mis = migration_incoming_get_current();
>      bool all_channels;
>  
>      all_channels = multifd_recv_all_channels_created();
>  
> -    return all_channels;
> +    return all_channels && mis->from_src_file != NULL;
>  }
>  
>  /*
> -- 
> 2.17.1
> 
>
Juan Quintela June 28, 2018, 11:06 a.m. UTC | #2
Balamuruhan S <bala24@linux.vnet.ibm.com> wrote:
> On Wed, Jun 27, 2018 at 02:56:04PM +0200, Juan Quintela wrote:
>> From: Daniel P. Berrangé <berrange@redhat.com>

....

> Hi Juan,
>
> I tried to perform multifd enabled migration and from qemu monitor
> enabled mutlifd capability on source and target,
> (qemu) migrate_set_capability x-multifd on
> (qemu) migrate -d tcp:127.0.0.1:4444
>
> The migration succeeds and its cool to have the feature :)

Thanks.

> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> decompress-error-check: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> zero-blocks: off compress: off events: off postcopy-ram: off x-colo:
> off release-ram: off block: off return-path: off
> pause-before-switchover: off x-multifd: on dirty-bitmaps: off
> postcopy-blocktime: off late-block-activate: off
> Migration status: completed
> total time: 1051 milliseconds
> downtime: 260 milliseconds
> setup: 17 milliseconds
> transferred ram: 8270 kbytes

What is your setup?  This value looks really small.  I can see that you
have 4GB of RAM, it should be a bit higher.  And setup time is also
quite low from my experience.

> throughput: 143.91 mbps

I don't know what networking are you using, but my experience is that
increasing packet_count to 64 or so helps a lot to increase bandwidth.

What is your networking, page_count and number of channels?

> remaining ram: 0 kbytes
> total ram: 4194560 kbytes
> duplicate: 940989 pages
> skipped: 0 pages
> normal: 109635 pages
> normal bytes: 438540 kbytes
> dirty sync count: 3
> page size: 4 kbytes
>
>
> But when I just enable the multifd in souce but not in target
>
> source:
> x-multifd: on
>
> target:
> x-multifd: off
>
> when migration is triggered with,
> migrate -d tcp:127.0.0.1:4444 (port I used)
>
> The VM is lost in source with Segmentation fault.
>
> I think the correct way is to enable multifd on both source and target
> similar to postcopy, but in this negative scenario we should consider
> the right way of handling not to loose the VM instead error out
> appropriately.

It is necesary to enable both sides.  And it "used" to be that it
dectected correctly when it was not enable on one of the sides.  Check
should be lost in some rebase, or any other change.

Will take a look.

> Please correct me if I miss something.

Sure, thanks for the report.

Later, Juan.
Balamuruhan S June 29, 2018, 9:11 a.m. UTC | #3
On Thu, Jun 28, 2018 at 01:06:25PM +0200, Juan Quintela wrote:
> Balamuruhan S <bala24@linux.vnet.ibm.com> wrote:
> > On Wed, Jun 27, 2018 at 02:56:04PM +0200, Juan Quintela wrote:
> >> From: Daniel P. Berrangé <berrange@redhat.com>
> 
> ....
> 
> > Hi Juan,
> >
> > I tried to perform multifd enabled migration and from qemu monitor
> > enabled mutlifd capability on source and target,
> > (qemu) migrate_set_capability x-multifd on
> > (qemu) migrate -d tcp:127.0.0.1:4444
> >
> > The migration succeeds and its cool to have the feature :)
> 
> Thanks.
> 
> > (qemu) info migrate
> > globals:
> > store-global-state: on
> > only-migratable: off
> > send-configuration: on
> > send-section-footer: on
> > decompress-error-check: on
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
> > zero-blocks: off compress: off events: off postcopy-ram: off x-colo:
> > off release-ram: off block: off return-path: off
> > pause-before-switchover: off x-multifd: on dirty-bitmaps: off
> > postcopy-blocktime: off late-block-activate: off
> > Migration status: completed
> > total time: 1051 milliseconds
> > downtime: 260 milliseconds
> > setup: 17 milliseconds
> > transferred ram: 8270 kbytes
> 
> What is your setup?  This value looks really small.  I can see that you

I have applied this patchset to upstream qemu to test multifd migration,

qemu commandline is as below,

/home/bala/qemu/ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic \
-vga none -machine pseries -m 4G,slots=32,maxmem=32G -smp 16,maxcpus=32 \
-device virtio-blk-pci,drive=rootdisk -drive file=/home/bala/hostos-ppc64le.qcow2,\
if=none,cache=none,format=qcow2,id=rootdisk -monitor telnet:127.0.0.1:1234,\
server,nowait -net nic,model=virtio -net user -redir tcp:2000::22

> have 4GB of RAM, it should be a bit higher.  And setup time is also
> quite low from my experience.

sure, I will try with 32G mem. I am not aware about the setup time value.

> 
> > throughput: 143.91 mbps
> 
> I don't know what networking are you using, but my experience is that
> increasing packet_count to 64 or so helps a lot to increase bandwidth.

how do I configure packet_count to 64 ?

> 
> What is your networking, page_count and number of channels?

I tried local host migration but need to work on multihost migration.
page_count and number of channels are default values,

x-multifd-channels: 2
x-multifd-page-count: 16

> 
> > remaining ram: 0 kbytes
> > total ram: 4194560 kbytes
> > duplicate: 940989 pages
> > skipped: 0 pages
> > normal: 109635 pages
> > normal bytes: 438540 kbytes
> > dirty sync count: 3
> > page size: 4 kbytes
> >
> >
> > But when I just enable the multifd in souce but not in target
> >
> > source:
> > x-multifd: on
> >
> > target:
> > x-multifd: off
> >
> > when migration is triggered with,
> > migrate -d tcp:127.0.0.1:4444 (port I used)
> >
> > The VM is lost in source with Segmentation fault.
> >
> > I think the correct way is to enable multifd on both source and target
> > similar to postcopy, but in this negative scenario we should consider
> > the right way of handling not to loose the VM instead error out
> > appropriately.
> 
> It is necesary to enable both sides.  And it "used" to be that it
> dectected correctly when it was not enable on one of the sides.  Check
> should be lost in some rebase, or any other change.
> 
> Will take a look.

Thank you.

-- Bala

> 
> > Please correct me if I miss something.
> 
> Sure, thanks for the report.
> 
> Later, Juan.
>
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index d075c27886..94d71f8b24 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -518,11 +518,12 @@  void migration_ioc_process_incoming(QIOChannel *ioc)
  */
 bool migration_has_all_channels(void)
 {
+    MigrationIncomingState *mis = migration_incoming_get_current();
     bool all_channels;
 
     all_channels = multifd_recv_all_channels_created();
 
-    return all_channels;
+    return all_channels && mis->from_src_file != NULL;
 }
 
 /*