Message ID | 20180430185943.35714-1-dgilbert@redhat.com |
---|---|
State | New |
Headers | show |
Series | Migration+TLS: Fix crash due to double cleanup | expand |
On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote: > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > During a TLS connect we see: > migration_channel_connect calls > migration_tls_channel_connect > (calls after TLS setup) > migration_channel_connect > > My previous error handling fix made migration_channel_connect > call migrate_fd_connect in all cases; unfortunately the above > means it gets called twice and crashes doing double cleanup. > > Fixes: 688a3dcba98 This fixes the crash, but we're still seeing error messages duplicated (qemu) migrate_set_parameter tls-creds tls0 (qemu) migrate tcp:localhost:9000 qemu-system-x86_64: Certificate does not match the hostname localhost qemu-system-x86_64: Certificate does not match the hostname localhost git bisect points to 688a3dcba98 as the cause of these double errors still. > > Reported-by: Peter Krempa <pkrempa@redhat.com> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/channel.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/migration/channel.c b/migration/channel.c > index c5eaf0fa0e..7a32b5aca4 100644 > --- a/migration/channel.c > +++ b/migration/channel.c > @@ -71,6 +71,15 @@ void migration_channel_connect(MigrationState *s, > !object_dynamic_cast(OBJECT(ioc), > TYPE_QIO_CHANNEL_TLS)) { > migration_tls_channel_connect(s, ioc, hostname, &error); > + > + if (!error) { > + /* tls_channel_connect will call back to this > + * function after the TLS handshake, > + * so we mustn't call migrate_fd_connect until then > + */ > + > + return; > + } > } else { > QEMUFile *f = qemu_fopen_channel_output(ioc); > > -- > 2.17.0 > Regards, Daniel
On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote: > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote: > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > During a TLS connect we see: > > migration_channel_connect calls > > migration_tls_channel_connect > > (calls after TLS setup) > > migration_channel_connect > > > > My previous error handling fix made migration_channel_connect > > call migrate_fd_connect in all cases; unfortunately the above > > means it gets called twice and crashes doing double cleanup. > > > > Fixes: 688a3dcba98 > > This fixes the crash, but we're still seeing error messages duplicated > > (qemu) migrate_set_parameter tls-creds tls0 > (qemu) migrate tcp:localhost:9000 > qemu-system-x86_64: Certificate does not match the hostname localhost > qemu-system-x86_64: Certificate does not match the hostname localhost > > git bisect points to 688a3dcba98 as the cause of these double > errors still. FYI the stack traces look like this.. The first error message is printed in this context: #0 0x0000555555bb6c20 in error_report (fmt=0x555555d0d77e "%s") at util/qemu-error.c:273 #1 0x0000555555bb5fa5 in error_report_err (err=0x55555749cb00) at util/error.c:228 #2 0x0000555555a6d1f8 in migrate_fd_cleanup (opaque=opaque@entry=0x5555569925a0) at migration/migration.c:1106 #3 0x0000555555a6e2aa in migrate_fd_connect (s=s@entry=0x5555569925a0, error_in=0x55555768de80) at migration/migration.c:2387 #4 0x0000555555a6f8e7 in migration_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555557e1de10, hostname=hostname@entry=0x0, error=<optimized out>) at migration/channel.c:83 #5 0x0000555555a6f2d6 in migration_tls_outgoing_handshake (task=<optimized out>, opaque=0x5555569925a0) at migration/tls.c:124 #6 0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ae620) at io/task.c:142 #7 0x0000555555b68c74 in qio_channel_tls_handshake_task (ioc=ioc@entry=0x555557e1de10, task=task@entry=0x5555576ae620) at io/channel-tls.c:171 #8 0x0000555555b6975a in qio_channel_tls_handshake (ioc=ioc@entry=0x555557e1de10, func=func@entry=0x555555a6f250 <migration_tls_outgoing_handshake>, opaque=opaque@entry=0x5555569925a0, destroy=destroy@entry=0x0) at io/channel-tls.c:215 #9 0x0000555555a6f6ac in migration_tls_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=hostname@entry=0x555556bc0c30 "localhost", errp=errp@entry=0x7fffffffd778) at migration/tls.c:159 #10 0x0000555555a6f967 in migration_channel_connect (s=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=0x555556bc0c30 "localhost", error=<optimized out>) at migration/channel.c:73 #11 0x0000555555a6e5f4 in socket_outgoing_migration (task=<optimized out>, opaque=0x555556cc7de0) at migration/socket.c:85 #12 0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ce3b0) at io/task.c:142 #13 0x0000555555b6d3a2 in gio_task_thread_result (opaque=0x5555578eff50) at io/task.c:88 #14 0x00007ffff5da1577 in g_idle_dispatch () at /lib64/libglib-2.0.so.0 #15 0x00007ffff5da4b77 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #16 0x0000555555baf677 in glib_pollfds_poll () at util/main-loop.c:214 #17 0x0000555555baf677 in os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:261 #18 0x0000555555baf677 in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:515 #19 0x00005555557b9157 in main_loop () at vl.c:1935 #20 0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767 The second error message is printed in this context: #0 0x0000555555bb6c20 in error_report (fmt=fmt@entry=0x555555d0d77e "%s") at util/qemu-error.c:273 #1 0x0000555555904035 in hmp_migrate_status_cb (opaque=0x555556bc0ba0) at hmp.c:1909 #2 0x0000555555baefac in timerlist_run_timers (timer_list=0x555556a0d760) at util/qemu-timer.c:536 #3 0x0000555555baf1b7 in qemu_clock_run_timers (type=QEMU_CLOCK_REALTIME) at util/qemu-timer.c:547 #4 0x0000555555baf1b7 in qemu_clock_run_all_timers () at util/qemu-timer.c:662 #5 0x0000555555baf69a in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:521 #6 0x00005555557b9157 in main_loop () at vl.c:1935 #7 0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767 The second stack trace is the error reporting context that I added originally in commit d59ce6f34434bf47a9b26138c908650bf9a24be1 Author: Daniel P. Berrange <berrange@redhat.com> Date: Wed Apr 27 11:05:00 2016 +0100 migration: add reporting of errors for outgoing migration So the first stack trace is the new duplicate. Which error reporting context is "better" though, I don't know ? My patch was based on the view that, although alot of code uses error_report, long term all migration would eventually need to be able to filter an 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via 'info migrate' / query-migrate. So I decided to leave the error_report_err call to the hmp.c code, as long term that's the only place that would need to print to the console. Regards, Daniel
* Daniel P. Berrangé (berrange@redhat.com) wrote: > On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote: > > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote: > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > > > During a TLS connect we see: > > > migration_channel_connect calls > > > migration_tls_channel_connect > > > (calls after TLS setup) > > > migration_channel_connect > > > > > > My previous error handling fix made migration_channel_connect > > > call migrate_fd_connect in all cases; unfortunately the above > > > means it gets called twice and crashes doing double cleanup. > > > > > > Fixes: 688a3dcba98 > > > > This fixes the crash, but we're still seeing error messages duplicated > > > > (qemu) migrate_set_parameter tls-creds tls0 > > (qemu) migrate tcp:localhost:9000 > > qemu-system-x86_64: Certificate does not match the hostname localhost > > qemu-system-x86_64: Certificate does not match the hostname localhost > > > > git bisect points to 688a3dcba98 as the cause of these double > > errors still. Can you give me the full sequence to trigger that; I did try yesterday and have failed to get that error out of the TLS code. > FYI the stack traces look like this.. > > The first error message is printed in this context: > > #0 0x0000555555bb6c20 in error_report (fmt=0x555555d0d77e "%s") at util/qemu-error.c:273 > #1 0x0000555555bb5fa5 in error_report_err (err=0x55555749cb00) at util/error.c:228 > #2 0x0000555555a6d1f8 in migrate_fd_cleanup (opaque=opaque@entry=0x5555569925a0) at migration/migration.c:1106 > #3 0x0000555555a6e2aa in migrate_fd_connect (s=s@entry=0x5555569925a0, error_in=0x55555768de80) at migration/migration.c:2387 > #4 0x0000555555a6f8e7 in migration_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555557e1de10, hostname=hostname@entry=0x0, error=<optimized out>) at migration/channel.c:83 > #5 0x0000555555a6f2d6 in migration_tls_outgoing_handshake (task=<optimized out>, opaque=0x5555569925a0) at migration/tls.c:124 > #6 0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ae620) at io/task.c:142 > #7 0x0000555555b68c74 in qio_channel_tls_handshake_task (ioc=ioc@entry=0x555557e1de10, task=task@entry=0x5555576ae620) at io/channel-tls.c:171 > #8 0x0000555555b6975a in qio_channel_tls_handshake (ioc=ioc@entry=0x555557e1de10, func=func@entry=0x555555a6f250 <migration_tls_outgoing_handshake>, opaque=opaque@entry=0x5555569925a0, destroy=destroy@entry=0x0) at io/channel-tls.c:215 > #9 0x0000555555a6f6ac in migration_tls_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=hostname@entry=0x555556bc0c30 "localhost", errp=errp@entry=0x7fffffffd778) at migration/tls.c:159 > #10 0x0000555555a6f967 in migration_channel_connect (s=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=0x555556bc0c30 "localhost", error=<optimized out>) at migration/channel.c:73 > #11 0x0000555555a6e5f4 in socket_outgoing_migration (task=<optimized out>, opaque=0x555556cc7de0) at migration/socket.c:85 > #12 0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ce3b0) at io/task.c:142 > #13 0x0000555555b6d3a2 in gio_task_thread_result (opaque=0x5555578eff50) at io/task.c:88 > #14 0x00007ffff5da1577 in g_idle_dispatch () at /lib64/libglib-2.0.so.0 > #15 0x00007ffff5da4b77 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 > #16 0x0000555555baf677 in glib_pollfds_poll () at util/main-loop.c:214 > #17 0x0000555555baf677 in os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:261 > #18 0x0000555555baf677 in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:515 > #19 0x00005555557b9157 in main_loop () at vl.c:1935 > #20 0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767 Right, that's certainly the one I'm expecting; my rule here is that any connect path that might call migrate_fd_connect, *always* calls migrate_fd_connect and in the error case gives it the error to report. My original reason to do this was to make sure we clean up any 'cancelling's that are in flight - cancel+a broken connect could lead to it getting stuck in cancelling. > The second error message is printed in this context: > > #0 0x0000555555bb6c20 in error_report (fmt=fmt@entry=0x555555d0d77e "%s") at util/qemu-error.c:273 > #1 0x0000555555904035 in hmp_migrate_status_cb (opaque=0x555556bc0ba0) at hmp.c:1909 > #2 0x0000555555baefac in timerlist_run_timers (timer_list=0x555556a0d760) at util/qemu-timer.c:536 > #3 0x0000555555baf1b7 in qemu_clock_run_timers (type=QEMU_CLOCK_REALTIME) at util/qemu-timer.c:547 > #4 0x0000555555baf1b7 in qemu_clock_run_all_timers () at util/qemu-timer.c:662 > #5 0x0000555555baf69a in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:521 > #6 0x00005555557b9157 in main_loop () at vl.c:1935 > #7 0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767 > > > The second stack trace is the error reporting context that I added originally > in > > commit d59ce6f34434bf47a9b26138c908650bf9a24be1 > Author: Daniel P. Berrange <berrange@redhat.com> > Date: Wed Apr 27 11:05:00 2016 +0100 > > migration: add reporting of errors for outgoing migration > > > So the first stack trace is the new duplicate. > > Which error reporting context is "better" though, I don't know ? OK, I see why I didn't see this - I almost always use migrate -d and that timer only happens without the -d. > My patch was based on the view that, although alot of code uses error_report, > long term all migration would eventually need to be able to filter an > 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via > 'info migrate' / query-migrate. So I decided to leave the error_report_err > call to the hmp.c code, as long term that's the only place that would need > to print to the console. I wanted it to land in the /var/log/libvirt/qemu/whatever so we could see what happened. Dave > > Regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Tue, May 01, 2018 at 11:57:25AM +0100, Dr. David Alan Gilbert wrote: > * Daniel P. Berrangé (berrange@redhat.com) wrote: > > On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote: > > > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote: > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > > > > > During a TLS connect we see: > > > > migration_channel_connect calls > > > > migration_tls_channel_connect > > > > (calls after TLS setup) > > > > migration_channel_connect > > > > > > > > My previous error handling fix made migration_channel_connect > > > > call migrate_fd_connect in all cases; unfortunately the above > > > > means it gets called twice and crashes doing double cleanup. > > > > > > > > Fixes: 688a3dcba98 > > > > > > This fixes the crash, but we're still seeing error messages duplicated > > > > > > (qemu) migrate_set_parameter tls-creds tls0 > > > (qemu) migrate tcp:localhost:9000 > > > qemu-system-x86_64: Certificate does not match the hostname localhost > > > qemu-system-x86_64: Certificate does not match the hostname localhost > > > > > > git bisect points to 688a3dcba98 as the cause of these double > > > errors still. > > Can you give me the full sequence to trigger that; I did try yesterday > and have failed to get that error out of the TLS code. Essentially I follow this: https://qemu.weilnetz.de/doc/qemu-doc.html#network_005ftls when generating the server-cert.pem file I use this input template: $ cat server.info organization = My Org cn = 127.0.0.1 # Commented out to force failure with 'localhost' name #dns_name = "localhost" dns_name = "localhost.localdomain" ip_address = 127.0.0.1 ip_address = ::1 tls_www_server encryption_key signing_key $ certtool --generate-certificate \ --load-ca-certificate ca-cert.pem \ --load-ca-privkey ca-key.pem \ --load-privkey server-key.pem \ --template server.info \ --outfile server-cert.pem On the server (mig target) side I run: $ ./x86_64-softmmu/qemu-system-x86_64 \ -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/security/qemutls,verify-peer=no \ -incoming defer -monitor stdio -display none QEMU 2.11.50 monitor - type 'help' for more information (qemu) migrate_set_parameter tls-creds tls0 (qemu) migrate_incoming tcp:localhost:9000 Notice I've set verify-peer=no on the server, since I've not bothered to issue any client-cert.pem on my dev setup here. On the client (mig source) side I run: $ ./x86_64-softmmu/qemu-system-x86_64 \ -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/security/qemutls\ -monitor stdio -display none QEMU 2.11.50 monitor - type 'help' for more information (qemu) migrate_set_parameter tls-creds tls0 (qemu) migrate tcp:localhost:9000 qemu-system-x86_64: Certificate does not match the hostname localhost qemu-system-x86_64: Certificate does not match the hostname localhost (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off Migration status: failed (Certificate does not match the hostname localhost) total time: 0 milliseconds Notice that when generating the server certificate I did *not* include "localhost" as a DNS name, nor in the Common Name. So when I connect with a URI of "tcp:localhost:9000" we correctly see that the "localhost" name doesn't match any DNS name or common name in the server cert. If I instead set the tls-hostname parameter it will succeed: (qemu) migrate_set_parameter tls-hostname 127.0.0.1 (qemu) migrate_set_parameter tls-creds tls0 (qemu) migrate tcp:localhost:9000 (qemu) info migrate globals: store-global-state: on only-migratable: off send-configuration: on send-section-footer: on capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off Migration status: completed total time: 131 milliseconds downtime: 5 milliseconds setup: 1 milliseconds transferred ram: 1369 kbytes throughput: 105.76 mbps remaining ram: 0 kbytes total ram: 148296 kbytes duplicate: 36817 pages skipped: 0 pages normal: 261 pages normal bytes: 1044 kbytes dirty sync count: 2 page size: 4 kbytes Or likewise if I used "tcp:127.0.0.1:9000" it would succeed too, because "127.0.0.1" is listed as a IP address in the server cert. > > > The second stack trace is the error reporting context that I added originally > > in > > > > commit d59ce6f34434bf47a9b26138c908650bf9a24be1 > > Author: Daniel P. Berrange <berrange@redhat.com> > > Date: Wed Apr 27 11:05:00 2016 +0100 > > > > migration: add reporting of errors for outgoing migration > > > > > > So the first stack trace is the new duplicate. > > > > Which error reporting context is "better" though, I don't know ? > > > OK, I see why I didn't see this - I almost always use migrate -d and > that timer only happens without the -d. Oh, yes, I forget to mention I left migration in the foreground. If you background it, then my original patch would only show the error message when you did "info migrate" > > My patch was based on the view that, although alot of code uses error_report, > > long term all migration would eventually need to be able to filter an > > 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via > > 'info migrate' / query-migrate. So I decided to leave the error_report_err > > call to the hmp.c code, as long term that's the only place that would need > > to print to the console. > > I wanted it to land in the /var/log/libvirt/qemu/whatever so we could > see what happened. In general QMP commands are expected to send errors back to the QMP client, not write them to stderr. Of course we have substantial legacy code so some commands do end up only using error_report, but the general expectation with QMP is that we'd eliminate error_report in codepaths triggered from monitor commands. The only reason we print to stderr with QMP is that it would be impossible to diagnose failures for code that is not converted to "Error *" yet. Places where we have converted to "Error *" don't need to use stderr. Regards, Daniel
On Tue, May 01, 2018 at 12:21:18PM +0100, Daniel P. Berrangé wrote: > On Tue, May 01, 2018 at 11:57:25AM +0100, Dr. David Alan Gilbert wrote: > > * Daniel P. Berrangé (berrange@redhat.com) wrote: > > > On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote: > > > > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote: > > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > > > > > > > > > During a TLS connect we see: > > > > > migration_channel_connect calls > > > > > migration_tls_channel_connect > > > > > (calls after TLS setup) > > > > > migration_channel_connect > > > > > > > > > > My previous error handling fix made migration_channel_connect > > > > > call migrate_fd_connect in all cases; unfortunately the above > > > > > means it gets called twice and crashes doing double cleanup. > > > > > > > > > > Fixes: 688a3dcba98 > > > > > > > > This fixes the crash, but we're still seeing error messages duplicated > > > > > > > > (qemu) migrate_set_parameter tls-creds tls0 > > > > (qemu) migrate tcp:localhost:9000 > > > > qemu-system-x86_64: Certificate does not match the hostname localhost > > > > qemu-system-x86_64: Certificate does not match the hostname localhost > > > > > > > > git bisect points to 688a3dcba98 as the cause of these double > > > > errors still. > > > > Can you give me the full sequence to trigger that; I did try yesterday > > and have failed to get that error out of the TLS code. > > Essentially I follow this: > > https://qemu.weilnetz.de/doc/qemu-doc.html#network_005ftls > > when generating the server-cert.pem file I use this input template: > > $ cat server.info > organization = My Org > cn = 127.0.0.1 > # Commented out to force failure with 'localhost' name > #dns_name = "localhost" > dns_name = "localhost.localdomain" > ip_address = 127.0.0.1 > ip_address = ::1 > tls_www_server > encryption_key > signing_key > > $ certtool --generate-certificate \ > --load-ca-certificate ca-cert.pem \ > --load-ca-privkey ca-key.pem \ > --load-privkey server-key.pem \ > --template server.info \ > --outfile server-cert.pem > > On the server (mig target) side I run: > > $ ./x86_64-softmmu/qemu-system-x86_64 \ > -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/security/qemutls,verify-peer=no \ > -incoming defer -monitor stdio -display none > QEMU 2.11.50 monitor - type 'help' for more information > (qemu) migrate_set_parameter tls-creds tls0 > (qemu) migrate_incoming tcp:localhost:9000 > > Notice I've set verify-peer=no on the server, since I've not bothered to > issue any client-cert.pem on my dev setup here. > > On the client (mig source) side I run: > > $ ./x86_64-softmmu/qemu-system-x86_64 \ > -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/security/qemutls\ > -monitor stdio -display none > QEMU 2.11.50 monitor - type 'help' for more information > (qemu) migrate_set_parameter tls-creds tls0 > (qemu) migrate tcp:localhost:9000 > qemu-system-x86_64: Certificate does not match the hostname localhost > qemu-system-x86_64: Certificate does not match the hostname localhost > (qemu) info migrate > globals: > store-global-state: on > only-migratable: off > send-configuration: on > send-section-footer: on > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off > Migration status: failed (Certificate does not match the hostname localhost) > total time: 0 milliseconds > > > Notice that when generating the server certificate I did *not* include > "localhost" as a DNS name, nor in the Common Name. > > So when I connect with a URI of "tcp:localhost:9000" we correctly see > that the "localhost" name doesn't match any DNS name or common name > in the server cert. > > If I instead set the tls-hostname parameter it will succeed: > > (qemu) migrate_set_parameter tls-hostname 127.0.0.1 > (qemu) migrate_set_parameter tls-creds tls0 > (qemu) migrate tcp:localhost:9000 > (qemu) info migrate > globals: > store-global-state: on > only-migratable: off > send-configuration: on > send-section-footer: on > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off > Migration status: completed > total time: 131 milliseconds > downtime: 5 milliseconds > setup: 1 milliseconds > transferred ram: 1369 kbytes > throughput: 105.76 mbps > remaining ram: 0 kbytes > total ram: 148296 kbytes > duplicate: 36817 pages > skipped: 0 pages > normal: 261 pages > normal bytes: 1044 kbytes > dirty sync count: 2 > page size: 4 kbytes > > > Or likewise if I used "tcp:127.0.0.1:9000" it would succeed too, > because "127.0.0.1" is listed as a IP address in the server cert. > > > > > > The second stack trace is the error reporting context that I added originally > > > in > > > > > > commit d59ce6f34434bf47a9b26138c908650bf9a24be1 > > > Author: Daniel P. Berrange <berrange@redhat.com> > > > Date: Wed Apr 27 11:05:00 2016 +0100 > > > > > > migration: add reporting of errors for outgoing migration > > > > > > > > > So the first stack trace is the new duplicate. > > > > > > Which error reporting context is "better" though, I don't know ? > > > > > > OK, I see why I didn't see this - I almost always use migrate -d and > > that timer only happens without the -d. > > Oh, yes, I forget to mention I left migration in the foreground. If you > background it, then my original patch would only show the error message > when you did "info migrate" > > > > My patch was based on the view that, although alot of code uses error_report, > > > long term all migration would eventually need to be able to filter an > > > 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via > > > 'info migrate' / query-migrate. So I decided to leave the error_report_err > > > call to the hmp.c code, as long term that's the only place that would need > > > to print to the console. > > > > I wanted it to land in the /var/log/libvirt/qemu/whatever so we could > > see what happened. > > In general QMP commands are expected to send errors back to the QMP client, > not write them to stderr. Of course we have substantial legacy code so some > commands do end up only using error_report, but the general expectation with > QMP is that we'd eliminate error_report in codepaths triggered from monitor > commands. The only reason we print to stderr with QMP is that it would be > impossible to diagnose failures for code that is not converted to "Error *" > yet. Places where we have converted to "Error *" don't need to use stderr. On looking further, it seems to me that the real problem was this change:L commit 87db1a7d89677e3dbc8b3763e417b9376009bdbb Author: Juan Quintela <quintela@redhat.com> Date: Tue Sep 5 12:50:22 2017 +0200 migration: Improve migration thread error handling We now report errors also when we finish migration, not only on info migrate. We plan to use this error from several places, and we want the first error to happen to win, so we add an mutex to order it. Signed-off-by: Juan Quintela <quintela@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Which added an "error_report_err" call in migrate_fd_cleanup() method which prints the s->error object. This is not needed for QMP since we feed it back via query-migrate, and also not needed for HMP since we can report it via info migrate, or directly via the status callback if migration is in the foreground. Regards, Daniel
On Mon, Apr 30, 2018 at 19:59:43 +0100, Dr. David Alan Gilbert (git) wrote: > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > During a TLS connect we see: > migration_channel_connect calls > migration_tls_channel_connect > (calls after TLS setup) > migration_channel_connect > > My previous error handling fix made migration_channel_connect > call migrate_fd_connect in all cases; unfortunately the above > means it gets called twice and crashes doing double cleanup. > > Fixes: 688a3dcba98 > > Reported-by: Peter Krempa <pkrempa@redhat.com> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> This fixes both crashes I've observed: 1) if the TLS environment is incorrect and certificate validation fails, I've seen crash in migrate_fd_connect in call to qemu_file_set_blocking 2) if the TLS environment was correct, I've seen a crash in cleanup phase Tested-by: Peter Krempa <pkrempa@redhat.com> Unfortunately I don't feel familiar enough with the migration code to provide a qualified review.
On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote: > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com> > > During a TLS connect we see: > migration_channel_connect calls > migration_tls_channel_connect > (calls after TLS setup) > migration_channel_connect > > My previous error handling fix made migration_channel_connect > call migrate_fd_connect in all cases; unfortunately the above > means it gets called twice and crashes doing double cleanup. > > Fixes: 688a3dcba98 > > Reported-by: Peter Krempa <pkrempa@redhat.com> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/channel.c | 9 +++++++++ > 1 file changed, 9 insertions(+) Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> I've satisfied myself that the double error reporting is not going to cause any functional harm, so that's less urgent to fix. > > diff --git a/migration/channel.c b/migration/channel.c > index c5eaf0fa0e..7a32b5aca4 100644 > --- a/migration/channel.c > +++ b/migration/channel.c > @@ -71,6 +71,15 @@ void migration_channel_connect(MigrationState *s, > !object_dynamic_cast(OBJECT(ioc), > TYPE_QIO_CHANNEL_TLS)) { > migration_tls_channel_connect(s, ioc, hostname, &error); > + > + if (!error) { > + /* tls_channel_connect will call back to this > + * function after the TLS handshake, > + * so we mustn't call migrate_fd_connect until then > + */ > + > + return; > + } > } else { > QEMUFile *f = qemu_fopen_channel_output(ioc); > > -- > 2.17.0 > Regards, Daniel
diff --git a/migration/channel.c b/migration/channel.c index c5eaf0fa0e..7a32b5aca4 100644 --- a/migration/channel.c +++ b/migration/channel.c @@ -71,6 +71,15 @@ void migration_channel_connect(MigrationState *s, !object_dynamic_cast(OBJECT(ioc), TYPE_QIO_CHANNEL_TLS)) { migration_tls_channel_connect(s, ioc, hostname, &error); + + if (!error) { + /* tls_channel_connect will call back to this + * function after the TLS handshake, + * so we mustn't call migrate_fd_connect until then + */ + + return; + } } else { QEMUFile *f = qemu_fopen_channel_output(ioc);