diff mbox series

Migration+TLS: Fix crash due to double cleanup

Message ID 20180430185943.35714-1-dgilbert@redhat.com
State New
Headers show
Series Migration+TLS: Fix crash due to double cleanup | expand

Commit Message

Dr. David Alan Gilbert April 30, 2018, 6:59 p.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

During a TLS connect we see:
  migration_channel_connect calls
  migration_tls_channel_connect
  (calls after TLS setup)
  migration_channel_connect

My previous error handling fix made migration_channel_connect
call migrate_fd_connect in all cases; unfortunately the above
means it gets called twice and crashes doing double cleanup.

Fixes: 688a3dcba98

Reported-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/channel.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Daniel P. Berrangé May 1, 2018, 10 a.m. UTC | #1
On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> During a TLS connect we see:
>   migration_channel_connect calls
>   migration_tls_channel_connect
>   (calls after TLS setup)
>   migration_channel_connect
> 
> My previous error handling fix made migration_channel_connect
> call migrate_fd_connect in all cases; unfortunately the above
> means it gets called twice and crashes doing double cleanup.
> 
> Fixes: 688a3dcba98

This fixes the crash, but we're still seeing error messages duplicated

(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate tcp:localhost:9000
qemu-system-x86_64: Certificate does not match the hostname localhost
qemu-system-x86_64: Certificate does not match the hostname localhost

git bisect points to 688a3dcba98 as the cause of these double
errors still.

> 
> Reported-by: Peter Krempa <pkrempa@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/channel.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/migration/channel.c b/migration/channel.c
> index c5eaf0fa0e..7a32b5aca4 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -71,6 +71,15 @@ void migration_channel_connect(MigrationState *s,
>              !object_dynamic_cast(OBJECT(ioc),
>                                   TYPE_QIO_CHANNEL_TLS)) {
>              migration_tls_channel_connect(s, ioc, hostname, &error);
> +
> +            if (!error) {
> +                /* tls_channel_connect will call back to this
> +                 * function after the TLS handshake,
> +                 * so we mustn't call migrate_fd_connect until then
> +                 */
> +
> +                return;
> +            }
>          } else {
>              QEMUFile *f = qemu_fopen_channel_output(ioc);
>  
> -- 
> 2.17.0
> 

Regards,
Daniel
Daniel P. Berrangé May 1, 2018, 10:11 a.m. UTC | #2
On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote:
> On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > During a TLS connect we see:
> >   migration_channel_connect calls
> >   migration_tls_channel_connect
> >   (calls after TLS setup)
> >   migration_channel_connect
> > 
> > My previous error handling fix made migration_channel_connect
> > call migrate_fd_connect in all cases; unfortunately the above
> > means it gets called twice and crashes doing double cleanup.
> > 
> > Fixes: 688a3dcba98
> 
> This fixes the crash, but we're still seeing error messages duplicated
> 
> (qemu) migrate_set_parameter tls-creds tls0
> (qemu) migrate tcp:localhost:9000
> qemu-system-x86_64: Certificate does not match the hostname localhost
> qemu-system-x86_64: Certificate does not match the hostname localhost
> 
> git bisect points to 688a3dcba98 as the cause of these double
> errors still.

FYI the stack traces look like this..

The first error message is printed in this context:

#0  0x0000555555bb6c20 in error_report (fmt=0x555555d0d77e "%s") at util/qemu-error.c:273
#1  0x0000555555bb5fa5 in error_report_err (err=0x55555749cb00) at util/error.c:228
#2  0x0000555555a6d1f8 in migrate_fd_cleanup (opaque=opaque@entry=0x5555569925a0) at migration/migration.c:1106
#3  0x0000555555a6e2aa in migrate_fd_connect (s=s@entry=0x5555569925a0, error_in=0x55555768de80) at migration/migration.c:2387
#4  0x0000555555a6f8e7 in migration_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555557e1de10, hostname=hostname@entry=0x0, error=<optimized out>) at migration/channel.c:83
#5  0x0000555555a6f2d6 in migration_tls_outgoing_handshake (task=<optimized out>, opaque=0x5555569925a0) at migration/tls.c:124
#6  0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ae620) at io/task.c:142
#7  0x0000555555b68c74 in qio_channel_tls_handshake_task (ioc=ioc@entry=0x555557e1de10, task=task@entry=0x5555576ae620) at io/channel-tls.c:171
#8  0x0000555555b6975a in qio_channel_tls_handshake (ioc=ioc@entry=0x555557e1de10, func=func@entry=0x555555a6f250 <migration_tls_outgoing_handshake>, opaque=opaque@entry=0x5555569925a0, destroy=destroy@entry=0x0) at io/channel-tls.c:215
#9  0x0000555555a6f6ac in migration_tls_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=hostname@entry=0x555556bc0c30 "localhost", errp=errp@entry=0x7fffffffd778) at migration/tls.c:159
#10 0x0000555555a6f967 in migration_channel_connect (s=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=0x555556bc0c30 "localhost", error=<optimized out>) at migration/channel.c:73
#11 0x0000555555a6e5f4 in socket_outgoing_migration (task=<optimized out>, opaque=0x555556cc7de0) at migration/socket.c:85
#12 0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ce3b0) at io/task.c:142
#13 0x0000555555b6d3a2 in gio_task_thread_result (opaque=0x5555578eff50) at io/task.c:88
#14 0x00007ffff5da1577 in g_idle_dispatch () at /lib64/libglib-2.0.so.0
#15 0x00007ffff5da4b77 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#16 0x0000555555baf677 in glib_pollfds_poll () at util/main-loop.c:214
#17 0x0000555555baf677 in os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:261
#18 0x0000555555baf677 in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:515
#19 0x00005555557b9157 in main_loop () at vl.c:1935
#20 0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767


The second error message is printed in this context:

#0  0x0000555555bb6c20 in error_report (fmt=fmt@entry=0x555555d0d77e "%s") at util/qemu-error.c:273
#1  0x0000555555904035 in hmp_migrate_status_cb (opaque=0x555556bc0ba0) at hmp.c:1909
#2  0x0000555555baefac in timerlist_run_timers (timer_list=0x555556a0d760) at util/qemu-timer.c:536
#3  0x0000555555baf1b7 in qemu_clock_run_timers (type=QEMU_CLOCK_REALTIME) at util/qemu-timer.c:547
#4  0x0000555555baf1b7 in qemu_clock_run_all_timers () at util/qemu-timer.c:662
#5  0x0000555555baf69a in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:521
#6  0x00005555557b9157 in main_loop () at vl.c:1935
#7  0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767


The second stack trace is the error reporting context that I added originally
in

  commit d59ce6f34434bf47a9b26138c908650bf9a24be1
  Author: Daniel P. Berrange <berrange@redhat.com>
  Date:   Wed Apr 27 11:05:00 2016 +0100

    migration: add reporting of errors for outgoing migration


So the first stack trace is the new duplicate.

Which error reporting context is "better" though, I don't know ?

My patch was based on the view that, although alot of code uses error_report,
long term all migration would eventually need to be able to filter an
'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via
'info migrate' / query-migrate. So I decided to leave the error_report_err
call to the hmp.c code, as long term that's the only place that would need
to print to the console.


Regards,
Daniel
Dr. David Alan Gilbert May 1, 2018, 10:57 a.m. UTC | #3
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote:
> > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > 
> > > During a TLS connect we see:
> > >   migration_channel_connect calls
> > >   migration_tls_channel_connect
> > >   (calls after TLS setup)
> > >   migration_channel_connect
> > > 
> > > My previous error handling fix made migration_channel_connect
> > > call migrate_fd_connect in all cases; unfortunately the above
> > > means it gets called twice and crashes doing double cleanup.
> > > 
> > > Fixes: 688a3dcba98
> > 
> > This fixes the crash, but we're still seeing error messages duplicated
> > 
> > (qemu) migrate_set_parameter tls-creds tls0
> > (qemu) migrate tcp:localhost:9000
> > qemu-system-x86_64: Certificate does not match the hostname localhost
> > qemu-system-x86_64: Certificate does not match the hostname localhost
> > 
> > git bisect points to 688a3dcba98 as the cause of these double
> > errors still.

Can you give me the full sequence to trigger that; I did try yesterday
and have failed to get that error out of the TLS code.

> FYI the stack traces look like this..
> 
> The first error message is printed in this context:
> 
> #0  0x0000555555bb6c20 in error_report (fmt=0x555555d0d77e "%s") at util/qemu-error.c:273
> #1  0x0000555555bb5fa5 in error_report_err (err=0x55555749cb00) at util/error.c:228
> #2  0x0000555555a6d1f8 in migrate_fd_cleanup (opaque=opaque@entry=0x5555569925a0) at migration/migration.c:1106
> #3  0x0000555555a6e2aa in migrate_fd_connect (s=s@entry=0x5555569925a0, error_in=0x55555768de80) at migration/migration.c:2387
> #4  0x0000555555a6f8e7 in migration_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555557e1de10, hostname=hostname@entry=0x0, error=<optimized out>) at migration/channel.c:83
> #5  0x0000555555a6f2d6 in migration_tls_outgoing_handshake (task=<optimized out>, opaque=0x5555569925a0) at migration/tls.c:124
> #6  0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ae620) at io/task.c:142
> #7  0x0000555555b68c74 in qio_channel_tls_handshake_task (ioc=ioc@entry=0x555557e1de10, task=task@entry=0x5555576ae620) at io/channel-tls.c:171
> #8  0x0000555555b6975a in qio_channel_tls_handshake (ioc=ioc@entry=0x555557e1de10, func=func@entry=0x555555a6f250 <migration_tls_outgoing_handshake>, opaque=opaque@entry=0x5555569925a0, destroy=destroy@entry=0x0) at io/channel-tls.c:215
> #9  0x0000555555a6f6ac in migration_tls_channel_connect (s=s@entry=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=hostname@entry=0x555556bc0c30 "localhost", errp=errp@entry=0x7fffffffd778) at migration/tls.c:159
> #10 0x0000555555a6f967 in migration_channel_connect (s=0x5555569925a0, ioc=ioc@entry=0x555556a01000, hostname=0x555556bc0c30 "localhost", error=<optimized out>) at migration/channel.c:73
> #11 0x0000555555a6e5f4 in socket_outgoing_migration (task=<optimized out>, opaque=0x555556cc7de0) at migration/socket.c:85
> #12 0x0000555555b6d2b2 in qio_task_complete (task=task@entry=0x5555576ce3b0) at io/task.c:142
> #13 0x0000555555b6d3a2 in gio_task_thread_result (opaque=0x5555578eff50) at io/task.c:88
> #14 0x00007ffff5da1577 in g_idle_dispatch () at /lib64/libglib-2.0.so.0
> #15 0x00007ffff5da4b77 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #16 0x0000555555baf677 in glib_pollfds_poll () at util/main-loop.c:214
> #17 0x0000555555baf677 in os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:261
> #18 0x0000555555baf677 in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:515
> #19 0x00005555557b9157 in main_loop () at vl.c:1935
> #20 0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767

Right, that's certainly the one I'm expecting; my rule here is that
any connect path that might call migrate_fd_connect, *always* calls
migrate_fd_connect and in the error case gives it the error to report.
My original reason to do this was to make sure we clean up any
'cancelling's that are in flight - cancel+a broken connect could lead
to it getting stuck in cancelling.

> The second error message is printed in this context:
> 
> #0  0x0000555555bb6c20 in error_report (fmt=fmt@entry=0x555555d0d77e "%s") at util/qemu-error.c:273
> #1  0x0000555555904035 in hmp_migrate_status_cb (opaque=0x555556bc0ba0) at hmp.c:1909
> #2  0x0000555555baefac in timerlist_run_timers (timer_list=0x555556a0d760) at util/qemu-timer.c:536
> #3  0x0000555555baf1b7 in qemu_clock_run_timers (type=QEMU_CLOCK_REALTIME) at util/qemu-timer.c:547
> #4  0x0000555555baf1b7 in qemu_clock_run_all_timers () at util/qemu-timer.c:662
> #5  0x0000555555baf69a in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:521
> #6  0x00005555557b9157 in main_loop () at vl.c:1935
> #7  0x00005555557b9157 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4767
> 
> 
> The second stack trace is the error reporting context that I added originally
> in
> 
>   commit d59ce6f34434bf47a9b26138c908650bf9a24be1
>   Author: Daniel P. Berrange <berrange@redhat.com>
>   Date:   Wed Apr 27 11:05:00 2016 +0100
> 
>     migration: add reporting of errors for outgoing migration
> 
> 
> So the first stack trace is the new duplicate.
>
> Which error reporting context is "better" though, I don't know ?


OK, I see why I didn't see this - I almost always use migrate -d  and
that timer only happens without the -d.


> My patch was based on the view that, although alot of code uses error_report,
> long term all migration would eventually need to be able to filter an
> 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via
> 'info migrate' / query-migrate. So I decided to leave the error_report_err
> call to the hmp.c code, as long term that's the only place that would need
> to print to the console.

I wanted it to land in the /var/log/libvirt/qemu/whatever so we could
see what happened.

Dave

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Daniel P. Berrangé May 1, 2018, 11:21 a.m. UTC | #4
On Tue, May 01, 2018 at 11:57:25AM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote:
> > > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > 
> > > > During a TLS connect we see:
> > > >   migration_channel_connect calls
> > > >   migration_tls_channel_connect
> > > >   (calls after TLS setup)
> > > >   migration_channel_connect
> > > > 
> > > > My previous error handling fix made migration_channel_connect
> > > > call migrate_fd_connect in all cases; unfortunately the above
> > > > means it gets called twice and crashes doing double cleanup.
> > > > 
> > > > Fixes: 688a3dcba98
> > > 
> > > This fixes the crash, but we're still seeing error messages duplicated
> > > 
> > > (qemu) migrate_set_parameter tls-creds tls0
> > > (qemu) migrate tcp:localhost:9000
> > > qemu-system-x86_64: Certificate does not match the hostname localhost
> > > qemu-system-x86_64: Certificate does not match the hostname localhost
> > > 
> > > git bisect points to 688a3dcba98 as the cause of these double
> > > errors still.
> 
> Can you give me the full sequence to trigger that; I did try yesterday
> and have failed to get that error out of the TLS code.

Essentially I follow this:

  https://qemu.weilnetz.de/doc/qemu-doc.html#network_005ftls

when generating the server-cert.pem file I use this input template:

$ cat server.info
organization = My Org
cn = 127.0.0.1
# Commented out to force failure with 'localhost' name
#dns_name = "localhost"
dns_name = "localhost.localdomain"
ip_address = 127.0.0.1
ip_address = ::1
tls_www_server
encryption_key
signing_key

$ certtool --generate-certificate \
   --load-ca-certificate ca-cert.pem \
   --load-ca-privkey ca-key.pem \
   --load-privkey server-key.pem \
   --template server.info \
   --outfile server-cert.pem

On the server (mig target) side I run:

$ ./x86_64-softmmu/qemu-system-x86_64 \
    -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/security/qemutls,verify-peer=no \
    -incoming defer -monitor stdio -display none
QEMU 2.11.50 monitor - type 'help' for more information
(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate_incoming tcp:localhost:9000

Notice I've set  verify-peer=no on the server, since I've not bothered to
issue any client-cert.pem on my dev setup here.

On the client (mig source) side I run:

$  ./x86_64-softmmu/qemu-system-x86_64 \
    -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/security/qemutls\
    -monitor stdio  -display none 
QEMU 2.11.50 monitor - type 'help' for more information
(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate tcp:localhost:9000
qemu-system-x86_64: Certificate does not match the hostname localhost
qemu-system-x86_64: Certificate does not match the hostname localhost
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off 
Migration status: failed (Certificate does not match the hostname localhost)
total time: 0 milliseconds


Notice that when generating the server certificate I did *not* include
"localhost" as a DNS name, nor in the Common Name.

So when I connect with a URI of "tcp:localhost:9000" we correctly see
that the "localhost" name doesn't match any DNS name or common name
in the server cert.

If I instead set the tls-hostname parameter it will succeed:

(qemu) migrate_set_parameter tls-hostname 127.0.0.1
(qemu) migrate_set_parameter tls-creds tls0
(qemu) migrate tcp:localhost:9000
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off 
Migration status: completed
total time: 131 milliseconds
downtime: 5 milliseconds
setup: 1 milliseconds
transferred ram: 1369 kbytes
throughput: 105.76 mbps
remaining ram: 0 kbytes
total ram: 148296 kbytes
duplicate: 36817 pages
skipped: 0 pages
normal: 261 pages
normal bytes: 1044 kbytes
dirty sync count: 2
page size: 4 kbytes


Or likewise if I used "tcp:127.0.0.1:9000"  it would succeed too,
because "127.0.0.1" is listed as a IP address in the server cert.

>
> > The second stack trace is the error reporting context that I added originally
> > in
> > 
> >   commit d59ce6f34434bf47a9b26138c908650bf9a24be1
> >   Author: Daniel P. Berrange <berrange@redhat.com>
> >   Date:   Wed Apr 27 11:05:00 2016 +0100
> > 
> >     migration: add reporting of errors for outgoing migration
> > 
> > 
> > So the first stack trace is the new duplicate.
> >
> > Which error reporting context is "better" though, I don't know ?
> 
> 
> OK, I see why I didn't see this - I almost always use migrate -d  and
> that timer only happens without the -d.

Oh, yes, I forget to mention I left migration in the foreground. If you
background it, then my original patch would only show the error message
when you did "info migrate"

> > My patch was based on the view that, although alot of code uses error_report,
> > long term all migration would eventually need to be able to filter an
> > 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via
> > 'info migrate' / query-migrate. So I decided to leave the error_report_err
> > call to the hmp.c code, as long term that's the only place that would need
> > to print to the console.
> 
> I wanted it to land in the /var/log/libvirt/qemu/whatever so we could
> see what happened.

In general QMP commands are expected to send errors back to the QMP client,
not write them to stderr. Of course we have substantial legacy code so some
commands do end up only using error_report, but the general expectation with
QMP is that we'd eliminate error_report in codepaths triggered from monitor
commands. The only reason we print to stderr with QMP is that it would be
impossible to diagnose failures for code that is not converted to "Error *"
yet. Places where we have converted to "Error *" don't need to use stderr.

Regards,
Daniel
Daniel P. Berrangé May 1, 2018, 12:51 p.m. UTC | #5
On Tue, May 01, 2018 at 12:21:18PM +0100, Daniel P. Berrangé wrote:
> On Tue, May 01, 2018 at 11:57:25AM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Tue, May 01, 2018 at 11:00:35AM +0100, Daniel P. Berrangé wrote:
> > > > On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > > > > 
> > > > > During a TLS connect we see:
> > > > >   migration_channel_connect calls
> > > > >   migration_tls_channel_connect
> > > > >   (calls after TLS setup)
> > > > >   migration_channel_connect
> > > > > 
> > > > > My previous error handling fix made migration_channel_connect
> > > > > call migrate_fd_connect in all cases; unfortunately the above
> > > > > means it gets called twice and crashes doing double cleanup.
> > > > > 
> > > > > Fixes: 688a3dcba98
> > > > 
> > > > This fixes the crash, but we're still seeing error messages duplicated
> > > > 
> > > > (qemu) migrate_set_parameter tls-creds tls0
> > > > (qemu) migrate tcp:localhost:9000
> > > > qemu-system-x86_64: Certificate does not match the hostname localhost
> > > > qemu-system-x86_64: Certificate does not match the hostname localhost
> > > > 
> > > > git bisect points to 688a3dcba98 as the cause of these double
> > > > errors still.
> > 
> > Can you give me the full sequence to trigger that; I did try yesterday
> > and have failed to get that error out of the TLS code.
> 
> Essentially I follow this:
> 
>   https://qemu.weilnetz.de/doc/qemu-doc.html#network_005ftls
> 
> when generating the server-cert.pem file I use this input template:
> 
> $ cat server.info
> organization = My Org
> cn = 127.0.0.1
> # Commented out to force failure with 'localhost' name
> #dns_name = "localhost"
> dns_name = "localhost.localdomain"
> ip_address = 127.0.0.1
> ip_address = ::1
> tls_www_server
> encryption_key
> signing_key
> 
> $ certtool --generate-certificate \
>    --load-ca-certificate ca-cert.pem \
>    --load-ca-privkey ca-key.pem \
>    --load-privkey server-key.pem \
>    --template server.info \
>    --outfile server-cert.pem
> 
> On the server (mig target) side I run:
> 
> $ ./x86_64-softmmu/qemu-system-x86_64 \
>     -object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/security/qemutls,verify-peer=no \
>     -incoming defer -monitor stdio -display none
> QEMU 2.11.50 monitor - type 'help' for more information
> (qemu) migrate_set_parameter tls-creds tls0
> (qemu) migrate_incoming tcp:localhost:9000
> 
> Notice I've set  verify-peer=no on the server, since I've not bothered to
> issue any client-cert.pem on my dev setup here.
> 
> On the client (mig source) side I run:
> 
> $  ./x86_64-softmmu/qemu-system-x86_64 \
>     -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/security/qemutls\
>     -monitor stdio  -display none 
> QEMU 2.11.50 monitor - type 'help' for more information
> (qemu) migrate_set_parameter tls-creds tls0
> (qemu) migrate tcp:localhost:9000
> qemu-system-x86_64: Certificate does not match the hostname localhost
> qemu-system-x86_64: Certificate does not match the hostname localhost
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off 
> Migration status: failed (Certificate does not match the hostname localhost)
> total time: 0 milliseconds
> 
> 
> Notice that when generating the server certificate I did *not* include
> "localhost" as a DNS name, nor in the Common Name.
> 
> So when I connect with a URI of "tcp:localhost:9000" we correctly see
> that the "localhost" name doesn't match any DNS name or common name
> in the server cert.
> 
> If I instead set the tls-hostname parameter it will succeed:
> 
> (qemu) migrate_set_parameter tls-hostname 127.0.0.1
> (qemu) migrate_set_parameter tls-creds tls0
> (qemu) migrate tcp:localhost:9000
> (qemu) info migrate
> globals:
> store-global-state: on
> only-migratable: off
> send-configuration: on
> send-section-footer: on
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off 
> Migration status: completed
> total time: 131 milliseconds
> downtime: 5 milliseconds
> setup: 1 milliseconds
> transferred ram: 1369 kbytes
> throughput: 105.76 mbps
> remaining ram: 0 kbytes
> total ram: 148296 kbytes
> duplicate: 36817 pages
> skipped: 0 pages
> normal: 261 pages
> normal bytes: 1044 kbytes
> dirty sync count: 2
> page size: 4 kbytes
> 
> 
> Or likewise if I used "tcp:127.0.0.1:9000"  it would succeed too,
> because "127.0.0.1" is listed as a IP address in the server cert.
> 
> >
> > > The second stack trace is the error reporting context that I added originally
> > > in
> > > 
> > >   commit d59ce6f34434bf47a9b26138c908650bf9a24be1
> > >   Author: Daniel P. Berrange <berrange@redhat.com>
> > >   Date:   Wed Apr 27 11:05:00 2016 +0100
> > > 
> > >     migration: add reporting of errors for outgoing migration
> > > 
> > > 
> > > So the first stack trace is the new duplicate.
> > >
> > > Which error reporting context is "better" though, I don't know ?
> > 
> > 
> > OK, I see why I didn't see this - I almost always use migrate -d  and
> > that timer only happens without the -d.
> 
> Oh, yes, I forget to mention I left migration in the foreground. If you
> background it, then my original patch would only show the error message
> when you did "info migrate"
> 
> > > My patch was based on the view that, although alot of code uses error_report,
> > > long term all migration would eventually need to be able to filter an
> > > 'Error *errp' back up the stack, so that we can pass it back to QMP / HMP via
> > > 'info migrate' / query-migrate. So I decided to leave the error_report_err
> > > call to the hmp.c code, as long term that's the only place that would need
> > > to print to the console.
> > 
> > I wanted it to land in the /var/log/libvirt/qemu/whatever so we could
> > see what happened.
> 
> In general QMP commands are expected to send errors back to the QMP client,
> not write them to stderr. Of course we have substantial legacy code so some
> commands do end up only using error_report, but the general expectation with
> QMP is that we'd eliminate error_report in codepaths triggered from monitor
> commands. The only reason we print to stderr with QMP is that it would be
> impossible to diagnose failures for code that is not converted to "Error *"
> yet. Places where we have converted to "Error *" don't need to use stderr.

On looking further, it seems to me that the real problem was this change:L

  commit 87db1a7d89677e3dbc8b3763e417b9376009bdbb
  Author: Juan Quintela <quintela@redhat.com>
  Date:   Tue Sep 5 12:50:22 2017 +0200

    migration: Improve migration thread error handling
    
    We now report errors also when we finish migration, not only on info
    migrate.  We plan to use this error from several places, and we want
    the first error to happen to win, so we add an mutex to order it.
    
    Signed-off-by: Juan Quintela <quintela@redhat.com>
    Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


Which added an "error_report_err" call in migrate_fd_cleanup()
method which prints the s->error object. This is not needed for
QMP since we feed it back via query-migrate, and also not needed
for HMP since we can report it via info migrate, or directly via
the status callback if migration is in the foreground.

Regards,
Daniel
Peter Krempa May 2, 2018, 12:03 p.m. UTC | #6
On Mon, Apr 30, 2018 at 19:59:43 +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> During a TLS connect we see:
>   migration_channel_connect calls
>   migration_tls_channel_connect
>   (calls after TLS setup)
>   migration_channel_connect
> 
> My previous error handling fix made migration_channel_connect
> call migrate_fd_connect in all cases; unfortunately the above
> means it gets called twice and crashes doing double cleanup.
> 
> Fixes: 688a3dcba98
> 
> Reported-by: Peter Krempa <pkrempa@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

This fixes both crashes I've observed:

1) if the TLS environment is incorrect and certificate validation fails,
I've seen crash in migrate_fd_connect in call to qemu_file_set_blocking

2) if the TLS environment was correct, I've seen a crash in cleanup
phase

Tested-by: Peter Krempa <pkrempa@redhat.com>

Unfortunately I don't feel familiar enough with the migration code to
provide a qualified review.
Daniel P. Berrangé May 2, 2018, 12:07 p.m. UTC | #7
On Mon, Apr 30, 2018 at 07:59:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> During a TLS connect we see:
>   migration_channel_connect calls
>   migration_tls_channel_connect
>   (calls after TLS setup)
>   migration_channel_connect
> 
> My previous error handling fix made migration_channel_connect
> call migrate_fd_connect in all cases; unfortunately the above
> means it gets called twice and crashes doing double cleanup.
> 
> Fixes: 688a3dcba98
> 
> Reported-by: Peter Krempa <pkrempa@redhat.com>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  migration/channel.c | 9 +++++++++
>  1 file changed, 9 insertions(+)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>

I've satisfied myself that the double error reporting is not going to
cause any functional harm, so that's less urgent to fix.

> 
> diff --git a/migration/channel.c b/migration/channel.c
> index c5eaf0fa0e..7a32b5aca4 100644
> --- a/migration/channel.c
> +++ b/migration/channel.c
> @@ -71,6 +71,15 @@ void migration_channel_connect(MigrationState *s,
>              !object_dynamic_cast(OBJECT(ioc),
>                                   TYPE_QIO_CHANNEL_TLS)) {
>              migration_tls_channel_connect(s, ioc, hostname, &error);
> +
> +            if (!error) {
> +                /* tls_channel_connect will call back to this
> +                 * function after the TLS handshake,
> +                 * so we mustn't call migrate_fd_connect until then
> +                 */
> +
> +                return;
> +            }
>          } else {
>              QEMUFile *f = qemu_fopen_channel_output(ioc);
>  
> -- 
> 2.17.0
> 

Regards,
Daniel
diff mbox series

Patch

diff --git a/migration/channel.c b/migration/channel.c
index c5eaf0fa0e..7a32b5aca4 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -71,6 +71,15 @@  void migration_channel_connect(MigrationState *s,
             !object_dynamic_cast(OBJECT(ioc),
                                  TYPE_QIO_CHANNEL_TLS)) {
             migration_tls_channel_connect(s, ioc, hostname, &error);
+
+            if (!error) {
+                /* tls_channel_connect will call back to this
+                 * function after the TLS handshake,
+                 * so we mustn't call migrate_fd_connect until then
+                 */
+
+                return;
+            }
         } else {
             QEMUFile *f = qemu_fopen_channel_output(ioc);