diff mbox

[v3,09/27] migration: add reporting of errors for outgoing migration

Message ID 1456499430-8558-10-git-send-email-berrange@redhat.com
State New
Headers show

Commit Message

Daniel P. Berrangé Feb. 26, 2016, 3:10 p.m. UTC
Currently if an app initiates an outgoing migration, it
may or may not, get an error reported back on failure. If
the error occurs synchronously to the 'migrate' command
execution, the client app will see the error message. This
is the case for DNS lookup failures. If the error occurs
asynchronously to the monitor command though, the error
will be thrown away and the client left guessing about
what went wrong. This is the case for failure to connect
to the TCP server (eg due to wrong port, or firewall
rules, or other similar errors).

In the future we'll be adding more scope for errors to
happen asynchronously with the TLS protocol handshake.
TLS errors are hard to diagnose even when they are well
reported, so discarding errors entirely will make it
impossible to debug TLS connection problems.

Management apps which do migration are already using
'query-migrate' / 'info migrate' to check up on progress
of background migration operations and to see their end
status. This is a fine place to also include the error
message when things go wrong.

This patch thus adds an 'error-desc' field to the
MigrationInfo struct, which will be populated when
the 'status' is set to 'failed':

(qemu) migrate -d tcp:localhost:9001
(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: failed
total time: 0 milliseconds
error description: Error connecting to socket: Connection refused

In the HMP, when doing non-detached migration, it is
also possible to display this error message directly
to the app.

(qemu) migrate tcp:localhost:9001
Error connecting to socket: Connection refused

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
---
 hmp.c                         |  8 ++++++++
 include/migration/migration.h |  5 ++++-
 include/qapi/error.h          |  2 +-
 migration/migration.c         | 12 +++++++++---
 migration/rdma.c              | 10 +++-------
 migration/tcp.c               |  2 +-
 migration/unix.c              |  2 +-
 qapi-schema.json              |  5 ++++-
 trace-events                  |  2 +-
 util/error.c                  |  2 +-
 10 files changed, 33 insertions(+), 17 deletions(-)

Comments

Markus Armbruster March 4, 2016, 9:49 a.m. UTC | #1
"Daniel P. Berrange" <berrange@redhat.com> writes:

> Currently if an app initiates an outgoing migration, it

application

> may or may not, get an error reported back on failure. If
> the error occurs synchronously to the 'migrate' command
> execution, the client app will see the error message. This
> is the case for DNS lookup failures. If the error occurs
> asynchronously to the monitor command though, the error
> will be thrown away and the client left guessing about
> what went wrong. This is the case for failure to connect
> to the TCP server (eg due to wrong port, or firewall
> rules, or other similar errors).
>
> In the future we'll be adding more scope for errors to
> happen asynchronously with the TLS protocol handshake.
> TLS errors are hard to diagnose even when they are well
> reported, so discarding errors entirely will make it
> impossible to debug TLS connection problems.
>
> Management apps which do migration are already using
> 'query-migrate' / 'info migrate' to check up on progress
> of background migration operations and to see their end
> status. This is a fine place to also include the error
> message when things go wrong.
>
> This patch thus adds an 'error-desc' field to the
> MigrationInfo struct, which will be populated when
> the 'status' is set to 'failed':
>
> (qemu) migrate -d tcp:localhost:9001
> (qemu) info migrate
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off
> Migration status: failed
> total time: 0 milliseconds
> error description: Error connecting to socket: Connection refused

Perhaps:

  Migration status: failed (Error connecting to socket: Connection refused)

Just an idea, use whatever you like better.

> In the HMP, when doing non-detached migration, it is
> also possible to display this error message directly
> to the app.
>
> (qemu) migrate tcp:localhost:9001
> Error connecting to socket: Connection refused

You could include a QMP example if you like.

> Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
> ---
>  hmp.c                         |  8 ++++++++
>  include/migration/migration.h |  5 ++++-
>  include/qapi/error.h          |  2 +-
>  migration/migration.c         | 12 +++++++++---
>  migration/rdma.c              | 10 +++-------
>  migration/tcp.c               |  2 +-
>  migration/unix.c              |  2 +-
>  qapi-schema.json              |  5 ++++-
>  trace-events                  |  2 +-
>  util/error.c                  |  2 +-
>  10 files changed, 33 insertions(+), 17 deletions(-)
>
> diff --git a/hmp.c b/hmp.c
> index 5b6084a..a239e1e 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -34,6 +34,7 @@
>  #include "ui/console.h"
>  #include "block/qapi.h"
>  #include "qemu-io.h"
> +#include "qemu/error-report.h"
>  
>  #ifdef CONFIG_SPICE
>  #include <spice/enums.h>
> @@ -239,6 +240,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>                         info->x_cpu_throttle_percentage);
>      }
>  
> +    if (info->has_error_desc) {
> +        monitor_printf(mon, "error description: %s\n", info->error_desc);
> +    }
> +
>      qapi_free_MigrationInfo(info);
>      qapi_free_MigrationCapabilityStatusList(caps);
>  }
> @@ -1532,6 +1537,9 @@ static void hmp_migrate_status_cb(void *opaque)
>          if (status->is_block_migration) {
>              monitor_printf(status->mon, "\n");
>          }
> +        if (info->has_error_desc) {
> +            error_report("%s", info->error_desc);
> +        }
>          monitor_resume(status->mon);
>          timer_del(status->timer);
>          g_free(status);
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 35db68a..5d44b07 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -169,6 +169,9 @@ struct MigrationState
>      QSIMPLEQ_HEAD(src_page_requests, MigrationSrcPageRequest) src_page_requests;
>      /* The RAMBlock used in the last src_page_request */
>      RAMBlock *last_req_rb;
> +
> +    /* The last error that occurred */
> +    Error *error;
>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);
> @@ -205,7 +208,7 @@ void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **
>  
>  void rdma_start_incoming_migration(const char *host_port, Error **errp);
>  
> -void migrate_fd_error(MigrationState *s);
> +void migrate_fd_error(MigrationState *s, const Error *error);
>  
>  void migrate_fd_connect(MigrationState *s);
>  
> diff --git a/include/qapi/error.h b/include/qapi/error.h
> index 02e9dd2..c7e2869 100644
> --- a/include/qapi/error.h
> +++ b/include/qapi/error.h
> @@ -139,7 +139,7 @@ typedef enum ErrorClass {
>  /*
>   * Get @err's human-readable error message.
>   */
> -const char *error_get_pretty(Error *err);
> +const char *error_get_pretty(const Error *err);
>  
>  /*
>   * Get @err's error class.

Independent cleanup, separate patch.

> diff --git a/migration/migration.c b/migration/migration.c
> index 76a4aac..08720bc 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -681,6 +681,10 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          break;
>      case MIGRATION_STATUS_FAILED:
>          info->has_status = true;
> +        if (s->error) {
> +            info->has_error_desc = true;
> +            info->error_desc = g_strdup(error_get_pretty(s->error));
> +        }

Interesting: error_desc may be absent in status 'failed'.

The Error object contains a message, an optional hint, location of the
spot in the source that created it (used only when the error hits
&error_abort), and a class (use discouraged).  We only report the
message.  Okay.

>          break;
>      case MIGRATION_STATUS_CANCELLED:
>          info->has_status = true;
> @@ -853,12 +857,14 @@ static void migrate_fd_cleanup(void *opaque)
>      notifier_list_notify(&migration_state_notifiers, s);
>  }
>  
> -void migrate_fd_error(MigrationState *s)
> +void migrate_fd_error(MigrationState *s, const Error *error)
>  {
> -    trace_migrate_fd_error();
> +    trace_migrate_fd_error(error ? error_get_pretty(error) : "");
>      assert(s->to_dst_file == NULL);
>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                        MIGRATION_STATUS_FAILED);
> +    error_free(s->error);
> +    s->error = error_copy(error);

Can migrate_fd_error() be called more than once per migration attempt?

>      notifier_list_notify(&migration_state_notifiers, s);
>  }
>  
> @@ -1057,7 +1063,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>      }
>  
>      if (local_err) {
> -        migrate_fd_error(s);
> +        migrate_fd_error(s, local_err);
>          error_propagate(errp, local_err);
>          return;
>      }
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 187fc1c..cd33d90 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3487,16 +3487,14 @@ void rdma_start_outgoing_migration(void *opaque,
>                              const char *host_port, Error **errp)
>  {
>      MigrationState *s = opaque;
> -    Error *local_err = NULL, **temp = &local_err;
> -    RDMAContext *rdma = qemu_rdma_data_init(host_port, &local_err);
> +    RDMAContext *rdma = qemu_rdma_data_init(host_port, errp);
>      int ret = 0;
>  
>      if (rdma == NULL) {
> -        ERROR(temp, "Failed to initialize RDMA data structures! %d", ret);

Good riddance.

ERROR() looks inapprioriate to me in general: it prints to stderr and
sets an error.  If you set an error, deciding whether and how to report
it is by convention 100% up to the caller.  In other words, if you set
an error, you should do it *silently*.  Debug prints and tracing are
okay, of course.

>          goto err;
>      }
>  
> -    ret = qemu_rdma_source_init(rdma, &local_err,
> +    ret = qemu_rdma_source_init(rdma, errp,
>          s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL]);
>  
>      if (ret) {
> @@ -3504,7 +3502,7 @@ void rdma_start_outgoing_migration(void *opaque,
>      }
>  
>      trace_rdma_start_outgoing_migration_after_rdma_source_init();
> -    ret = qemu_rdma_connect(rdma, &local_err);
> +    ret = qemu_rdma_connect(rdma, errp);
>  
>      if (ret) {
>          goto err;
> @@ -3516,7 +3514,5 @@ void rdma_start_outgoing_migration(void *opaque,
>      migrate_fd_connect(s);
>      return;
>  err:
> -    error_propagate(errp, local_err);
>      g_free(rdma);
> -    migrate_fd_error(s);
>  }
> diff --git a/migration/tcp.c b/migration/tcp.c
> index e888a4e..48904e0 100644
> --- a/migration/tcp.c
> +++ b/migration/tcp.c
> @@ -40,7 +40,7 @@ static void tcp_wait_for_connect(int fd, Error *err, void *opaque)
>      if (fd < 0) {
>          DPRINTF("migrate connect error: %s\n", error_get_pretty(err));
>          s->to_dst_file = NULL;
> -        migrate_fd_error(s);
> +        migrate_fd_error(s, err);
>      } else {
>          DPRINTF("migrate connect success\n");
>          s->to_dst_file = qemu_fopen_socket(fd, "wb");
> diff --git a/migration/unix.c b/migration/unix.c
> index d9aac36..b3537fd 100644
> --- a/migration/unix.c
> +++ b/migration/unix.c
> @@ -40,7 +40,7 @@ static void unix_wait_for_connect(int fd, Error *err, void *opaque)
>      if (fd < 0) {
>          DPRINTF("migrate connect error: %s\n", error_get_pretty(err));
>          s->to_dst_file = NULL;
> -        migrate_fd_error(s);
> +        migrate_fd_error(s, err);
>      } else {
>          DPRINTF("migrate connect success\n");
>          s->to_dst_file = qemu_fopen_socket(fd, "wb");
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7b8f2a1..ff89747 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -484,6 +484,8 @@
>  #       throttled during auto-converge. This is only present when auto-converge
>  #       has started throttling guest cpus. (Since 2.5)
>  #
> +# @error-desc: the error description string, when @status == 'failed' (Since 2.6)
> +#

"when @status == 'failed'" is a semantic constraint, not visible in
query-qmp-schema.  Several existing members have similar constraints.

Making this a flat union tagged by @status would turn semantic
constraints into syntax, visible in query-qmp-schema.  Not sure it's
worth the churn now.

Note that qmp_query_migrate() may not actually set @error-desc when
@status is 'failed'.  Bug in either the code or the documentation.

Further note that code uses it without also checking status.  You could
assert the constraint holds.  Your choice.

Let me take a step back and examine the bigger picture.

Migration is a long-running task with a non-trivial live cycle (see enum
MigrationStatus).  Similar tasks exist in the block layer ("block
jobs"), and perhaps we can have a generic "jobs" abstraction some day.

Originally, QMP was designed to permit doing long-running tasks as
asynchronous commands.  We ended up doing them as synchronous commands +
events + status queries.  Two reasons, one accidental, one fundamental.

The accidental one is asynchronous commands never quite worked, so
nobody used them, so nobody bothered to fix them.

The fundamental one is complex life cycles.  A job that starts, runs
silently for a while, then either completes or fails can be done nicely
as asynchronous command.  But when the job's life cycle is more complex,
a single command reply isn't enough.  Thus commands + events + status
queries.

In this model, the reply to a status query in a "failed" state takes the
role of the asynchronous command's error reply.  It therefore makes
sense to compare the two.

The QMP error reply is documented as follows in qmp-spec.txt:

    The format of an error response is:

    { "error": { "class": json-string, "desc": json-string }, "id": json-value }

     Where,

    - The "class" member contains the error class name (eg. "GenericError")
    - The "desc" member is a human-readable error message. Clients should
      not attempt to parse this message.
    - The "id" member contains the transaction identification associated with
      the command execution if issued by the Client

MigrationInfo in a "failed" contains @status and @error-desc, and may
contain additional members like @total-time (not entirely clear from
documentation, which means the documentation is too vague).

MigrationInfo covers @desc, but not @class and @id.

Use of @class is discouraged nowadays, and omitting it here at least
until we have a compelling use for it makes sense.

@id ties the asynchronous reply to the command.  Not necessary as long
as only one migration task can exist.

The only thing I'd like you to add to MigrationInfo is the "Clients
should not attempt to parse this" admonition.  Suggest to describe
@error-desc as "human-readable" while there.

>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
> @@ -494,7 +496,8 @@
>             '*expected-downtime': 'int',
>             '*downtime': 'int',
>             '*setup-time': 'int',
> -           '*x-cpu-throttle-percentage': 'int'} }
> +           '*x-cpu-throttle-percentage': 'int',
> +           '*error-desc': 'str'} }
>  
>  ##
>  # @query-migrate
> diff --git a/trace-events b/trace-events
> index 61a133f..097f46d 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1486,7 +1486,7 @@ await_return_path_close_on_source_close(void) ""
>  await_return_path_close_on_source_joining(void) ""
>  migrate_set_state(int new_state) "new state %d"
>  migrate_fd_cleanup(void) ""
> -migrate_fd_error(void) ""
> +migrate_fd_error(const char *error_desc) "error=%s"
>  migrate_fd_cancel(void) ""
>  migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at %zx len %zx"
>  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
> diff --git a/util/error.c b/util/error.c
> index 471b8b3..f134a6d 100644
> --- a/util/error.c
> +++ b/util/error.c
> @@ -218,7 +218,7 @@ ErrorClass error_get_class(const Error *err)
>      return err->err_class;
>  }
>  
> -const char *error_get_pretty(Error *err)
> +const char *error_get_pretty(const Error *err)
>  {
>      return err->msg;
>  }
Daniel P. Berrangé March 4, 2016, 10:49 a.m. UTC | #2
On Fri, Mar 04, 2016 at 10:49:35AM +0100, Markus Armbruster wrote:
> "Daniel P. Berrange" <berrange@redhat.com> writes:
> 
> > Currently if an app initiates an outgoing migration, it
> 
> application
> 
> > may or may not, get an error reported back on failure. If
> > the error occurs synchronously to the 'migrate' command
> > execution, the client app will see the error message. This
> > is the case for DNS lookup failures. If the error occurs
> > asynchronously to the monitor command though, the error
> > will be thrown away and the client left guessing about
> > what went wrong. This is the case for failure to connect
> > to the TCP server (eg due to wrong port, or firewall
> > rules, or other similar errors).
> >
> > In the future we'll be adding more scope for errors to
> > happen asynchronously with the TLS protocol handshake.
> > TLS errors are hard to diagnose even when they are well
> > reported, so discarding errors entirely will make it
> > impossible to debug TLS connection problems.
> >
> > Management apps which do migration are already using
> > 'query-migrate' / 'info migrate' to check up on progress
> > of background migration operations and to see their end
> > status. This is a fine place to also include the error
> > message when things go wrong.
> >
> > This patch thus adds an 'error-desc' field to the
> > MigrationInfo struct, which will be populated when
> > the 'status' is set to 'failed':
> >
> > (qemu) migrate -d tcp:localhost:9001
> > (qemu) info migrate
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off
> > Migration status: failed
> > total time: 0 milliseconds
> > error description: Error connecting to socket: Connection refused
> 
> Perhaps:
> 
>   Migration status: failed (Error connecting to socket: Connection refused)
> 
> Just an idea, use whatever you like better.

Yeah, that is nicer for the HMP.

> > In the HMP, when doing non-detached migration, it is
> > also possible to display this error message directly
> > to the app.
> >
> > (qemu) migrate tcp:localhost:9001
> > Error connecting to socket: Connection refused
> 
> You could include a QMP example if you like.

Sure, will add it.


> > @@ -853,12 +857,14 @@ static void migrate_fd_cleanup(void *opaque)
> >      notifier_list_notify(&migration_state_notifiers, s);
> >  }
> >  
> > -void migrate_fd_error(MigrationState *s)
> > +void migrate_fd_error(MigrationState *s, const Error *error)
> >  {
> > -    trace_migrate_fd_error();
> > +    trace_migrate_fd_error(error ? error_get_pretty(error) : "");
> >      assert(s->to_dst_file == NULL);
> >      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
> >                        MIGRATION_STATUS_FAILED);
> > +    error_free(s->error);
> > +    s->error = error_copy(error);
> 
> Can migrate_fd_error() be called more than once per migration attempt?

I think so, but I felt it worth being paranoid against mistakes
so I chose to call error_free() just in case.

> > diff --git a/qapi-schema.json b/qapi-schema.json
> > index 7b8f2a1..ff89747 100644
> > --- a/qapi-schema.json
> > +++ b/qapi-schema.json
> > @@ -484,6 +484,8 @@
> >  #       throttled during auto-converge. This is only present when auto-converge
> >  #       has started throttling guest cpus. (Since 2.5)
> >  #
> > +# @error-desc: the error description string, when @status == 'failed' (Since 2.6)
> > +#
> 
> "when @status == 'failed'" is a semantic constraint, not visible in
> query-qmp-schema.  Several existing members have similar constraints.
> 
> Making this a flat union tagged by @status would turn semantic
> constraints into syntax, visible in query-qmp-schema.  Not sure it's
> worth the churn now.
> 
> Note that qmp_query_migrate() may not actually set @error-desc when
> @status is 'failed'.  Bug in either the code or the documentation.

Bug in docs :-)

> Further note that code uses it without also checking status.  You could
> assert the constraint holds.  Your choice.
> 
> Let me take a step back and examine the bigger picture.
> 
> Migration is a long-running task with a non-trivial live cycle (see enum
> MigrationStatus).  Similar tasks exist in the block layer ("block
> jobs"), and perhaps we can have a generic "jobs" abstraction some day.
> 
> Originally, QMP was designed to permit doing long-running tasks as
> asynchronous commands.  We ended up doing them as synchronous commands +
> events + status queries.  Two reasons, one accidental, one fundamental.
> 
> The accidental one is asynchronous commands never quite worked, so
> nobody used them, so nobody bothered to fix them.
> 
> The fundamental one is complex life cycles.  A job that starts, runs
> silently for a while, then either completes or fails can be done nicely
> as asynchronous command.  But when the job's life cycle is more complex,
> a single command reply isn't enough.  Thus commands + events + status
> queries.
> 
> In this model, the reply to a status query in a "failed" state takes the
> role of the asynchronous command's error reply.  It therefore makes
> sense to compare the two.
> 
> The QMP error reply is documented as follows in qmp-spec.txt:
> 
>     The format of an error response is:
> 
>     { "error": { "class": json-string, "desc": json-string }, "id": json-value }
> 
>      Where,
> 
>     - The "class" member contains the error class name (eg. "GenericError")
>     - The "desc" member is a human-readable error message. Clients should
>       not attempt to parse this message.
>     - The "id" member contains the transaction identification associated with
>       the command execution if issued by the Client
> 
> MigrationInfo in a "failed" contains @status and @error-desc, and may
> contain additional members like @total-time (not entirely clear from
> documentation, which means the documentation is too vague).
> 
> MigrationInfo covers @desc, but not @class and @id.
> 
> Use of @class is discouraged nowadays, and omitting it here at least
> until we have a compelling use for it makes sense.
> 
> @id ties the asynchronous reply to the command.  Not necessary as long
> as only one migration task can exist.
> 
> The only thing I'd like you to add to MigrationInfo is the "Clients
> should not attempt to parse this" admonition.  Suggest to describe
> @error-desc as "human-readable" while there.

Yes, makes sense.


Regards,
Daniel
Dr. David Alan Gilbert March 10, 2016, 3:13 p.m. UTC | #3
* Daniel P. Berrange (berrange@redhat.com) wrote:
> Currently if an app initiates an outgoing migration, it
> may or may not, get an error reported back on failure. If
> the error occurs synchronously to the 'migrate' command
> execution, the client app will see the error message. This
> is the case for DNS lookup failures. If the error occurs
> asynchronously to the monitor command though, the error
> will be thrown away and the client left guessing about
> what went wrong. This is the case for failure to connect
> to the TCP server (eg due to wrong port, or firewall
> rules, or other similar errors).
> 
> In the future we'll be adding more scope for errors to
> happen asynchronously with the TLS protocol handshake.
> TLS errors are hard to diagnose even when they are well
> reported, so discarding errors entirely will make it
> impossible to debug TLS connection problems.
> 
> Management apps which do migration are already using
> 'query-migrate' / 'info migrate' to check up on progress
> of background migration operations and to see their end
> status. This is a fine place to also include the error
> message when things go wrong.
> 
> This patch thus adds an 'error-desc' field to the
> MigrationInfo struct, which will be populated when
> the 'status' is set to 'failed':
> 
> (qemu) migrate -d tcp:localhost:9001
> (qemu) info migrate
> capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off
> Migration status: failed
> total time: 0 milliseconds
> error description: Error connecting to socket: Connection refused

Nice.

I don't see any change for migrate_init() to clear the
error pointer (migrate, fail, migrate again).

> In the HMP, when doing non-detached migration, it is
> also possible to display this error message directly
> to the app.
> 
> (qemu) migrate tcp:localhost:9001
> Error connecting to socket: Connection refused
> 
> Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
> ---
>  hmp.c                         |  8 ++++++++
>  include/migration/migration.h |  5 ++++-
>  include/qapi/error.h          |  2 +-
>  migration/migration.c         | 12 +++++++++---
>  migration/rdma.c              | 10 +++-------
>  migration/tcp.c               |  2 +-
>  migration/unix.c              |  2 +-
>  qapi-schema.json              |  5 ++++-
>  trace-events                  |  2 +-
>  util/error.c                  |  2 +-
>  10 files changed, 33 insertions(+), 17 deletions(-)
> 
> diff --git a/hmp.c b/hmp.c
> index 5b6084a..a239e1e 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -34,6 +34,7 @@
>  #include "ui/console.h"
>  #include "block/qapi.h"
>  #include "qemu-io.h"
> +#include "qemu/error-report.h"
>  
>  #ifdef CONFIG_SPICE
>  #include <spice/enums.h>
> @@ -239,6 +240,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>                         info->x_cpu_throttle_percentage);
>      }
>  
> +    if (info->has_error_desc) {
> +        monitor_printf(mon, "error description: %s\n", info->error_desc);
> +    }
> +
>      qapi_free_MigrationInfo(info);
>      qapi_free_MigrationCapabilityStatusList(caps);
>  }
> @@ -1532,6 +1537,9 @@ static void hmp_migrate_status_cb(void *opaque)
>          if (status->is_block_migration) {
>              monitor_printf(status->mon, "\n");
>          }
> +        if (info->has_error_desc) {
> +            error_report("%s", info->error_desc);
> +        }
>          monitor_resume(status->mon);
>          timer_del(status->timer);
>          g_free(status);
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 35db68a..5d44b07 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -169,6 +169,9 @@ struct MigrationState
>      QSIMPLEQ_HEAD(src_page_requests, MigrationSrcPageRequest) src_page_requests;
>      /* The RAMBlock used in the last src_page_request */
>      RAMBlock *last_req_rb;
> +
> +    /* The last error that occurred */
> +    Error *error;

Is it the last error you actually want or the first?
Normally we find it's the first error that's the most useful,
often we get later errors during cleanup as stuff falls to bits.
Although since you're only using this during the startup stuff you might
be OK, and thus I think that's all on the main thread;  if we
started to try and report errors from the migration thread as well
life would get more interesting.

Dave

>  };
>  
>  void migrate_set_state(int *state, int old_state, int new_state);
> @@ -205,7 +208,7 @@ void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **
>  
>  void rdma_start_incoming_migration(const char *host_port, Error **errp);
>  
> -void migrate_fd_error(MigrationState *s);
> +void migrate_fd_error(MigrationState *s, const Error *error);
>  
>  void migrate_fd_connect(MigrationState *s);
>  
> diff --git a/include/qapi/error.h b/include/qapi/error.h
> index 02e9dd2..c7e2869 100644
> --- a/include/qapi/error.h
> +++ b/include/qapi/error.h
> @@ -139,7 +139,7 @@ typedef enum ErrorClass {
>  /*
>   * Get @err's human-readable error message.
>   */
> -const char *error_get_pretty(Error *err);
> +const char *error_get_pretty(const Error *err);
>  
>  /*
>   * Get @err's error class.
> diff --git a/migration/migration.c b/migration/migration.c
> index 76a4aac..08720bc 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -681,6 +681,10 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          break;
>      case MIGRATION_STATUS_FAILED:
>          info->has_status = true;
> +        if (s->error) {
> +            info->has_error_desc = true;
> +            info->error_desc = g_strdup(error_get_pretty(s->error));
> +        }
>          break;
>      case MIGRATION_STATUS_CANCELLED:
>          info->has_status = true;
> @@ -853,12 +857,14 @@ static void migrate_fd_cleanup(void *opaque)
>      notifier_list_notify(&migration_state_notifiers, s);
>  }
>  
> -void migrate_fd_error(MigrationState *s)
> +void migrate_fd_error(MigrationState *s, const Error *error)
>  {
> -    trace_migrate_fd_error();
> +    trace_migrate_fd_error(error ? error_get_pretty(error) : "");
>      assert(s->to_dst_file == NULL);
>      migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
>                        MIGRATION_STATUS_FAILED);
> +    error_free(s->error);
> +    s->error = error_copy(error);
>      notifier_list_notify(&migration_state_notifiers, s);
>  }
>  
> @@ -1057,7 +1063,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>      }
>  
>      if (local_err) {
> -        migrate_fd_error(s);
> +        migrate_fd_error(s, local_err);
>          error_propagate(errp, local_err);
>          return;
>      }
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 187fc1c..cd33d90 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3487,16 +3487,14 @@ void rdma_start_outgoing_migration(void *opaque,
>                              const char *host_port, Error **errp)
>  {
>      MigrationState *s = opaque;
> -    Error *local_err = NULL, **temp = &local_err;
> -    RDMAContext *rdma = qemu_rdma_data_init(host_port, &local_err);
> +    RDMAContext *rdma = qemu_rdma_data_init(host_port, errp);
>      int ret = 0;
>  
>      if (rdma == NULL) {
> -        ERROR(temp, "Failed to initialize RDMA data structures! %d", ret);
>          goto err;
>      }
>  
> -    ret = qemu_rdma_source_init(rdma, &local_err,
> +    ret = qemu_rdma_source_init(rdma, errp,
>          s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL]);
>  
>      if (ret) {
> @@ -3504,7 +3502,7 @@ void rdma_start_outgoing_migration(void *opaque,
>      }
>  
>      trace_rdma_start_outgoing_migration_after_rdma_source_init();
> -    ret = qemu_rdma_connect(rdma, &local_err);
> +    ret = qemu_rdma_connect(rdma, errp);
>  
>      if (ret) {
>          goto err;
> @@ -3516,7 +3514,5 @@ void rdma_start_outgoing_migration(void *opaque,
>      migrate_fd_connect(s);
>      return;
>  err:
> -    error_propagate(errp, local_err);
>      g_free(rdma);
> -    migrate_fd_error(s);
>  }
> diff --git a/migration/tcp.c b/migration/tcp.c
> index e888a4e..48904e0 100644
> --- a/migration/tcp.c
> +++ b/migration/tcp.c
> @@ -40,7 +40,7 @@ static void tcp_wait_for_connect(int fd, Error *err, void *opaque)
>      if (fd < 0) {
>          DPRINTF("migrate connect error: %s\n", error_get_pretty(err));
>          s->to_dst_file = NULL;
> -        migrate_fd_error(s);
> +        migrate_fd_error(s, err);
>      } else {
>          DPRINTF("migrate connect success\n");
>          s->to_dst_file = qemu_fopen_socket(fd, "wb");
> diff --git a/migration/unix.c b/migration/unix.c
> index d9aac36..b3537fd 100644
> --- a/migration/unix.c
> +++ b/migration/unix.c
> @@ -40,7 +40,7 @@ static void unix_wait_for_connect(int fd, Error *err, void *opaque)
>      if (fd < 0) {
>          DPRINTF("migrate connect error: %s\n", error_get_pretty(err));
>          s->to_dst_file = NULL;
> -        migrate_fd_error(s);
> +        migrate_fd_error(s, err);
>      } else {
>          DPRINTF("migrate connect success\n");
>          s->to_dst_file = qemu_fopen_socket(fd, "wb");
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7b8f2a1..ff89747 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -484,6 +484,8 @@
>  #       throttled during auto-converge. This is only present when auto-converge
>  #       has started throttling guest cpus. (Since 2.5)
>  #
> +# @error-desc: the error description string, when @status == 'failed' (Since 2.6)
> +#
>  # Since: 0.14.0
>  ##
>  { 'struct': 'MigrationInfo',
> @@ -494,7 +496,8 @@
>             '*expected-downtime': 'int',
>             '*downtime': 'int',
>             '*setup-time': 'int',
> -           '*x-cpu-throttle-percentage': 'int'} }
> +           '*x-cpu-throttle-percentage': 'int',
> +           '*error-desc': 'str'} }
>  
>  ##
>  # @query-migrate
> diff --git a/trace-events b/trace-events
> index 61a133f..097f46d 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -1486,7 +1486,7 @@ await_return_path_close_on_source_close(void) ""
>  await_return_path_close_on_source_joining(void) ""
>  migrate_set_state(int new_state) "new state %d"
>  migrate_fd_cleanup(void) ""
> -migrate_fd_error(void) ""
> +migrate_fd_error(const char *error_desc) "error=%s"
>  migrate_fd_cancel(void) ""
>  migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at %zx len %zx"
>  migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
> diff --git a/util/error.c b/util/error.c
> index 471b8b3..f134a6d 100644
> --- a/util/error.c
> +++ b/util/error.c
> @@ -218,7 +218,7 @@ ErrorClass error_get_class(const Error *err)
>      return err->err_class;
>  }
>  
> -const char *error_get_pretty(Error *err)
> +const char *error_get_pretty(const Error *err)
>  {
>      return err->msg;
>  }
> -- 
> 2.5.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/hmp.c b/hmp.c
index 5b6084a..a239e1e 100644
--- a/hmp.c
+++ b/hmp.c
@@ -34,6 +34,7 @@ 
 #include "ui/console.h"
 #include "block/qapi.h"
 #include "qemu-io.h"
+#include "qemu/error-report.h"
 
 #ifdef CONFIG_SPICE
 #include <spice/enums.h>
@@ -239,6 +240,10 @@  void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->x_cpu_throttle_percentage);
     }
 
+    if (info->has_error_desc) {
+        monitor_printf(mon, "error description: %s\n", info->error_desc);
+    }
+
     qapi_free_MigrationInfo(info);
     qapi_free_MigrationCapabilityStatusList(caps);
 }
@@ -1532,6 +1537,9 @@  static void hmp_migrate_status_cb(void *opaque)
         if (status->is_block_migration) {
             monitor_printf(status->mon, "\n");
         }
+        if (info->has_error_desc) {
+            error_report("%s", info->error_desc);
+        }
         monitor_resume(status->mon);
         timer_del(status->timer);
         g_free(status);
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 35db68a..5d44b07 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -169,6 +169,9 @@  struct MigrationState
     QSIMPLEQ_HEAD(src_page_requests, MigrationSrcPageRequest) src_page_requests;
     /* The RAMBlock used in the last src_page_request */
     RAMBlock *last_req_rb;
+
+    /* The last error that occurred */
+    Error *error;
 };
 
 void migrate_set_state(int *state, int old_state, int new_state);
@@ -205,7 +208,7 @@  void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **
 
 void rdma_start_incoming_migration(const char *host_port, Error **errp);
 
-void migrate_fd_error(MigrationState *s);
+void migrate_fd_error(MigrationState *s, const Error *error);
 
 void migrate_fd_connect(MigrationState *s);
 
diff --git a/include/qapi/error.h b/include/qapi/error.h
index 02e9dd2..c7e2869 100644
--- a/include/qapi/error.h
+++ b/include/qapi/error.h
@@ -139,7 +139,7 @@  typedef enum ErrorClass {
 /*
  * Get @err's human-readable error message.
  */
-const char *error_get_pretty(Error *err);
+const char *error_get_pretty(const Error *err);
 
 /*
  * Get @err's error class.
diff --git a/migration/migration.c b/migration/migration.c
index 76a4aac..08720bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -681,6 +681,10 @@  MigrationInfo *qmp_query_migrate(Error **errp)
         break;
     case MIGRATION_STATUS_FAILED:
         info->has_status = true;
+        if (s->error) {
+            info->has_error_desc = true;
+            info->error_desc = g_strdup(error_get_pretty(s->error));
+        }
         break;
     case MIGRATION_STATUS_CANCELLED:
         info->has_status = true;
@@ -853,12 +857,14 @@  static void migrate_fd_cleanup(void *opaque)
     notifier_list_notify(&migration_state_notifiers, s);
 }
 
-void migrate_fd_error(MigrationState *s)
+void migrate_fd_error(MigrationState *s, const Error *error)
 {
-    trace_migrate_fd_error();
+    trace_migrate_fd_error(error ? error_get_pretty(error) : "");
     assert(s->to_dst_file == NULL);
     migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
                       MIGRATION_STATUS_FAILED);
+    error_free(s->error);
+    s->error = error_copy(error);
     notifier_list_notify(&migration_state_notifiers, s);
 }
 
@@ -1057,7 +1063,7 @@  void qmp_migrate(const char *uri, bool has_blk, bool blk,
     }
 
     if (local_err) {
-        migrate_fd_error(s);
+        migrate_fd_error(s, local_err);
         error_propagate(errp, local_err);
         return;
     }
diff --git a/migration/rdma.c b/migration/rdma.c
index 187fc1c..cd33d90 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -3487,16 +3487,14 @@  void rdma_start_outgoing_migration(void *opaque,
                             const char *host_port, Error **errp)
 {
     MigrationState *s = opaque;
-    Error *local_err = NULL, **temp = &local_err;
-    RDMAContext *rdma = qemu_rdma_data_init(host_port, &local_err);
+    RDMAContext *rdma = qemu_rdma_data_init(host_port, errp);
     int ret = 0;
 
     if (rdma == NULL) {
-        ERROR(temp, "Failed to initialize RDMA data structures! %d", ret);
         goto err;
     }
 
-    ret = qemu_rdma_source_init(rdma, &local_err,
+    ret = qemu_rdma_source_init(rdma, errp,
         s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL]);
 
     if (ret) {
@@ -3504,7 +3502,7 @@  void rdma_start_outgoing_migration(void *opaque,
     }
 
     trace_rdma_start_outgoing_migration_after_rdma_source_init();
-    ret = qemu_rdma_connect(rdma, &local_err);
+    ret = qemu_rdma_connect(rdma, errp);
 
     if (ret) {
         goto err;
@@ -3516,7 +3514,5 @@  void rdma_start_outgoing_migration(void *opaque,
     migrate_fd_connect(s);
     return;
 err:
-    error_propagate(errp, local_err);
     g_free(rdma);
-    migrate_fd_error(s);
 }
diff --git a/migration/tcp.c b/migration/tcp.c
index e888a4e..48904e0 100644
--- a/migration/tcp.c
+++ b/migration/tcp.c
@@ -40,7 +40,7 @@  static void tcp_wait_for_connect(int fd, Error *err, void *opaque)
     if (fd < 0) {
         DPRINTF("migrate connect error: %s\n", error_get_pretty(err));
         s->to_dst_file = NULL;
-        migrate_fd_error(s);
+        migrate_fd_error(s, err);
     } else {
         DPRINTF("migrate connect success\n");
         s->to_dst_file = qemu_fopen_socket(fd, "wb");
diff --git a/migration/unix.c b/migration/unix.c
index d9aac36..b3537fd 100644
--- a/migration/unix.c
+++ b/migration/unix.c
@@ -40,7 +40,7 @@  static void unix_wait_for_connect(int fd, Error *err, void *opaque)
     if (fd < 0) {
         DPRINTF("migrate connect error: %s\n", error_get_pretty(err));
         s->to_dst_file = NULL;
-        migrate_fd_error(s);
+        migrate_fd_error(s, err);
     } else {
         DPRINTF("migrate connect success\n");
         s->to_dst_file = qemu_fopen_socket(fd, "wb");
diff --git a/qapi-schema.json b/qapi-schema.json
index 7b8f2a1..ff89747 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -484,6 +484,8 @@ 
 #       throttled during auto-converge. This is only present when auto-converge
 #       has started throttling guest cpus. (Since 2.5)
 #
+# @error-desc: the error description string, when @status == 'failed' (Since 2.6)
+#
 # Since: 0.14.0
 ##
 { 'struct': 'MigrationInfo',
@@ -494,7 +496,8 @@ 
            '*expected-downtime': 'int',
            '*downtime': 'int',
            '*setup-time': 'int',
-           '*x-cpu-throttle-percentage': 'int'} }
+           '*x-cpu-throttle-percentage': 'int',
+           '*error-desc': 'str'} }
 
 ##
 # @query-migrate
diff --git a/trace-events b/trace-events
index 61a133f..097f46d 100644
--- a/trace-events
+++ b/trace-events
@@ -1486,7 +1486,7 @@  await_return_path_close_on_source_close(void) ""
 await_return_path_close_on_source_joining(void) ""
 migrate_set_state(int new_state) "new state %d"
 migrate_fd_cleanup(void) ""
-migrate_fd_error(void) ""
+migrate_fd_error(const char *error_desc) "error=%s"
 migrate_fd_cancel(void) ""
 migrate_handle_rp_req_pages(const char *rbname, size_t start, size_t len) "in %s at %zx len %zx"
 migrate_pending(uint64_t size, uint64_t max, uint64_t post, uint64_t nonpost) "pending size %" PRIu64 " max %" PRIu64 " (post=%" PRIu64 " nonpost=%" PRIu64 ")"
diff --git a/util/error.c b/util/error.c
index 471b8b3..f134a6d 100644
--- a/util/error.c
+++ b/util/error.c
@@ -218,7 +218,7 @@  ErrorClass error_get_class(const Error *err)
     return err->err_class;
 }
 
-const char *error_get_pretty(Error *err)
+const char *error_get_pretty(const Error *err)
 {
     return err->msg;
 }