mbox series

[v3,00/26] migration: Improve error reporting

Message ID 20240304122844.1888308-1-clg@redhat.com
Headers show
Series migration: Improve error reporting | expand

Message

Cédric Le Goater March 4, 2024, 12:28 p.m. UTC
Hello,

The motivation behind these changes is to improve error reporting to
the upper management layer (libvirt) with a more detailed error, this
to let it decide, depending on the reported error, whether to try
migration again later. It would be useful in cases where migration
fails due to lack of HW resources on the host. For instance, some
adapters can only initiate a limited number of simultaneous dirty
tracking requests and this imposes a limit on the the number of VMs
that can be migrated simultaneously.

We are not quite ready for such a mechanism but what we can do first is
to cleanup the error reporting in the early save_setup sequence. This
is what the following changes propose, by adding an Error** argument to
various handlers and propagating it to the core migration subsystem.
 
Thanks,

C.

Changes in v3:

 - New changes to make sure an error is always set in case of failure.
   This is the reason behing the 5/6 extra patches. (Markus)
 - Documentation fixup (Peter + Avihai)
 - Set migration state to MIGRATION_STATUS_FAILED always
 - Fixed error handling in bg_migration_thread() (Peter)
 - Fixed return value of vfio_listener_log_global_start/stop(). 
   Went unnoticed because value is not tested. (Peter)
 - Add ERRP_GUARD() when error_prepend is used 
 - Use error_setg_errno() when possible
    
Changes in v2:

- Removed v1 patches addressing the return-path thread termination as
  they are now superseded by :  
  https://lore.kernel.org/qemu-devel/20240226203122.22894-1-farosas@suse.de/
- Documentation updates of handlers
- Removed call to PRECOPY_NOTIFY_SETUP notifiers in case of errors
- Modified routines taking an Error** argument to return a bool when
  possible and made adjustments in callers.
- new MEMORY_LISTENER_CALL_LOG_GLOBAL macro for .log_global*()
  handlers
- Handled SETUP state when migration terminates
- Modified memory_get_xlat_addr() to take an Error** argument
- Various refinements on error handling

Cédric Le Goater (26):
  s390/stattrib: Add Error** argument to set_migrationmode() handler
  vfio: Always report an error in vfio_save_setup()
  migration: Always report an error in block_save_setup()
  migration: Always report an error in ram_save_setup()
  migration: Add Error** argument to vmstate_save()
  migration: Report error when shutdown fails
  migration: Remove SaveStateHandler and LoadStateHandler typedefs
  migration: Add documentation for SaveVMHandlers
  migration: Do not call PRECOPY_NOTIFY_SETUP notifiers in case of error
  migration: Move cleanup after after error reporting in
    qemu_savevm_state_setup()
  migration: Add Error** argument to qemu_savevm_state_setup()
  migration: Add Error** argument to .save_setup() handler
  migration: Add Error** argument to .load_setup() handler
  memory: Add Error** argument to .log_global*() handlers
  memory: Add Error** argument to the global_dirty_log routines
  migration: Modify ram_init_bitmaps() to report dirty tracking errors
  vfio: Add Error** argument to .set_dirty_page_tracking() handler
  vfio: Add Error** argument to vfio_devices_dma_logging_start()
  vfio: Add Error** argument to vfio_devices_dma_logging_stop()
  vfio: Use new Error** argument in vfio_save_setup()
  vfio: Add Error** argument to .vfio_save_config() handler
  vfio: Reverse test on vfio_get_dirty_bitmap()
  memory: Add Error** argument to memory_get_xlat_addr()
  vfio: Add Error** argument to .get_dirty_bitmap() handler
  vfio: Also trace event failures in vfio_save_complete_precopy()
  vfio: Extend vfio_set_migration_error() with Error* argument

 include/exec/memory.h                 |  40 +++-
 include/hw/s390x/storage-attributes.h |   2 +-
 include/hw/vfio/vfio-common.h         |  29 ++-
 include/hw/vfio/vfio-container-base.h |  35 +++-
 include/migration/register.h          | 273 +++++++++++++++++++++++---
 include/qemu/typedefs.h               |   2 -
 migration/savevm.h                    |   2 +-
 hw/i386/xen/xen-hvm.c                 |  10 +-
 hw/ppc/spapr.c                        |   2 +-
 hw/s390x/s390-stattrib-kvm.c          |  12 +-
 hw/s390x/s390-stattrib.c              |  14 +-
 hw/vfio/common.c                      | 162 +++++++++------
 hw/vfio/container-base.c              |   9 +-
 hw/vfio/container.c                   |  19 +-
 hw/vfio/migration.c                   |  99 ++++++----
 hw/vfio/pci.c                         |   5 +-
 hw/virtio/vhost-vdpa.c                |   5 +-
 hw/virtio/vhost.c                     |   6 +-
 migration/block-dirty-bitmap.c        |   4 +-
 migration/block.c                     |  15 +-
 migration/dirtyrate.c                 |  21 +-
 migration/migration.c                 |  27 ++-
 migration/qemu-file.c                 |   5 +-
 migration/ram.c                       |  58 ++++--
 migration/savevm.c                    |  59 +++---
 system/memory.c                       |  95 +++++++--
 system/physmem.c                      |   5 +-
 27 files changed, 772 insertions(+), 243 deletions(-)

Comments

Peter Xu March 5, 2024, 8:06 a.m. UTC | #1
On Mon, Mar 04, 2024 at 01:28:18PM +0100, Cédric Le Goater wrote:
>   migration: Report error when shutdown fails
>   migration: Remove SaveStateHandler and LoadStateHandler typedefs
>   migration: Add documentation for SaveVMHandlers
>   migration: Do not call PRECOPY_NOTIFY_SETUP notifiers in case of error

These four patches seem to be pretty standalone ones and got at least 1
ACKs.  I queued them for 9.0, thanks.
Cédric Le Goater March 5, 2024, 8:30 a.m. UTC | #2
On 3/5/24 09:06, Peter Xu wrote:
> On Mon, Mar 04, 2024 at 01:28:18PM +0100, Cédric Le Goater wrote:
>>    migration: Report error when shutdown fails
>>    migration: Remove SaveStateHandler and LoadStateHandler typedefs
>>    migration: Add documentation for SaveVMHandlers
>>    migration: Do not call PRECOPY_NOTIFY_SETUP notifiers in case of error
> 
> These four patches seem to be pretty standalone ones and got at least 1
> ACKs.  I queued them for 9.0, thanks.

OK.

I will try to have the first 5 ready before 9.0 :

   s390/stattrib: Add Error** argument to set_migrationmode() handler
   vfio: Always report an error in vfio_save_setup()
   migration: Always report an error in block_save_setup()
   migration: Always report an error in ram_save_setup()
   migration: Add Error** argument to vmstate_save()

So that we only have the core changes in log_global_start() and
ram_init_bitmaps() to address in the next cycle.

As for the VFIO part coming after, I will see which initial cleanups
we can merge before soft freeze.

Thanks,

C.