mbox series

[v6,00/23] migration: File based migration with multifd and mapped-ram

Message ID 20240229153017.2221-1-farosas@suse.de
Headers show
Series migration: File based migration with multifd and mapped-ram | expand

Message

Fabiano Rosas Feb. 29, 2024, 3:29 p.m. UTC
Based-on: 74aa0fb297 (migration: options incompatible with cpr) # peterx/migration-next

Hi,

In this v6:

- Minor fixes to 17/23 and 19/23

CI run: https://gitlab.com/farosas/qemu/-/pipelines/1195796010

Series structure
================

This series enables mapped-ram in steps:

0) Cleanups                           [1]
1) QIOChannel interfaces              [2-6]
2) Mapped-ram format for precopy      [7-11]
3) Multifd adaptation without packets [12-15]
4) Mapped-ram format for multifd      [16-23]

* below will be sent separately *
5) Direct-io generic support          [TODO]
6) Direct-io for mapped-ram multifd with file: URI  [TODO]
7) Fdset interface for mapped-ram multifd  [TODO]

About mapped-ram
================

Mapped-ram is a new stream format for the RAM section designed to
supplement the existing ``file:`` migration and make it compatible
with ``multifd``. This enables parallel migration of a guest's RAM to
a file.

The core of the feature is to map RAM pages to migration file
offsets. This enables the ``multifd`` threads to write exclusively to
those offsets even if the guest is constantly dirtying pages
(i.e. live migration).

Another benefit is that the resulting file will have a bounded size,
since pages which are dirtied multiple times will always go to a fixed
location in the file, rather than constantly being added to a
sequential stream.

Having the pages at fixed offsets also allows the usage of O_DIRECT
for save/restore of the migration stream as the pages are ensured to
be written respecting O_DIRECT alignment restrictions.

Latest numbers (unchanged from v4)
==============

=> guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
=> host: 128 CPU AMD EPYC 7543 - 2 NVMe disks in RAID0 (8586 MiB/s) - xfs
=> pinned vcpus w/ NUMA shortest distances - average of 3 runs - results
   from query-migrate

non-live           | time (ms)   pages/s   mb/s   MB/s
-------------------+-----------------------------------
file               |    110512    256258   9549   1193
  + bg-snapshot    |    245660    119581   4303    537
-------------------+-----------------------------------
mapped-ram         |    157975    216877   6672    834
  + multifd 8 ch.  |     95922    292178  10982   1372
     + direct-io   |     23268   1936897  45330   5666
-------------------------------------------------------

live               | time (ms)   pages/s   mb/s   MB/s
-------------------+-----------------------------------
file               |         -         -      -      - (file grew 4x the VM size)
  + bg-snapshot    |    357635    141747   2974    371
-------------------+-----------------------------------
mapped-ram         |         -         -      -      - (no convergence in 5 min)
  + multifd 8 ch.  |    230812    497551  14900   1862
     + direct-io   |     27475   1788025  46736   5842
-------------------------------------------------------

v5:
https://lore.kernel.org/r/20240228152127.18769-1-farosas@suse.de
v4:
https://lore.kernel.org/r/20240220224138.24759-1-farosas@suse.de
v3:
https://lore.kernel.org/r/20231127202612.23012-1-farosas@suse.de
v2:
https://lore.kernel.org/r/20231023203608.26370-1-farosas@suse.de
v1:
https://lore.kernel.org/r/20230330180336.2791-1-farosas@suse.de

Fabiano Rosas (20):
  migration/multifd: Cleanup multifd_recv_sync_main
  io: fsync before closing a file channel
  migration/qemu-file: add utility methods for working with seekable
    channels
  migration/ram: Introduce 'mapped-ram' migration capability
  migration: Add mapped-ram URI compatibility check
  migration/ram: Add outgoing 'mapped-ram' migration
  migration/ram: Add incoming 'mapped-ram' migration
  tests/qtest/migration: Add tests for mapped-ram file-based migration
  migration/multifd: Rename MultiFDSend|RecvParams::data to
    compress_data
  migration/multifd: Decouple recv method from pages
  migration/multifd: Allow multifd without packets
  migration/multifd: Allow receiving pages without packets
  migration/multifd: Add a wrapper for channels_created
  migration/multifd: Add outgoing QIOChannelFile support
  migration/multifd: Add incoming QIOChannelFile support
  migration/multifd: Prepare multifd sync for mapped-ram migration
  migration/multifd: Support outgoing mapped-ram stream format
  migration/multifd: Support incoming mapped-ram stream format
  migration/multifd: Add mapped-ram support to fd: URI
  tests/qtest/migration: Add a multifd + mapped-ram migration test

Nikolay Borisov (3):
  io: add and implement QIO_CHANNEL_FEATURE_SEEKABLE for channel file
  io: Add generic pwritev/preadv interface
  io: implement io_pwritev/preadv for QIOChannelFile

 docs/devel/migration/features.rst   |   1 +
 docs/devel/migration/mapped-ram.rst | 138 ++++++++++
 include/exec/ramblock.h             |  13 +
 include/io/channel.h                |  83 ++++++
 include/migration/qemu-file-types.h |   2 +
 include/qemu/bitops.h               |  13 +
 io/channel-file.c                   |  69 +++++
 io/channel.c                        |  58 ++++
 migration/fd.c                      |  44 +++
 migration/fd.h                      |   2 +
 migration/file.c                    | 149 +++++++++-
 migration/file.h                    |   8 +
 migration/migration.c               |  56 +++-
 migration/multifd-zlib.c            |  26 +-
 migration/multifd-zstd.c            |  26 +-
 migration/multifd.c                 | 405 ++++++++++++++++++++++------
 migration/multifd.h                 |  27 +-
 migration/options.c                 |  35 +++
 migration/options.h                 |   1 +
 migration/qemu-file.c               | 106 ++++++++
 migration/qemu-file.h               |   6 +
 migration/ram.c                     | 345 ++++++++++++++++++++++--
 migration/ram.h                     |   1 +
 migration/savevm.c                  |   1 +
 migration/trace-events              |   2 +-
 qapi/migration.json                 |   6 +-
 tests/qtest/migration-test.c        | 127 +++++++++
 27 files changed, 1607 insertions(+), 143 deletions(-)
 create mode 100644 docs/devel/migration/mapped-ram.rst

Comments

Peter Xu March 1, 2024, 1:50 a.m. UTC | #1
On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> Based-on: 74aa0fb297 (migration: options incompatible with cpr) # peterx/migration-next
> 
> Hi,
> 
> In this v6:
> 
> - Minor fixes to 17/23 and 19/23

The whole set looks good to me now.  I plan to queue it before the
direct-io stuff.  Any other comments / concerns from anyone?

Dan, would it be fine I queue the IO patches together?

Thanks,
Markus Armbruster March 1, 2024, 7:18 a.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
>> Based-on: 74aa0fb297 (migration: options incompatible with cpr) # peterx/migration-next
>> 
>> Hi,
>> 
>> In this v6:
>> 
>> - Minor fixes to 17/23 and 19/23
>
> The whole set looks good to me now.  I plan to queue it before the
> direct-io stuff.  Any other comments / concerns from anyone?

No.  My remaining review comments all apply to the direct-io part, which
got split off this series..

> Dan, would it be fine I queue the IO patches together?
>
> Thanks,
Daniel P. Berrangé March 1, 2024, 8:11 a.m. UTC | #3
On Fri, Mar 01, 2024 at 09:50:32AM +0800, Peter Xu wrote:
> On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> > Based-on: 74aa0fb297 (migration: options incompatible with cpr) # peterx/migration-next
> > 
> > Hi,
> > 
> > In this v6:
> > 
> > - Minor fixes to 17/23 and 19/23
> 
> The whole set looks good to me now.  I plan to queue it before the
> direct-io stuff.  Any other comments / concerns from anyone?
> 
> Dan, would it be fine I queue the IO patches together?

Yes, that's fine, when the series is ready.


With regards,
Daniel
Peter Xu March 1, 2024, 8:37 a.m. UTC | #4
On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> Based-on: 74aa0fb297 (migration: options incompatible with cpr) # peterx/migration-next
> 
> Hi,
> 
> In this v6:
> 
> - Minor fixes to 17/23 and 19/23

Thanks both for confirming, queued now.
Peter Xu March 4, 2024, 12:35 p.m. UTC | #5
Fabiano,

On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory

I'm curious normally how much time does it take to do the final fdatasync()
for you when you did this test.

I finally got a relatively large system today and gave it a quick shot over
128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
migration save/load does all fine, so I don't think there's anything wrong
with the patchset, however when save completes (I'll need to stop the
workload as my disk isn't fast enough I guess..) I'll always hit a super
long hang of QEMU on fdatasync() on XFS during which the main thread is in
UNINTERRUPTIBLE state.

[<0>] rq_qos_wait+0xbb/0x130
[<0>] wbt_wait+0x9c/0x100
[<0>] __rq_qos_throttle+0x23/0x40
[<0>] blk_mq_submit_bio+0x183/0x580
[<0>] __submit_bio_noacct+0x7e/0x1e0
[<0>] iomap_submit_ioend+0x4e/0x80
[<0>] iomap_writepage_map+0x22a/0x400
[<0>] write_cache_pages+0x17c/0x4c0
[<0>] iomap_writepages+0x1c/0x40
[<0>] xfs_vm_writepages+0x7a/0xb0 [xfs]
[<0>] do_writepages+0xcf/0x1d0
[<0>] filemap_fdatawrite_wbc+0x66/0x90
[<0>] __filemap_fdatawrite_range+0x54/0x80
[<0>] file_write_and_wait_range+0x48/0xb0
[<0>] xfs_file_fsync+0x5a/0x240 [xfs]
[<0>] __x64_sys_fdatasync+0x46/0x80
[<0>] do_syscall_64+0x5c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x72/0xdc

Do you also have it, or it's just my host kernel / other config that is
different?
Daniel P. Berrangé March 4, 2024, 12:42 p.m. UTC | #6
On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
> Fabiano,
> 
> On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
> 
> I'm curious normally how much time does it take to do the final fdatasync()
> for you when you did this test.
> 
> I finally got a relatively large system today and gave it a quick shot over
> 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
> migration save/load does all fine, so I don't think there's anything wrong
> with the patchset, however when save completes (I'll need to stop the
> workload as my disk isn't fast enough I guess..) I'll always hit a super
> long hang of QEMU on fdatasync() on XFS during which the main thread is in
> UNINTERRUPTIBLE state.

That isn't very surprising. If you don't have O_DIRECT enabled, then
all that disk I/O from the migrate is going to be in RAM, and thus the
fdatasync() is likely to trigger writing out alot of data.

Blocking the main QEMU thread though is pretty unhelpful. That suggests
the data sync needs to be moved to a non-main thread.

With O_DIRECT meanwhile there should be essentially no hit from fdatasync.

> 
> [<0>] rq_qos_wait+0xbb/0x130
> [<0>] wbt_wait+0x9c/0x100
> [<0>] __rq_qos_throttle+0x23/0x40
> [<0>] blk_mq_submit_bio+0x183/0x580
> [<0>] __submit_bio_noacct+0x7e/0x1e0
> [<0>] iomap_submit_ioend+0x4e/0x80
> [<0>] iomap_writepage_map+0x22a/0x400
> [<0>] write_cache_pages+0x17c/0x4c0
> [<0>] iomap_writepages+0x1c/0x40
> [<0>] xfs_vm_writepages+0x7a/0xb0 [xfs]
> [<0>] do_writepages+0xcf/0x1d0
> [<0>] filemap_fdatawrite_wbc+0x66/0x90
> [<0>] __filemap_fdatawrite_range+0x54/0x80
> [<0>] file_write_and_wait_range+0x48/0xb0
> [<0>] xfs_file_fsync+0x5a/0x240 [xfs]
> [<0>] __x64_sys_fdatasync+0x46/0x80
> [<0>] do_syscall_64+0x5c/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
> 
> Do you also have it, or it's just my host kernel / other config that is
> different?
> 
> -- 
> Peter Xu
> 

With regards,
Daniel
Peter Xu March 4, 2024, 12:53 p.m. UTC | #7
On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
> On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
> > Fabiano,
> > 
> > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
> > 
> > I'm curious normally how much time does it take to do the final fdatasync()
> > for you when you did this test.
> > 
> > I finally got a relatively large system today and gave it a quick shot over
> > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
> > migration save/load does all fine, so I don't think there's anything wrong
> > with the patchset, however when save completes (I'll need to stop the
> > workload as my disk isn't fast enough I guess..) I'll always hit a super
> > long hang of QEMU on fdatasync() on XFS during which the main thread is in
> > UNINTERRUPTIBLE state.
> 
> That isn't very surprising. If you don't have O_DIRECT enabled, then
> all that disk I/O from the migrate is going to be in RAM, and thus the
> fdatasync() is likely to trigger writing out alot of data.
> 
> Blocking the main QEMU thread though is pretty unhelpful. That suggests
> the data sync needs to be moved to a non-main thread.

Perhaps migration thread itself can also be a candidate, then.

> 
> With O_DIRECT meanwhile there should be essentially no hit from fdatasync.

The update of COMPLETED status can be a good place of a marker point to
show such flush done if from the gut feeling of a user POV.  If that makes
sense, maybe we can do that sync before setting COMPLETED.

No matter which thread does that sync, it's still a pity that it'll go into
UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
onto it to have a look will also hang.

Thanks,
Fabiano Rosas March 4, 2024, 1:09 p.m. UTC | #8
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
>> Fabiano,
>> 
>> On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
>> > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
>> 
>> I'm curious normally how much time does it take to do the final fdatasync()
>> for you when you did this test.

I haven't looked at the fdatasync() in isolation. I'll do some
measurements soon.

>> 
>> I finally got a relatively large system today and gave it a quick shot over
>> 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
>> migration save/load does all fine, so I don't think there's anything wrong
>> with the patchset, however when save completes (I'll need to stop the
>> workload as my disk isn't fast enough I guess..) I'll always hit a super
>> long hang of QEMU on fdatasync() on XFS during which the main thread is in
>> UNINTERRUPTIBLE state.

> That isn't very surprising. If you don't have O_DIRECT enabled, then
> all that disk I/O from the migrate is going to be in RAM, and thus the
> fdatasync() is likely to trigger writing out alot of data.
>
> Blocking the main QEMU thread though is pretty unhelpful. That suggests
> the data sync needs to be moved to a non-main thread.

Perhaps if we move the fsync to the same spot as the multifd thread sync
instead of having a big one at the end? Not sure how that looks with
concurrency in the mix.

I'll have to experiment a bit.
Peter Xu March 4, 2024, 1:12 p.m. UTC | #9
On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
> > > Fabiano,
> > > 
> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
> > > 
> > > I'm curious normally how much time does it take to do the final fdatasync()
> > > for you when you did this test.
> > > 
> > > I finally got a relatively large system today and gave it a quick shot over
> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
> > > migration save/load does all fine, so I don't think there's anything wrong
> > > with the patchset, however when save completes (I'll need to stop the
> > > workload as my disk isn't fast enough I guess..) I'll always hit a super
> > > long hang of QEMU on fdatasync() on XFS during which the main thread is in
> > > UNINTERRUPTIBLE state.
> > 
> > That isn't very surprising. If you don't have O_DIRECT enabled, then
> > all that disk I/O from the migrate is going to be in RAM, and thus the
> > fdatasync() is likely to trigger writing out alot of data.
> > 
> > Blocking the main QEMU thread though is pretty unhelpful. That suggests
> > the data sync needs to be moved to a non-main thread.
> 
> Perhaps migration thread itself can also be a candidate, then.
> 
> > 
> > With O_DIRECT meanwhile there should be essentially no hit from fdatasync.
> 
> The update of COMPLETED status can be a good place of a marker point to
> show such flush done if from the gut feeling of a user POV.  If that makes
> sense, maybe we can do that sync before setting COMPLETED.
> 
> No matter which thread does that sync, it's still a pity that it'll go into
> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
> onto it to have a look will also hang.

Or... would it be nicer we get rid of the fdatasync() but leave that for
upper layers?  QEMU used to support file: migration already, it never
manage cache behavior; it does smell like something shouldn't be done in
QEMU when thinking about it, at least mapped-ram is nothing special to me
from this regard.

User should be able to control that either manually (sync), or Libvirt can
do that after QEMU quits; after all Libvirt holds the fd itself?  It should
allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of QEMU
went away.  Another side benefit: rather than holding all of QEMU resources
(especially, guest RAM) when waiting for a super slow disk flush, Libvirt /
upper layer can do that separately after releasing all the QEMU resources
first.

Thanks,
Peter Xu March 4, 2024, 1:17 p.m. UTC | #10
On Mon, Mar 04, 2024 at 10:09:25AM -0300, Fabiano Rosas wrote:
> Perhaps if we move the fsync to the same spot as the multifd thread sync
> instead of having a big one at the end? Not sure how that looks with
> concurrency in the mix.

Can try, but I think the bottleneck should normally be on the block
backend, in which case I won't be surprised concurrency on flushing won't
help then.

Please see my other proposal on removing fdatasync() from qemu; I could
overlook some reason to have it, though..
Fabiano Rosas March 4, 2024, 8:15 p.m. UTC | #11
Peter Xu <peterx@redhat.com> writes:

> On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
>> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
>> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
>> > > Fabiano,
>> > > 
>> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
>> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
>> > > 
>> > > I'm curious normally how much time does it take to do the final fdatasync()
>> > > for you when you did this test.

I measured and it takes ~4s for the live migration and ~2s for the
non-live. I didn't notice this before because the VM goes into
postmigrate, so it's paused anyway.

>> > > 
>> > > I finally got a relatively large system today and gave it a quick shot over
>> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
>> > > migration save/load does all fine, so I don't think there's anything wrong
>> > > with the patchset, however when save completes (I'll need to stop the
>> > > workload as my disk isn't fast enough I guess..) I'll always hit a super
>> > > long hang of QEMU on fdatasync() on XFS during which the main thread is in
>> > > UNINTERRUPTIBLE state.
>> > 
>> > That isn't very surprising. If you don't have O_DIRECT enabled, then
>> > all that disk I/O from the migrate is going to be in RAM, and thus the
>> > fdatasync() is likely to trigger writing out alot of data.
>> > 
>> > Blocking the main QEMU thread though is pretty unhelpful. That suggests
>> > the data sync needs to be moved to a non-main thread.
>> 
>> Perhaps migration thread itself can also be a candidate, then.
>> 
>> > 
>> > With O_DIRECT meanwhile there should be essentially no hit from fdatasync.
>> 
>> The update of COMPLETED status can be a good place of a marker point to
>> show such flush done if from the gut feeling of a user POV.  If that makes
>> sense, maybe we can do that sync before setting COMPLETED.

At the migration completion I believe the multifd threads will have
already cleaned up and dropped the reference to the channel, it might be
too late then.

In the multifd threads, we'll be wasting (like we are today) the extra
syscalls after the first sync succeeds.

>> 
>> No matter which thread does that sync, it's still a pity that it'll go into
>> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
>> onto it to have a look will also hang.
>
> Or... would it be nicer we get rid of the fdatasync() but leave that for
> upper layers?  QEMU used to support file: migration already, it never
> manage cache behavior; it does smell like something shouldn't be done in
> QEMU when thinking about it, at least mapped-ram is nothing special to me
> from this regard.
>
> User should be able to control that either manually (sync), or Libvirt can
> do that after QEMU quits; after all Libvirt holds the fd itself?  It should
> allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of QEMU
> went away.  Another side benefit: rather than holding all of QEMU resources
> (especially, guest RAM) when waiting for a super slow disk flush, Libvirt /
> upper layer can do that separately after releasing all the QEMU resources
> first.

I like the idea of QEMU having a self-contained
implementation. Specially since we'll add O_DIRECT support, which is
already quite heavy-handed if we're talking about managing cache
behavior.

However, it's not trivial to find the right place to add the sync.
Wherever we put it there will be some implications, such as ensuring the
sync works even after migration failure, avoiding concurrent cleanup,
etc.

In any case, I don't think it's correct to have the sync at
qio_channel_close(), now that we've seen it might block for a long
time. We could at the very least have a qio_channel_flush()[1] which the
QIOChannelFile implements with fdatasync(). Then the clients can choose
when to sync.

1- we already have that actually and it's coupled to zero_copy. I don't
see why not make the function more generic.
Daniel P. Berrangé March 4, 2024, 9:04 p.m. UTC | #12
On Mon, Mar 04, 2024 at 05:15:05PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
> >> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
> >> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
> >> > > Fabiano,
> >> > > 
> >> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> >> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
> >> > > 
> >> > > I'm curious normally how much time does it take to do the final fdatasync()
> >> > > for you when you did this test.
> 
> I measured and it takes ~4s for the live migration and ~2s for the
> non-live. I didn't notice this before because the VM goes into
> postmigrate, so it's paused anyway.
> 
> >> > > 
> >> > > I finally got a relatively large system today and gave it a quick shot over
> >> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
> >> > > migration save/load does all fine, so I don't think there's anything wrong
> >> > > with the patchset, however when save completes (I'll need to stop the
> >> > > workload as my disk isn't fast enough I guess..) I'll always hit a super
> >> > > long hang of QEMU on fdatasync() on XFS during which the main thread is in
> >> > > UNINTERRUPTIBLE state.
> >> > 
> >> > That isn't very surprising. If you don't have O_DIRECT enabled, then
> >> > all that disk I/O from the migrate is going to be in RAM, and thus the
> >> > fdatasync() is likely to trigger writing out alot of data.
> >> > 
> >> > Blocking the main QEMU thread though is pretty unhelpful. That suggests
> >> > the data sync needs to be moved to a non-main thread.
> >> 
> >> Perhaps migration thread itself can also be a candidate, then.
> >> 
> >> > 
> >> > With O_DIRECT meanwhile there should be essentially no hit from fdatasync.
> >> 
> >> The update of COMPLETED status can be a good place of a marker point to
> >> show such flush done if from the gut feeling of a user POV.  If that makes
> >> sense, maybe we can do that sync before setting COMPLETED.
> 
> At the migration completion I believe the multifd threads will have
> already cleaned up and dropped the reference to the channel, it might be
> too late then.
> 
> In the multifd threads, we'll be wasting (like we are today) the extra
> syscalls after the first sync succeeds.
> 
> >> 
> >> No matter which thread does that sync, it's still a pity that it'll go into
> >> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
> >> onto it to have a look will also hang.
> >
> > Or... would it be nicer we get rid of the fdatasync() but leave that for
> > upper layers?  QEMU used to support file: migration already, it never
> > manage cache behavior; it does smell like something shouldn't be done in
> > QEMU when thinking about it, at least mapped-ram is nothing special to me
> > from this regard.
> >
> > User should be able to control that either manually (sync), or Libvirt can
> > do that after QEMU quits; after all Libvirt holds the fd itself?  It should
> > allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of QEMU
> > went away.  Another side benefit: rather than holding all of QEMU resources
> > (especially, guest RAM) when waiting for a super slow disk flush, Libvirt /
> > upper layer can do that separately after releasing all the QEMU resources
> > first.
> 
> I like the idea of QEMU having a self-contained
> implementation. Specially since we'll add O_DIRECT support, which is
> already quite heavy-handed if we're talking about managing cache
> behavior.
> 
> However, it's not trivial to find the right place to add the sync.
> Wherever we put it there will be some implications, such as ensuring the
> sync works even after migration failure, avoiding concurrent cleanup,
> etc.
> 
> In any case, I don't think it's correct to have the sync at
> qio_channel_close(), now that we've seen it might block for a long
> time. We could at the very least have a qio_channel_flush()[1] which the
> QIOChannelFile implements with fdatasync(). Then the clients can choose
> when to sync.

Yes, I agree with de-coupling it.

With regards,
Daniel
Peter Xu March 5, 2024, 1:51 a.m. UTC | #13
On Mon, Mar 04, 2024 at 09:04:51PM +0000, Daniel P. Berrangé wrote:
> On Mon, Mar 04, 2024 at 05:15:05PM -0300, Fabiano Rosas wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
> > >> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
> > >> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
> > >> > > Fabiano,
> > >> > > 
> > >> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
> > >> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
> > >> > > 
> > >> > > I'm curious normally how much time does it take to do the final fdatasync()
> > >> > > for you when you did this test.
> > 
> > I measured and it takes ~4s for the live migration and ~2s for the
> > non-live. I didn't notice this before because the VM goes into
> > postmigrate, so it's paused anyway.

For my case it took me tens of seconds at least, if not go into minutes,
which I didn't measure.

I could have dirtied harder, or I just had a slower disk.  IIUC the worst
case is all cache dirty (didn't yet writeback in the kernel), say 100GB,
assuming the disk bandwidth 1GB/s (that's the bw of my test machine hard
drive of 1M chunk dd for a 10GB file, even without a sync..), IIUC it means
it could take 1min or more in reality.

> > 
> > >> > > 
> > >> > > I finally got a relatively large system today and gave it a quick shot over
> > >> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
> > >> > > migration save/load does all fine, so I don't think there's anything wrong
> > >> > > with the patchset, however when save completes (I'll need to stop the
> > >> > > workload as my disk isn't fast enough I guess..) I'll always hit a super
> > >> > > long hang of QEMU on fdatasync() on XFS during which the main thread is in
> > >> > > UNINTERRUPTIBLE state.
> > >> > 
> > >> > That isn't very surprising. If you don't have O_DIRECT enabled, then
> > >> > all that disk I/O from the migrate is going to be in RAM, and thus the
> > >> > fdatasync() is likely to trigger writing out alot of data.
> > >> > 
> > >> > Blocking the main QEMU thread though is pretty unhelpful. That suggests
> > >> > the data sync needs to be moved to a non-main thread.
> > >> 
> > >> Perhaps migration thread itself can also be a candidate, then.
> > >> 
> > >> > 
> > >> > With O_DIRECT meanwhile there should be essentially no hit from fdatasync.
> > >> 
> > >> The update of COMPLETED status can be a good place of a marker point to
> > >> show such flush done if from the gut feeling of a user POV.  If that makes
> > >> sense, maybe we can do that sync before setting COMPLETED.
> > 
> > At the migration completion I believe the multifd threads will have
> > already cleaned up and dropped the reference to the channel, it might be
> > too late then.
> > 
> > In the multifd threads, we'll be wasting (like we are today) the extra
> > syscalls after the first sync succeeds.
> > 
> > >> 
> > >> No matter which thread does that sync, it's still a pity that it'll go into
> > >> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
> > >> onto it to have a look will also hang.
> > >
> > > Or... would it be nicer we get rid of the fdatasync() but leave that for
> > > upper layers?  QEMU used to support file: migration already, it never
> > > manage cache behavior; it does smell like something shouldn't be done in
> > > QEMU when thinking about it, at least mapped-ram is nothing special to me
> > > from this regard.
> > >
> > > User should be able to control that either manually (sync), or Libvirt can
> > > do that after QEMU quits; after all Libvirt holds the fd itself?  It should
> > > allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of QEMU
> > > went away.  Another side benefit: rather than holding all of QEMU resources
> > > (especially, guest RAM) when waiting for a super slow disk flush, Libvirt /
> > > upper layer can do that separately after releasing all the QEMU resources
> > > first.
> > 
> > I like the idea of QEMU having a self-contained
> > implementation. Specially since we'll add O_DIRECT support, which is
> > already quite heavy-handed if we're talking about managing cache
> > behavior.

O_DIRECT is optionally selected by the user by setting the new parameter
first, so the user is still in full control - it's still user's decision on
how cache should be managed, even if QEMU needs explicit changes to support
and expose the new parameter.

For fdatasync(), I think it's slightly different in that it doesn't require
anything implemented in QEMU, as the snapshot is always in the form of a
file, and file is pretty common concept which well supports sync semantics
separately.  Instead of providing yet another parameter to control it, we
can just avoid that datasync.

Besides what I already described above as reasons, I think it's also legal
if an user wants to temporarily flush a VM into a disk (in paused state),
run some RAM-intense loads (which can immediately make use of guest's RAM
which is directly freed, but may _not_ always require a page cache flush),
then relaunch the VM.  In that case keeping some cache around might help
already to speedup relaunching to avoid unnecessary swap-ins/swap-outs.

> > 
> > However, it's not trivial to find the right place to add the sync.
> > Wherever we put it there will be some implications, such as ensuring the
> > sync works even after migration failure, avoiding concurrent cleanup,
> > etc.
> > 
> > In any case, I don't think it's correct to have the sync at
> > qio_channel_close(), now that we've seen it might block for a long
> > time. We could at the very least have a qio_channel_flush()[1] which the
> > QIOChannelFile implements with fdatasync(). Then the clients can choose
> > when to sync.
> 
> Yes, I agree with de-coupling it.

Yes, that decoupling makes sense to me.  That definitely answers some of my
previous confusions.

The following question is whether we should require a qio_channel_flush()
by default at anywhere around the end of migration for mapped-ram, in which
case I lean towards removing it completely.  In all cases, considering the
time it could hang qemu (possible in minutes) we may want to change that
behavior for 9.0 if possible.

Thanks,
Fabiano Rosas March 5, 2024, 3:23 p.m. UTC | #14
Peter Xu <peterx@redhat.com> writes:

> On Mon, Mar 04, 2024 at 09:04:51PM +0000, Daniel P. Berrangé wrote:
>> On Mon, Mar 04, 2024 at 05:15:05PM -0300, Fabiano Rosas wrote:
>> > Peter Xu <peterx@redhat.com> writes:
>> > 
>> > > On Mon, Mar 04, 2024 at 08:53:24PM +0800, Peter Xu wrote:
>> > >> On Mon, Mar 04, 2024 at 12:42:25PM +0000, Daniel P. Berrangé wrote:
>> > >> > On Mon, Mar 04, 2024 at 08:35:36PM +0800, Peter Xu wrote:
>> > >> > > Fabiano,
>> > >> > > 
>> > >> > > On Thu, Feb 29, 2024 at 12:29:54PM -0300, Fabiano Rosas wrote:
>> > >> > > > => guest: 128 GB RAM - 120 GB dirty - 1 vcpu in tight loop dirtying memory
>> > >> > > 
>> > >> > > I'm curious normally how much time does it take to do the final fdatasync()
>> > >> > > for you when you did this test.
>> > 
>> > I measured and it takes ~4s for the live migration and ~2s for the
>> > non-live. I didn't notice this before because the VM goes into
>> > postmigrate, so it's paused anyway.
>
> For my case it took me tens of seconds at least, if not go into minutes,
> which I didn't measure.
>
> I could have dirtied harder, or I just had a slower disk.  IIUC the worst
> case is all cache dirty (didn't yet writeback in the kernel), say 100GB,
> assuming the disk bandwidth 1GB/s (that's the bw of my test machine hard
> drive of 1M chunk dd for a 10GB file, even without a sync..), IIUC it means
> it could take 1min or more in reality.
>
>> > 
>> > >> > > 
>> > >> > > I finally got a relatively large system today and gave it a quick shot over
>> > >> > > 128G (100G busy dirty) mapped-ram snapshot with 8 multifd channels.  The
>> > >> > > migration save/load does all fine, so I don't think there's anything wrong
>> > >> > > with the patchset, however when save completes (I'll need to stop the
>> > >> > > workload as my disk isn't fast enough I guess..) I'll always hit a super
>> > >> > > long hang of QEMU on fdatasync() on XFS during which the main thread is in
>> > >> > > UNINTERRUPTIBLE state.
>> > >> > 
>> > >> > That isn't very surprising. If you don't have O_DIRECT enabled, then
>> > >> > all that disk I/O from the migrate is going to be in RAM, and thus the
>> > >> > fdatasync() is likely to trigger writing out alot of data.
>> > >> > 
>> > >> > Blocking the main QEMU thread though is pretty unhelpful. That suggests
>> > >> > the data sync needs to be moved to a non-main thread.
>> > >> 
>> > >> Perhaps migration thread itself can also be a candidate, then.
>> > >> 
>> > >> > 
>> > >> > With O_DIRECT meanwhile there should be essentially no hit from fdatasync.
>> > >> 
>> > >> The update of COMPLETED status can be a good place of a marker point to
>> > >> show such flush done if from the gut feeling of a user POV.  If that makes
>> > >> sense, maybe we can do that sync before setting COMPLETED.
>> > 
>> > At the migration completion I believe the multifd threads will have
>> > already cleaned up and dropped the reference to the channel, it might be
>> > too late then.
>> > 
>> > In the multifd threads, we'll be wasting (like we are today) the extra
>> > syscalls after the first sync succeeds.
>> > 
>> > >> 
>> > >> No matter which thread does that sync, it's still a pity that it'll go into
>> > >> UNINTERRUPTIBLE during fdatasync(), then whoever wants to e.g. attach a gdb
>> > >> onto it to have a look will also hang.
>> > >
>> > > Or... would it be nicer we get rid of the fdatasync() but leave that for
>> > > upper layers?  QEMU used to support file: migration already, it never
>> > > manage cache behavior; it does smell like something shouldn't be done in
>> > > QEMU when thinking about it, at least mapped-ram is nothing special to me
>> > > from this regard.
>> > >
>> > > User should be able to control that either manually (sync), or Libvirt can
>> > > do that after QEMU quits; after all Libvirt holds the fd itself?  It should
>> > > allow us to get rid of above UNINTERRUPTIBLE / un-debuggable period of QEMU
>> > > went away.  Another side benefit: rather than holding all of QEMU resources
>> > > (especially, guest RAM) when waiting for a super slow disk flush, Libvirt /
>> > > upper layer can do that separately after releasing all the QEMU resources
>> > > first.
>> > 
>> > I like the idea of QEMU having a self-contained
>> > implementation. Specially since we'll add O_DIRECT support, which is
>> > already quite heavy-handed if we're talking about managing cache
>> > behavior.
>
> O_DIRECT is optionally selected by the user by setting the new parameter
> first, so the user is still in full control - it's still user's decision on
> how cache should be managed, even if QEMU needs explicit changes to support
> and expose the new parameter.
>
> For fdatasync(), I think it's slightly different in that it doesn't require
> anything implemented in QEMU, as the snapshot is always in the form of a
> file, and file is pretty common concept which well supports sync semantics
> separately.  Instead of providing yet another parameter to control it, we
> can just avoid that datasync.
>
> Besides what I already described above as reasons, I think it's also legal
> if an user wants to temporarily flush a VM into a disk (in paused state),
> run some RAM-intense loads (which can immediately make use of guest's RAM
> which is directly freed, but may _not_ always require a page cache flush),
> then relaunch the VM.  In that case keeping some cache around might help
> already to speedup relaunching to avoid unnecessary swap-ins/swap-outs.
>
>> > 
>> > However, it's not trivial to find the right place to add the sync.
>> > Wherever we put it there will be some implications, such as ensuring the
>> > sync works even after migration failure, avoiding concurrent cleanup,
>> > etc.
>> > 
>> > In any case, I don't think it's correct to have the sync at
>> > qio_channel_close(), now that we've seen it might block for a long
>> > time. We could at the very least have a qio_channel_flush()[1] which the
>> > QIOChannelFile implements with fdatasync(). Then the clients can choose
>> > when to sync.
>> 
>> Yes, I agree with de-coupling it.
>
> Yes, that decoupling makes sense to me.  That definitely answers some of my
> previous confusions.
>
> The following question is whether we should require a qio_channel_flush()
> by default at anywhere around the end of migration for mapped-ram, in which
> case I lean towards removing it completely.  In all cases, considering the
> time it could hang qemu (possible in minutes) we may want to change that
> behavior for 9.0 if possible.

Ok, I'll remove it for 9.0 then. And I guess I'll also remove the flush
completely since there are no other users except for migration.