diff mbox series

migration: support file: uri for source migration

Message ID 20220908102633.123536-1-nborisov@suse.com
State New
Headers show
Series migration: support file: uri for source migration | expand

Commit Message

Nikolay Borisov Sept. 8, 2022, 10:26 a.m. UTC
This is a prototype of supporting a 'file:' based uri protocol for
writing out the migration stream of qemu. Currently the code always
opens the file in DIO mode and adheres to an alignment of 64k to be
generic enough. However this comes with a problem - it requires copying
all data that we are writing (qemu metadata + guest ram pages) to a
bounce buffer so that we adhere to this alignment. With this code I get
the following performance results:

      DIO              exec: cat > file         virsh --bypass-cache
      82		     		77							81
      82		    	    78							80
      80		    	    80							82
      82		    	    82							77
      77		    	    79							77

AVG:  80.6		     		79.2						79.4
stddev: 1.959		     	1.720						2.05

All numbers are in seconds.

Those results are somewhat surprising to me as I'd expected doing the
writeout directly within qemu and avoiding copying between qemu and
virsh's iohelper process would result in a speed up. Clearly that's not
the case, I attribute this to the fact that all memory pages have to be
copied into the bounce buffer. There are more measurements/profiling
work that I'd have to do in order to (dis)prove this hypotheses and will
report back when I have the data.

However I'm sending the code now as I'd like to facilitate a discussion
as to whether this is an approach that would be acceptable to upstream
merging. Any ideas/comments would be much appreciated.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 include/io/channel-file.h |   1 +
 include/io/channel.h      |   1 +
 io/channel-file.c         |  17 +++++++
 migration/meson.build     |   1 +
 migration/migration.c     |   4 ++
 migration/migration.h     |   2 +
 migration/qemu-file.c     | 104 +++++++++++++++++++++++++++++++++++++-
 7 files changed, 129 insertions(+), 1 deletion(-)

--
2.25.1

Comments

Daniel P. Berrangé Sept. 12, 2022, 3:41 p.m. UTC | #1
On Thu, Sep 08, 2022 at 01:26:32PM +0300, Nikolay Borisov wrote:
> This is a prototype of supporting a 'file:' based uri protocol for
> writing out the migration stream of qemu. Currently the code always
> opens the file in DIO mode and adheres to an alignment of 64k to be
> generic enough. However this comes with a problem - it requires copying
> all data that we are writing (qemu metadata + guest ram pages) to a
> bounce buffer so that we adhere to this alignment.

The adhoc device metadata clearly needs bounce buffers since it
is splattered all over RAM with no concern of alignemnt. THe use
of bounce buffers for this shouldn't be a performance issue though
as metadata is small relative to the size of the snapshot as a whole.

The guest RAM pages should not need bounce buffers at all when using
huge pages, as alignment will already be way larger than we required.
Guests with huge pages are the ones which are likely to have huge
RAM sizes and thus need the DIO mode, so we should be sorted for that.

When using small pages for guest RAM, if it is not already allocated
with suitable alignment, I feel like we should be able to make it
so that we allocate the RAM block with good alignemnt to avoid the
need for bounce buffers. This would address the less common case of
a guest with huge RAM size but not huge pages.

Thus if we assume guest RAM is suitably aligned, then we can avoid
bounce buffers for RAM pages, while still using bounce buffers for
the metadata.

>                                                    With this code I get
> the following performance results:
> 
>       DIO              exec: cat > file         virsh --bypass-cache
>       82		     		77							81
>       82		    	    78							80
>       80		    	    80							82
>       82		    	    82							77
>       77		    	    79							77
> 
> AVG:  80.6		     		79.2						79.4
> stddev: 1.959		     	1.720						2.05
> 
> All numbers are in seconds.
> 
> Those results are somewhat surprising to me as I'd expected doing the
> writeout directly within qemu and avoiding copying between qemu and
> virsh's iohelper process would result in a speed up. Clearly that's not
> the case, I attribute this to the fact that all memory pages have to be
> copied into the bounce buffer. There are more measurements/profiling
> work that I'd have to do in order to (dis)prove this hypotheses and will
> report back when I have the data.

When using the libvirt iohelper we have mutliple CPUs involved. IOW the
bounce buffer copy is taking place on a separate CPU from the QEMU
migration loop. This ability to use multiple CPUs may well have balanced
out any benefit from doing DIO on the QEMU side.

If you eliminate bounce buffers for guest RAM and write it directly to
the fixed location on disk, then we should see the benefit - and if not
then something is really wrong in our thoughts.

> However I'm sending the code now as I'd like to facilitate a discussion
> as to whether this is an approach that would be acceptable to upstream
> merging. Any ideas/comments would be much appreciated.

AFAICT this impl is still using the existing on-disk format, where RAM
pages are just written inline to the stream. For DIO benefit to be
maximised we need the on-disk format to be changed, so that the guest
RAM regions can be directly associated with fixed locations on disk.
This also means that if guest dirties RAM while its saving, then we
overwrite the existing content on disk, such that restore only ever
needs to restore each RAM page once, instead of restoring every
dirtied version.


With regards,
Daniel
Nikolay Borisov Sept. 12, 2022, 4:30 p.m. UTC | #2
On 12.09.22 г. 18:41 ч., Daniel P. Berrangé wrote:
> On Thu, Sep 08, 2022 at 01:26:32PM +0300, Nikolay Borisov wrote:
>> This is a prototype of supporting a 'file:' based uri protocol for
>> writing out the migration stream of qemu. Currently the code always
>> opens the file in DIO mode and adheres to an alignment of 64k to be
>> generic enough. However this comes with a problem - it requires copying
>> all data that we are writing (qemu metadata + guest ram pages) to a
>> bounce buffer so that we adhere to this alignment.
> 
> The adhoc device metadata clearly needs bounce buffers since it
> is splattered all over RAM with no concern of alignemnt. THe use
> of bounce buffers for this shouldn't be a performance issue though
> as metadata is small relative to the size of the snapshot as a whole.

Bounce buffers can be eliminated altogether so long as we simply switch 
between buffered/DIO mode via fcntl.

> 
> The guest RAM pages should not need bounce buffers at all when using
> huge pages, as alignment will already be way larger than we required.
> Guests with huge pages are the ones which are likely to have huge
> RAM sizes and thus need the DIO mode, so we should be sorted for that.
> 
> When using small pages for guest RAM, if it is not already allocated
> with suitable alignment, I feel like we should be able to make it
> so that we allocate the RAM block with good alignemnt to avoid the
> need for bounce buffers. This would address the less common case of
> a guest with huge RAM size but not huge pages.

Ram blocks are generally allocated with good alignment due to them being 
mmaped(), however as I was toying with eliminating bounce buffers for 
ram I hit an issue where the page headers being written (8 bytes each) 
aren't aligned (naturally). Imo I think the on-disk format can be 
changed the following way:


<ramblock header, containing base address of ramblock>, each subsequent 
page is then written at an offset from the base address of the ramblock, 
that is it's index would be :

page_offset = page_addr - ramblock_base, Then the page is written at 
ramblock_base (in the file) + page_offset. This would eliminate the page 
headers altogether. This leaves aligning the initial ramblock header 
initially. However, this would lead to us potentially having to issue 1 
lseek per page to write - to adjust the the file position, which might 
not be a problem in itself but who knows. How dooes that sound to you?

> 
> Thus if we assume guest RAM is suitably aligned, then we can avoid
> bounce buffers for RAM pages, while still using bounce buffers for
> the metadata.
> 
>>                                                     With this code I get
>> the following performance results:
>>
>>        DIO              exec: cat > file         virsh --bypass-cache
>>        82		     		77							81
>>        82		    	    78							80
>>        80		    	    80							82
>>        82		    	    82							77
>>        77		    	    79							77
>>
>> AVG:  80.6		     		79.2						79.4
>> stddev: 1.959		     	1.720						2.05
>>
>> All numbers are in seconds.
>>
>> Those results are somewhat surprising to me as I'd expected doing the
>> writeout directly within qemu and avoiding copying between qemu and
>> virsh's iohelper process would result in a speed up. Clearly that's not
>> the case, I attribute this to the fact that all memory pages have to be
>> copied into the bounce buffer. There are more measurements/profiling
>> work that I'd have to do in order to (dis)prove this hypotheses and will
>> report back when I have the data.
> 
> When using the libvirt iohelper we have mutliple CPUs involved. IOW the
> bounce buffer copy is taking place on a separate CPU from the QEMU
> migration loop. This ability to use multiple CPUs may well have balanced
> out any benefit from doing DIO on the QEMU side.
> 
> If you eliminate bounce buffers for guest RAM and write it directly to
> the fixed location on disk, then we should see the benefit - and if not
> then something is really wrong in our thoughts.
> 
>> However I'm sending the code now as I'd like to facilitate a discussion
>> as to whether this is an approach that would be acceptable to upstream
>> merging. Any ideas/comments would be much appreciated.
> 
> AFAICT this impl is still using the existing on-disk format, where RAM
> pages are just written inline to the stream. For DIO benefit to be
> maximised we need the on-disk format to be changed, so that the guest
> RAM regions can be directly associated with fixed locations on disk.
> This also means that if guest dirties RAM while its saving, then we
> overwrite the existing content on disk, such that restore only ever
> needs to restore each RAM page once, instead of restoring every
> dirtied version.
> 
> 
> With regards,
> Daniel
Daniel P. Berrangé Sept. 12, 2022, 4:43 p.m. UTC | #3
On Mon, Sep 12, 2022 at 07:30:50PM +0300, Nikolay Borisov wrote:
> 
> 
> On 12.09.22 г. 18:41 ч., Daniel P. Berrangé wrote:
> > On Thu, Sep 08, 2022 at 01:26:32PM +0300, Nikolay Borisov wrote:
> > > This is a prototype of supporting a 'file:' based uri protocol for
> > > writing out the migration stream of qemu. Currently the code always
> > > opens the file in DIO mode and adheres to an alignment of 64k to be
> > > generic enough. However this comes with a problem - it requires copying
> > > all data that we are writing (qemu metadata + guest ram pages) to a
> > > bounce buffer so that we adhere to this alignment.
> > 
> > The adhoc device metadata clearly needs bounce buffers since it
> > is splattered all over RAM with no concern of alignemnt. THe use
> > of bounce buffers for this shouldn't be a performance issue though
> > as metadata is small relative to the size of the snapshot as a whole.
> 
> Bounce buffers can be eliminated altogether so long as we simply switch
> between buffered/DIO mode via fcntl.
> 
> > 
> > The guest RAM pages should not need bounce buffers at all when using
> > huge pages, as alignment will already be way larger than we required.
> > Guests with huge pages are the ones which are likely to have huge
> > RAM sizes and thus need the DIO mode, so we should be sorted for that.
> > 
> > When using small pages for guest RAM, if it is not already allocated
> > with suitable alignment, I feel like we should be able to make it
> > so that we allocate the RAM block with good alignemnt to avoid the
> > need for bounce buffers. This would address the less common case of
> > a guest with huge RAM size but not huge pages.
> 
> Ram blocks are generally allocated with good alignment due to them being
> mmaped(), however as I was toying with eliminating bounce buffers for ram I
> hit an issue where the page headers being written (8 bytes each) aren't
> aligned (naturally). Imo I think the on-disk format can be changed the
> following way:
> 
> 
> <ramblock header, containing base address of ramblock>, each subsequent page
> is then written at an offset from the base address of the ramblock, that is
> it's index would be :
> 
> page_offset = page_addr - ramblock_base, Then the page is written at
> ramblock_base (in the file) + page_offset. This would eliminate the page
> headers altogether. This leaves aligning the initial ramblock header
> initially. However, this would lead to us potentially having to issue 1
> lseek per page to write - to adjust the the file position, which might not
> be a problem in itself but who knows. How dooes that sound to you?

Yes, definitely. We don't want the headers on front of each page,
just one single large block. Looking forward to multi-fd, we don't
want to be using lseek at all, because that changes the file offset
for all threads using the FD. Instead we need to be able to use
pread/pwrite for writing the RAM blocks.

With regards,
Daniel
diff mbox series

Patch

diff --git a/include/io/channel-file.h b/include/io/channel-file.h
index 50e8eb113868..6cb0b698c62c 100644
--- a/include/io/channel-file.h
+++ b/include/io/channel-file.h
@@ -89,4 +89,5 @@  qio_channel_file_new_path(const char *path,
                           mode_t mode,
                           Error **errp);

+void qio_channel_file_disable_dio(QIOChannelFile *ioc);
 #endif /* QIO_CHANNEL_FILE_H */
diff --git a/include/io/channel.h b/include/io/channel.h
index c680ee748021..6127ff6c0626 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -41,6 +41,7 @@  enum QIOChannelFeature {
     QIO_CHANNEL_FEATURE_SHUTDOWN,
     QIO_CHANNEL_FEATURE_LISTEN,
     QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY,
+    QIO_CHANNEL_FEATURE_DIO,
 };


diff --git a/io/channel-file.c b/io/channel-file.c
index b67687c2aa64..5c7211b128f1 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -59,6 +59,10 @@  qio_channel_file_new_path(const char *path,
         return NULL;
     }

+    if (flags & O_DIRECT) {
+	    qio_channel_set_feature(QIO_CHANNEL(ioc), QIO_CHANNEL_FEATURE_DIO);
+    }
+
     trace_qio_channel_file_new_path(ioc, path, flags, mode, ioc->fd);

     return ioc;
@@ -109,6 +113,19 @@  static ssize_t qio_channel_file_readv(QIOChannel *ioc,
     return ret;
 }

+
+void qio_channel_file_disable_dio(QIOChannelFile *ioc)
+{
+	int flags = fcntl(ioc->fd, F_GETFL);
+	if (flags == -1) {
+		//handle failure
+	}
+
+	if (fcntl(ioc->fd, F_SETFL, (flags & ~O_DIRECT)) == -1) {
+		error_report("Can't disable O_DIRECT");
+	}
+}
+
 static ssize_t qio_channel_file_writev(QIOChannel *ioc,
                                        const struct iovec *iov,
                                        size_t niov,
diff --git a/migration/meson.build b/migration/meson.build
index 690487cf1a81..30a8392701c3 100644
--- a/migration/meson.build
+++ b/migration/meson.build
@@ -17,6 +17,7 @@  softmmu_ss.add(files(
   'colo.c',
   'exec.c',
   'fd.c',
+  'file.c',
   'global_state.c',
   'migration.c',
   'multifd.c',
diff --git a/migration/migration.c b/migration/migration.c
index bb8bbddfe467..e7e84ae12066 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -20,6 +20,7 @@ 
 #include "migration/blocker.h"
 #include "exec.h"
 #include "fd.h"
+#include "file.h"
 #include "socket.h"
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
@@ -2414,6 +2415,8 @@  void qmp_migrate(const char *uri, bool has_blk, bool blk,
         exec_start_outgoing_migration(s, p, &local_err);
     } else if (strstart(uri, "fd:", &p)) {
         fd_start_outgoing_migration(s, p, &local_err);
+    } else if (strstart(uri, "file:", &p)) {
+	file_start_outgoing_migration(s, p, &local_err);
     } else {
         if (!(has_resume && resume)) {
             yank_unregister_instance(MIGRATION_YANK_INSTANCE);
@@ -4307,6 +4310,7 @@  void migration_global_dump(Monitor *mon)
 static Property migration_properties[] = {
     DEFINE_PROP_BOOL("store-global-state", MigrationState,
                      store_global_state, true),
+    DEFINE_PROP_BOOL("use-direct", MigrationState, use_dio, false),
     DEFINE_PROP_BOOL("send-configuration", MigrationState,
                      send_configuration, true),
     DEFINE_PROP_BOOL("send-section-footer", MigrationState,
diff --git a/migration/migration.h b/migration/migration.h
index cdad8aceaaab..fa1a996bdd4e 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -336,6 +336,8 @@  struct MigrationState {
      */
     bool store_global_state;

+    bool use_dio;
+
     /* Whether we send QEMU_VM_CONFIGURATION during migration */
     bool send_configuration;
     /* Whether we send section footer during migration */
diff --git a/migration/qemu-file.c b/migration/qemu-file.c
index 4f400c2e5265..18a2fefccd00 100644
--- a/migration/qemu-file.c
+++ b/migration/qemu-file.c
@@ -30,9 +30,14 @@ 
 #include "qemu-file.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "qemu/memalign.h"
+#include "qemu/error-report.h"
+#include "migration.h"
+#include "io/channel-file.h"

 #define IO_BUF_SIZE 32768
 #define MAX_IOV_SIZE MIN_CONST(IOV_MAX, 64)
+#define DIO_BUF_SIZE (8*IO_BUF_SIZE)

 struct QEMUFile {
     const QEMUFileHooks *hooks;
@@ -56,6 +61,8 @@  struct QEMUFile {
     int buf_index;
     int buf_size; /* 0 when writing */
     uint8_t buf[IO_BUF_SIZE];
+    uint8_t *dio_buf;
+    int dio_buf_index;

     DECLARE_BITMAP(may_free, MAX_IOV_SIZE);
     struct iovec iov[MAX_IOV_SIZE];
@@ -65,6 +72,7 @@  struct QEMUFile {
     Error *last_error_obj;
     /* has the file has been shutdown */
     bool shutdown;
+    bool closing;
 };

 /*
@@ -115,6 +123,7 @@  static QEMUFile *qemu_file_new_impl(QIOChannel *ioc, bool is_writable)
     object_ref(ioc);
     f->ioc = ioc;
     f->is_writable = is_writable;
+    f->dio_buf = qemu_memalign(64*1024, DIO_BUF_SIZE);

     return f;
 }
@@ -260,6 +269,76 @@  static void qemu_iovec_release_ram(QEMUFile *f)
     memset(f->may_free, 0, sizeof(f->may_free));
 }

+#define in_range(b, first, len) ((uintptr_t)(b) >= (uintptr_t)(first) && (uintptr_t)(b) < (uintptr_t)(first) + (len))
+
+static void qemu_fflush_dio(QEMUFile *f)
+{
+	do  {
+		int i;
+		int new_ioveccnt = 0;
+		for (i = 0; i < f->iovcnt && f->dio_buf_index < DIO_BUF_SIZE; i++) {
+			struct iovec *vec = &f->iov[i];
+			size_t copy_len = MIN(vec->iov_len, DIO_BUF_SIZE - f->dio_buf_index);
+
+			/* if the iovec contains inline buf, adjust buf_index
+			 * accordingly
+			 */
+			if (in_range(vec->iov_base, f->buf, IO_BUF_SIZE)) {
+				f->buf_index -= copy_len;
+			}
+
+			memcpy(f->dio_buf+f->dio_buf_index, vec->iov_base, copy_len);
+			f->dio_buf_index += copy_len;
+			/* In case we couldn't fit the full iovec */
+			if (copy_len < vec->iov_len) {
+				// partial write or no write at all;
+				vec->iov_base += copy_len;
+				vec->iov_len -= copy_len;
+				break;
+			}
+		}
+
+		new_ioveccnt = f->iovcnt - i;
+		/*
+		 * DIO buf has been filled but we still have outstanding iovecs
+		 * so shift them to the beginning of iov array for subsequent
+		 * flushing
+		 */
+		for (int j = 0; i < f->iovcnt; j++, i++) {
+			f->iov[j] = f->iov[i];
+		}
+		f->iovcnt = new_ioveccnt;
+
+
+		/*
+		 * DIO BUFF is either full or this is the final flush, in the
+		 * latter case it's guaranteed that the fd is now in buffered
+		 * mode so we simply write anything which is outstanding
+		 */
+		if (f->dio_buf_index == DIO_BUF_SIZE || f->closing) {
+			Error *local_error = NULL;
+			struct iovec dio_iovec = {.iov_base = f->dio_buf,
+				.iov_len = f->dio_buf_index };
+
+			/*
+			 * This is the last flush so revert back to buffered
+			 * write
+			 */
+			if (f->closing) {
+				qio_channel_file_disable_dio(QIO_CHANNEL_FILE(f->ioc));
+			}
+
+			if (qio_channel_writev_all(f->ioc, &dio_iovec, 1, &local_error) < 0) {
+				qemu_file_set_error_obj(f, -EIO, local_error);
+			} else {
+				f->total_transferred += dio_iovec.iov_len;
+			}
+
+			f->dio_buf_index = 0;
+		}
+	} while (f->iovcnt);
+
+}

 /**
  * Flushes QEMUFile buffer
@@ -276,6 +355,12 @@  void qemu_fflush(QEMUFile *f)
     if (f->shutdown) {
         return;
     }
+
+    if (qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_DIO)) {
+	    qemu_fflush_dio(f);
+	    return;
+    }
+
     if (f->iovcnt > 0) {
         Error *local_error = NULL;
         if (qio_channel_writev_all(f->ioc,
@@ -434,6 +519,8 @@  void qemu_file_credit_transfer(QEMUFile *f, size_t size)
 int qemu_fclose(QEMUFile *f)
 {
     int ret, ret2;
+
+    f->closing = true;
     qemu_fflush(f);
     ret = qemu_file_get_error(f);

@@ -450,6 +537,7 @@  int qemu_fclose(QEMUFile *f)
         ret = f->last_error;
     }
     error_free(f->last_error_obj);
+    qemu_vfree(f->dio_buf);
     g_free(f);
     trace_qemu_file_fclose();
     return ret;
@@ -706,6 +794,10 @@  int64_t qemu_file_total_transferred_fast(QEMUFile *f)
     int64_t ret = f->total_transferred;
     int i;

+    if (qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_DIO)) {
+	    ret += f->dio_buf_index;
+    }
+
     for (i = 0; i < f->iovcnt; i++) {
         ret += f->iov[i].iov_len;
     }
@@ -715,8 +807,18 @@  int64_t qemu_file_total_transferred_fast(QEMUFile *f)

 int64_t qemu_file_total_transferred(QEMUFile *f)
 {
+    int64_t total_transferred = 0;
     qemu_fflush(f);
-    return f->total_transferred;
+    total_transferred += f->total_transferred;
+    /*
+     * If we are a DIO channel then adjust total transferred with possible bytes
+     * which might not have been totally written but are in the staging dio
+     * buffer
+     */
+    if (qio_channel_has_feature(f->ioc, QIO_CHANNEL_FEATURE_DIO)) {
+	    total_transferred += f->dio_buf_index;
+    }
+    return total_transferred;
 }

 int qemu_file_rate_limit(QEMUFile *f)