Patchwork [7/8] block: add witness argument to drive-reopen

login
register
mail settings
Submitter Paolo Bonzini
Date April 13, 2012, 4:23 p.m.
Message ID <1334334198-30899-8-git-send-email-pbonzini@redhat.com>
Download mbox | patch
Permalink /patch/152346/
State New
Headers show

Comments

Paolo Bonzini - April 13, 2012, 4:23 p.m.
Management needs a way for QEMU to confirm that no I/O has been sent to the
target and not to the source.  To provide this guarantee we rely on a file
in local persistent storage.  QEMU receives a file descriptor via SCM_RIGHTS
and writes a single byte to it.  If it fails, it will fail the drive-reopen
command too and management knows that no I/O request has been issued to the
new destination.  Likewise, if management finds the file to have nonzero
size it knows that the target is valid and that indeed I/O requests could
have been submitted to it.

The argument does not have an HMP equivalent.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 blockdev.c       |   24 +++++++++++++++++++++---
 hmp.c            |    2 +-
 qapi-schema.json |   10 +++++++++-
 qmp-commands.hx  |   12 +++++++++++-
 4 files changed, 42 insertions(+), 6 deletions(-)
Eric Blake - April 13, 2012, 10:32 p.m.
On 04/13/2012 10:23 AM, Paolo Bonzini wrote:
> Management needs a way for QEMU to confirm that no I/O has been sent to the
> target and not to the source.  To provide this guarantee we rely on a file
> in local persistent storage.  QEMU receives a file descriptor via SCM_RIGHTS
> and writes a single byte to it.  If it fails, it will fail the drive-reopen
> command too and management knows that no I/O request has been issued to the
> new destination.  Likewise, if management finds the file to have nonzero
> size it knows that the target is valid and that indeed I/O requests could
> have been submitted to it.

So if I understand correctly, the idea is that if just libvirtd goes
down after issuing 'drive-reopen' but before getting a response, then I
will miss both the return value and any event associated with
drive-reopen (whether that event is success or failure), and when
libvirtd restarts, 'query-block-jobs' no longer lists the job, so I
don't know whether the pivot happened.  However, if qemu stayed alive
during that time, I can at least use 'query-block' to see what qemu
thinks is open, and deduce from that.

But if the world conspires against me, such as libvirt going down, then
qemu completing the reopen, then the guest VM halting itself so that the
qemu process goes away, all before libvirt restarts, then I'm stuck
figuring out whether qemu finished the job (so that when I restart the
guest, I want to pivot the filename) or failed the job (so that when I
restart the guest, I want to revert to the source).  To do this, I now
have to create a new file on disk (not a pipe), pass in the fd in
advance, and then call drive-reopen, as well as record that filename as
the location where I will look as part of trying to re-establish
connections with the guest when libvirtd restarts.

I'm not quite sure how to expose this to upper-layer management
applications when they are using libvirt transient guests, but that's
not qemu's problem.  [In particular, it sounds like the sort of thing
that I can't cram into virDomainBlockRebase for RHEL 6.3 based on
libvirt 0.9.10, but would have to consider for a more powerful
virDomainBlockCopy for libvirt 0.9.12 or later.]  Overall, the idea
sounds workable, and does seem to offer an extra measure of protection
for recovery to uncorrupted data after a worse-case drive-reopen failure.

> +++ b/hmp.c
> @@ -744,7 +744,7 @@ void hmp_drive_reopen(Monitor *mon, const QDict *qdict)
>      const char *format = qdict_get_try_str(qdict, "format");
>      Error *errp = NULL;
>  
> -    qmp_drive_reopen(device, filename, !!format, format, &errp);
> +    qmp_drive_reopen(device, filename, !!format, format, false, NULL, &errp);
>      hmp_handle_error(mon, &errp);
>  }
>  
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 0bf3a25..2e5a925 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -1228,6 +1228,13 @@
>  #
>  # @format: #optional the format of the new image, default is 'qcow2'.
>  #
> +# @witness: A file descriptor name that was passed via getfd.  QEMU will write

Mark this #optional

> +#   a single byte to this file descriptor before completing the command
> +#   successfully.  If the byte is not written to the file, it is
> +#   guaranteed that the guest has not issued any I/O to the new image.
> +#   Failure to write the byte is fatal just like failure to open the new
> +#   image, and will cause the guest to revert to the currently open file.

Still seems like something that could fit if we get 'drive-reopen'
shoehorned into 'transaction' in the future.

Question - I know that 'drive-reopen' forces a block_job_cancel_sync()
call before closing the source; how long can that take?  After all, we
recently make block-job-cancel asynchronous (with block_job_cancel_sync
a wrapper around the asynchronous version that waits for things to
settle).  So that does mean that a call to 'drive-reopen' could indeed
take a very long time from initially sending the monitor command before
I finally get a response of success or failure, and that while the
response will be accurate, the whole intent of this patch is that
libvirt might not be around to get the response, so we want something a
bit more persistent.  Does this mean that if we add 'drive-reopen' to
'transaction', that transaction will be forced to wait for
block_job_cancel_sync?  And while it is waiting, are we locked out from
all other monitor commands?  Does this argue that 'transaction' needs to
gain an asynchronous mode of operation, where you request a transaction
but with immediate return, and then can issue other monitor commands
while waiting for acknowledgment of whether the transaction could be
acted on?  But these questions start to sound like they are geared more
to qemu 1.2, when we spend more time thinking about the situation of
adding asynchronous commands and properly managing which commands are
long-running vs. asynchronous.
Paolo Bonzini - April 16, 2012, 9:03 a.m.
Il 14/04/2012 00:32, Eric Blake ha scritto:
> But if the world conspires against me, such as libvirt going down, then
> qemu completing the reopen, then the guest VM halting itself so that the
> qemu process goes away, all before libvirt restarts, then I'm stuck
> figuring out whether qemu finished the job (so that when I restart the
> guest, I want to pivot the filename) or failed the job (so that when I
> restart the guest, I want to revert to the source).  To do this, I now
> have to create a new file on disk (not a pipe), pass in the fd in
> advance, and then call drive-reopen, as well as record that filename as
> the location where I will look as part of trying to re-establish
> connections with the guest when libvirtd restarts.

Yes.

> I'm not quite sure how to expose this to upper-layer management
> applications when they are using libvirt transient guests, but that's
> not qemu's problem.

Do transient guests have persistent storage for them in /var while they
are running?

>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 0bf3a25..2e5a925 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -1228,6 +1228,13 @@
>>  #
>>  # @format: #optional the format of the new image, default is 'qcow2'.
>>  #
>> +# @witness: A file descriptor name that was passed via getfd.  QEMU will write
> 
> Mark this #optional

Yes.

> Question - I know that 'drive-reopen' forces a block_job_cancel_sync()
> call before closing the source; how long can that take?

Not much; the mirroring job polls the dirty bitmap for new I/O every 100
ms, so it should be more or less comparable with the bdrv_drain_all that
is also performed by drive-reopen and blockdev-snapshot-sync.

> So that does mean that a call to 'drive-reopen' could indeed
> take a very long time from initially sending the monitor command before
> I finally get a response of success or failure, and that while the
> response will be accurate, the whole intent of this patch is that
> libvirt might not be around to get the response, so we want something a
> bit more persistent.  Does this mean that if we add 'drive-reopen' to
> 'transaction', that transaction will be forced to wait for
> block_job_cancel_sync?

Transactions are already waiting for pending I/O to complete
(bdrv_drain_all), and mirroring that I/O to the target
(block_job_cancel_sync) won't take much longer.

Paolo
Eric Blake - April 16, 2012, 9:55 p.m.
On 04/16/2012 03:03 AM, Paolo Bonzini wrote:
> Il 14/04/2012 00:32, Eric Blake ha scritto:
>> But if the world conspires against me, such as libvirt going down, then
>> qemu completing the reopen, then the guest VM halting itself so that the
>> qemu process goes away, all before libvirt restarts, then I'm stuck
>> figuring out whether qemu finished the job (so that when I restart the
>> guest, I want to pivot the filename) or failed the job (so that when I
>> restart the guest, I want to revert to the source).  To do this, I now
>> have to create a new file on disk (not a pipe), pass in the fd in
>> advance, and then call drive-reopen, as well as record that filename as
>> the location where I will look as part of trying to re-establish
>> connections with the guest when libvirtd restarts.
> 
> Yes.
> 
>> I'm not quite sure how to expose this to upper-layer management
>> applications when they are using libvirt transient guests, but that's
>> not qemu's problem.
> 
> Do transient guests have persistent storage for them in /var while they
> are running?

Yes - that's how libvirt tracks the pid of the qemu process that it
should be re-attaching to on a libvirtd restart, as well as several
other aspects (for example, the capabilities of the qemu binary running
that pid, in case qemu has been upgraded in the meantime).  It's just
that right now, the files that libvirt stores in /var are intended to be
internal to libvirt, so management apps shouldn't be poking around in
there, so much as having an interface to ask libvirt what state things
are in.

>> Question - I know that 'drive-reopen' forces a block_job_cancel_sync()
>> call before closing the source; how long can that take?
> 
> Not much; the mirroring job polls the dirty bitmap for new I/O every 100
> ms, so it should be more or less comparable with the bdrv_drain_all that
> is also performed by drive-reopen and blockdev-snapshot-sync.
> 
>> So that does mean that a call to 'drive-reopen' could indeed
>> take a very long time from initially sending the monitor command before
>> I finally get a response of success or failure, and that while the
>> response will be accurate, the whole intent of this patch is that
>> libvirt might not be around to get the response, so we want something a
>> bit more persistent.  Does this mean that if we add 'drive-reopen' to
>> 'transaction', that transaction will be forced to wait for
>> block_job_cancel_sync?
> 
> Transactions are already waiting for pending I/O to complete
> (bdrv_drain_all), and mirroring that I/O to the target
> (block_job_cancel_sync) won't take much longer.

In other words, 'block-job-cancel' (and commands built on top of it,
like 'drive-reopen') isn't instantaneous, but also is still easily in
the sub-second range.  Good to know.
Paolo Bonzini - April 17, 2012, 6:47 a.m.
Il 16/04/2012 23:55, Eric Blake ha scritto:
> > Do transient guests have persistent storage for them in /var while they
> > are running?
> 
> Yes - that's how libvirt tracks the pid of the qemu process that it
> should be re-attaching to on a libvirtd restart, as well as several
> other aspects (for example, the capabilities of the qemu binary running
> that pid, in case qemu has been upgraded in the meantime).  It's just
> that right now, the files that libvirt stores in /var are intended to be
> internal to libvirt, so management apps shouldn't be poking around in
> there, so much as having an interface to ask libvirt what state things
> are in.

Can libvirtd report the block-job-completed event to vdsm before the
transient-domain-died event?

Paolo
Eric Blake - April 17, 2012, 12:09 p.m.
[adding libvir-list for some interesting discussions on potential
libvirt design issues]

On 04/17/2012 12:47 AM, Paolo Bonzini wrote:
> Il 16/04/2012 23:55, Eric Blake ha scritto:
>>> Do transient guests have persistent storage for them in /var while they
>>> are running?
>>
>> Yes - that's how libvirt tracks the pid of the qemu process that it
>> should be re-attaching to on a libvirtd restart, as well as several
>> other aspects (for example, the capabilities of the qemu binary running
>> that pid, in case qemu has been upgraded in the meantime).  It's just
>> that right now, the files that libvirt stores in /var are intended to be
>> internal to libvirt, so management apps shouldn't be poking around in
>> there, so much as having an interface to ask libvirt what state things
>> are in.
> 
> Can libvirtd report the block-job-completed event to vdsm before the
> transient-domain-died event?

Yes, it should be possible to organize the code so that on libvirtd
restart, libvirt is careful to emit all other events, such as block-job
completed/failed, prior to the final domain died event as libvirt cleans
out its records of the dead domain.  But it probably involves quite a
bit of code cleanup on libvirt's side to reach that state.  There's also
the question of whether vdsm will be listening to the events soon
enough; after all, when libvirtd restarts, libvirt is doing quite a bit
of bookkeeping before even accepting connections, and if vdsm isn't
connected to the restarted libvirt at the time libvirt queues up an
event to be sent, then the event is lost.  Maybe that implies that
libvirt needs to have a configurable-depth array of prior events that
can be played back to the user on request.
Michal Privoznik - April 17, 2012, 12:42 p.m.
On 17.04.2012 14:09, Eric Blake wrote:
> [adding libvir-list for some interesting discussions on potential
> libvirt design issues]
> 
> On 04/17/2012 12:47 AM, Paolo Bonzini wrote:
>> Il 16/04/2012 23:55, Eric Blake ha scritto:
>>>> Do transient guests have persistent storage for them in /var while they
>>>> are running?
>>>
>>> Yes - that's how libvirt tracks the pid of the qemu process that it
>>> should be re-attaching to on a libvirtd restart, as well as several
>>> other aspects (for example, the capabilities of the qemu binary running
>>> that pid, in case qemu has been upgraded in the meantime).  It's just
>>> that right now, the files that libvirt stores in /var are intended to be
>>> internal to libvirt, so management apps shouldn't be poking around in
>>> there, so much as having an interface to ask libvirt what state things
>>> are in.
>>
>> Can libvirtd report the block-job-completed event to vdsm before the
>> transient-domain-died event?
> 
> Yes, it should be possible to organize the code so that on libvirtd
> restart, libvirt is careful to emit all other events, such as block-job
> completed/failed, prior to the final domain died event as libvirt cleans
> out its records of the dead domain.  But it probably involves quite a
> bit of code cleanup on libvirt's side to reach that state.  There's also
> the question of whether vdsm will be listening to the events soon
> enough; after all, when libvirtd restarts, libvirt is doing quite a bit
> of bookkeeping before even accepting connections, and if vdsm isn't
> connected to the restarted libvirt at the time libvirt queues up an
> event to be sent, then the event is lost.  Maybe that implies that
> libvirt needs to have a configurable-depth array of prior events that
> can be played back to the user on request.

That imply we need an API to hold up emitting events so user can
re-register callbacks. That is - store tem in a internal array. And
another to replay events = flushing the internal array.
Or even better, substitute 1st API with new switch to be passed to the
daemon. Because on libvirtd reconnection to qemu sockets we will
instantly get all pending events. So it gives really short time for user
apps to issue such API (if any).

What I mean:
1. libvirtd --listen --hold-up-events
2. conn = virConnectOpen*(...);
3. virConnectDomainEventRegister*(conn, ...);
4. virConnectReplayEvents(conn);

Michal

Patch

diff --git a/blockdev.c b/blockdev.c
index 08953fa..78b72f2 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -660,7 +660,7 @@  void do_commit(Monitor *mon, const QDict *qdict)
 }
 
 static void change_blockdev_image(BlockDriverState *bs, const char *new_image_file,
-                                  const char *format, Error **errp)
+                                  const char *format, int fd, Error **errp)
 {
     BlockDriver *old_drv, *proto_drv;
     BlockDriver *drv = NULL;
@@ -702,6 +702,16 @@  static void change_blockdev_image(BlockDriverState *bs, const char *new_image_fi
 
     bdrv_close(bs);
     ret = bdrv_open(bs, new_image_file, flags, drv);
+
+    if (ret == 0 && fd != -1) {
+        ret = write(fd, "", 1) == 1 ? 0 : -1;
+        qemu_fdatasync(fd);
+        close(fd);
+        if (ret < 0) {
+            bdrv_close(bs);
+        }
+    }
+
     /*
      * If reopening the image file we just created fails, fall back
      * and try to re-open the original image. If that fails too, we
@@ -718,9 +725,20 @@  static void change_blockdev_image(BlockDriverState *bs, const char *new_image_fi
 }
 
 void qmp_drive_reopen(const char *device, const char *new_image_file,
-                      bool has_format, const char *format, Error **errp)
+                      bool has_format, const char *format,
+                      bool has_witness, const char *witness,
+                      Error **errp)
 {
     BlockDriverState *bs;
+    int fd = -1;
+
+    if (has_witness) {
+        fd = monitor_get_fd(cur_mon, witness);
+        if (fd == -1) {
+            error_set(errp, QERR_FD_NOT_FOUND, witness);
+            return;
+        }
+    }
 
     bs = bdrv_find(device);
     if (!bs) {
@@ -731,7 +749,7 @@  void qmp_drive_reopen(const char *device, const char *new_image_file,
         block_job_cancel_sync(bs->job);
     }
     change_blockdev_image(bs, new_image_file,
-                          has_format ? format : NULL, errp);
+                          has_format ? format : NULL, fd, errp);
 }
 
 static void blockdev_do_action(int kind, void *data, Error **errp)
diff --git a/hmp.c b/hmp.c
index 28697ec..f67c441 100644
--- a/hmp.c
+++ b/hmp.c
@@ -744,7 +744,7 @@  void hmp_drive_reopen(Monitor *mon, const QDict *qdict)
     const char *format = qdict_get_try_str(qdict, "format");
     Error *errp = NULL;
 
-    qmp_drive_reopen(device, filename, !!format, format, &errp);
+    qmp_drive_reopen(device, filename, !!format, format, false, NULL, &errp);
     hmp_handle_error(mon, &errp);
 }
 
diff --git a/qapi-schema.json b/qapi-schema.json
index 0bf3a25..2e5a925 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -1228,6 +1228,13 @@ 
 #
 # @format: #optional the format of the new image, default is 'qcow2'.
 #
+# @witness: A file descriptor name that was passed via getfd.  QEMU will write
+#   a single byte to this file descriptor before completing the command
+#   successfully.  If the byte is not written to the file, it is
+#   guaranteed that the guest has not issued any I/O to the new image.
+#   Failure to write the byte is fatal just like failure to open the new
+#   image, and will cause the guest to revert to the currently open file.
+#
 # Returns: nothing on success
 #          If @device is not a valid block device, DeviceNotFound
 #          If @new-image-file can't be opened, OpenFileFailed
@@ -1236,7 +1243,8 @@ 
 # Since 1.1
 ##
 { 'command': 'drive-reopen',
-  'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str' } }
+  'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str',
+            '*witness': 'str' } }
 
 ##
 # @human-monitor-command:
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 6ea0ef5..fedfc36 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -829,7 +829,7 @@  EQMP
 
     {
         .name       = "drive-reopen",
-        .args_type  = "device:B,new-image-file:s,format:s?",
+        .args_type  = "device:B,new-image-file:s,format:s?,witness:s?",
         .mhandler.cmd_new = qmp_marshal_input_drive_reopen,
     },
 
@@ -842,11 +842,21 @@  guest is expecting the drive to change its content, the new image should
 contain the same data of the current one.  One use case is to terminate
 a drive-mirror command.
 
+The command can optionally write a single byte to a file descriptor name
+that was passed via SCM rights (getfd).  QEMU will write a single byte
+to this file descriptor before completing the command successfully.
+If the byte is not written to the file, it is guaranteed that the
+guest has not issued any I/O to the new image.  Failure to write the
+byte is fatal just like failure to open the new image, and will cause
+the guest to revert to the currently open file.
+
+
 Arguments:
 
 - "device": device name to operate on (json-string)
 - "new-image-file": name of new image file (json-string)
 - "format": format of new image (json-string, optional)
+- "witness": file descriptor previously passed via SCM rights (json-string, optional)
 
 Example: