Message ID | 1334334198-30899-8-git-send-email-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
On 04/13/2012 10:23 AM, Paolo Bonzini wrote: > Management needs a way for QEMU to confirm that no I/O has been sent to the > target and not to the source. To provide this guarantee we rely on a file > in local persistent storage. QEMU receives a file descriptor via SCM_RIGHTS > and writes a single byte to it. If it fails, it will fail the drive-reopen > command too and management knows that no I/O request has been issued to the > new destination. Likewise, if management finds the file to have nonzero > size it knows that the target is valid and that indeed I/O requests could > have been submitted to it. So if I understand correctly, the idea is that if just libvirtd goes down after issuing 'drive-reopen' but before getting a response, then I will miss both the return value and any event associated with drive-reopen (whether that event is success or failure), and when libvirtd restarts, 'query-block-jobs' no longer lists the job, so I don't know whether the pivot happened. However, if qemu stayed alive during that time, I can at least use 'query-block' to see what qemu thinks is open, and deduce from that. But if the world conspires against me, such as libvirt going down, then qemu completing the reopen, then the guest VM halting itself so that the qemu process goes away, all before libvirt restarts, then I'm stuck figuring out whether qemu finished the job (so that when I restart the guest, I want to pivot the filename) or failed the job (so that when I restart the guest, I want to revert to the source). To do this, I now have to create a new file on disk (not a pipe), pass in the fd in advance, and then call drive-reopen, as well as record that filename as the location where I will look as part of trying to re-establish connections with the guest when libvirtd restarts. I'm not quite sure how to expose this to upper-layer management applications when they are using libvirt transient guests, but that's not qemu's problem. [In particular, it sounds like the sort of thing that I can't cram into virDomainBlockRebase for RHEL 6.3 based on libvirt 0.9.10, but would have to consider for a more powerful virDomainBlockCopy for libvirt 0.9.12 or later.] Overall, the idea sounds workable, and does seem to offer an extra measure of protection for recovery to uncorrupted data after a worse-case drive-reopen failure. > +++ b/hmp.c > @@ -744,7 +744,7 @@ void hmp_drive_reopen(Monitor *mon, const QDict *qdict) > const char *format = qdict_get_try_str(qdict, "format"); > Error *errp = NULL; > > - qmp_drive_reopen(device, filename, !!format, format, &errp); > + qmp_drive_reopen(device, filename, !!format, format, false, NULL, &errp); > hmp_handle_error(mon, &errp); > } > > diff --git a/qapi-schema.json b/qapi-schema.json > index 0bf3a25..2e5a925 100644 > --- a/qapi-schema.json > +++ b/qapi-schema.json > @@ -1228,6 +1228,13 @@ > # > # @format: #optional the format of the new image, default is 'qcow2'. > # > +# @witness: A file descriptor name that was passed via getfd. QEMU will write Mark this #optional > +# a single byte to this file descriptor before completing the command > +# successfully. If the byte is not written to the file, it is > +# guaranteed that the guest has not issued any I/O to the new image. > +# Failure to write the byte is fatal just like failure to open the new > +# image, and will cause the guest to revert to the currently open file. Still seems like something that could fit if we get 'drive-reopen' shoehorned into 'transaction' in the future. Question - I know that 'drive-reopen' forces a block_job_cancel_sync() call before closing the source; how long can that take? After all, we recently make block-job-cancel asynchronous (with block_job_cancel_sync a wrapper around the asynchronous version that waits for things to settle). So that does mean that a call to 'drive-reopen' could indeed take a very long time from initially sending the monitor command before I finally get a response of success or failure, and that while the response will be accurate, the whole intent of this patch is that libvirt might not be around to get the response, so we want something a bit more persistent. Does this mean that if we add 'drive-reopen' to 'transaction', that transaction will be forced to wait for block_job_cancel_sync? And while it is waiting, are we locked out from all other monitor commands? Does this argue that 'transaction' needs to gain an asynchronous mode of operation, where you request a transaction but with immediate return, and then can issue other monitor commands while waiting for acknowledgment of whether the transaction could be acted on? But these questions start to sound like they are geared more to qemu 1.2, when we spend more time thinking about the situation of adding asynchronous commands and properly managing which commands are long-running vs. asynchronous.
Il 14/04/2012 00:32, Eric Blake ha scritto: > But if the world conspires against me, such as libvirt going down, then > qemu completing the reopen, then the guest VM halting itself so that the > qemu process goes away, all before libvirt restarts, then I'm stuck > figuring out whether qemu finished the job (so that when I restart the > guest, I want to pivot the filename) or failed the job (so that when I > restart the guest, I want to revert to the source). To do this, I now > have to create a new file on disk (not a pipe), pass in the fd in > advance, and then call drive-reopen, as well as record that filename as > the location where I will look as part of trying to re-establish > connections with the guest when libvirtd restarts. Yes. > I'm not quite sure how to expose this to upper-layer management > applications when they are using libvirt transient guests, but that's > not qemu's problem. Do transient guests have persistent storage for them in /var while they are running? >> diff --git a/qapi-schema.json b/qapi-schema.json >> index 0bf3a25..2e5a925 100644 >> --- a/qapi-schema.json >> +++ b/qapi-schema.json >> @@ -1228,6 +1228,13 @@ >> # >> # @format: #optional the format of the new image, default is 'qcow2'. >> # >> +# @witness: A file descriptor name that was passed via getfd. QEMU will write > > Mark this #optional Yes. > Question - I know that 'drive-reopen' forces a block_job_cancel_sync() > call before closing the source; how long can that take? Not much; the mirroring job polls the dirty bitmap for new I/O every 100 ms, so it should be more or less comparable with the bdrv_drain_all that is also performed by drive-reopen and blockdev-snapshot-sync. > So that does mean that a call to 'drive-reopen' could indeed > take a very long time from initially sending the monitor command before > I finally get a response of success or failure, and that while the > response will be accurate, the whole intent of this patch is that > libvirt might not be around to get the response, so we want something a > bit more persistent. Does this mean that if we add 'drive-reopen' to > 'transaction', that transaction will be forced to wait for > block_job_cancel_sync? Transactions are already waiting for pending I/O to complete (bdrv_drain_all), and mirroring that I/O to the target (block_job_cancel_sync) won't take much longer. Paolo
On 04/16/2012 03:03 AM, Paolo Bonzini wrote: > Il 14/04/2012 00:32, Eric Blake ha scritto: >> But if the world conspires against me, such as libvirt going down, then >> qemu completing the reopen, then the guest VM halting itself so that the >> qemu process goes away, all before libvirt restarts, then I'm stuck >> figuring out whether qemu finished the job (so that when I restart the >> guest, I want to pivot the filename) or failed the job (so that when I >> restart the guest, I want to revert to the source). To do this, I now >> have to create a new file on disk (not a pipe), pass in the fd in >> advance, and then call drive-reopen, as well as record that filename as >> the location where I will look as part of trying to re-establish >> connections with the guest when libvirtd restarts. > > Yes. > >> I'm not quite sure how to expose this to upper-layer management >> applications when they are using libvirt transient guests, but that's >> not qemu's problem. > > Do transient guests have persistent storage for them in /var while they > are running? Yes - that's how libvirt tracks the pid of the qemu process that it should be re-attaching to on a libvirtd restart, as well as several other aspects (for example, the capabilities of the qemu binary running that pid, in case qemu has been upgraded in the meantime). It's just that right now, the files that libvirt stores in /var are intended to be internal to libvirt, so management apps shouldn't be poking around in there, so much as having an interface to ask libvirt what state things are in. >> Question - I know that 'drive-reopen' forces a block_job_cancel_sync() >> call before closing the source; how long can that take? > > Not much; the mirroring job polls the dirty bitmap for new I/O every 100 > ms, so it should be more or less comparable with the bdrv_drain_all that > is also performed by drive-reopen and blockdev-snapshot-sync. > >> So that does mean that a call to 'drive-reopen' could indeed >> take a very long time from initially sending the monitor command before >> I finally get a response of success or failure, and that while the >> response will be accurate, the whole intent of this patch is that >> libvirt might not be around to get the response, so we want something a >> bit more persistent. Does this mean that if we add 'drive-reopen' to >> 'transaction', that transaction will be forced to wait for >> block_job_cancel_sync? > > Transactions are already waiting for pending I/O to complete > (bdrv_drain_all), and mirroring that I/O to the target > (block_job_cancel_sync) won't take much longer. In other words, 'block-job-cancel' (and commands built on top of it, like 'drive-reopen') isn't instantaneous, but also is still easily in the sub-second range. Good to know.
Il 16/04/2012 23:55, Eric Blake ha scritto: > > Do transient guests have persistent storage for them in /var while they > > are running? > > Yes - that's how libvirt tracks the pid of the qemu process that it > should be re-attaching to on a libvirtd restart, as well as several > other aspects (for example, the capabilities of the qemu binary running > that pid, in case qemu has been upgraded in the meantime). It's just > that right now, the files that libvirt stores in /var are intended to be > internal to libvirt, so management apps shouldn't be poking around in > there, so much as having an interface to ask libvirt what state things > are in. Can libvirtd report the block-job-completed event to vdsm before the transient-domain-died event? Paolo
[adding libvir-list for some interesting discussions on potential libvirt design issues] On 04/17/2012 12:47 AM, Paolo Bonzini wrote: > Il 16/04/2012 23:55, Eric Blake ha scritto: >>> Do transient guests have persistent storage for them in /var while they >>> are running? >> >> Yes - that's how libvirt tracks the pid of the qemu process that it >> should be re-attaching to on a libvirtd restart, as well as several >> other aspects (for example, the capabilities of the qemu binary running >> that pid, in case qemu has been upgraded in the meantime). It's just >> that right now, the files that libvirt stores in /var are intended to be >> internal to libvirt, so management apps shouldn't be poking around in >> there, so much as having an interface to ask libvirt what state things >> are in. > > Can libvirtd report the block-job-completed event to vdsm before the > transient-domain-died event? Yes, it should be possible to organize the code so that on libvirtd restart, libvirt is careful to emit all other events, such as block-job completed/failed, prior to the final domain died event as libvirt cleans out its records of the dead domain. But it probably involves quite a bit of code cleanup on libvirt's side to reach that state. There's also the question of whether vdsm will be listening to the events soon enough; after all, when libvirtd restarts, libvirt is doing quite a bit of bookkeeping before even accepting connections, and if vdsm isn't connected to the restarted libvirt at the time libvirt queues up an event to be sent, then the event is lost. Maybe that implies that libvirt needs to have a configurable-depth array of prior events that can be played back to the user on request.
On 17.04.2012 14:09, Eric Blake wrote: > [adding libvir-list for some interesting discussions on potential > libvirt design issues] > > On 04/17/2012 12:47 AM, Paolo Bonzini wrote: >> Il 16/04/2012 23:55, Eric Blake ha scritto: >>>> Do transient guests have persistent storage for them in /var while they >>>> are running? >>> >>> Yes - that's how libvirt tracks the pid of the qemu process that it >>> should be re-attaching to on a libvirtd restart, as well as several >>> other aspects (for example, the capabilities of the qemu binary running >>> that pid, in case qemu has been upgraded in the meantime). It's just >>> that right now, the files that libvirt stores in /var are intended to be >>> internal to libvirt, so management apps shouldn't be poking around in >>> there, so much as having an interface to ask libvirt what state things >>> are in. >> >> Can libvirtd report the block-job-completed event to vdsm before the >> transient-domain-died event? > > Yes, it should be possible to organize the code so that on libvirtd > restart, libvirt is careful to emit all other events, such as block-job > completed/failed, prior to the final domain died event as libvirt cleans > out its records of the dead domain. But it probably involves quite a > bit of code cleanup on libvirt's side to reach that state. There's also > the question of whether vdsm will be listening to the events soon > enough; after all, when libvirtd restarts, libvirt is doing quite a bit > of bookkeeping before even accepting connections, and if vdsm isn't > connected to the restarted libvirt at the time libvirt queues up an > event to be sent, then the event is lost. Maybe that implies that > libvirt needs to have a configurable-depth array of prior events that > can be played back to the user on request. That imply we need an API to hold up emitting events so user can re-register callbacks. That is - store tem in a internal array. And another to replay events = flushing the internal array. Or even better, substitute 1st API with new switch to be passed to the daemon. Because on libvirtd reconnection to qemu sockets we will instantly get all pending events. So it gives really short time for user apps to issue such API (if any). What I mean: 1. libvirtd --listen --hold-up-events 2. conn = virConnectOpen*(...); 3. virConnectDomainEventRegister*(conn, ...); 4. virConnectReplayEvents(conn); Michal
diff --git a/blockdev.c b/blockdev.c index 08953fa..78b72f2 100644 --- a/blockdev.c +++ b/blockdev.c @@ -660,7 +660,7 @@ void do_commit(Monitor *mon, const QDict *qdict) } static void change_blockdev_image(BlockDriverState *bs, const char *new_image_file, - const char *format, Error **errp) + const char *format, int fd, Error **errp) { BlockDriver *old_drv, *proto_drv; BlockDriver *drv = NULL; @@ -702,6 +702,16 @@ static void change_blockdev_image(BlockDriverState *bs, const char *new_image_fi bdrv_close(bs); ret = bdrv_open(bs, new_image_file, flags, drv); + + if (ret == 0 && fd != -1) { + ret = write(fd, "", 1) == 1 ? 0 : -1; + qemu_fdatasync(fd); + close(fd); + if (ret < 0) { + bdrv_close(bs); + } + } + /* * If reopening the image file we just created fails, fall back * and try to re-open the original image. If that fails too, we @@ -718,9 +725,20 @@ static void change_blockdev_image(BlockDriverState *bs, const char *new_image_fi } void qmp_drive_reopen(const char *device, const char *new_image_file, - bool has_format, const char *format, Error **errp) + bool has_format, const char *format, + bool has_witness, const char *witness, + Error **errp) { BlockDriverState *bs; + int fd = -1; + + if (has_witness) { + fd = monitor_get_fd(cur_mon, witness); + if (fd == -1) { + error_set(errp, QERR_FD_NOT_FOUND, witness); + return; + } + } bs = bdrv_find(device); if (!bs) { @@ -731,7 +749,7 @@ void qmp_drive_reopen(const char *device, const char *new_image_file, block_job_cancel_sync(bs->job); } change_blockdev_image(bs, new_image_file, - has_format ? format : NULL, errp); + has_format ? format : NULL, fd, errp); } static void blockdev_do_action(int kind, void *data, Error **errp) diff --git a/hmp.c b/hmp.c index 28697ec..f67c441 100644 --- a/hmp.c +++ b/hmp.c @@ -744,7 +744,7 @@ void hmp_drive_reopen(Monitor *mon, const QDict *qdict) const char *format = qdict_get_try_str(qdict, "format"); Error *errp = NULL; - qmp_drive_reopen(device, filename, !!format, format, &errp); + qmp_drive_reopen(device, filename, !!format, format, false, NULL, &errp); hmp_handle_error(mon, &errp); } diff --git a/qapi-schema.json b/qapi-schema.json index 0bf3a25..2e5a925 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -1228,6 +1228,13 @@ # # @format: #optional the format of the new image, default is 'qcow2'. # +# @witness: A file descriptor name that was passed via getfd. QEMU will write +# a single byte to this file descriptor before completing the command +# successfully. If the byte is not written to the file, it is +# guaranteed that the guest has not issued any I/O to the new image. +# Failure to write the byte is fatal just like failure to open the new +# image, and will cause the guest to revert to the currently open file. +# # Returns: nothing on success # If @device is not a valid block device, DeviceNotFound # If @new-image-file can't be opened, OpenFileFailed @@ -1236,7 +1243,8 @@ # Since 1.1 ## { 'command': 'drive-reopen', - 'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str' } } + 'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str', + '*witness': 'str' } } ## # @human-monitor-command: diff --git a/qmp-commands.hx b/qmp-commands.hx index 6ea0ef5..fedfc36 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -829,7 +829,7 @@ EQMP { .name = "drive-reopen", - .args_type = "device:B,new-image-file:s,format:s?", + .args_type = "device:B,new-image-file:s,format:s?,witness:s?", .mhandler.cmd_new = qmp_marshal_input_drive_reopen, }, @@ -842,11 +842,21 @@ guest is expecting the drive to change its content, the new image should contain the same data of the current one. One use case is to terminate a drive-mirror command. +The command can optionally write a single byte to a file descriptor name +that was passed via SCM rights (getfd). QEMU will write a single byte +to this file descriptor before completing the command successfully. +If the byte is not written to the file, it is guaranteed that the +guest has not issued any I/O to the new image. Failure to write the +byte is fatal just like failure to open the new image, and will cause +the guest to revert to the currently open file. + + Arguments: - "device": device name to operate on (json-string) - "new-image-file": name of new image file (json-string) - "format": format of new image (json-string, optional) +- "witness": file descriptor previously passed via SCM rights (json-string, optional) Example:
Management needs a way for QEMU to confirm that no I/O has been sent to the target and not to the source. To provide this guarantee we rely on a file in local persistent storage. QEMU receives a file descriptor via SCM_RIGHTS and writes a single byte to it. If it fails, it will fail the drive-reopen command too and management knows that no I/O request has been issued to the new destination. Likewise, if management finds the file to have nonzero size it knows that the target is valid and that indeed I/O requests could have been submitted to it. The argument does not have an HMP equivalent. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- blockdev.c | 24 +++++++++++++++++++++--- hmp.c | 2 +- qapi-schema.json | 10 +++++++++- qmp-commands.hx | 12 +++++++++++- 4 files changed, 42 insertions(+), 6 deletions(-)