diff mbox

[4/5] disk_deadlines: add control of requests time expiration

Message ID 1441699228-25767-5-git-send-email-den@openvz.org
State New
Headers show

Commit Message

Denis V. Lunev Sept. 8, 2015, 8 a.m. UTC
From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>

If disk-deadlines option is enabled for a drive, one controls time
completion of this drive's requests. The method is as follows (further
assume that this option is enabled).

Every drive has its own red-black tree for keeping its requests.
Expiration time of the request is a key, cookie (as id of request) is an
appropriate node. Assume that every requests has 8 seconds to be completed.
If request was not accomplished in time for some reasons (server crash or
smth else), timer of this drive is fired and an appropriate callback
requests to stop Virtial Machine (VM).

VM remains stopped until all requests from the disk which caused VM's
stopping are completed. Furthermore, if there is another disks whose
requests are waiting to be completed, do not start VM : wait completion
of all "late" requests from all disks.

Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
---
 block/accounting.c             |   8 ++
 block/disk-deadlines.c         | 167 +++++++++++++++++++++++++++++++++++++++++
 include/block/disk-deadlines.h |  11 +++
 3 files changed, 186 insertions(+)

Comments

Fam Zheng Sept. 8, 2015, 9:35 a.m. UTC | #1
On Tue, 09/08 11:00, Denis V. Lunev wrote:
>  typedef struct DiskDeadlines {
>      bool enabled;
> +    bool expired_tree;
> +    pthread_mutex_t mtx_tree;

This won't compile on win32, probably use QemuMutex instead?

In file included from /tmp/qemu-build/include/block/accounting.h:30:0,
                 from /tmp/qemu-build/include/block/block.h:8,
                 from /tmp/qemu-build/include/monitor/monitor.h:6,
                 from /tmp/qemu-build/util/osdep.c:51:
/tmp/qemu-build/include/block/disk-deadlines.h:38:5: error: unknown type name 'pthread_mutex_t'
     pthread_mutex_t mtx_tree;
     ^
/tmp/qemu-build/rules.mak:57: recipe for target 'util/osdep.o' failed
Denis V. Lunev Sept. 8, 2015, 9:42 a.m. UTC | #2
On 09/08/2015 12:35 PM, Fam Zheng wrote:
> On Tue, 09/08 11:00, Denis V. Lunev wrote:
>>   typedef struct DiskDeadlines {
>>       bool enabled;
>> +    bool expired_tree;
>> +    pthread_mutex_t mtx_tree;
> This won't compile on win32, probably use QemuMutex instead?
>
> In file included from /tmp/qemu-build/include/block/accounting.h:30:0,
>                   from /tmp/qemu-build/include/block/block.h:8,
>                   from /tmp/qemu-build/include/monitor/monitor.h:6,
>                   from /tmp/qemu-build/util/osdep.c:51:
> /tmp/qemu-build/include/block/disk-deadlines.h:38:5: error: unknown type name 'pthread_mutex_t'
>       pthread_mutex_t mtx_tree;
>       ^
> /tmp/qemu-build/rules.mak:57: recipe for target 'util/osdep.o' failed
>
got this. Thank you
Kevin Wolf Sept. 8, 2015, 11:06 a.m. UTC | #3
Am 08.09.2015 um 10:00 hat Denis V. Lunev geschrieben:
> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> 
> If disk-deadlines option is enabled for a drive, one controls time
> completion of this drive's requests. The method is as follows (further
> assume that this option is enabled).
> 
> Every drive has its own red-black tree for keeping its requests.
> Expiration time of the request is a key, cookie (as id of request) is an
> appropriate node. Assume that every requests has 8 seconds to be completed.
> If request was not accomplished in time for some reasons (server crash or
> smth else), timer of this drive is fired and an appropriate callback
> requests to stop Virtial Machine (VM).
> 
> VM remains stopped until all requests from the disk which caused VM's
> stopping are completed. Furthermore, if there is another disks whose
> requests are waiting to be completed, do not start VM : wait completion
> of all "late" requests from all disks.
> 
> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>

> +    disk_deadlines->expired_tree = true;
> +    need_vmstop = !atomic_fetch_inc(&num_requests_vmstopped);
> +    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
> +
> +    if (need_vmstop) {
> +        qemu_system_vmstop_request_prepare();
> +        qemu_system_vmstop_request(RUN_STATE_PAUSED);
> +    }
> +}

What behaviour does this result in? If I understand correctly, this is
an indirect call of do_vm_stop(), which involves a bdrv_drain_all(). In
this case, qemu would completely block (including unresponsive monitor)
until the request can complete.

Is this what you are seeing with this patch, or why doesn't the
bdrv_drain_all() call cause such effects?

Kevin
Denis V. Lunev Sept. 8, 2015, 11:27 a.m. UTC | #4
On 09/08/2015 02:06 PM, Kevin Wolf wrote:
> Am 08.09.2015 um 10:00 hat Denis V. Lunev geschrieben:
>> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>>
>> If disk-deadlines option is enabled for a drive, one controls time
>> completion of this drive's requests. The method is as follows (further
>> assume that this option is enabled).
>>
>> Every drive has its own red-black tree for keeping its requests.
>> Expiration time of the request is a key, cookie (as id of request) is an
>> appropriate node. Assume that every requests has 8 seconds to be completed.
>> If request was not accomplished in time for some reasons (server crash or
>> smth else), timer of this drive is fired and an appropriate callback
>> requests to stop Virtial Machine (VM).
>>
>> VM remains stopped until all requests from the disk which caused VM's
>> stopping are completed. Furthermore, if there is another disks whose
>> requests are waiting to be completed, do not start VM : wait completion
>> of all "late" requests from all disks.
>>
>> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> +    disk_deadlines->expired_tree = true;
>> +    need_vmstop = !atomic_fetch_inc(&num_requests_vmstopped);
>> +    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
>> +
>> +    if (need_vmstop) {
>> +        qemu_system_vmstop_request_prepare();
>> +        qemu_system_vmstop_request(RUN_STATE_PAUSED);
>> +    }
>> +}
> What behaviour does this result in? If I understand correctly, this is
> an indirect call of do_vm_stop(), which involves a bdrv_drain_all(). In
> this case, qemu would completely block (including unresponsive monitor)
> until the request can complete.
>
> Is this what you are seeing with this patch, or why doesn't the
> bdrv_drain_all() call cause such effects?
>
> Kevin
interesting point. Yes, it flushes all requests and most likely
hangs inside waiting requests to complete. But fortunately
this happens after the switch to paused state thus
the guest becomes paused. That's why I have missed this
fact.

This (could) be considered as a problem but I have no (good)
solution at the moment. Should think a bit on.

Nice catch, though!

Den
Kevin Wolf Sept. 8, 2015, 1:05 p.m. UTC | #5
Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> interesting point. Yes, it flushes all requests and most likely
> hangs inside waiting requests to complete. But fortunately
> this happens after the switch to paused state thus
> the guest becomes paused. That's why I have missed this
> fact.
> 
> This (could) be considered as a problem but I have no (good)
> solution at the moment. Should think a bit on.

Let me suggest a radically different design. Note that I don't say this
is necessarily how things should be done, I'm just trying to introduce
some new ideas and broaden the discussion, so that we have a larger set
of ideas from which we can pick the right solution(s).

The core of my idea would be a new filter block driver 'timeout' that
can be added on top of each BDS that could potentially fail, like a
raw-posix BDS pointing to a file on NFS. This way most pieces of the
solution are nicely modularised and don't touch the block layer core.

During normal operation the driver would just be passing through
requests to the lower layer. When it detects a timeout, however, it
completes the request it received with -ETIMEDOUT. It also completes any
new request it receives with -ETIMEDOUT without passing the request on
until the request that originally timed out returns. This is our safety
measure against anyone seeing whether or how the timed out request
modified data.

We need to make sure that bdrv_drain() doesn't wait for this request.
Possibly we need to introduce a .bdrv_drain callback that replaces the
default handling, because bdrv_requests_pending() in the default
handling considers bs->file, which would still have the timed out
request. We don't want to see this; bdrv_drain_all() should complete
even though that request is still pending internally (externally, we
returned -ETIMEDOUT, so we can consider it completed). This way the
monitor stays responsive and background jobs can go on if they don't use
the failing block device.

And then we essentially reuse the rerror/werror mechanism that we
already have to stop the VM. The device models would be extended to
always stop the VM on -ETIMEDOUT, regardless of the error policy. In
this state, the VM would even be migratable if you make sure that the
pending request can't modify the image on the destination host any more.

Do you think this could work, or did I miss something important?

Kevin
Denis V. Lunev Sept. 8, 2015, 2:23 p.m. UTC | #6
On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
>> interesting point. Yes, it flushes all requests and most likely
>> hangs inside waiting requests to complete. But fortunately
>> this happens after the switch to paused state thus
>> the guest becomes paused. That's why I have missed this
>> fact.
>>
>> This (could) be considered as a problem but I have no (good)
>> solution at the moment. Should think a bit on.
> Let me suggest a radically different design. Note that I don't say this
> is necessarily how things should be done, I'm just trying to introduce
> some new ideas and broaden the discussion, so that we have a larger set
> of ideas from which we can pick the right solution(s).
>
> The core of my idea would be a new filter block driver 'timeout' that
> can be added on top of each BDS that could potentially fail, like a
> raw-posix BDS pointing to a file on NFS. This way most pieces of the
> solution are nicely modularised and don't touch the block layer core.
>
> During normal operation the driver would just be passing through
> requests to the lower layer. When it detects a timeout, however, it
> completes the request it received with -ETIMEDOUT. It also completes any
> new request it receives with -ETIMEDOUT without passing the request on
> until the request that originally timed out returns. This is our safety
> measure against anyone seeing whether or how the timed out request
> modified data.
>
> We need to make sure that bdrv_drain() doesn't wait for this request.
> Possibly we need to introduce a .bdrv_drain callback that replaces the
> default handling, because bdrv_requests_pending() in the default
> handling considers bs->file, which would still have the timed out
> request. We don't want to see this; bdrv_drain_all() should complete
> even though that request is still pending internally (externally, we
> returned -ETIMEDOUT, so we can consider it completed). This way the
> monitor stays responsive and background jobs can go on if they don't use
> the failing block device.
>
> And then we essentially reuse the rerror/werror mechanism that we
> already have to stop the VM. The device models would be extended to
> always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> this state, the VM would even be migratable if you make sure that the
> pending request can't modify the image on the destination host any more.
>
> Do you think this could work, or did I miss something important?
>
> Kevin
could I propose even more radical solution then?

My original approach was based on the fact that
this could should be maintainable out-of-stream.
If the patch will be merged - this boundary condition
could be dropped.

Why not to invent 'terror' field on BdrvOptions
and process things in core block layer without
a filter? RB Tree entry will just not created if
the policy will be set to 'ignore'.

Den
Kevin Wolf Sept. 8, 2015, 2:48 p.m. UTC | #7
Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> >>interesting point. Yes, it flushes all requests and most likely
> >>hangs inside waiting requests to complete. But fortunately
> >>this happens after the switch to paused state thus
> >>the guest becomes paused. That's why I have missed this
> >>fact.
> >>
> >>This (could) be considered as a problem but I have no (good)
> >>solution at the moment. Should think a bit on.
> >Let me suggest a radically different design. Note that I don't say this
> >is necessarily how things should be done, I'm just trying to introduce
> >some new ideas and broaden the discussion, so that we have a larger set
> >of ideas from which we can pick the right solution(s).
> >
> >The core of my idea would be a new filter block driver 'timeout' that
> >can be added on top of each BDS that could potentially fail, like a
> >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> >solution are nicely modularised and don't touch the block layer core.
> >
> >During normal operation the driver would just be passing through
> >requests to the lower layer. When it detects a timeout, however, it
> >completes the request it received with -ETIMEDOUT. It also completes any
> >new request it receives with -ETIMEDOUT without passing the request on
> >until the request that originally timed out returns. This is our safety
> >measure against anyone seeing whether or how the timed out request
> >modified data.
> >
> >We need to make sure that bdrv_drain() doesn't wait for this request.
> >Possibly we need to introduce a .bdrv_drain callback that replaces the
> >default handling, because bdrv_requests_pending() in the default
> >handling considers bs->file, which would still have the timed out
> >request. We don't want to see this; bdrv_drain_all() should complete
> >even though that request is still pending internally (externally, we
> >returned -ETIMEDOUT, so we can consider it completed). This way the
> >monitor stays responsive and background jobs can go on if they don't use
> >the failing block device.
> >
> >And then we essentially reuse the rerror/werror mechanism that we
> >already have to stop the VM. The device models would be extended to
> >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> >this state, the VM would even be migratable if you make sure that the
> >pending request can't modify the image on the destination host any more.
> >
> >Do you think this could work, or did I miss something important?
> >
> >Kevin
> could I propose even more radical solution then?
> 
> My original approach was based on the fact that
> this could should be maintainable out-of-stream.
> If the patch will be merged - this boundary condition
> could be dropped.
> 
> Why not to invent 'terror' field on BdrvOptions
> and process things in core block layer without
> a filter? RB Tree entry will just not created if
> the policy will be set to 'ignore'.

'terror' might not be the most fortunate name... ;-)

The reason why I would prefer a filter driver is so the code and the
associated data structures are cleanly modularised and we can keep the
actual block layer core small and clean. The same is true for some other
functions that I would rather move out of the core into filter drivers
than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
which are a bit harder to actually move because we already have old
interfaces that we can't break (we'll probably do it anyway eventually,
even if it needs a bit more compatibility code).

However, it seems that you are mostly touching code that is maintained
by Stefan, and Stefan used to be a bit more open to adding functionality
to the core, so my opinion might not be the last word.

Kevin
Stefan Hajnoczi Sept. 10, 2015, 10:27 a.m. UTC | #8
On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > >>interesting point. Yes, it flushes all requests and most likely
> > >>hangs inside waiting requests to complete. But fortunately
> > >>this happens after the switch to paused state thus
> > >>the guest becomes paused. That's why I have missed this
> > >>fact.
> > >>
> > >>This (could) be considered as a problem but I have no (good)
> > >>solution at the moment. Should think a bit on.
> > >Let me suggest a radically different design. Note that I don't say this
> > >is necessarily how things should be done, I'm just trying to introduce
> > >some new ideas and broaden the discussion, so that we have a larger set
> > >of ideas from which we can pick the right solution(s).
> > >
> > >The core of my idea would be a new filter block driver 'timeout' that
> > >can be added on top of each BDS that could potentially fail, like a
> > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > >solution are nicely modularised and don't touch the block layer core.
> > >
> > >During normal operation the driver would just be passing through
> > >requests to the lower layer. When it detects a timeout, however, it
> > >completes the request it received with -ETIMEDOUT. It also completes any
> > >new request it receives with -ETIMEDOUT without passing the request on
> > >until the request that originally timed out returns. This is our safety
> > >measure against anyone seeing whether or how the timed out request
> > >modified data.
> > >
> > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > >default handling, because bdrv_requests_pending() in the default
> > >handling considers bs->file, which would still have the timed out
> > >request. We don't want to see this; bdrv_drain_all() should complete
> > >even though that request is still pending internally (externally, we
> > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > >monitor stays responsive and background jobs can go on if they don't use
> > >the failing block device.
> > >
> > >And then we essentially reuse the rerror/werror mechanism that we
> > >already have to stop the VM. The device models would be extended to
> > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > >this state, the VM would even be migratable if you make sure that the
> > >pending request can't modify the image on the destination host any more.
> > >
> > >Do you think this could work, or did I miss something important?
> > >
> > >Kevin
> > could I propose even more radical solution then?
> > 
> > My original approach was based on the fact that
> > this could should be maintainable out-of-stream.
> > If the patch will be merged - this boundary condition
> > could be dropped.
> > 
> > Why not to invent 'terror' field on BdrvOptions
> > and process things in core block layer without
> > a filter? RB Tree entry will just not created if
> > the policy will be set to 'ignore'.
> 
> 'terror' might not be the most fortunate name... ;-)
> 
> The reason why I would prefer a filter driver is so the code and the
> associated data structures are cleanly modularised and we can keep the
> actual block layer core small and clean. The same is true for some other
> functions that I would rather move out of the core into filter drivers
> than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> which are a bit harder to actually move because we already have old
> interfaces that we can't break (we'll probably do it anyway eventually,
> even if it needs a bit more compatibility code).
> 
> However, it seems that you are mostly touching code that is maintained
> by Stefan, and Stefan used to be a bit more open to adding functionality
> to the core, so my opinion might not be the last word.

I've been thinking more about the correctness of this feature:

QEMU cannot cancel I/O because there is no Linux userspace API for doing
so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
uninterruptible sleep.

The only way to make sure a request has finished is to wait for
completion.  If we treat a request as failed/cancelled but it's actually
still pending at a layer of the storage stack:
1. Read requests may modify guest memory.
2. Write requests may modify disk sectors.

Today the guest times out and tries to do IDE/ATA recovery, for example.
This causes QEMU to eventually call the synchronous bdrv_drain_all()
function and the guest hangs.  Also, if the guest mounts the file system
read-only in response to the timeout, then game over.

The disk-deadlines feature lets QEMU detect timeouts before the guest so
we can pause the guest.  The part I have been thinking about is that the
only option is to wait until the request completes.

We cannot abandon the timed out request because we'll face #1 or #2
above.  This means it doesn't make sense to retry the request like
rerror=/werror=.  rerror=/werror= can retry safely because the original
request has failed but that is not the case for timed out requests.

This also means that live migration isn't safe, at least if a write
request is pending.  If the guest migrates, the pending write request on
the source host could still complete after live migration handover,
corrupting the disk.

Getting back to these patches: I think the implementation is correct in
that the only policy is to wait for timed out requests to complete and
then resume the guest.

However, these patches need to violate the constraint that guest memory
isn't dirtied when the guest is paused.  This is an important constraint
for the correctness of live migration, since we need to be able to track
all changes to guest memory.

Just wanted to post this in case anyone disagrees.

Stefan
Kevin Wolf Sept. 10, 2015, 11:39 a.m. UTC | #9
Am 10.09.2015 um 12:27 hat Stefan Hajnoczi geschrieben:
> On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > >>interesting point. Yes, it flushes all requests and most likely
> > > >>hangs inside waiting requests to complete. But fortunately
> > > >>this happens after the switch to paused state thus
> > > >>the guest becomes paused. That's why I have missed this
> > > >>fact.
> > > >>
> > > >>This (could) be considered as a problem but I have no (good)
> > > >>solution at the moment. Should think a bit on.
> > > >Let me suggest a radically different design. Note that I don't say this
> > > >is necessarily how things should be done, I'm just trying to introduce
> > > >some new ideas and broaden the discussion, so that we have a larger set
> > > >of ideas from which we can pick the right solution(s).
> > > >
> > > >The core of my idea would be a new filter block driver 'timeout' that
> > > >can be added on top of each BDS that could potentially fail, like a
> > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > >solution are nicely modularised and don't touch the block layer core.
> > > >
> > > >During normal operation the driver would just be passing through
> > > >requests to the lower layer. When it detects a timeout, however, it
> > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > >new request it receives with -ETIMEDOUT without passing the request on
> > > >until the request that originally timed out returns. This is our safety
> > > >measure against anyone seeing whether or how the timed out request
> > > >modified data.
> > > >
> > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > >default handling, because bdrv_requests_pending() in the default
> > > >handling considers bs->file, which would still have the timed out
> > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > >even though that request is still pending internally (externally, we
> > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > >monitor stays responsive and background jobs can go on if they don't use
> > > >the failing block device.
> > > >
> > > >And then we essentially reuse the rerror/werror mechanism that we
> > > >already have to stop the VM. The device models would be extended to
> > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > >this state, the VM would even be migratable if you make sure that the
> > > >pending request can't modify the image on the destination host any more.
> > > >
> > > >Do you think this could work, or did I miss something important?
> > > >
> > > >Kevin
> > > could I propose even more radical solution then?
> > > 
> > > My original approach was based on the fact that
> > > this could should be maintainable out-of-stream.
> > > If the patch will be merged - this boundary condition
> > > could be dropped.
> > > 
> > > Why not to invent 'terror' field on BdrvOptions
> > > and process things in core block layer without
> > > a filter? RB Tree entry will just not created if
> > > the policy will be set to 'ignore'.
> > 
> > 'terror' might not be the most fortunate name... ;-)
> > 
> > The reason why I would prefer a filter driver is so the code and the
> > associated data structures are cleanly modularised and we can keep the
> > actual block layer core small and clean. The same is true for some other
> > functions that I would rather move out of the core into filter drivers
> > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > which are a bit harder to actually move because we already have old
> > interfaces that we can't break (we'll probably do it anyway eventually,
> > even if it needs a bit more compatibility code).
> > 
> > However, it seems that you are mostly touching code that is maintained
> > by Stefan, and Stefan used to be a bit more open to adding functionality
> > to the core, so my opinion might not be the last word.
> 
> I've been thinking more about the correctness of this feature:
> 
> QEMU cannot cancel I/O because there is no Linux userspace API for doing
> so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> uninterruptible sleep.
> 
> The only way to make sure a request has finished is to wait for
> completion.  If we treat a request as failed/cancelled but it's actually
> still pending at a layer of the storage stack:
> 1. Read requests may modify guest memory.
> 2. Write requests may modify disk sectors.
> 
> Today the guest times out and tries to do IDE/ATA recovery, for example.
> This causes QEMU to eventually call the synchronous bdrv_drain_all()
> function and the guest hangs.  Also, if the guest mounts the file system
> read-only in response to the timeout, then game over.
> 
> The disk-deadlines feature lets QEMU detect timeouts before the guest so
> we can pause the guest.  The part I have been thinking about is that the
> only option is to wait until the request completes.
> 
> We cannot abandon the timed out request because we'll face #1 or #2
> above.  This means it doesn't make sense to retry the request like
> rerror=/werror=.  rerror=/werror= can retry safely because the original
> request has failed but that is not the case for timed out requests.
> 
> This also means that live migration isn't safe, at least if a write
> request is pending.  If the guest migrates, the pending write request on
> the source host could still complete after live migration handover,
> corrupting the disk.
> 
> Getting back to these patches: I think the implementation is correct in
> that the only policy is to wait for timed out requests to complete and
> then resume the guest.
> 
> However, these patches need to violate the constraint that guest memory
> isn't dirtied when the guest is paused.  This is an important constraint
> for the correctness of live migration, since we need to be able to track
> all changes to guest memory.
> 
> Just wanted to post this in case anyone disagrees.

You're making a few good points here.

I thought that migration with a pending write request could be safe with
some additional knowledge because if you know that the write is hanging
because the connection to the NFS server is down and you make sure that
it remains disconnected, that would work. However, the hanging request
is already in the kernel, so you could never bring the connection up
again without rebooting the host, which is clearly not a realistic
assumption.

Never thought of the constraints of live migration either, so it seems
reads requests are equally problematic.

So it appears that the filter driver would have to add a migration
blocker whenever it sees any request time out, and only clear it again
when all pending requests have completed.

Kevin
Stefan Hajnoczi Sept. 14, 2015, 4:53 p.m. UTC | #10
On Thu, Sep 10, 2015 at 01:39:20PM +0200, Kevin Wolf wrote:
> Am 10.09.2015 um 12:27 hat Stefan Hajnoczi geschrieben:
> > On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > > >>interesting point. Yes, it flushes all requests and most likely
> > > > >>hangs inside waiting requests to complete. But fortunately
> > > > >>this happens after the switch to paused state thus
> > > > >>the guest becomes paused. That's why I have missed this
> > > > >>fact.
> > > > >>
> > > > >>This (could) be considered as a problem but I have no (good)
> > > > >>solution at the moment. Should think a bit on.
> > > > >Let me suggest a radically different design. Note that I don't say this
> > > > >is necessarily how things should be done, I'm just trying to introduce
> > > > >some new ideas and broaden the discussion, so that we have a larger set
> > > > >of ideas from which we can pick the right solution(s).
> > > > >
> > > > >The core of my idea would be a new filter block driver 'timeout' that
> > > > >can be added on top of each BDS that could potentially fail, like a
> > > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > > >solution are nicely modularised and don't touch the block layer core.
> > > > >
> > > > >During normal operation the driver would just be passing through
> > > > >requests to the lower layer. When it detects a timeout, however, it
> > > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > > >new request it receives with -ETIMEDOUT without passing the request on
> > > > >until the request that originally timed out returns. This is our safety
> > > > >measure against anyone seeing whether or how the timed out request
> > > > >modified data.
> > > > >
> > > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > > >default handling, because bdrv_requests_pending() in the default
> > > > >handling considers bs->file, which would still have the timed out
> > > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > > >even though that request is still pending internally (externally, we
> > > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > > >monitor stays responsive and background jobs can go on if they don't use
> > > > >the failing block device.
> > > > >
> > > > >And then we essentially reuse the rerror/werror mechanism that we
> > > > >already have to stop the VM. The device models would be extended to
> > > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > > >this state, the VM would even be migratable if you make sure that the
> > > > >pending request can't modify the image on the destination host any more.
> > > > >
> > > > >Do you think this could work, or did I miss something important?
> > > > >
> > > > >Kevin
> > > > could I propose even more radical solution then?
> > > > 
> > > > My original approach was based on the fact that
> > > > this could should be maintainable out-of-stream.
> > > > If the patch will be merged - this boundary condition
> > > > could be dropped.
> > > > 
> > > > Why not to invent 'terror' field on BdrvOptions
> > > > and process things in core block layer without
> > > > a filter? RB Tree entry will just not created if
> > > > the policy will be set to 'ignore'.
> > > 
> > > 'terror' might not be the most fortunate name... ;-)
> > > 
> > > The reason why I would prefer a filter driver is so the code and the
> > > associated data structures are cleanly modularised and we can keep the
> > > actual block layer core small and clean. The same is true for some other
> > > functions that I would rather move out of the core into filter drivers
> > > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > > which are a bit harder to actually move because we already have old
> > > interfaces that we can't break (we'll probably do it anyway eventually,
> > > even if it needs a bit more compatibility code).
> > > 
> > > However, it seems that you are mostly touching code that is maintained
> > > by Stefan, and Stefan used to be a bit more open to adding functionality
> > > to the core, so my opinion might not be the last word.
> > 
> > I've been thinking more about the correctness of this feature:
> > 
> > QEMU cannot cancel I/O because there is no Linux userspace API for doing
> > so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> > implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> > O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> > uninterruptible sleep.
> > 
> > The only way to make sure a request has finished is to wait for
> > completion.  If we treat a request as failed/cancelled but it's actually
> > still pending at a layer of the storage stack:
> > 1. Read requests may modify guest memory.
> > 2. Write requests may modify disk sectors.
> > 
> > Today the guest times out and tries to do IDE/ATA recovery, for example.
> > This causes QEMU to eventually call the synchronous bdrv_drain_all()
> > function and the guest hangs.  Also, if the guest mounts the file system
> > read-only in response to the timeout, then game over.
> > 
> > The disk-deadlines feature lets QEMU detect timeouts before the guest so
> > we can pause the guest.  The part I have been thinking about is that the
> > only option is to wait until the request completes.
> > 
> > We cannot abandon the timed out request because we'll face #1 or #2
> > above.  This means it doesn't make sense to retry the request like
> > rerror=/werror=.  rerror=/werror= can retry safely because the original
> > request has failed but that is not the case for timed out requests.
> > 
> > This also means that live migration isn't safe, at least if a write
> > request is pending.  If the guest migrates, the pending write request on
> > the source host could still complete after live migration handover,
> > corrupting the disk.
> > 
> > Getting back to these patches: I think the implementation is correct in
> > that the only policy is to wait for timed out requests to complete and
> > then resume the guest.
> > 
> > However, these patches need to violate the constraint that guest memory
> > isn't dirtied when the guest is paused.  This is an important constraint
> > for the correctness of live migration, since we need to be able to track
> > all changes to guest memory.
> > 
> > Just wanted to post this in case anyone disagrees.
> 
> You're making a few good points here.
> 
> I thought that migration with a pending write request could be safe with
> some additional knowledge because if you know that the write is hanging
> because the connection to the NFS server is down and you make sure that
> it remains disconnected, that would work. However, the hanging request
> is already in the kernel, so you could never bring the connection up
> again without rebooting the host, which is clearly not a realistic
> assumption.
> 
> Never thought of the constraints of live migration either, so it seems
> reads requests are equally problematic.
> 
> So it appears that the filter driver would have to add a migration
> blocker whenever it sees any request time out, and only clear it again
> when all pending requests have completed.

Adding new features as filters (like quorum) instead adding them to the
core block layer is a good thing.

Kevin: Can you post an example of the syntax so it's clear what you
mean?
Dr. David Alan Gilbert Sept. 25, 2015, 12:34 p.m. UTC | #11
* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > >>interesting point. Yes, it flushes all requests and most likely
> > > >>hangs inside waiting requests to complete. But fortunately
> > > >>this happens after the switch to paused state thus
> > > >>the guest becomes paused. That's why I have missed this
> > > >>fact.
> > > >>
> > > >>This (could) be considered as a problem but I have no (good)
> > > >>solution at the moment. Should think a bit on.
> > > >Let me suggest a radically different design. Note that I don't say this
> > > >is necessarily how things should be done, I'm just trying to introduce
> > > >some new ideas and broaden the discussion, so that we have a larger set
> > > >of ideas from which we can pick the right solution(s).
> > > >
> > > >The core of my idea would be a new filter block driver 'timeout' that
> > > >can be added on top of each BDS that could potentially fail, like a
> > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > >solution are nicely modularised and don't touch the block layer core.
> > > >
> > > >During normal operation the driver would just be passing through
> > > >requests to the lower layer. When it detects a timeout, however, it
> > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > >new request it receives with -ETIMEDOUT without passing the request on
> > > >until the request that originally timed out returns. This is our safety
> > > >measure against anyone seeing whether or how the timed out request
> > > >modified data.
> > > >
> > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > >default handling, because bdrv_requests_pending() in the default
> > > >handling considers bs->file, which would still have the timed out
> > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > >even though that request is still pending internally (externally, we
> > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > >monitor stays responsive and background jobs can go on if they don't use
> > > >the failing block device.
> > > >
> > > >And then we essentially reuse the rerror/werror mechanism that we
> > > >already have to stop the VM. The device models would be extended to
> > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > >this state, the VM would even be migratable if you make sure that the
> > > >pending request can't modify the image on the destination host any more.
> > > >
> > > >Do you think this could work, or did I miss something important?
> > > >
> > > >Kevin
> > > could I propose even more radical solution then?
> > > 
> > > My original approach was based on the fact that
> > > this could should be maintainable out-of-stream.
> > > If the patch will be merged - this boundary condition
> > > could be dropped.
> > > 
> > > Why not to invent 'terror' field on BdrvOptions
> > > and process things in core block layer without
> > > a filter? RB Tree entry will just not created if
> > > the policy will be set to 'ignore'.
> > 
> > 'terror' might not be the most fortunate name... ;-)
> > 
> > The reason why I would prefer a filter driver is so the code and the
> > associated data structures are cleanly modularised and we can keep the
> > actual block layer core small and clean. The same is true for some other
> > functions that I would rather move out of the core into filter drivers
> > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > which are a bit harder to actually move because we already have old
> > interfaces that we can't break (we'll probably do it anyway eventually,
> > even if it needs a bit more compatibility code).
> > 
> > However, it seems that you are mostly touching code that is maintained
> > by Stefan, and Stefan used to be a bit more open to adding functionality
> > to the core, so my opinion might not be the last word.
> 
> I've been thinking more about the correctness of this feature:
> 
> QEMU cannot cancel I/O because there is no Linux userspace API for doing
> so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> uninterruptible sleep.

There are things that work on some devices, but nothing generic.
For NBD/iSCSI/(ceph?) you should be able to issue a shutdown(2) on the socket
that connects to the server and that should call all existing IO to fail
quickly.  Then you could do a drain and be done.    This would
be very useful for the fault-tolerant uses (e.g. Wen Congyang's block replication).

There are even ways of killing hard NFS mounts; for example adding
a unreachable route to the NFS server (ip route add unreachable hostname),
and then umount -f  seems to cause I/O errors to tasks.   (I can't find
a way to do a remount to change the hard flag).  This isn't pretty but
it's a reasonable way of getting your host back to useable if one NFS
server has died.

Dave

> 
> The only way to make sure a request has finished is to wait for
> completion.  If we treat a request as failed/cancelled but it's actually
> still pending at a layer of the storage stack:
> 1. Read requests may modify guest memory.
> 2. Write requests may modify disk sectors.
> 
> Today the guest times out and tries to do IDE/ATA recovery, for example.
> This causes QEMU to eventually call the synchronous bdrv_drain_all()
> function and the guest hangs.  Also, if the guest mounts the file system
> read-only in response to the timeout, then game over.
> 
> The disk-deadlines feature lets QEMU detect timeouts before the guest so
> we can pause the guest.  The part I have been thinking about is that the
> only option is to wait until the request completes.
> 
> We cannot abandon the timed out request because we'll face #1 or #2
> above.  This means it doesn't make sense to retry the request like
> rerror=/werror=.  rerror=/werror= can retry safely because the original
> request has failed but that is not the case for timed out requests.
> 
> This also means that live migration isn't safe, at least if a write
> request is pending.  If the guest migrates, the pending write request on
> the source host could still complete after live migration handover,
> corrupting the disk.
> 
> Getting back to these patches: I think the implementation is correct in
> that the only policy is to wait for timed out requests to complete and
> then resume the guest.
> 
> However, these patches need to violate the constraint that guest memory
> isn't dirtied when the guest is paused.  This is an important constraint
> for the correctness of live migration, since we need to be able to track
> all changes to guest memory.
> 
> Just wanted to post this in case anyone disagrees.
> 
> Stefan
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Stefan Hajnoczi Sept. 28, 2015, 12:42 p.m. UTC | #12
On Fri, Sep 25, 2015 at 01:34:22PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> > On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > > >>interesting point. Yes, it flushes all requests and most likely
> > > > >>hangs inside waiting requests to complete. But fortunately
> > > > >>this happens after the switch to paused state thus
> > > > >>the guest becomes paused. That's why I have missed this
> > > > >>fact.
> > > > >>
> > > > >>This (could) be considered as a problem but I have no (good)
> > > > >>solution at the moment. Should think a bit on.
> > > > >Let me suggest a radically different design. Note that I don't say this
> > > > >is necessarily how things should be done, I'm just trying to introduce
> > > > >some new ideas and broaden the discussion, so that we have a larger set
> > > > >of ideas from which we can pick the right solution(s).
> > > > >
> > > > >The core of my idea would be a new filter block driver 'timeout' that
> > > > >can be added on top of each BDS that could potentially fail, like a
> > > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > > >solution are nicely modularised and don't touch the block layer core.
> > > > >
> > > > >During normal operation the driver would just be passing through
> > > > >requests to the lower layer. When it detects a timeout, however, it
> > > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > > >new request it receives with -ETIMEDOUT without passing the request on
> > > > >until the request that originally timed out returns. This is our safety
> > > > >measure against anyone seeing whether or how the timed out request
> > > > >modified data.
> > > > >
> > > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > > >default handling, because bdrv_requests_pending() in the default
> > > > >handling considers bs->file, which would still have the timed out
> > > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > > >even though that request is still pending internally (externally, we
> > > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > > >monitor stays responsive and background jobs can go on if they don't use
> > > > >the failing block device.
> > > > >
> > > > >And then we essentially reuse the rerror/werror mechanism that we
> > > > >already have to stop the VM. The device models would be extended to
> > > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > > >this state, the VM would even be migratable if you make sure that the
> > > > >pending request can't modify the image on the destination host any more.
> > > > >
> > > > >Do you think this could work, or did I miss something important?
> > > > >
> > > > >Kevin
> > > > could I propose even more radical solution then?
> > > > 
> > > > My original approach was based on the fact that
> > > > this could should be maintainable out-of-stream.
> > > > If the patch will be merged - this boundary condition
> > > > could be dropped.
> > > > 
> > > > Why not to invent 'terror' field on BdrvOptions
> > > > and process things in core block layer without
> > > > a filter? RB Tree entry will just not created if
> > > > the policy will be set to 'ignore'.
> > > 
> > > 'terror' might not be the most fortunate name... ;-)
> > > 
> > > The reason why I would prefer a filter driver is so the code and the
> > > associated data structures are cleanly modularised and we can keep the
> > > actual block layer core small and clean. The same is true for some other
> > > functions that I would rather move out of the core into filter drivers
> > > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > > which are a bit harder to actually move because we already have old
> > > interfaces that we can't break (we'll probably do it anyway eventually,
> > > even if it needs a bit more compatibility code).
> > > 
> > > However, it seems that you are mostly touching code that is maintained
> > > by Stefan, and Stefan used to be a bit more open to adding functionality
> > > to the core, so my opinion might not be the last word.
> > 
> > I've been thinking more about the correctness of this feature:
> > 
> > QEMU cannot cancel I/O because there is no Linux userspace API for doing
> > so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> > implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> > O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> > uninterruptible sleep.
> 
> There are things that work on some devices, but nothing generic.
> For NBD/iSCSI/(ceph?) you should be able to issue a shutdown(2) on the socket
> that connects to the server and that should call all existing IO to fail
> quickly.  Then you could do a drain and be done.    This would
> be very useful for the fault-tolerant uses (e.g. Wen Congyang's block replication).
> 
> There are even ways of killing hard NFS mounts; for example adding
> a unreachable route to the NFS server (ip route add unreachable hostname),
> and then umount -f  seems to cause I/O errors to tasks.   (I can't find
> a way to do a remount to change the hard flag).  This isn't pretty but
> it's a reasonable way of getting your host back to useable if one NFS
> server has died.

If you just throw away a socket, you don't know the state of the disk
since some requests may have been handled by the server and others were
not handled.

So I doubt these approaches work because cleanly closing a connection
requires communication between the client and server to determine that
the connection was closed and which pending requests were completed.

The trade-off is that the client no longer has DMA buffers that might
get written to, but now you no longer know the state of the disk!

Stefan
Dr. David Alan Gilbert Sept. 28, 2015, 1:55 p.m. UTC | #13
* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Fri, Sep 25, 2015 at 01:34:22PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> > > On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > > > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > > > >>interesting point. Yes, it flushes all requests and most likely
> > > > > >>hangs inside waiting requests to complete. But fortunately
> > > > > >>this happens after the switch to paused state thus
> > > > > >>the guest becomes paused. That's why I have missed this
> > > > > >>fact.
> > > > > >>
> > > > > >>This (could) be considered as a problem but I have no (good)
> > > > > >>solution at the moment. Should think a bit on.
> > > > > >Let me suggest a radically different design. Note that I don't say this
> > > > > >is necessarily how things should be done, I'm just trying to introduce
> > > > > >some new ideas and broaden the discussion, so that we have a larger set
> > > > > >of ideas from which we can pick the right solution(s).
> > > > > >
> > > > > >The core of my idea would be a new filter block driver 'timeout' that
> > > > > >can be added on top of each BDS that could potentially fail, like a
> > > > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > > > >solution are nicely modularised and don't touch the block layer core.
> > > > > >
> > > > > >During normal operation the driver would just be passing through
> > > > > >requests to the lower layer. When it detects a timeout, however, it
> > > > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > > > >new request it receives with -ETIMEDOUT without passing the request on
> > > > > >until the request that originally timed out returns. This is our safety
> > > > > >measure against anyone seeing whether or how the timed out request
> > > > > >modified data.
> > > > > >
> > > > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > > > >default handling, because bdrv_requests_pending() in the default
> > > > > >handling considers bs->file, which would still have the timed out
> > > > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > > > >even though that request is still pending internally (externally, we
> > > > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > > > >monitor stays responsive and background jobs can go on if they don't use
> > > > > >the failing block device.
> > > > > >
> > > > > >And then we essentially reuse the rerror/werror mechanism that we
> > > > > >already have to stop the VM. The device models would be extended to
> > > > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > > > >this state, the VM would even be migratable if you make sure that the
> > > > > >pending request can't modify the image on the destination host any more.
> > > > > >
> > > > > >Do you think this could work, or did I miss something important?
> > > > > >
> > > > > >Kevin
> > > > > could I propose even more radical solution then?
> > > > > 
> > > > > My original approach was based on the fact that
> > > > > this could should be maintainable out-of-stream.
> > > > > If the patch will be merged - this boundary condition
> > > > > could be dropped.
> > > > > 
> > > > > Why not to invent 'terror' field on BdrvOptions
> > > > > and process things in core block layer without
> > > > > a filter? RB Tree entry will just not created if
> > > > > the policy will be set to 'ignore'.
> > > > 
> > > > 'terror' might not be the most fortunate name... ;-)
> > > > 
> > > > The reason why I would prefer a filter driver is so the code and the
> > > > associated data structures are cleanly modularised and we can keep the
> > > > actual block layer core small and clean. The same is true for some other
> > > > functions that I would rather move out of the core into filter drivers
> > > > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > > > which are a bit harder to actually move because we already have old
> > > > interfaces that we can't break (we'll probably do it anyway eventually,
> > > > even if it needs a bit more compatibility code).
> > > > 
> > > > However, it seems that you are mostly touching code that is maintained
> > > > by Stefan, and Stefan used to be a bit more open to adding functionality
> > > > to the core, so my opinion might not be the last word.
> > > 
> > > I've been thinking more about the correctness of this feature:
> > > 
> > > QEMU cannot cancel I/O because there is no Linux userspace API for doing
> > > so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> > > implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> > > O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> > > uninterruptible sleep.
> > 
> > There are things that work on some devices, but nothing generic.
> > For NBD/iSCSI/(ceph?) you should be able to issue a shutdown(2) on the socket
> > that connects to the server and that should call all existing IO to fail
> > quickly.  Then you could do a drain and be done.    This would
> > be very useful for the fault-tolerant uses (e.g. Wen Congyang's block replication).
> > 
> > There are even ways of killing hard NFS mounts; for example adding
> > a unreachable route to the NFS server (ip route add unreachable hostname),
> > and then umount -f  seems to cause I/O errors to tasks.   (I can't find
> > a way to do a remount to change the hard flag).  This isn't pretty but
> > it's a reasonable way of getting your host back to useable if one NFS
> > server has died.
> 
> If you just throw away a socket, you don't know the state of the disk
> since some requests may have been handled by the server and others were
> not handled.
> 
> So I doubt these approaches work because cleanly closing a connection
> requires communication between the client and server to determine that
> the connection was closed and which pending requests were completed.
> 
> The trade-off is that the client no longer has DMA buffers that might
> get written to, but now you no longer know the state of the disk!

Right, you dont know what the last successfull IOs really were, but if
you know that the NBD/iSCSI/NFS server is dead and is going to need to
get rebooted/replaced anyway then your current state is that you have
some QEMUs that are running fine except for one disk, but are now very
delicate because anything that tries to a drain will hang.  There's no
way that you can recover that knowledge about which IOs completed, but
you can recover all your guests that aren't critical on that device.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/block/accounting.c b/block/accounting.c
index 01d594f..7b913fd 100644
--- a/block/accounting.c
+++ b/block/accounting.c
@@ -34,6 +34,10 @@  void block_acct_start(BlockAcctStats *stats, BlockAcctCookie *cookie,
     cookie->bytes = bytes;
     cookie->start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     cookie->type = type;
+
+    if (stats->disk_deadlines.enabled) {
+        insert_request(&stats->disk_deadlines, cookie);
+    }
 }
 
 void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
@@ -44,6 +48,10 @@  void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
     stats->nr_ops[cookie->type]++;
     stats->total_time_ns[cookie->type] +=
         qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - cookie->start_time_ns;
+
+    if (stats->disk_deadlines.enabled) {
+        remove_request(&stats->disk_deadlines, cookie);
+    }
 }
 
 
diff --git a/block/disk-deadlines.c b/block/disk-deadlines.c
index 39dec53..acb44bc 100644
--- a/block/disk-deadlines.c
+++ b/block/disk-deadlines.c
@@ -23,8 +23,175 @@ 
  */
 
 #include "block/disk-deadlines.h"
+#include "block/accounting.h"
+#include "sysemu/sysemu.h"
+#include "qemu/atomic.h"
+
+/*
+ * Number of late requests which were not completed in time
+ * (its timer has expired) and as a result it caused VM's stopping
+ */
+uint64_t num_requests_vmstopped;
+
+/* Give 8 seconds for request to complete by default */
+const uint64_t EXPIRE_DEFAULT_NS = 8000000000;
+
+typedef struct RequestInfo {
+    BlockAcctCookie *cookie;
+    int64_t expire_time;
+} RequestInfo;
+
+static gint compare(gconstpointer a, gconstpointer b)
+{
+    return (int64_t)a - (int64_t)b;
+}
+
+static gboolean find_request(gpointer key, gpointer value, gpointer data)
+{
+    BlockAcctCookie *cookie = value;
+    RequestInfo *request = data;
+    if (cookie == request->cookie) {
+        request->expire_time = (int64_t)key;
+        return true;
+    }
+    return false;
+}
+
+static gint search_min_key(gpointer key, gpointer data)
+{
+    int64_t tree_key = (int64_t)key;
+    int64_t *ptr_curr_min_key = data;
+
+    if ((tree_key <= *ptr_curr_min_key) || (*ptr_curr_min_key == 0)) {
+        *ptr_curr_min_key = tree_key;
+    }
+    /*
+     * We always want to proceed searching among key/value pairs
+     * with smaller key => return -1
+     */
+    return -1;
+}
+
+static int64_t soonest_expire_time(GTree *requests_tree)
+{
+    int64_t min_timestamp = 0;
+    /*
+     * g_tree_search() will always return NULL, because there is no
+     * key = 0 in the tree, we simply search for node the with the minimal key
+     */
+    g_tree_search(requests_tree, (GCompareFunc)search_min_key, &min_timestamp);
+    return min_timestamp;
+}
+
+static void disk_deadlines_callback(void *opaque)
+{
+    bool need_vmstop = false;
+    int64_t current_time, expire_time;
+    DiskDeadlines *disk_deadlines = opaque;
+
+    /*
+     * Check whether the request that triggered callback invocation
+     * is still in the tree of requests.
+     */
+    current_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+    pthread_mutex_lock(&disk_deadlines->mtx_tree);
+    if (g_tree_nnodes(disk_deadlines->requests_tree) == 0) {
+        /* There are no requests in the tree, do nothing */
+        pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+        return;
+    }
+    expire_time = soonest_expire_time(disk_deadlines->requests_tree);
+
+    /*
+     * If the request was not found, then there is no disk deadline detected,
+     * just update the timer with new value
+     */
+    if (expire_time > current_time) {
+        timer_mod_ns(disk_deadlines->request_timer, expire_time);
+        pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+        return;
+    }
+
+    disk_deadlines->expired_tree = true;
+    need_vmstop = !atomic_fetch_inc(&num_requests_vmstopped);
+    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+
+    if (need_vmstop) {
+        qemu_system_vmstop_request_prepare();
+        qemu_system_vmstop_request(RUN_STATE_PAUSED);
+    }
+}
 
 void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled)
 {
     disk_deadlines->enabled = enabled;
+    if (!disk_deadlines->enabled) {
+        return;
+    }
+
+    disk_deadlines->requests_tree = g_tree_new(compare);
+    if (disk_deadlines->requests_tree == NULL) {
+        disk_deadlines->enabled = false;
+        fprintf(stderr,
+                "disk_deadlines_init: failed to allocate requests_tree\n");
+        return;
+    }
+
+    pthread_mutex_init(&disk_deadlines->mtx_tree, NULL);
+    disk_deadlines->expired_tree = false;
+    disk_deadlines->request_timer = timer_new_ns(QEMU_CLOCK_REALTIME,
+                                                 disk_deadlines_callback,
+                                                 (void *)disk_deadlines);
+}
+
+void insert_request(DiskDeadlines *disk_deadlines, void *request)
+{
+    BlockAcctCookie *cookie = request;
+
+    int64_t expire_time = cookie->start_time_ns + EXPIRE_DEFAULT_NS;
+
+    pthread_mutex_lock(&disk_deadlines->mtx_tree);
+    /* Set up expire time for the current disk if it is not set yet */
+    if (timer_expired(disk_deadlines->request_timer,
+        qemu_clock_get_ns(QEMU_CLOCK_REALTIME))) {
+        timer_mod_ns(disk_deadlines->request_timer, expire_time);
+    }
+
+    g_tree_insert(disk_deadlines->requests_tree, (int64_t *)expire_time,
+                  cookie);
+    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+}
+
+void remove_request(DiskDeadlines *disk_deadlines, void *request)
+{
+    bool need_vmstart = false;
+    RequestInfo request_info = {
+        .cookie = request,
+        .expire_time = 0,
+    };
+
+    /* Find the request to remove */
+    pthread_mutex_lock(&disk_deadlines->mtx_tree);
+    g_tree_foreach(disk_deadlines->requests_tree, find_request, &request_info);
+    g_tree_remove(disk_deadlines->requests_tree,
+                  (int64_t *)request_info.expire_time);
+
+    /*
+     * If tree is empty, but marked as expired, then one needs to
+     * unset "expired_tree" flag and check whether VM can be resumed
+     */
+    if (!g_tree_nnodes(disk_deadlines->requests_tree) &&
+        disk_deadlines->expired_tree) {
+        disk_deadlines->expired_tree = false;
+        /*
+         * If all requests (from all disks with enabled
+         * "disk-deadlines" feature) are completed, resume VM
+         */
+        need_vmstart = !atomic_dec_fetch(&num_requests_vmstopped);
+    }
+    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+
+    if (need_vmstart) {
+        qemu_system_vmstart_request();
+    }
 }
diff --git a/include/block/disk-deadlines.h b/include/block/disk-deadlines.h
index 2ea193b..9672aff 100644
--- a/include/block/disk-deadlines.h
+++ b/include/block/disk-deadlines.h
@@ -25,11 +25,22 @@ 
 #define DISK_DEADLINES_H
 
 #include <stdbool.h>
+#include <stdint.h>
+#include <glib.h>
+
+#include "qemu/typedefs.h"
+#include "qemu/timer.h"
 
 typedef struct DiskDeadlines {
     bool enabled;
+    bool expired_tree;
+    pthread_mutex_t mtx_tree;
+    GTree *requests_tree;
+    QEMUTimer *request_timer;
 } DiskDeadlines;
 
 void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled);
+void insert_request(DiskDeadlines *disk_deadlines, void *request);
+void remove_request(DiskDeadlines *disk_deadlines, void *request);
 
 #endif