diff mbox

[RFC,v2] block: add write threshold reporting for block devices

Message ID 1415365933-3727-2-git-send-email-fromani@redhat.com
State New
Headers show

Commit Message

Francesco Romani Nov. 7, 2014, 1:12 p.m. UTC
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
To let the guest run smoothly and be not unnecessarily paused, oVirt sets
a disk usage threshold (so called 'high water mark') based on the occupation
of the device,  and automatically extends the image once the threshold
is reached or exceeded.

In order to detect the crossing of the threshold, oVirt has no choice but
aggressively polling the QEMU monitor using the query-blockstats command.
This lead to unnecessary system load, and is made even worse under scale:
deployments with hundreds of VMs are no longer rare.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
altogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani <fromani@redhat.com>
---
 block/Makefile.objs             |   1 +
 block/qapi.c                    |   3 +
 block/usage-threshold.c         | 124 ++++++++++++++++++++++++++++++++++++++++
 include/block/block_int.h       |   4 ++
 include/block/usage-threshold.h |  39 +++++++++++++
 qapi/block-core.json            |  46 ++++++++++++++-
 qmp-commands.hx                 |  26 +++++++++
 7 files changed, 242 insertions(+), 1 deletion(-)
 create mode 100644 block/usage-threshold.c
 create mode 100644 include/block/usage-threshold.h

Comments

Stefan Hajnoczi Nov. 17, 2014, 4:49 p.m. UTC | #1
On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:

Sorry for the long review delay.  Looks pretty good, just one real issue
to think about at the bottom.

> +static void usage_threshold_disable(BlockDriverState *bs)
> +{

It would be safest to make this idempotent:

if (!usage_threshold_is_set(bs)) {
    return;
}

That way it can be called multiple times.

> +    notifier_with_return_remove(&bs->wr_usage_threshold_notifier);
> +    bs->wr_offset_threshold = 0;
> +}
> +
> +static int usage_threshold_is_set(const BlockDriverState *bs)
> +{
> +    return !!(bs->wr_offset_threshold > 0);
> +}

Please use the bool type instead of an int return value.

> +
> +static int64_t usage_threshold_exceeded(const BlockDriverState *bs,
> +                                        const BdrvTrackedRequest *req)
> +{
> +    if (usage_threshold_is_set(bs)) {
> +        int64_t amount = req->offset + req->bytes - bs->wr_offset_threshold;
> +        if (amount > 0) {
> +            return amount;
> +        }
> +    }
> +    return 0;
> +}
> +
> +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> +                                            void *opaque)
> +{
> +    BdrvTrackedRequest *req = opaque;
> +    BlockDriverState *bs = req->bs;
> +    int64_t amount = 0;
> +
> +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);

Does the code still make these assumptions or are the asserts left over
from previous versions of the patch?

> +
> +    amount = usage_threshold_exceeded(bs, req);
> +    if (amount > 0) {
> +        qapi_event_send_block_usage_threshold(
> +            bdrv_get_device_name(bs), /* FIXME: this does not work */
> +            amount,
> +            bs->wr_offset_threshold,
> +            &error_abort);
> +
> +        /* autodisable to avoid to flood the monitor */
> +        usage_threshold_disable(bs);
> +    }
> +
> +    return 0; /* should always let other notifiers run */
> +}
> +
> +static void usage_threshold_register_notifier(BlockDriverState *bs)
> +{
> +    bs->wr_usage_threshold_notifier.notify = before_write_notify;
> +    notifier_with_return_list_add(&bs->before_write_notifiers,
> +                                  &bs->wr_usage_threshold_notifier);
> +}
> +
> +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes)
> +{
> +    BlockDriverState *target_bs = bs;
> +    if (bs->file) {
> +        target_bs = bs->file;
> +    }

Hmm...I think now I understand why you are trying to use bs->file.  This
is an attempt to make image formats work with the threshold.

Unfortunately the BlockDriverState topology can be more complicated than
just 1 level.

If we hardcode a strategy to traverse bs->file then it will work in most
cases:

  while (bs->file) {
      bs = bs->file;
  }

But there are cases like VMDK extent files where a BlockDriverState
actually has multiple children.

One way to solve this is to require that the management tool tells QEMU
which exact BlockDriverState node the threshold applies to.  Then QEMU
doesn't need any hardcoded policy.  But I'm not sure how realistic that
it at the moment (whether management tools are uses node names for each
node yet), so it may be best to hardcode the bs->file traversal that
I've suggested.

Kevin: Do you agree?
Francesco Romani Nov. 18, 2014, 8:12 a.m. UTC | #2
----- Original Message -----
> From: "Stefan Hajnoczi" <stefanha@gmail.com>
> To: "Francesco Romani" <fromani@redhat.com>
> Cc: kwolf@redhat.com, lcapitulino@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com, mdroth@linux.vnet.ibm.com
> Sent: Monday, November 17, 2014 5:49:36 PM
> Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
> 
> On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
> 
> Sorry for the long review delay.  Looks pretty good, just one real issue
> to think about at the bottom.

Hi Stefan, thanks for the review and no problem for the delay :)

> 
> > +static void usage_threshold_disable(BlockDriverState *bs)
> > +{
> 
> It would be safest to make this idempotent:
> 
> if (!usage_threshold_is_set(bs)) {
>     return;
> }
> 
> That way it can be called multiple times.

Will change.

> 
> > +    notifier_with_return_remove(&bs->wr_usage_threshold_notifier);
> > +    bs->wr_offset_threshold = 0;
> > +}
> > +
> > +static int usage_threshold_is_set(const BlockDriverState *bs)
> > +{
> > +    return !!(bs->wr_offset_threshold > 0);
> > +}
> 
> Please use the bool type instead of an int return value.

Sure, will fix.

> > +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> > +                                            void *opaque)
> > +{
> > +    BdrvTrackedRequest *req = opaque;
> > +    BlockDriverState *bs = req->bs;
> > +    int64_t amount = 0;
> > +
> > +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> > +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> 
> Does the code still make these assumptions or are the asserts left over
> from previous versions of the patch?

It's a leftover.
I understood they don't hurt and add a bit of safety, but if they are confusing
I'll remove them.

> > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t
> > threshold_bytes)
> > +{
> > +    BlockDriverState *target_bs = bs;
> > +    if (bs->file) {
> > +        target_bs = bs->file;
> > +    }
> 
> Hmm...I think now I understand why you are trying to use bs->file.  This
> is an attempt to make image formats work with the threshold.
> 
> Unfortunately the BlockDriverState topology can be more complicated than
> just 1 level.

I thought so but I can't reproduce yet more complex topologies.
Disclosure: I'm testing against the topology I know to be used on oVirt, lacking
immediate availability of others: suggestions welcome.

At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device,
and we'd like to be notified about the allocation of the lvm block device, which (IIUC)
is the last bs->file.

This is a simple topology (unless I'm missing something), and that's
the reason why I go down just one level.

Of course I want a general solution for this change, so...

> If we hardcode a strategy to traverse bs->file then it will work in most
> cases:
> 
>   while (bs->file) {
>       bs = bs->file;
>   }
> 
> But there are cases like VMDK extent files where a BlockDriverState
> actually has multiple children.
> 
> One way to solve this is to require that the management tool tells QEMU
> which exact BlockDriverState node the threshold applies to.  Then QEMU
> doesn't need any hardcoded policy.  But I'm not sure how realistic that
> it at the moment (whether management tools are uses node names for each
> node yet), so it may be best to hardcode the bs->file traversal that
> I've suggested.

oVirt relies on libvirt[1], and uses device node (e.g. 'vda').

BTW, I haven't found a way to inspect from the outside (e.g. monitor command) the BlockDriverState
topology, there is a way I'm missing?

Another issue I don't yet have a proper solution is related to this one.

The management app will have deal with VM with more than one block device disk, so we need
a way to map the notification with the corresponding device.

If we descend the bs->file chain, AFAIU the easiest mapping, being the device name,
is easily lost because only the outermost BlockDriverState has a device name attached, so when the
notification trigger
bdrv_get_device_name() will return NULL

I believe we don't necessarily need a device name in the notification, for example something
like an opaque cookie set together with the threshold and returned back with the notification
will suffice. Of course the device name is nicer :)

+++

[1] if that can help further to understand the usecase, these are the libvirt APIs being used
by oVirt:
http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockStatsFlags
http://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetBlockInfo
both relies on the output[2] of 'query-blockstats' monitor command.

[2] AFAIU -but this is just my guesswork!- it also assumes a quite simple topology like I did

Thanks and bests,
Stefan Hajnoczi Nov. 19, 2014, 3:52 p.m. UTC | #3
On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote:
> > > +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> > > +                                            void *opaque)
> > > +{
> > > +    BdrvTrackedRequest *req = opaque;
> > > +    BlockDriverState *bs = req->bs;
> > > +    int64_t amount = 0;
> > > +
> > > +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> > > +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> > 
> > Does the code still make these assumptions or are the asserts left over
> > from previous versions of the patch?
> 
> It's a leftover.
> I understood they don't hurt and add a bit of safety, but if they are confusing
> I'll remove them.

Yes, it made me wonder why.  Probably best to remove them.

> > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t
> > > threshold_bytes)
> > > +{
> > > +    BlockDriverState *target_bs = bs;
> > > +    if (bs->file) {
> > > +        target_bs = bs->file;
> > > +    }
> > 
> > Hmm...I think now I understand why you are trying to use bs->file.  This
> > is an attempt to make image formats work with the threshold.
> > 
> > Unfortunately the BlockDriverState topology can be more complicated than
> > just 1 level.
> 
> I thought so but I can't reproduce yet more complex topologies.
> Disclosure: I'm testing against the topology I know to be used on oVirt, lacking
> immediate availability of others: suggestions welcome.
> 
> At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device,
> and we'd like to be notified about the allocation of the lvm block device, which (IIUC)
> is the last bs->file.
> 
> This is a simple topology (unless I'm missing something), and that's
> the reason why I go down just one level.
> 
> Of course I want a general solution for this change, so...

There is a block driver for error injection called "blkdebug" (see
docs/blkdebug.txt).  Here is an example of the following topology:

  raw_bsd (drive0) -> blkdebug -> raw-posix (test.img)

qemu-system-x86_64 -drive if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img

The blkdebug driver is interposing between the raw_bsd (drive0) root and
the raw-posix leaf node.

> > If we hardcode a strategy to traverse bs->file then it will work in most
> > cases:
> > 
> >   while (bs->file) {
> >       bs = bs->file;
> >   }
> > 
> > But there are cases like VMDK extent files where a BlockDriverState
> > actually has multiple children.
> > 
> > One way to solve this is to require that the management tool tells QEMU
> > which exact BlockDriverState node the threshold applies to.  Then QEMU
> > doesn't need any hardcoded policy.  But I'm not sure how realistic that
> > it at the moment (whether management tools are uses node names for each
> > node yet), so it may be best to hardcode the bs->file traversal that
> > I've suggested.
> 
> oVirt relies on libvirt[1], and uses device node (e.g. 'vda').
> 
> BTW, I haven't found a way to inspect from the outside (e.g. monitor command) the BlockDriverState
> topology, there is a way I'm missing?

You can get the BlockDriverState and ->backing_hd chain using the
query-block QMP command.  I'm not aware of a command that returns the
full graph of BlockDriverState nodes.

The management tool should not need to inspect the graph because the
graph can change at runtime (e.g. imagine I/O throttling is implemented
as a BlockDriverState node then it could appear/disapper when the
feature is activated/deactivated).  Instead the management tool should
name the nodes it knows about and then use those node names.

> Another issue I don't yet have a proper solution is related to this one.
> 
> The management app will have deal with VM with more than one block device disk, so we need
> a way to map the notification with the corresponding device.
> 
> If we descend the bs->file chain, AFAIU the easiest mapping, being the device name,
> is easily lost because only the outermost BlockDriverState has a device name attached, so when the
> notification trigger
> bdrv_get_device_name() will return NULL

In the worst case a name string can be passed in along with the
threshold values.

> I believe we don't necessarily need a device name in the notification, for example something
> like an opaque cookie set together with the threshold and returned back with the notification
> will suffice. Of course the device name is nicer :)

Agreed.
Francesco Romani Nov. 20, 2014, 8:23 a.m. UTC | #4
----- Original Message -----
> From: "Stefan Hajnoczi" <stefanha@redhat.com>
> To: "Francesco Romani" <fromani@redhat.com>
> Cc: kwolf@redhat.com, "Stefan Hajnoczi" <stefanha@gmail.com>, mdroth@linux.vnet.ibm.com, qemu-devel@nongnu.org,
> lcapitulino@redhat.com
> Sent: Wednesday, November 19, 2014 4:52:51 PM
> Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
> 
> On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote:
> > > > +static int coroutine_fn before_write_notify(NotifierWithReturn
> > > > *notifier,
> > > > +                                            void *opaque)
> > > > +{
> > > > +    BdrvTrackedRequest *req = opaque;
> > > > +    BlockDriverState *bs = req->bs;
> > > > +    int64_t amount = 0;
> > > > +
> > > > +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> > > > +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> > > 
> > > Does the code still make these assumptions or are the asserts left over
> > > from previous versions of the patch?
> > 
> > It's a leftover.
> > I understood they don't hurt and add a bit of safety, but if they are
> > confusing
> > I'll remove them.
> 
> Yes, it made me wonder why.  Probably best to remove them.

Will do

[...]
> > At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block
> > device,
> > and we'd like to be notified about the allocation of the lvm block device,
> > which (IIUC)
> > is the last bs->file.
> > 
> > This is a simple topology (unless I'm missing something), and that's
> > the reason why I go down just one level.
> > 
> > Of course I want a general solution for this change, so...
> 
> There is a block driver for error injection called "blkdebug" (see
> docs/blkdebug.txt).  Here is an example of the following topology:
> 
>   raw_bsd (drive0) -> blkdebug -> raw-posix (test.img)
> 
> qemu-system-x86_64 -drive
> if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img
> 
> The blkdebug driver is interposing between the raw_bsd (drive0) root and
> the raw-posix leaf node.

Thanks, I'll have a look

[...]
> The management tool should not need to inspect the graph because the
> graph can change at runtime (e.g. imagine I/O throttling is implemented
> as a BlockDriverState node then it could appear/disapper when the
> feature is activated/deactivated).  Instead the management tool should
> name the nodes it knows about and then use those node names.

Agreed - and indeed simpler for us (oVirt), which it doesn't hurt :)

> > If we descend the bs->file chain, AFAIU the easiest mapping, being the
> > device name,
> > is easily lost because only the outermost BlockDriverState has a device
> > name attached, so when the
> > notification trigger
> > bdrv_get_device_name() will return NULL
> 
> In the worst case a name string can be passed in along with the
> threshold values.

OK, I guess to keep a copy of the string with g_strdup() could be good enough start,
at least for further discussion.

Thanks for your review and for the informations,
I'll submit a new revision of the patch in a couple of days,
to give to other reviewers some time to jump in.

Bests,
Kevin Wolf Nov. 20, 2014, 10:30 a.m. UTC | #5
Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben:
> On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
> > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes)
> > +{
> > +    BlockDriverState *target_bs = bs;
> > +    if (bs->file) {
> > +        target_bs = bs->file;
> > +    }
> 
> Hmm...I think now I understand why you are trying to use bs->file.  This
> is an attempt to make image formats work with the threshold.
> 
> Unfortunately the BlockDriverState topology can be more complicated than
> just 1 level.
> 
> If we hardcode a strategy to traverse bs->file then it will work in most
> cases:
> 
>   while (bs->file) {
>       bs = bs->file;
>   }
> 
> But there are cases like VMDK extent files where a BlockDriverState
> actually has multiple children.
> 
> One way to solve this is to require that the management tool tells QEMU
> which exact BlockDriverState node the threshold applies to.  Then QEMU
> doesn't need any hardcoded policy.  But I'm not sure how realistic that
> it at the moment (whether management tools are uses node names for each
> node yet), so it may be best to hardcode the bs->file traversal that
> I've suggested.
> 
> Kevin: Do you agree?

I have a feeling that we would regret this in the long run because it
would allow only one special case of a general problem (watching a BDS).
This means that we'll get inconsistent APIs.

We're "only" talking about an optimisation here, even though a very
useful one, so I wouldn't easily make compromises here. We should
probably insist on using the node-name. Management tools need new code
anyway to make use of the new functionality, so they can implement
node-name support as well while they're at it.

Kevin
Stefan Hajnoczi Nov. 20, 2014, 11:04 a.m. UTC | #6
On Thu, Nov 20, 2014 at 11:30:53AM +0100, Kevin Wolf wrote:
> Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben:
> > On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
> > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes)
> > > +{
> > > +    BlockDriverState *target_bs = bs;
> > > +    if (bs->file) {
> > > +        target_bs = bs->file;
> > > +    }
> > 
> > Hmm...I think now I understand why you are trying to use bs->file.  This
> > is an attempt to make image formats work with the threshold.
> > 
> > Unfortunately the BlockDriverState topology can be more complicated than
> > just 1 level.
> > 
> > If we hardcode a strategy to traverse bs->file then it will work in most
> > cases:
> > 
> >   while (bs->file) {
> >       bs = bs->file;
> >   }
> > 
> > But there are cases like VMDK extent files where a BlockDriverState
> > actually has multiple children.
> > 
> > One way to solve this is to require that the management tool tells QEMU
> > which exact BlockDriverState node the threshold applies to.  Then QEMU
> > doesn't need any hardcoded policy.  But I'm not sure how realistic that
> > it at the moment (whether management tools are uses node names for each
> > node yet), so it may be best to hardcode the bs->file traversal that
> > I've suggested.
> > 
> > Kevin: Do you agree?
> 
> I have a feeling that we would regret this in the long run because it
> would allow only one special case of a general problem (watching a BDS).
> This means that we'll get inconsistent APIs.
> 
> We're "only" talking about an optimisation here, even though a very
> useful one, so I wouldn't easily make compromises here. We should
> probably insist on using the node-name. Management tools need new code
> anyway to make use of the new functionality, so they can implement
> node-name support as well while they're at it.

Using node-name is the best thing to do.

My concern is just whether libvirt and other management tools are
actually using node-name yet.

Stefan
Kevin Wolf Nov. 20, 2014, 11:34 a.m. UTC | #7
Am 20.11.2014 um 12:04 hat Stefan Hajnoczi geschrieben:
> On Thu, Nov 20, 2014 at 11:30:53AM +0100, Kevin Wolf wrote:
> > Am 17.11.2014 um 17:49 hat Stefan Hajnoczi geschrieben:
> > > On Fri, Nov 07, 2014 at 02:12:13PM +0100, Francesco Romani wrote:
> > > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes)
> > > > +{
> > > > +    BlockDriverState *target_bs = bs;
> > > > +    if (bs->file) {
> > > > +        target_bs = bs->file;
> > > > +    }
> > > 
> > > Hmm...I think now I understand why you are trying to use bs->file.  This
> > > is an attempt to make image formats work with the threshold.
> > > 
> > > Unfortunately the BlockDriverState topology can be more complicated than
> > > just 1 level.
> > > 
> > > If we hardcode a strategy to traverse bs->file then it will work in most
> > > cases:
> > > 
> > >   while (bs->file) {
> > >       bs = bs->file;
> > >   }
> > > 
> > > But there are cases like VMDK extent files where a BlockDriverState
> > > actually has multiple children.
> > > 
> > > One way to solve this is to require that the management tool tells QEMU
> > > which exact BlockDriverState node the threshold applies to.  Then QEMU
> > > doesn't need any hardcoded policy.  But I'm not sure how realistic that
> > > it at the moment (whether management tools are uses node names for each
> > > node yet), so it may be best to hardcode the bs->file traversal that
> > > I've suggested.
> > > 
> > > Kevin: Do you agree?
> > 
> > I have a feeling that we would regret this in the long run because it
> > would allow only one special case of a general problem (watching a BDS).
> > This means that we'll get inconsistent APIs.
> > 
> > We're "only" talking about an optimisation here, even though a very
> > useful one, so I wouldn't easily make compromises here. We should
> > probably insist on using the node-name. Management tools need new code
> > anyway to make use of the new functionality, so they can implement
> > node-name support as well while they're at it.
> 
> Using node-name is the best thing to do.
> 
> My concern is just whether libvirt and other management tools are
> actually using node-name yet.

I don't think so. They also don't use blockdev-add yet.

But that's not a reason for us to add hacks that allow libvirt and other
management tools to avoid the proper APIs even in the future. They just
need to add support for node-names if they want to use new qemu features.
New features require support for new infrastructure, I think that's fair.

If they feel that representing complete BDS graphs in their code is too
much work for now, they can still keep temporary hacks with hardcoded
assumptions in their management code (like setting file.node-name and
ignoring other setups). At least it would be temporary hacks there; if
we did them in qemu, they would be a permanent API.

Kevin
Eric Blake Nov. 21, 2014, 12:10 a.m. UTC | #8
On 11/20/2014 04:04 AM, Stefan Hajnoczi wrote:
>>
>> We're "only" talking about an optimisation here, even though a very
>> useful one, so I wouldn't easily make compromises here. We should
>> probably insist on using the node-name. Management tools need new code
>> anyway to make use of the new functionality, so they can implement
>> node-name support as well while they're at it.
> 
> Using node-name is the best thing to do.
> 
> My concern is just whether libvirt and other management tools are
> actually using node-name yet.

Libvirt is not yet using it, but the more compelling we make it, the
more libvirt will accelerate the efforts needed to start using
node-name.  I'm okay with requiring node-name here.
Francesco Romani Nov. 21, 2014, 8:43 a.m. UTC | #9
----- Original Message -----
> From: "Kevin Wolf" <kwolf@redhat.com>
> To: "Stefan Hajnoczi" <stefanha@redhat.com>
> Cc: mdroth@linux.vnet.ibm.com, "Stefan Hajnoczi" <stefanha@gmail.com>, lcapitulino@redhat.com, qemu-devel@nongnu.org,
> "Francesco Romani" <fromani@redhat.com>
> Sent: Thursday, November 20, 2014 12:34:28 PM
> Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

> > > > One way to solve this is to require that the management tool tells QEMU
> > > > which exact BlockDriverState node the threshold applies to.  Then QEMU
> > > > doesn't need any hardcoded policy.  But I'm not sure how realistic that
> > > > it at the moment (whether management tools are uses node names for each
> > > > node yet), so it may be best to hardcode the bs->file traversal that
> > > > I've suggested.
> > > > 
> > > > Kevin: Do you agree?
> > > 
> > > I have a feeling that we would regret this in the long run because it
> > > would allow only one special case of a general problem (watching a BDS).
> > > This means that we'll get inconsistent APIs.
> > > 
> > > We're "only" talking about an optimisation here, even though a very
> > > useful one, so I wouldn't easily make compromises here. We should
> > > probably insist on using the node-name. Management tools need new code
> > > anyway to make use of the new functionality, so they can implement
> > > node-name support as well while they're at it.
> > 
> > Using node-name is the best thing to do.
> > 
> > My concern is just whether libvirt and other management tools are
> > actually using node-name yet.
> 
> I don't think so. They also don't use blockdev-add yet.
> 
> But that's not a reason for us to add hacks that allow libvirt and other
> management tools to avoid the proper APIs even in the future. They just
> need to add support for node-names if they want to use new qemu features.
> New features require support for new infrastructure, I think that's fair.
> 
> If they feel that representing complete BDS graphs in their code is too
> much work for now, they can still keep temporary hacks with hardcoded
> assumptions in their management code (like setting file.node-name and
> ignoring other setups). At least it would be temporary hacks there; if
> we did them in qemu, they would be a permanent API.

I'm fine to use node_name in my patch, it looks even much simpler and cleaner

I'd love to take this chance and learn more about the topic, becuse
I'm very near to the border of my knowledge in that area.
I joined the discussion quite later, so my sources are actually
pretty sparse. Mostly:
- staring at the sources and git history
- googling for specific bits
- presentations like http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf

There are some sources I'm missing? Hopefully a nice wiki page I somehow lost :)

A couple of specific questions more, mostly to make sure I can do meaningful
tests for my next submission:

1. I'm running a simple test using the attached script -
which is a qemu command line adapted from libvirt ouput driven
by oVirt. There is a way to attach a name at this stage, using a QMP command?

2. (related to the former) it seems from a not-so-deep look that the blessed (only?)
way to set a proper node_name is using blockdev-add.
If so, I'm not sure I follow how the qemu boot flow would look like.
It will not be anymore as simple as crafting a command line and run the qemu, right?
IIUC some interaction with QMP will be needed (sorry for asking silly question,
trying to fill gaps in my knowledge).

Thanks for the great feedback!
Kevin Wolf Nov. 21, 2014, 10:11 a.m. UTC | #10
Am 21.11.2014 um 09:43 hat Francesco Romani geschrieben:
> ----- Original Message -----
> > From: "Kevin Wolf" <kwolf@redhat.com>
> > To: "Stefan Hajnoczi" <stefanha@redhat.com>
> > Cc: mdroth@linux.vnet.ibm.com, "Stefan Hajnoczi" <stefanha@gmail.com>, lcapitulino@redhat.com, qemu-devel@nongnu.org,
> > "Francesco Romani" <fromani@redhat.com>
> > Sent: Thursday, November 20, 2014 12:34:28 PM
> > Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
> 
> > > > > One way to solve this is to require that the management tool tells QEMU
> > > > > which exact BlockDriverState node the threshold applies to.  Then QEMU
> > > > > doesn't need any hardcoded policy.  But I'm not sure how realistic that
> > > > > it at the moment (whether management tools are uses node names for each
> > > > > node yet), so it may be best to hardcode the bs->file traversal that
> > > > > I've suggested.
> > > > > 
> > > > > Kevin: Do you agree?
> > > > 
> > > > I have a feeling that we would regret this in the long run because it
> > > > would allow only one special case of a general problem (watching a BDS).
> > > > This means that we'll get inconsistent APIs.
> > > > 
> > > > We're "only" talking about an optimisation here, even though a very
> > > > useful one, so I wouldn't easily make compromises here. We should
> > > > probably insist on using the node-name. Management tools need new code
> > > > anyway to make use of the new functionality, so they can implement
> > > > node-name support as well while they're at it.
> > > 
> > > Using node-name is the best thing to do.
> > > 
> > > My concern is just whether libvirt and other management tools are
> > > actually using node-name yet.
> > 
> > I don't think so. They also don't use blockdev-add yet.
> > 
> > But that's not a reason for us to add hacks that allow libvirt and other
> > management tools to avoid the proper APIs even in the future. They just
> > need to add support for node-names if they want to use new qemu features.
> > New features require support for new infrastructure, I think that's fair.
> > 
> > If they feel that representing complete BDS graphs in their code is too
> > much work for now, they can still keep temporary hacks with hardcoded
> > assumptions in their management code (like setting file.node-name and
> > ignoring other setups). At least it would be temporary hacks there; if
> > we did them in qemu, they would be a permanent API.
> 
> I'm fine to use node_name in my patch, it looks even much simpler and cleaner
> 
> I'd love to take this chance and learn more about the topic, becuse
> I'm very near to the border of my knowledge in that area.
> I joined the discussion quite later, so my sources are actually
> pretty sparse. Mostly:
> - staring at the sources and git history
> - googling for specific bits
> - presentations like http://www.linux-kvm.org/wiki/images/3/34/Kvm-forum-2013-block-dev-configuration.pdf

The video recording of that presentation wasn't really good quality,
unfortunately. But you can watch the one of this year's presentation by
Max and myself (you're probably mostly interested in Max's part), the
title was "More Block Device Configuration".

> There are some sources I'm missing? Hopefully a nice wiki page I somehow lost :)

I'm afraid that this single wiki page that has a well structured
presentation of all information doesn't exist.

> A couple of specific questions more, mostly to make sure I can do meaningful
> tests for my next submission:
> 
> 1. I'm running a simple test using the attached script -
> which is a qemu command line adapted from libvirt ouput driven
> by oVirt. There is a way to attach a name at this stage, using a QMP command?

No, node-name is assigned at the BlockDriverState (BDS) creation and
can't be changed later on.

> 2. (related to the former) it seems from a not-so-deep look that the blessed (only?)
> way to set a proper node_name is using blockdev-add.
> If so, I'm not sure I follow how the qemu boot flow would look like.
> It will not be anymore as simple as crafting a command line and run the qemu, right?
> IIUC some interaction with QMP will be needed (sorry for asking silly question,
> trying to fill gaps in my knowledge).

-drive on the command line can do everything that blockdev-add can do.
So let's assume you have a qcow2 image on a filesystem. Then you end up
with two BDSes, one for the format driver and one for accessing the
filesystem:

    BlockBackend (virtual device) -> qcow2 BDS -> file BDS (raw-posix.c)

For assigning a node name to the qcow2 BDS, you simply specify it in the
obvious way:

    -drive file=test.qcow2,node-name=foo

Now if you want to assign a node name to the file BDS as well, you would
get nested dicts in the blockdev-add call. In -drive a dot syntax is
used to represent this:

    -drive file=test.qcow2,node-name=foo,file.node-name=bar

Are things a bit clearer with this?

Kevin
Francesco Romani Nov. 21, 2014, 3:32 p.m. UTC | #11
----- Original Message -----
> From: "Kevin Wolf" <kwolf@redhat.com>
> To: "Francesco Romani" <fromani@redhat.com>
> Cc: qemu-devel@nongnu.org, "Stefan Hajnoczi" <stefanha@gmail.com>, mdroth@linux.vnet.ibm.com, "Luiz Capitulino"
> <lcapitulino@redhat.com>, "Stefan Hajnoczi" <stefanha@redhat.com>
> Sent: Friday, November 21, 2014 11:11:26 AM
> Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
[...]
> > 1. I'm running a simple test using the attached script -
> > which is a qemu command line adapted from libvirt ouput driven
> > by oVirt. There is a way to attach a name at this stage, using a QMP
> > command?
> 
> No, node-name is assigned at the BlockDriverState (BDS) creation and
> can't be changed later on.

Makes sense to me.

> > 2. (related to the former) it seems from a not-so-deep look that the
> > blessed (only?)
> > way to set a proper node_name is using blockdev-add.
> > If so, I'm not sure I follow how the qemu boot flow would look like.
> > It will not be anymore as simple as crafting a command line and run the
> > qemu, right?
> > IIUC some interaction with QMP will be needed (sorry for asking silly
> > question,
> > trying to fill gaps in my knowledge).
> 
> -drive on the command line can do everything that blockdev-add can do.
> So let's assume you have a qcow2 image on a filesystem. Then you end up
> with two BDSes, one for the format driver and one for accessing the
> filesystem:
> 
>     BlockBackend (virtual device) -> qcow2 BDS -> file BDS (raw-posix.c)
> 
> For assigning a node name to the qcow2 BDS, you simply specify it in the
> obvious way:
> 
>     -drive file=test.qcow2,node-name=foo
> 
> Now if you want to assign a node name to the file BDS as well, you would
> get nested dicts in the blockdev-add call. In -drive a dot syntax is
> used to represent this:
> 
>     -drive file=test.qcow2,node-name=foo,file.node-name=bar
> 
> Are things a bit clearer with this?

Yes, thanks a lot. I was a bit misleaded by the lack of the reference (after a very quick
look) in the man page.
Maybe the manpage is out of date, but this is a different story -and maybe a different patch :)

New revision will come in a few days.

Bests,
Eric Blake Nov. 21, 2014, 4:24 p.m. UTC | #12
On 11/21/2014 01:43 AM, Francesco Romani wrote:

> A couple of specific questions more, mostly to make sure I can do meaningful
> tests for my next submission:
> 
> 1. I'm running a simple test using the attached script -
> which is a qemu command line adapted from libvirt ouput driven
> by oVirt. There is a way to attach a name at this stage, using a QMP command?

Libvirt isn't yet attaching node names, and right now, there is no way
to retroactively attach a node name (only at creation).  Jeff Cody
proposed a patch prior to 2.1 that would give ALL nodes a generated name
if one was not supplied, but we still haven't taken that patch in, and
by now it probably needs rebasing...
diff mbox

Patch

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 04b0e43..43e381d 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@  block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
 block-obj-y += accounting.o
+block-obj-y += usage-threshold.o
 
 common-obj-y += stream.o
 common-obj-y += commit.o
diff --git a/block/qapi.c b/block/qapi.c
index 1301144..3bb0bc7 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -24,6 +24,7 @@ 
 
 #include "block/qapi.h"
 #include "block/block_int.h"
+#include "block/usage-threshold.h"
 #include "qmp-commands.h"
 #include "qapi-visit.h"
 #include "qapi/qmp-output-visitor.h"
@@ -315,6 +316,8 @@  static void bdrv_query_info(BlockBackend *blk, BlockInfo **p_info,
         }
     }
 
+    info->write_threshold = bdrv_get_usage_threshold(bs);
+
     *p_info = info;
     return;
 
diff --git a/block/usage-threshold.c b/block/usage-threshold.c
new file mode 100644
index 0000000..31a587d
--- /dev/null
+++ b/block/usage-threshold.c
@@ -0,0 +1,124 @@ 
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani <fromani@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "block/block_int.h"
+#include "block/coroutine.h"
+#include "block/usage-threshold.h"
+#include "qemu/notify.h"
+#include "qapi-event.h"
+#include "qmp-commands.h"
+
+
+int64_t bdrv_get_usage_threshold(const BlockDriverState *bs)
+{
+    if (bs == NULL) {
+        return 0;
+    }
+    if (bs->file) {
+        return bs->file->wr_offset_threshold;
+    }
+    return bs->wr_offset_threshold;
+}
+
+static void usage_threshold_disable(BlockDriverState *bs)
+{
+    notifier_with_return_remove(&bs->wr_usage_threshold_notifier);
+    bs->wr_offset_threshold = 0;
+}
+
+static int usage_threshold_is_set(const BlockDriverState *bs)
+{
+    return !!(bs->wr_offset_threshold > 0);
+}
+
+static int64_t usage_threshold_exceeded(const BlockDriverState *bs,
+                                        const BdrvTrackedRequest *req)
+{
+    if (usage_threshold_is_set(bs)) {
+        int64_t amount = req->offset + req->bytes - bs->wr_offset_threshold;
+        if (amount > 0) {
+            return amount;
+        }
+    }
+    return 0;
+}
+
+static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
+                                            void *opaque)
+{
+    BdrvTrackedRequest *req = opaque;
+    BlockDriverState *bs = req->bs;
+    int64_t amount = 0;
+
+    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    amount = usage_threshold_exceeded(bs, req);
+    if (amount > 0) {
+        qapi_event_send_block_usage_threshold(
+            bdrv_get_device_name(bs), /* FIXME: this does not work */
+            amount,
+            bs->wr_offset_threshold,
+            &error_abort);
+
+        /* autodisable to avoid to flood the monitor */
+        usage_threshold_disable(bs);
+    }
+
+    return 0; /* should always let other notifiers run */
+}
+
+static void usage_threshold_register_notifier(BlockDriverState *bs)
+{
+    bs->wr_usage_threshold_notifier.notify = before_write_notify;
+    notifier_with_return_list_add(&bs->before_write_notifiers,
+                                  &bs->wr_usage_threshold_notifier);
+}
+
+void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes)
+{
+    BlockDriverState *target_bs = bs;
+    if (bs->file) {
+        target_bs = bs->file;
+    }
+
+    if (usage_threshold_is_set(target_bs)) {
+        if (threshold_bytes > 0) { /* update */
+            target_bs->wr_offset_threshold = threshold_bytes;
+        } else {
+            usage_threshold_disable(target_bs);
+        }
+    } else if (threshold_bytes > 0) { /* update only if meaningful */
+        usage_threshold_register_notifier(target_bs);
+        target_bs->wr_offset_threshold = threshold_bytes;
+    }
+}
+
+void qmp_block_set_threshold(const char *device, int64_t threshold_bytes,
+                             Error **errp)
+{
+    BlockDriverState *bs;
+    AioContext *aio_context;
+
+    bs = bdrv_find(device);
+    if (!bs) {
+        error_set(errp, QERR_DEVICE_NOT_FOUND, device);
+        return;
+    }
+
+    aio_context = bdrv_get_aio_context(bs);
+    aio_context_acquire(aio_context);
+
+    bdrv_set_usage_threshold(bs, threshold_bytes);
+
+    aio_context_release(aio_context);
+}
diff --git a/include/block/block_int.h b/include/block/block_int.h
index a1c17b9..f6b26f3 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -409,6 +409,10 @@  struct BlockDriverState {
 
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
+
+    /* threshold limit for writes, in bytes. "High water mark". */
+    int64_t wr_offset_threshold;
+    NotifierWithReturn wr_usage_threshold_notifier;
 };
 
 int get_tmp_filename(char *filename, int size);
diff --git a/include/block/usage-threshold.h b/include/block/usage-threshold.h
new file mode 100644
index 0000000..96b8274
--- /dev/null
+++ b/include/block/usage-threshold.h
@@ -0,0 +1,39 @@ 
+/*
+ * QEMU System Emulator block usage threshold notification
+ *
+ * Copyright Red Hat, Inc. 2014
+ *
+ * Authors:
+ *  Francesco Romani <fromani@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+#ifndef BLOCK_USAGE_THRESHOLD_H
+#define BLOCK_USAGE_THRESHOLD_H
+
+#include <stdint.h>
+
+#include "qemu/typedefs.h"
+
+/*
+ * bdrv_set_usage_threshold:
+ *
+ * Set the usage threshold for block devices, in bytes.
+ * Notify when a write exceeds the threshold, meaning the device
+ * is becoming full, so it can be transparently resized.
+ * To be used with thin-provisioned block devices.
+ *
+ * Use threshold_bytes == 0 to disable.
+ */
+void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t threshold_bytes);
+
+/*
+ * bdrv_get_usage_threshold
+ *
+ * Get the configured usage threshold, in bytes.
+ * Zero means no threshold configured.
+ */
+int64_t bdrv_get_usage_threshold(const BlockDriverState *bs);
+
+#endif
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 77a0cfb..3441e41 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -345,13 +345,17 @@ 
 # @inserted: #optional @BlockDeviceInfo describing the device if media is
 #            present
 #
+# @write-threshold: configured write threshold for the device.
+#                   0 if disabled. (Since 2.2)
+#
 # Since:  0.14.0
 ##
 { 'type': 'BlockInfo',
   'data': {'device': 'str', 'type': 'str', 'removable': 'bool',
            'locked': 'bool', '*inserted': 'BlockDeviceInfo',
            '*tray_open': 'bool', '*io-status': 'BlockDeviceIoStatus',
-           '*dirty-bitmaps': ['BlockDirtyInfo'] } }
+           '*dirty-bitmaps': ['BlockDirtyInfo'],
+           'write-threshold': 'int' } }
 
 ##
 # @query-block:
@@ -1827,3 +1831,43 @@ 
 ##
 { 'enum': 'PreallocMode',
   'data': [ 'off', 'metadata', 'falloc', 'full' ] }
+
+##
+# @BLOCK_USAGE_THRESHOLD
+#
+# Emitted when writes on block device reaches or exceeds the
+# configured threshold. For thin-provisioned devices, this
+# means the device should be extended to avoid pausing for
+# disk exaustion.
+#
+# @device: device name
+#
+# @amount-exceeded: amount of data which exceeded the threshold, in bytes.
+#
+# @offset-threshold: last configured threshold, in bytes.
+#
+# Since: 2.3
+##
+{ 'event': 'BLOCK_USAGE_THRESHOLD',
+  'data': { 'device': 'str', 'amount-exceeded': 'int', 'threshold': 'int' } }
+
+##
+# @block-set-threshold
+#
+# Change usage threshold for a block drive. An event will be delivered
+# if a write to this block drive crosses the configured threshold.
+# This is useful to transparently resize thin-provisioned drives without
+# the guest OS noticing.
+#
+# @device: The name of the device
+#
+# @write-threshold: configured threshold for the block device, bytes.
+#                   Use 0 to disable the threshold.
+#
+# Returns: Nothing on success
+#          If @device is not a valid block device, DeviceNotFound
+#
+# Since: 2.3
+##
+{ 'command': 'block-set-threshold',
+  'data': { 'device': 'str', 'threshold': 'int' } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 1abd619..93b0cb9 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -3852,3 +3852,29 @@  Move mouse pointer to absolute coordinates (20000, 400).
 <- { "return": {} }
 
 EQMP
+
+    {
+        .name       = "block-set-threshold",
+        .args_type  = "device:B,threshold:l",
+        .mhandler.cmd_new = qmp_marshal_input_block_set_threshold,
+    },
+
+SQMP
+block-set-threshold
+------------
+
+Change the write threshold for a block drive.
+
+Arguments:
+
+- "device": device name (json-string)
+- "threshold": the write threshold in bytes (json-int)
+
+Example:
+
+-> { "execute": "block-set-threshold",
+  "arguments": { "device": "drive-virtio-disk0",
+                 "threshold": 17179869184 } }
+<- { "return": {} }
+
+EQMP