diff mbox

block: add watermark event

Message ID 1404830964-10733-2-git-send-email-fromani@redhat.com
State New
Headers show

Commit Message

Francesco Romani July 8, 2014, 2:49 p.m. UTC
Managing applications, like oVirt (http://www.ovirt.org), make extensive
use of thin-provisioned disk images.
In order to let the guest run flawlessly and be not unnecessarily
paused, oVirt sets a watermark based on the percentage occupation of the
device against the advertised size, and automatically resizes the image
once the watermark is reached or exceeded.

In order to detect the mark crossing, the managing application has no
choice than aggressively polling the QEMU monitor using the
query-blockstats command. This lead to unnecessary system
load, and is made even worse under scale: scenarios
with hundreds of VM are becoming not unusual.

To fix this, this patch adds:
* A new monitor command to set a mark for a given block device.
* A new event to report if a block device usage exceeds the threshold.

This will allow the managing application to drop the polling
alltogether and just wait for a watermark crossing event.

Signed-off-by: Francesco Romani <fromani@redhat.com>
---
 block.c                   | 56 +++++++++++++++++++++++++++++++++++++++++++++++
 blockdev.c                | 21 ++++++++++++++++++
 include/block/block.h     |  2 ++
 include/block/block_int.h |  3 +++
 qapi/block-core.json      | 33 ++++++++++++++++++++++++++++
 qmp-commands.hx           | 24 ++++++++++++++++++++
 6 files changed, 139 insertions(+)

Comments

Eric Blake July 8, 2014, 3:10 p.m. UTC | #1
On 07/08/2014 08:49 AM, Francesco Romani wrote:
> Managing applications, like oVirt (http://www.ovirt.org), make extensive
> use of thin-provisioned disk images.
> In order to let the guest run flawlessly and be not unnecessarily
> paused, oVirt sets a watermark based on the percentage occupation of the
> device against the advertised size, and automatically resizes the image
> once the watermark is reached or exceeded.
> 
> In order to detect the mark crossing, the managing application has no
> choice than aggressively polling the QEMU monitor using the
> query-blockstats command. This lead to unnecessary system
> load, and is made even worse under scale: scenarios
> with hundreds of VM are becoming not unusual.
> 
> To fix this, this patch adds:
> * A new monitor command to set a mark for a given block device.
> * A new event to report if a block device usage exceeds the threshold.
> 
> This will allow the managing application to drop the polling
> alltogether and just wait for a watermark crossing event.

s/alltogether/altogether/

> 
> Signed-off-by: Francesco Romani <fromani@redhat.com>
> ---
>  block.c                   | 56 +++++++++++++++++++++++++++++++++++++++++++++++
>  blockdev.c                | 21 ++++++++++++++++++
>  include/block/block.h     |  2 ++
>  include/block/block_int.h |  3 +++
>  qapi/block-core.json      | 33 ++++++++++++++++++++++++++++
>  qmp-commands.hx           | 24 ++++++++++++++++++++
>  6 files changed, 139 insertions(+)
> 


> +void bdrv_set_watermark_perc(BlockDriverState *bs,
> +                             int watermark_perc)
> +{
> +    NotifierWithReturn before_write = {
> +        .notify = watermark_before_write_notify,
> +    };
> +
> +    if (watermark_perc <= 0) {
> +        return;
> +    }

Shouldn't you uninstall the write_notifier if someone is changing the
watermark_perc from non-zero back to zero?


> +++ b/include/block/block_int.h
> @@ -393,6 +393,9 @@ struct BlockDriverState {
>  
>      /* The error object in use for blocking operations on backing_hd */
>      Error *backing_blocker;
> +
> +    /* watermark limit for writes, percentage of sectors */
> +    int wr_watermark_perc;

I didn't see any code that ensures that this limit is between 0 and 100;
since it is under user control, your code probably misbehaves for values
such as 101.

> +
> +##
> +# @BLOCK_WATERMARK
> +#
> +# Emitted when a block device reaches or exceeds the watermark limit.
> +#
> +# @device: device name
> +#
> +# @sector-num: number of the sector exceeding the threshold
> +#
> +# @sector-highest: number of the last highest written sector
> +#
> +# Since: 2.1

2.2. You've missed hard freeze for 2.1.

> +##
> +{ 'event': 'BLOCK_WATERMARK',
> +  'data': { 'device': 'str', 'sector-num': 'int', 'sector-highest': 'int' } }
> +
> +##
> +# @block_set_watermark

s/_/-/2 - new commands should use - in their name.

> +#
> +# Change watermark percentage for a block drive.
> +#
> +# @device: The name of the device
> +#
> +# @watermark: high water mark, percentage

Is percentage the right unit?  On a 10G disk, that means you only have a
granularity down to 100M.  Should the watermark limit instead be
expressed as a byte value instead of a percentage?

> +#
> +# Returns: Nothing on success
> +#          If @device is not a valid block device, DeviceNotFound
> +#
> +# Since: 2.1

2.2.

> +##
> +{ 'command': 'block_set_watermark',
> +  'data': { 'device': 'str', 'watermark': 'int' } }

I hate write-only commands.  Which query-* command are you modifying to
output the current watermark level?
Stefan Hajnoczi Aug. 1, 2014, 11:39 a.m. UTC | #2
On Tue, Jul 08, 2014 at 04:49:24PM +0200, Francesco Romani wrote:
> @@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
>          bdrv_flush_io_queue(bs->file);
>      }
>  }
> +
> +static bool watermark_exceeded(BlockDriverState *bs,
> +                               int64_t sector_num,
> +                               int nb_sectors)
> +{
> +
> +    if (bs->wr_watermark_perc > 0) {
> +        int64_t watermark = (bs->total_sectors) / 100 * bs->wr_watermark_perc;

bs->total_sectors should not be used directly.

Have you considered making the watermark parameter take sector units
instead of a percentage?

I'm not sure whether a precentage makes sense because 25% of a 10GB
image is 2.5 GB so a 75% watermark might be reasonable.  25% of a 1 TB
image is 250 GB and that's probably not a reasonable watermark.

So let the block-set-watermark caller pass an absolute sector number
instead.  It keeps things simple for both QEMU and thin provisioning
manager.

> +        if (sector_num >= watermark) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
> +static int coroutine_fn watermark_before_write_notify(NotifierWithReturn *notifier,
> +                                                      void *opaque)
> +{
> +    BdrvTrackedRequest *req = opaque;
> +    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> +    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
> +
> +/*  FIXME: needed? */
> +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);

Not really needed here.  Emulated storage controllers either get
requests in block units (i.e. they are automatically aligned) or check
them (like virtio-blk).

I guess there's no harm in checking, but I would drop it.

> +
> +    if (watermark_exceeded(req->bs, sector_num, nb_sectors)) {
> +        BlockDriverState *bs = req->bs;
> +        qapi_event_send_block_watermark(
> +            bdrv_get_device_name(bs),
> +            sector_num,
> +            bs->wr_highest_sector,
> +            &error_abort);

How do you prevent flooding events if every write request exceeds the
watermark?

Perhaps the watermark should be disabled until block-set-watermark is
called again.

> +    }
> +
> +    return 0; /* should always let other notifiers run */
> +}
> +
> +void bdrv_set_watermark_perc(BlockDriverState *bs,
> +                             int watermark_perc)
> +{
> +    NotifierWithReturn before_write = {
> +        .notify = watermark_before_write_notify,
> +    };
> +
> +    if (watermark_perc <= 0) {
> +        return;
> +    }
> +
> +    if (bs->wr_watermark_perc == 0) {
> +        bdrv_add_before_write_notifier(bs, &before_write);

before_write must be a BlockDriverState field so it has the correct
lifetime.  In this patch before_write is allocated on the stack and will
cause invalid memory accesses once we leave this function.
Kevin Wolf Aug. 5, 2014, 8:47 a.m. UTC | #3
Am 01.08.2014 um 13:39 hat Stefan Hajnoczi geschrieben:
> On Tue, Jul 08, 2014 at 04:49:24PM +0200, Francesco Romani wrote:
> > @@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
> >          bdrv_flush_io_queue(bs->file);
> >      }
> >  }
> > +
> > +static bool watermark_exceeded(BlockDriverState *bs,
> > +                               int64_t sector_num,
> > +                               int nb_sectors)
> > +{
> > +
> > +    if (bs->wr_watermark_perc > 0) {
> > +        int64_t watermark = (bs->total_sectors) / 100 * bs->wr_watermark_perc;
> 
> bs->total_sectors should not be used directly.
> 
> Have you considered making the watermark parameter take sector units
> instead of a percentage?
> 
> I'm not sure whether a precentage makes sense because 25% of a 10GB
> image is 2.5 GB so a 75% watermark might be reasonable.  25% of a 1 TB
> image is 250 GB and that's probably not a reasonable watermark.
> 
> So let the block-set-watermark caller pass an absolute sector number
> instead.  It keeps things simple for both QEMU and thin provisioning
> manager.

No sector numbers in external interfaces, please. These units of 512
bytes are completely arbitrary and don't make any sense. I hope to get
rid of BDRV_SECTOR_* eventually even internally.

So for external APIs, please use bytes instead.

> > +        if (sector_num >= watermark) {
> > +            return true;
> > +        }
> > +    }
> > +    return false;
> > +}
> > +
> > +static int coroutine_fn watermark_before_write_notify(NotifierWithReturn *notifier,
> > +                                                      void *opaque)
> > +{
> > +    BdrvTrackedRequest *req = opaque;
> > +    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
> > +    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
> > +
> > +/*  FIXME: needed? */
> > +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> > +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> 
> Not really needed here.  Emulated storage controllers either get
> requests in block units (i.e. they are automatically aligned) or check
> them (like virtio-blk).

But we don't only get requests from emulated storage controllers. You
cannot reasonably assume anything here (the requests are aligned
according to the requirements of the backend, but that may just be 1).
Again, do your calculations in a byte granularity to be safe here.

> I guess there's no harm in checking, but I would drop it.
> 
> > +
> > +    if (watermark_exceeded(req->bs, sector_num, nb_sectors)) {
> > +        BlockDriverState *bs = req->bs;
> > +        qapi_event_send_block_watermark(
> > +            bdrv_get_device_name(bs),
> > +            sector_num,
> > +            bs->wr_highest_sector,
> > +            &error_abort);
> 
> How do you prevent flooding events if every write request exceeds the
> watermark?
> 
> Perhaps the watermark should be disabled until block-set-watermark is
> called again.

I agree.

Kevin
Stefan Hajnoczi Aug. 5, 2014, 1:08 p.m. UTC | #4
On Tue, Aug 05, 2014 at 10:47:57AM +0200, Kevin Wolf wrote:
> Am 01.08.2014 um 13:39 hat Stefan Hajnoczi geschrieben:
> > On Tue, Jul 08, 2014 at 04:49:24PM +0200, Francesco Romani wrote:
> > > @@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
> > >          bdrv_flush_io_queue(bs->file);
> > >      }
> > >  }
> > > +
> > > +static bool watermark_exceeded(BlockDriverState *bs,
> > > +                               int64_t sector_num,
> > > +                               int nb_sectors)
> > > +{
> > > +
> > > +    if (bs->wr_watermark_perc > 0) {
> > > +        int64_t watermark = (bs->total_sectors) / 100 * bs->wr_watermark_perc;
> > 
> > bs->total_sectors should not be used directly.
> > 
> > Have you considered making the watermark parameter take sector units
> > instead of a percentage?
> > 
> > I'm not sure whether a precentage makes sense because 25% of a 10GB
> > image is 2.5 GB so a 75% watermark might be reasonable.  25% of a 1 TB
> > image is 250 GB and that's probably not a reasonable watermark.
> > 
> > So let the block-set-watermark caller pass an absolute sector number
> > instead.  It keeps things simple for both QEMU and thin provisioning
> > manager.
> 
> No sector numbers in external interfaces, please. These units of 512
> bytes are completely arbitrary and don't make any sense. I hope to get
> rid of BDRV_SECTOR_* eventually even internally.
> 
> So for external APIs, please use bytes instead.

I agree and forgot about that.  Please use bytes instead of sectors or a
percentage.
Francesco Romani Aug. 8, 2014, 8:01 a.m. UTC | #5
----- Original Message -----
> From: "Stefan Hajnoczi" <stefanha@redhat.com>
> To: "Kevin Wolf" <kwolf@redhat.com>
> Cc: mdroth@linux.vnet.ibm.com, "Francesco Romani" <fromani@redhat.com>, qemu-devel@nongnu.org, lcapitulino@redhat.com
> Sent: Tuesday, August 5, 2014 3:08:46 PM
> Subject: Re: [Qemu-devel] [PATCH] block: add watermark event
> 
> On Tue, Aug 05, 2014 at 10:47:57AM +0200, Kevin Wolf wrote:
> > Am 01.08.2014 um 13:39 hat Stefan Hajnoczi geschrieben:
> > > On Tue, Jul 08, 2014 at 04:49:24PM +0200, Francesco Romani wrote:
> > > > @@ -5813,3 +5815,57 @@ void bdrv_flush_io_queue(BlockDriverState *bs)
> > > >          bdrv_flush_io_queue(bs->file);
> > > >      }
> > > >  }
> > > > +
> > > > +static bool watermark_exceeded(BlockDriverState *bs,
> > > > +                               int64_t sector_num,
> > > > +                               int nb_sectors)
> > > > +{
> > > > +
> > > > +    if (bs->wr_watermark_perc > 0) {
> > > > +        int64_t watermark = (bs->total_sectors) / 100 *
> > > > bs->wr_watermark_perc;
> > > 
> > > bs->total_sectors should not be used directly.
> > > 
> > > Have you considered making the watermark parameter take sector units
> > > instead of a percentage?
> > > 
> > > I'm not sure whether a precentage makes sense because 25% of a 10GB
> > > image is 2.5 GB so a 75% watermark might be reasonable.  25% of a 1 TB
> > > image is 250 GB and that's probably not a reasonable watermark.
> > > 
> > > So let the block-set-watermark caller pass an absolute sector number
> > > instead.  It keeps things simple for both QEMU and thin provisioning
> > > manager.
> > 
> > No sector numbers in external interfaces, please. These units of 512
> > bytes are completely arbitrary and don't make any sense. I hope to get
> > rid of BDRV_SECTOR_* eventually even internally.
> > 
> > So for external APIs, please use bytes instead.
> 
> I agree and forgot about that.  Please use bytes instead of sectors or a
> percentage.
> 

Thanks everyone for the great feedback received!

I'll post asap a new patch addressing all the comments.

I'll also change the name because 'watermark' may be misleading/wrong jargon :)
(http://en.wikipedia.org/wiki/Watermark)

Bests,
Eric Blake Aug. 8, 2014, 12:51 p.m. UTC | #6
On 08/08/2014 02:01 AM, Francesco Romani wrote:

>>>> So let the block-set-watermark caller pass an absolute sector number
>>>> instead.  It keeps things simple for both QEMU and thin provisioning
>>>> manager.
>>>
>>> No sector numbers in external interfaces, please. These units of 512
>>> bytes are completely arbitrary and don't make any sense. I hope to get
>>> rid of BDRV_SECTOR_* eventually even internally.
>>>
>>> So for external APIs, please use bytes instead.
>>
>> I agree and forgot about that.  Please use bytes instead of sectors or a
>> percentage.
>>
> 
> Thanks everyone for the great feedback received!
> 
> I'll post asap a new patch addressing all the comments.
> 
> I'll also change the name because 'watermark' may be misleading/wrong jargon :)
> (http://en.wikipedia.org/wiki/Watermark)

Is "threshold" a better name than "watermark"?
diff mbox

Patch

diff --git a/block.c b/block.c
index 8800a6b..cf34b7f 100644
--- a/block.c
+++ b/block.c
@@ -1983,6 +1983,8 @@  static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
     bs_dest->device_list = bs_src->device_list;
     memcpy(bs_dest->op_blockers, bs_src->op_blockers,
            sizeof(bs_dest->op_blockers));
+
+    bs_dest->wr_watermark_perc  = bs_src->wr_watermark_perc;
 }
 
 /*
@@ -5813,3 +5815,57 @@  void bdrv_flush_io_queue(BlockDriverState *bs)
         bdrv_flush_io_queue(bs->file);
     }
 }
+
+static bool watermark_exceeded(BlockDriverState *bs,
+                               int64_t sector_num,
+                               int nb_sectors)
+{
+
+    if (bs->wr_watermark_perc > 0) {
+        int64_t watermark = (bs->total_sectors) / 100 * bs->wr_watermark_perc;
+        if (sector_num >= watermark) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static int coroutine_fn watermark_before_write_notify(NotifierWithReturn *notifier,
+                                                      void *opaque)
+{
+    BdrvTrackedRequest *req = opaque;
+    int64_t sector_num = req->offset >> BDRV_SECTOR_BITS;
+    int nb_sectors = req->bytes >> BDRV_SECTOR_BITS;
+
+/*  FIXME: needed? */
+    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
+    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
+
+    if (watermark_exceeded(req->bs, sector_num, nb_sectors)) {
+        BlockDriverState *bs = req->bs;
+        qapi_event_send_block_watermark(
+            bdrv_get_device_name(bs),
+            sector_num,
+            bs->wr_highest_sector,
+            &error_abort);
+    }
+
+    return 0; /* should always let other notifiers run */
+}
+
+void bdrv_set_watermark_perc(BlockDriverState *bs,
+                             int watermark_perc)
+{
+    NotifierWithReturn before_write = {
+        .notify = watermark_before_write_notify,
+    };
+
+    if (watermark_perc <= 0) {
+        return;
+    }
+
+    if (bs->wr_watermark_perc == 0) {
+        bdrv_add_before_write_notifier(bs, &before_write);
+    }
+    bs->wr_watermark_perc = watermark_perc;
+}
diff --git a/blockdev.c b/blockdev.c
index 48bd9a3..ede21d9 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -2546,6 +2546,27 @@  BlockJobInfoList *qmp_query_block_jobs(Error **errp)
     return dummy.next;
 }
 
+void qmp_block_set_watermark(const char *device, int64_t watermark,
+                             Error **errp)
+{
+    BlockDriverState *bs;
+    AioContext *aio_context;
+
+    bs = bdrv_find(device);
+    if (!bs) {
+        error_set(errp, QERR_DEVICE_NOT_FOUND, device);
+        return;
+    }
+
+    aio_context = bdrv_get_aio_context(bs);
+    aio_context_acquire(aio_context);
+
+    bdrv_set_watermark_perc(bs, watermark);
+
+    aio_context_release(aio_context);
+}
+
+
 QemuOptsList qemu_common_drive_opts = {
     .name = "drive",
     .head = QTAILQ_HEAD_INITIALIZER(qemu_common_drive_opts.head),
diff --git a/include/block/block.h b/include/block/block.h
index 32d3676..ff92ef9 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -588,4 +588,6 @@  void bdrv_io_plug(BlockDriverState *bs);
 void bdrv_io_unplug(BlockDriverState *bs);
 void bdrv_flush_io_queue(BlockDriverState *bs);
 
+void bdrv_set_watermark_perc(BlockDriverState *bs, int watermark_perc);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f6c3bef..666ea1d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -393,6 +393,9 @@  struct BlockDriverState {
 
     /* The error object in use for blocking operations on backing_hd */
     Error *backing_blocker;
+
+    /* watermark limit for writes, percentage of sectors */
+    int wr_watermark_perc;
 };
 
 int get_tmp_filename(char *filename, int size);
diff --git a/qapi/block-core.json b/qapi/block-core.json
index e378653..58e3b05 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1643,3 +1643,36 @@ 
             'len'   : 'int',
             'offset': 'int',
             'speed' : 'int' } }
+
+##
+# @BLOCK_WATERMARK
+#
+# Emitted when a block device reaches or exceeds the watermark limit.
+#
+# @device: device name
+#
+# @sector-num: number of the sector exceeding the threshold
+#
+# @sector-highest: number of the last highest written sector
+#
+# Since: 2.1
+##
+{ 'event': 'BLOCK_WATERMARK',
+  'data': { 'device': 'str', 'sector-num': 'int', 'sector-highest': 'int' } }
+
+##
+# @block_set_watermark
+#
+# Change watermark percentage for a block drive.
+#
+# @device: The name of the device
+#
+# @watermark: high water mark, percentage
+#
+# Returns: Nothing on success
+#          If @device is not a valid block device, DeviceNotFound
+#
+# Since: 2.1
+##
+{ 'command': 'block_set_watermark',
+  'data': { 'device': 'str', 'watermark': 'int' } }
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 4be4765..89fee40 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -3755,3 +3755,27 @@  Example:
 <- { "return": {} }
 
 EQMP
+
+    {
+        .name       = "block_set_watermark",
+        .args_type  = "device:B,watermark:l",
+        .mhandler.cmd_new = qmp_marshal_input_block_set_watermark,
+    },
+
+SQMP
+block_set_watermark
+------------
+
+Change the high water mark for a block drive.
+
+Arguments:
+
+- "device": device name (json-string)
+- "watermark": the high water mark in percentage (json-int)
+
+Example:
+
+-> { "execute": "block_set_watermark", "arguments": { "device": "virtio0", "watermark": 75 } }
+<- { "return": {} }
+
+EQMP