Patchwork [1/3] FVD: Added support for 'qemu-img update'

login
register
mail settings
Submitter Chunqiang Tang
Date Jan. 21, 2011, 10:19 p.m.
Message ID <1295648355-17359-1-git-send-email-ctang@us.ibm.com>
Download mbox | patch
Permalink /patch/79951/
State New
Headers show

Comments

Chunqiang Tang - Jan. 21, 2011, 10:19 p.m.
This patch is part of the Fast Virtual Disk (FVD) proposal. See the related
discussions at
http://lists.gnu.org/archive/html/qemu-devel/2011-01/msg00426.html.

This patch adds the 'update' command to qemu-img. FVD stores various
image-specific configurable parameters in the image header. A user can use
'qemu-img update' to modify those parameters and accordingly controls FVD's
runtime behavior. This command may also be leveraged by other block device
drivers, e.g., to set the size of the in-memory metadata cache. Currently
those parameters are hard-coded in a one-size-fit-all manner.

Signed-off-by: Chunqiang Tang <ctang@us.ibm.com>
---
 block_int.h      |    1 +
 qemu-img-cmds.hx |    6 ++++++
 qemu-img.c       |   43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+), 0 deletions(-)
Stefan Hajnoczi - Jan. 28, 2011, 9:57 a.m.
On Fri, Jan 21, 2011 at 05:19:13PM -0500, Chunqiang Tang wrote:
> This patch adds the 'update' command to qemu-img. FVD stores various
> image-specific configurable parameters in the image header. A user can use
> 'qemu-img update' to modify those parameters and accordingly controls FVD's
> runtime behavior. This command may also be leveraged by other block device
> drivers, e.g., to set the size of the in-memory metadata cache. Currently
> those parameters are hard-coded in a one-size-fit-all manner.

There's a high risk that users will try this command while the VM is
running.  A safe-guard is needed here in order to avoid corrupting the
image.

Please use qemu-option.h instead of int argc, char **argv just like
qemu-img create -o does.

Finally, is this interface really necessary?  As a developer it can be
useful to tweak image values (in QED I actually have a free-standing
tool that can query and manipulate image internals).  But should users
need to micromanage every image file in order to achieve desired
functionality/performance?  What's the real need here?

> diff --git a/qemu-img.c b/qemu-img.c
> index afd9ed2..5f35c4d 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1054,6 +1054,49 @@ static int img_info(int argc, char **argv)
>      return 0;
>  }
>  
> +static int img_update(int argc, char **argv)
> +{
> +    int c;
> +    const char *filename, *fmt;
> +    BlockDriverState *bs;
> +
> +    fmt = NULL;
> +    for(;;) {
> +        c = getopt(argc, argv, "f:h");
> +        if (c == -1)
> +            break;

{}, see CODING_STYLE and HACKING.

> +        switch(c) {
> +        case 'h':
> +            help();
> +            break;
> +        case 'f':
> +            fmt = optarg;
> +            break;
> +        }
> +    }
> +    if (optind >= argc)
> +        help();

{}

> +    filename = argv[optind++];
> +    bs = bdrv_new_open(filename, fmt, BDRV_O_FLAGS | BDRV_O_NO_BACKING
> +                       | BDRV_O_RDWR);
> +    if (!bs) {
> +        return 1;
> +    }
> +
> +    if (bs->drv->bdrv_update) {
> +        bs->drv->bdrv_update(bs, argc-optind, &argv[optind]);
> +    }
> +    else {

} else {, see CODING_STYLE

> +        char fmt_name[128];
> +        bdrv_get_format(bs, fmt_name, sizeof(fmt_name));
> +        error_report ("The 'update' command is not supported for "
> +                      "the '%s' image format.", fmt_name);

Return value should be 1?

Stefan
Chunqiang Tang - Jan. 28, 2011, 2:51 p.m.
> On Fri, Jan 21, 2011 at 05:19:13PM -0500, Chunqiang Tang wrote:
> > This patch adds the 'update' command to qemu-img. FVD stores various
> > image-specific configurable parameters in the image header. A user can 
use
> > 'qemu-img update' to modify those parameters and accordingly controls 
FVD's
> > runtime behavior. This command may also be leveraged by other block 
device
> > drivers, e.g., to set the size of the in-memory metadata cache. 
Currently
> > those parameters are hard-coded in a one-size-fit-all manner.
> 
> There's a high risk that users will try this command while the VM is
> running.  A safe-guard is needed here in order to avoid corrupting the
> image.

Good observation and this should be added. In FVD, once an image is open, 
a dirty bit is set in the image header. qemu-img update can refuse to work 
or ask for confirmation if it sees the dirty bit.
 
> Please use qemu-option.h instead of int argc, char **argv just like
> qemu-img create -o does.

Will do.
 
> Finally, is this interface really necessary?  As a developer it can be
> useful to tweak image values (in QED I actually have a free-standing
> tool that can query and manipulate image internals).  But should users
> need to micromanage every image file in order to achieve desired
> functionality/performance?  What's the real need here?

Default values of the image parameters may be suitable for most users, but 
there are cases where more control is needed, not only for performance 
tuning but also for functional control. This is especially true for 
copy-on-read and prefetching, and when many VMs are created based on the 
same image template (like that in a Cloud or the publicly shared VMware 
appliance http://www.vmware.com/appliances/). For example, the image 
template may turn on copy-on-read and prefetching, but a user wants to set 
it off. Even for prefetching, there are parameters that control when to 
start prefetching and how much bandwidth prefetching can use (as 
prefetching has to be conservative). This can be different in different 
environments, e.g., 100MB vs 10GB Ethernet, SAN disk array  vs. a local 
disk. Without control, users share the same image template may often find 
it hard to use.

Performance tuning may also be important. Because QEMU supports so many 
diverse platforms, we cannot expect one size fit all. Specifically, when 
doing benchmarking, I found QCOW2's metadata cache was way too small even 
for my 32-bit host, but I simply couldn't control that.


Regards,
ChunQiang (CQ) Tang, 
Homepage: http://www.research.ibm.com/people/c/ctang
Stefan Hajnoczi - Jan. 28, 2011, 4:16 p.m.
On Fri, Jan 28, 2011 at 2:51 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> On Fri, Jan 21, 2011 at 05:19:13PM -0500, Chunqiang Tang wrote:
>> > This patch adds the 'update' command to qemu-img. FVD stores various
>> > image-specific configurable parameters in the image header. A user can
> use
>> > 'qemu-img update' to modify those parameters and accordingly controls
> FVD's
>> > runtime behavior. This command may also be leveraged by other block
> device
>> > drivers, e.g., to set the size of the in-memory metadata cache.
> Currently
>> > those parameters are hard-coded in a one-size-fit-all manner.
>>
>> There's a high risk that users will try this command while the VM is
>> running.  A safe-guard is needed here in order to avoid corrupting the
>> image.
>
> Good observation and this should be added. In FVD, once an image is open,
> a dirty bit is set in the image header. qemu-img update can refuse to work
> or ask for confirmation if it sees the dirty bit.
>
>> Please use qemu-option.h instead of int argc, char **argv just like
>> qemu-img create -o does.
>
> Will do.
>
>> Finally, is this interface really necessary?  As a developer it can be
>> useful to tweak image values (in QED I actually have a free-standing
>> tool that can query and manipulate image internals).  But should users
>> need to micromanage every image file in order to achieve desired
>> functionality/performance?  What's the real need here?
>
> Default values of the image parameters may be suitable for most users, but
> there are cases where more control is needed, not only for performance
> tuning but also for functional control. This is especially true for
> copy-on-read and prefetching, and when many VMs are created based on the
> same image template (like that in a Cloud or the publicly shared VMware
> appliance http://www.vmware.com/appliances/). For example, the image
> template may turn on copy-on-read and prefetching, but a user wants to set
> it off. Even for prefetching, there are parameters that control when to
> start prefetching and how much bandwidth prefetching can use (as
> prefetching has to be conservative). This can be different in different
> environments, e.g., 100MB vs 10GB Ethernet, SAN disk array  vs. a local
> disk. Without control, users share the same image template may often find
> it hard to use.

It should be possible to change prefetching and copy-on-read while the
VM is running.  For example, having to shut down a VM in order to
pause the prefetching is not workable.  In the QED image streaming
tree there are monitor commands for this:

http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/stream-command

Stefan
Chunqiang Tang - Jan. 28, 2011, 9:26 p.m.
> It should be possible to change prefetching and copy-on-read while the
> VM is running.  For example, having to shut down a VM in order to
> pause the prefetching is not workable.  In the QED image streaming
> tree there are monitor commands for this:
> 
> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/stream-command

I took a quick look at the code. Using a monitor command to dynamically 
control copy-on-read and prefetching is a good idea. This should be 
adopted in FVD as well. 

On another note, I saw that the code does not support (copy_on_read=off && 
stream=on). My previous benchmarking shows that copy_on_read does slow 
down other normal reads and writes, because it needs to save data to disk. 
For example, numbers in my papers show that, on ext3, FVD with 
copy_on_read=on actually boots a VM slower than QCOW2 does, even if FVD's 
copy-on-read is already heavily optimized and is not on the  critical path 
of read (i.e., callback is invoked and data is returned to the VM first, 
and then save copy-on-read data asynchronously in the background). 
Therefore, it might be possible that a user does not want to enable 
copy-on-read and only wants to do prefetching when resources are idle.
Stefan Hajnoczi - Jan. 29, 2011, 10:05 a.m.
On Fri, Jan 28, 2011 at 9:26 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
>> It should be possible to change prefetching and copy-on-read while the
>> VM is running.  For example, having to shut down a VM in order to
>> pause the prefetching is not workable.  In the QED image streaming
>> tree there are monitor commands for this:
>>
>> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/stream-command
>
> I took a quick look at the code. Using a monitor command to dynamically
> control copy-on-read and prefetching is a good idea. This should be
> adopted in FVD as well.

After thinking about it more, qemu-img update does also serve a
purpose.  Sometimes it is necessary to set options on many images in
bulk or from provisioning scripts instead of at runtime.

I guess my main fear of qemu-img update is that it adds a new
interface that only FVD exploits so far.  If it never catches on with
other formats then we have this special feature that must be
maintained but is rarely used.  I'd hold off this patch until code
that can make use of it has been merged into qemu.git.

> On another note, I saw that the code does not support (copy_on_read=off &&
> stream=on). My previous benchmarking shows that copy_on_read does slow
> down other normal reads and writes, because it needs to save data to disk.
> For example, numbers in my papers show that, on ext3, FVD with
> copy_on_read=on actually boots a VM slower than QCOW2 does, even if FVD's
> copy-on-read is already heavily optimized and is not on the  critical path
> of read (i.e., callback is invoked and data is returned to the VM first,
> and then save copy-on-read data asynchronously in the background).
> Therefore, it might be possible that a user does not want to enable
> copy-on-read and only wants to do prefetching when resources are idle.

The current implementation basically takes advantage of copy-on-read
in order to populate the image.

There's a lot of room for studying the behavior and making
improvements.  Coming up with throttling strategies that make the
prefetch I/O an "idle task" only when there's bandwidth available is
difficult because the problem is more complex than just one greedy
QEMU process.  In a cloud environment there will be any physical
hosts, each with multiple VMs, on a shared network and no single QEMU
process has global knowledge.  It's more like TCP where you need to
try seeing how much data the connection can carry, fall back on packet
loss, and then gradually try again.  But I'm not sure we have a
feedback mechanism to say "you're doing too much prefetching".

Stefan
Chunqiang Tang - Jan. 31, 2011, 2:49 p.m.
> After thinking about it more, qemu-img update does also serve a

> purpose.  Sometimes it is necessary to set options on many images in

> bulk or from provisioning scripts instead of at runtime.

> 

> I guess my main fear of qemu-img update is that it adds a new

> interface that only FVD exploits so far.  If it never catches on with

> other formats then we have this special feature that must be

> maintained but is rarely used.  I'd hold off this patch until code

> that can make use of it has been merged into qemu.git.


I am fine with holding it off. Actually, 'qemu-img rebase' is already a 
special type of 'qemu-img update'. Without a general update interface, 
every parameter change would have to use a special interface like 
'rebase'.

> There's a lot of room for studying the behavior and making

> improvements.  Coming up with throttling strategies that make the

> prefetch I/O an "idle task" only when there's bandwidth available is

> difficult because the problem is more complex than just one greedy

> QEMU process.  In a cloud environment there will be any physical

> hosts, each with multiple VMs, on a shared network and no single QEMU

> process has global knowledge.  It's more like TCP where you need to

> try seeing how much data the connection can carry, fall back on packet

> loss, and then gradually try again.  But I'm not sure we have a

> feedback mechanism to say "you're doing too much prefetching".


Your observation on the similarity to TCP is incisive. To my knowledge, 
the two papers below are the most relevant work on congestion control for 
VM storage. I evaluated [1] below, and found that it is not most reliable. 
I copied some text from my paper below. Basically, since a storage system 
does not have packet loss, there are two ways of detecting congestion in a 
storage system: 1) making decisions based on increased latency, or 2) 
making decisions based on reduced throughput. The paper [1] below uses 
latency but I found it not always reliable. The current implementation in 
FVD is based on throughput, which tends to be more reliable. But as you 
said,this is still a problem that has a lot of room for future study.

===========
"[1] also performs adaptive prefetching. It halves the prefetch rate
if a certain “percentage” of recent requests experiencing
a high latency. Our experiments show that it is hard to set
a proper “percentage” to reliably detect contentions. Because
storage servers and disk controllers perform readahead
in large chunks for sequential reads, a very large
percentage (e.g., 90%) of a VM’s prefetching reads hit in
read-ahead caches and experience a low latency. When a
storage server becomes busy, the “percentage” of requests
that hit in read-ahead caches may change little, but the response
time of those cache-miss requests may increase
dramatically. In other words, this “percentage” does not
correlate well with the achieved disk I/O throughput."
============

The paper [2] below from VMware is informative but cannot be adopted by us 
directly, as the problem domain is different. I previously had a paper on 
general congestion control, 
https://sites.google.com/site/tangchq/papers/NCI-USENIX09.pdf?attredirects=0 
, and attended a conference together with an author of paper [2] , and had 
some discussions. It is an interesting paper.

[1] http://suif.stanford.edu/papers/nsdi05.pdf . 
[2] http://www.usenix.org/events/fast09/tech/full_papers/gulati/gulati.pdf 

 


Regards,
ChunQiang (CQ) Tang
Homepage: http://www.research.ibm.com/people/c/ctang
Stefan Hajnoczi - Feb. 1, 2011, 1:53 p.m.
On Mon, Jan 31, 2011 at 2:49 PM, Chunqiang Tang <ctang@us.ibm.com> wrote:
> The paper [2] below from VMware is informative but cannot be adopted by us
> directly, as the problem domain is different. I previously had a paper on
> general congestion control,
> https://sites.google.com/site/tangchq/papers/NCI-USENIX09.pdf?attredirects=0
> , and attended a conference together with an author of paper [2] , and had
> some discussions. It is an interesting paper.
>
> [1] http://suif.stanford.edu/papers/nsdi05.pdf .
> [2] http://www.usenix.org/events/fast09/tech/full_papers/gulati/gulati.pdf

Thanks for the links, I will take a look.  I can see both latency and
throughput being problematic.  With throughput the problem is that the
I/O pattern may be sporadic or bursty.  It's hard for a heuristic to
follow large changes in the guest's I/O pattern.

Stefan

Patch

diff --git a/block_int.h b/block_int.h
index 12663e8..e98872a 100644
--- a/block_int.h
+++ b/block_int.h
@@ -98,6 +98,7 @@  struct BlockDriver {
     int (*bdrv_snapshot_load_tmp)(BlockDriverState *bs,
                                   const char *snapshot_name);
     int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
+    int (*bdrv_update)(BlockDriverState *bs, int argc, char **argv);
 
     int (*bdrv_save_vmstate)(BlockDriverState *bs, const uint8_t *buf,
                              int64_t pos, int size);
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 6c7176f..1ad378b 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -39,6 +39,12 @@  STEXI
 @item info [-f @var{fmt}] @var{filename}
 ETEXI
 
+DEF("update", img_update,
+    "update [-f fmt] filename [attr1=val1 attr2=val2 ...]")
+STEXI
+@item update [-f @var{fmt}] @var{filename} [@var{attr1=val1 attr2=val2 ...}]")
+ETEXI
+
 DEF("snapshot", img_snapshot,
     "snapshot [-l | -a snapshot | -c snapshot | -d snapshot] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index afd9ed2..5f35c4d 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1054,6 +1054,49 @@  static int img_info(int argc, char **argv)
     return 0;
 }
 
+static int img_update(int argc, char **argv)
+{
+    int c;
+    const char *filename, *fmt;
+    BlockDriverState *bs;
+
+    fmt = NULL;
+    for(;;) {
+        c = getopt(argc, argv, "f:h");
+        if (c == -1)
+            break;
+        switch(c) {
+        case 'h':
+            help();
+            break;
+        case 'f':
+            fmt = optarg;
+            break;
+        }
+    }
+    if (optind >= argc)
+        help();
+    filename = argv[optind++];
+    bs = bdrv_new_open(filename, fmt, BDRV_O_FLAGS | BDRV_O_NO_BACKING
+                       | BDRV_O_RDWR);
+    if (!bs) {
+        return 1;
+    }
+
+    if (bs->drv->bdrv_update) {
+        bs->drv->bdrv_update(bs, argc-optind, &argv[optind]);
+    }
+    else {
+        char fmt_name[128];
+        bdrv_get_format(bs, fmt_name, sizeof(fmt_name));
+        error_report ("The 'update' command is not supported for "
+                      "the '%s' image format.", fmt_name);
+    }
+
+    bdrv_delete(bs);
+    return 0;
+}
+
 #define SNAPSHOT_LIST   1
 #define SNAPSHOT_CREATE 2
 #define SNAPSHOT_APPLY  3