diff mbox

Introduce cache images for the QCOW2 format

Message ID 1376413436-5424-1-git-send-email-kaveh@cs.vu.nl
State New
Headers show

Commit Message

Kaveh Razavi Aug. 13, 2013, 5:03 p.m. UTC
Using copy-on-write images with the base image stored remotely is common
practice in data centers. This saves significant network traffic by
avoiding the transfer of the complete base image. However, the data
blocks needed for a VM boot still need to be transfered to the node that
runs the VM. On slower networks, this will create a bottleneck when
booting many VMs simultaneously from a single VM image. Also,
simultaneously booting VMs from more than one VM image creates a
bottleneck at the storage device of the base image, if the storage
device does not fair well with the random access pattern that happens
during booting.

This patch introduces a block-level caching mechanism by introducing a
copy-on-read image that supports quota and goes in between the base
image and copy-on-write image. This cache image can either be stored on
the nodes that run VMs or on a storage device that can handle random
access well (e.g. memory, SSD, etc.). This cache image is effective
since usually only a very small part of the image is necessary for
booting a VM. We measured 100MB to be enough for a default CentOS and
Debian installations.

A cache image with a quota of 100MB can be created using these commands:

$ qemu-img create -f qcow2 -o
cache_img_quota=104857600,backing_file=/path/to/base /path/to/cache
$ qemu-img create -f qcow2 -o backing_file=/path/to/cache /path/to/cow

The first time a VM boots from the copy-on-write image, the cache gets
warm. Subsequent boots do not need to read from the base image.

The implementation is a small extension to the QCOW2 format. If you are
interested to know more, please read this paper:
http://cs.vu.nl/~kaveh/pubs/pdf/sc13.pdf
---
 block.c                   |   28 +++++++++-
 block/qcow2.c             |  121 +++++++++++++++++++++++++++++++++++++++++++--
 block/qcow2.h             |    6 ++
 include/block/block_int.h |    3 +
 4 files changed, 151 insertions(+), 7 deletions(-)

Comments

Eric Blake Aug. 13, 2013, 9:37 p.m. UTC | #1
On 08/13/2013 11:03 AM, Kaveh Razavi wrote:
> Using copy-on-write images with the base image stored remotely is common
> practice in data centers. This saves significant network traffic by
> avoiding the transfer of the complete base image. However, the data
> blocks needed for a VM boot still need to be transfered to the node that
> runs the VM. On slower networks, this will create a bottleneck when
> booting many VMs simultaneously from a single VM image. Also,
> simultaneously booting VMs from more than one VM image creates a
> bottleneck at the storage device of the base image, if the storage
> device does not fair well with the random access pattern that happens

s/fair/fare/

> during booting.
> 
> This patch introduces a block-level caching mechanism by introducing a
> copy-on-read image that supports quota and goes in between the base
> image and copy-on-write image. This cache image can either be stored on
> the nodes that run VMs or on a storage device that can handle random
> access well (e.g. memory, SSD, etc.). This cache image is effective
> since usually only a very small part of the image is necessary for
> booting a VM. We measured 100MB to be enough for a default CentOS and
> Debian installations.
> 
> A cache image with a quota of 100MB can be created using these commands:
> 
> $ qemu-img create -f qcow2 -o
> cache_img_quota=104857600,backing_file=/path/to/base /path/to/cache
> $ qemu-img create -f qcow2 -o backing_file=/path/to/cache /path/to/cow

What is the QMP counterpart for hot-plugging a disk with the cache
attached?  Is this something that can integrate nicely with Kevin's
planned blockdev-add for 1.7?

> 
> The first time a VM boots from the copy-on-write image, the cache gets
> warm. Subsequent boots do not need to read from the base image.
> 
> The implementation is a small extension to the QCOW2 format. If you are
> interested to know more, please read this paper:
> http://cs.vu.nl/~kaveh/pubs/pdf/sc13.pdf

Please post this as a series, with patch 1 of the series being a doc
patch to docs/specs/qcow2.txt fully documenting this extension.  No one
else can be expected to interoperate with your extension if you don't
document it upstream.  I want to make sure that there are no races where
two competing processes both open a file read-write, and where the first
process requests to modify metadata (possibly by adding the cache
designation header, but also by making some other modification), but
where before that completes, the other process sees an incomplete
picture of the metadata.  We already document that qemu-img is liable to
misbehave, even when operating in read-only mode, on an image held
read-write by one qemu process; and I want to see in the formal docs
(without having to chase a link to the pdf) how you guarantee that this
is safe.
Alex Bligh Aug. 13, 2013, 10:53 p.m. UTC | #2
--On 13 August 2013 19:03:56 +0200 Kaveh Razavi <kaveh@cs.vu.nl> wrote:

> This patch introduces a block-level caching mechanism by introducing a
> copy-on-read image that supports quota and goes in between the base
> image and copy-on-write image. This cache image can either be stored on
> the nodes that run VMs or on a storage device that can handle random
> access well (e.g. memory, SSD, etc.).

What is this cache keyed on and how is it invalidated? Let's say a
2 VM on node X boot with backing file A. The first populates the cache,
and the second utilises the cache. I then stop both VMs, delete
the derived disks, and change the contents of the backing file. I then
boot a VM using the changed backing file on node X and node Y. I think
node Y is going to get the clean backing file. However, how does node
X know not to use the cache? Would it not be a good idea to check
(at least) the inode number and the mtime of the backing file correspond
with values saved in the cache, and if not the same then ignore the
cache?

> +    /* backing files always opened read-only except for cache images,
> +     * we first open the file with RDWR and check whether it is a cache
> +     * image. If so, we leave the RDWR, if not, we re-open read-only.
> +     */

This seems like the less safe way of doing it. Why not open RO and check
whether it is a cache image, and if so reopen RDWR. That would mean
you never open a backing file that isn't a cache image RDWR even for
only a second? That sounds safer to me, and is also the least change
for the existing code path.

> @@ -840,7 +886,6 @@ static coroutine_fn int
> qcow2_co_writev(BlockDriverState *bs,      qemu_co_mutex_lock(&s->lock);
>
>      while (remaining_sectors != 0) {
> -
>          l2meta = NULL;
>
>          trace_qcow2_writev_start_part(qemu_coroutine_self());

Pointless whitespace change.
Alex Bligh Aug. 13, 2013, 11:16 p.m. UTC | #3
--On 13 August 2013 19:03:56 +0200 Kaveh Razavi <kaveh@cs.vu.nl> wrote:

>  Also,
> simultaneously booting VMs from more than one VM image creates a
> bottleneck at the storage device of the base image, if the storage
> device does not fair well with the random access pattern that happens
> during booting.

Additional question (sorry for spitting them)

The above para implies you intend one cache file to be shared by
two VMs booting from the same backing image on the same node.
If that's true, how do you protect yourself from the following:

    VM1                                         VM2

1.  Read rq for block 1234

2.  Start writing block 1234 to cache file

3.                                              Read fq for blk 1234

4.                                              Read blk 1234 from
                                                cache file

5.  Finish writing block 1234 to cache file


As far as I can see VM1 could read an incomplete write from
the cache file.

Further, unless you're opening these files O_DIRECT, how do you
know half the writes from VM1 won't be sitting dirty in the page
cache when you read using VM2?
Stefan Hajnoczi Aug. 14, 2013, 9:29 a.m. UTC | #4
On Tue, Aug 13, 2013 at 07:03:56PM +0200, Kaveh Razavi wrote:
> Using copy-on-write images with the base image stored remotely is common
> practice in data centers. This saves significant network traffic by
> avoiding the transfer of the complete base image. However, the data
> blocks needed for a VM boot still need to be transfered to the node that
> runs the VM. On slower networks, this will create a bottleneck when
> booting many VMs simultaneously from a single VM image. Also,
> simultaneously booting VMs from more than one VM image creates a
> bottleneck at the storage device of the base image, if the storage
> device does not fair well with the random access pattern that happens
> during booting.
> 
> This patch introduces a block-level caching mechanism by introducing a
> copy-on-read image that supports quota and goes in between the base
> image and copy-on-write image. This cache image can either be stored on
> the nodes that run VMs or on a storage device that can handle random
> access well (e.g. memory, SSD, etc.). This cache image is effective
> since usually only a very small part of the image is necessary for
> booting a VM. We measured 100MB to be enough for a default CentOS and
> Debian installations.
> 
> A cache image with a quota of 100MB can be created using these commands:
> 
> $ qemu-img create -f qcow2 -o
> cache_img_quota=104857600,backing_file=/path/to/base /path/to/cache
> $ qemu-img create -f qcow2 -o backing_file=/path/to/cache /path/to/cow
> 
> The first time a VM boots from the copy-on-write image, the cache gets
> warm. Subsequent boots do not need to read from the base image.

100 MB is small enough for RAM.  Did you try enabling the host kernel
page cache for the backing file?  That way all guests running on this
host share a single RAM-cached version of the backing file.

The other existing solution is to use the image streaming feature, which
was designed to speed up deployment of image files over the network.  It
copies the contents of the image from a remote server onto the host
while allowing immediate random access from the guest.  This isn't a
cache, this is a full copy of the image.

I share an idea of how to turn this into a cache in a second, but first
how to deploy this safely.  Since multiple QEMU processes can share a
backing file and the cache must not suffer from corruptions due to
races, you can use one qemu-nbd per backing image.  The QEMU processes
connect to the local read-only qemu-nbd server.

If you want a cache you could enable copy-on-read without the image
streaming feature (block_stream command) and evict old data using
discard commands.  No qcow2 image format changes are necessary to do
this.

> @@ -730,6 +751,31 @@ static coroutine_fn int qcow2_co_readv(BlockDriverState *bs, int64_t sector_num,
>                      if (ret < 0) {
>                          goto fail;
>                      }
> +                    /* do copy-on-read if this is a cache image */
> +                    if (bs->is_cache_img && !s->is_cache_full && 
> +                            !s->is_writing_on_cache)
> +                    {
> +                        qemu_co_mutex_unlock(&s->lock);
> +                        s->is_writing_on_cache = true;
> +                        ret = bdrv_co_writev(bs,
> +                                             sector_num,
> +                                             n1,
> +                                             &hd_qiov);
> +                        s->is_writing_on_cache = false;
> +                        qemu_co_mutex_lock(&s->lock);
> +                        if (ret < 0) {
> +                            if (ret == (-ENOSPC))
> +                            {
> +                                s->is_cache_full = true;
> +                            }
> +                            else {
> +                                /* error is other than cache space */
> +                                fprintf(stderr, "Cache write error (%d)\n", 
> +                                        ret);
> +                                goto fail;
> +                            }
> +                        }
> +                    }

This is unsafe since other QEMU processes on the host are not
synchronizing with each other.  The image file will be corrupted.

Stefan
Kaveh Razavi Aug. 14, 2013, 11:13 a.m. UTC | #5
On 08/13/2013 11:37 PM, Eric Blake wrote:
> What is the QMP counterpart for hot-plugging a disk with the cache
> attached?  Is this something that can integrate nicely with Kevin's
> planned blockdev-add for 1.7?
> 

I do not know the details of this, but as long as it has proper support
for backing files, integration should be straightforward. I can take a
look if you point me to the right path.

> Please post this as a series, with patch 1 of the series being a doc
> patch to docs/specs/qcow2.txt fully documenting this extension.  No one
> else can be expected to interoperate with your extension if you don't
> document it upstream.  I want to make sure that there are no races where
> two competing processes both open a file read-write, and where the first
> process requests to modify metadata (possibly by adding the cache
> designation header, but also by making some other modification), but
> where before that completes, the other process sees an incomplete
> picture of the metadata.  We already document that qemu-img is liable to
> misbehave, even when operating in read-only mode, on an image held
> read-write by one qemu process; and I want to see in the formal docs
> (without having to chase a link to the pdf) how you guarantee that this
> is safe.

I can repost this later on, updating the document and fixing the race.
The way we intended it, only a single qemu process at each host should
write to the cache. Once this is done, the cache should be marked as
ready. Subsequent boots on that host can reuse this read-only cache,
only if it is ready. For this, we likely need a variable in the header
extension that defines whether the cache image is ready/cold/dirty. If
the image is not ready, it should be discarded and the cache's backing
file be used instead. If the image is cold, a VM can set the variable to
dirty, start warming it, and set it back to ready on shutdown.

Kaveh
Kaveh Razavi Aug. 14, 2013, 11:28 a.m. UTC | #6
On 08/14/2013 12:53 AM, Alex Bligh wrote:
> What is this cache keyed on and how is it invalidated? Let's say a
> 2 VM on node X boot with backing file A. The first populates the cache,
> and the second utilises the cache. I then stop both VMs, delete
> the derived disks, and change the contents of the backing file. I then
> boot a VM using the changed backing file on node X and node Y. I think
> node Y is going to get the clean backing file. However, how does node
> X know not to use the cache? Would it not be a good idea to check
> (at least) the inode number and the mtime of the backing file correspond
> with values saved in the cache, and if not the same then ignore the
> cache?

You could argue the same for normal qcow2. Start from a cow image with a
backing image, stop the VM. Start another VM, modifying the backing
image directly. Start the VM again, this time from the cow image, and
the VM can see stale data in the stored data clusters of the cow image.

The idea is once a user registers an image to a cloud middleware, it is
assigned an image ID. As long as the middleware assigns a cache to the
backing image with the same ID, there is no possibility to read stale
data. If it is desired to have some sort of check at the qemu level, it
should be implemented in the qcow2 directly for all backing files and
this extension will benefit from it too.

> This seems like the less safe way of doing it. Why not open RO and check
> whether it is a cache image, and if so reopen RDWR. That would mean
> you never open a backing file that isn't a cache image RDWR even for
> only a second? That sounds safer to me, and is also the least change
> for the existing code path.

It should be possible to do it the other way around. I will check it out.

Kaveh
Kaveh Razavi Aug. 14, 2013, 11:42 a.m. UTC | #7
On 08/14/2013 01:16 AM, Alex Bligh wrote:
> The above para implies you intend one cache file to be shared by
> two VMs booting from the same backing image on the same node.
> If that's true, how do you protect yourself from the following:
>

Not really. I meant different backing images, and not necessarily
booting on the same host.

>    VM1                                         VM2
> 
> 1.  Read rq for block 1234
> 
> 2.  Start writing block 1234 to cache file
> 
> 3.                                              Read fq for blk 1234
> 
> 4.                                              Read blk 1234 from
>                                                cache file
> 
> 5.  Finish writing block 1234 to cache file
> 
> 
> As far as I can see VM1 could read an incomplete write from
> the cache file.
> 
> Further, unless you're opening these files O_DIRECT, how do you
> know half the writes from VM1 won't be sitting dirty in the page
> cache when you read using VM2?

This is essentially the same as what Eric mentioned in his email. As
long as only one VM writes to the image and it is not simultaneously
used by other VMs this should not happen.

There is little benefit in having a second VM reading from the cache
that is being created by another VM. If the backing file is not opened
with O_DIRECT, the reads from the first VM will likely exist in the host
page cache for the second VM.

Kaveh
Fam Zheng Aug. 14, 2013, 11:52 a.m. UTC | #8
On Wed, 08/14 13:28, Kaveh Razavi wrote:
> On 08/14/2013 12:53 AM, Alex Bligh wrote:
> > What is this cache keyed on and how is it invalidated? Let's say a
> > 2 VM on node X boot with backing file A. The first populates the cache,
> > and the second utilises the cache. I then stop both VMs, delete
> > the derived disks, and change the contents of the backing file. I then
> > boot a VM using the changed backing file on node X and node Y. I think
> > node Y is going to get the clean backing file. However, how does node
> > X know not to use the cache? Would it not be a good idea to check
> > (at least) the inode number and the mtime of the backing file correspond
> > with values saved in the cache, and if not the same then ignore the
> > cache?
> 
> You could argue the same for normal qcow2. Start from a cow image with a
> backing image, stop the VM. Start another VM, modifying the backing
> image directly. Start the VM again, this time from the cow image, and
> the VM can see stale data in the stored data clusters of the cow image.
> 
> The idea is once a user registers an image to a cloud middleware, it is
> assigned an image ID. As long as the middleware assigns a cache to the
> backing image with the same ID, there is no possibility to read stale
> data. If it is desired to have some sort of check at the qemu level, it
> should be implemented in the qcow2 directly for all backing files and
> this extension will benefit from it too.
> 
Yes, this one sounds good to have. VMDK and VHDX have this kind of
backing file status validation.

Thanks.

Fam
Alex Bligh Aug. 14, 2013, 11:57 a.m. UTC | #9
Kaveh,

On 14 Aug 2013, at 12:28, Kaveh Razavi wrote:

> On 08/14/2013 12:53 AM, Alex Bligh wrote:
>> What is this cache keyed on and how is it invalidated? Let's say a
>> 2 VM on node X boot with backing file A. The first populates the cache,
>> and the second utilises the cache. I then stop both VMs, delete
>> the derived disks, and change the contents of the backing file. I then
>> boot a VM using the changed backing file on node X and node Y. I think
>> node Y is going to get the clean backing file. However, how does node
>> X know not to use the cache? Would it not be a good idea to check
>> (at least) the inode number and the mtime of the backing file correspond
>> with values saved in the cache, and if not the same then ignore the
>> cache?
> 
> You could argue the same for normal qcow2. Start from a cow image with a
> backing image, stop the VM. Start another VM, modifying the backing
> image directly. Start the VM again, this time from the cow image, and
> the VM can see stale data in the stored data clusters of the cow image.

That's if the VM retains a qcow2 based on the backing file. I meant reboot
the two VMs with a fresh qcow2 read/write file.

> The idea is once a user registers an image to a cloud middleware, it is
> assigned an image ID. As long as the middleware assigns a cache to the
> backing image with the same ID, there is no possibility to read stale
> data. If it is desired to have some sort of check at the qemu level, it
> should be implemented in the qcow2 directly for all backing files and
> this extension will benefit from it too.

I don't agree. The penalty for a qcow2 suffering a false positive on
a change to a backing file is that the VM can no longer boot. The
penalty for your cache suffering a false positive is that the
VM boots marginally slower. Moreover, it is expected behaviour that
you CAN change a backing file if there are no r/w images based on
it. Your cache changes that assumption.
Alex Bligh Aug. 14, 2013, 12:02 p.m. UTC | #10
On 14 Aug 2013, at 12:42, Kaveh Razavi wrote:

> On 08/14/2013 01:16 AM, Alex Bligh wrote:
>> The above para implies you intend one cache file to be shared by
>> two VMs booting from the same backing image on the same node.
>> If that's true, how do you protect yourself from the following
> 
> Not really. I meant different backing images, and not necessarily
> booting on the same host.

So how does your cache solve the problem you mentioned in that
para?

>> As far as I can see VM1 could read an incomplete write from
>> the cache file.
>> 
>> Further, unless you're opening these files O_DIRECT, how do you
>> know half the writes from VM1 won't be sitting dirty in the page
>> cache when you read using VM2?
> 
> This is essentially the same as what Eric mentioned in his email. As
> long as only one VM writes to the image and it is not simultaneously
> used by other VMs this should not happen.

Correct. So you need one cache file per VM, not one per image.

> There is little benefit in having a second VM reading from the cache
> that is being created by another VM. If the backing file is not opened
> with O_DIRECT, the reads from the first VM will likely exist in the host
> page cache for the second VM.


Correct. So the patch would only speed up the first reboot of
a VM that has already been booted on that node (as you have one
cache file per VM). However, when the second VM boots with the
same image, rather than the pages being hot in the page cache,
it will be loading them from whatever device the cache is on
(as per the above, we need one cache per VM), and whatever that
is, it won't be faster than RAM.

So I fail to see how this speeds things up. Do you have measured
numbers?

I ask because I'm genuinely interested in caching strategies here.
I think the problem is on the write side and not the read side,
particularly where the host page cache is already used (e.g.
cache=writeback) and the image usages ureadahead.
Alex Bligh Aug. 14, 2013, 12:03 p.m. UTC | #11
On 14 Aug 2013, at 12:52, Fam Zheng wrote:

> Yes, this one sounds good to have. VMDK and VHDX have this kind of
> backing file status validation.

... though I'd prefer something safer than looking at mtime, for
instance a sequence number that is incremented prior to any
bdrv_close if a write has been done since bdrv_open.
Kaveh Razavi Aug. 14, 2013, 1:37 p.m. UTC | #12
On 08/14/2013 01:57 PM, Alex Bligh wrote:
> I don't agree. The penalty for a qcow2 suffering a false positive on
> a change to a backing file is that the VM can no longer boot. The
> penalty for your cache suffering a false positive is that the
> VM boots marginally slower. Moreover, it is expected behaviour that
> you CAN change a backing file if there are no r/w images based on
> it. Your cache changes that assumption.

That is right. So when there is a change in the backing file, either the
cache should be invalidated or the changes should be propagated to the
cache image. If changes to the backing image are not frequent, then
invalidation is the simpler approach. In any case, there should be a
mechanism to detect this. I assume it is also undesirable for a VM to
see stale data when booting from a cow image twice.

Kaveh
Kaveh Razavi Aug. 14, 2013, 1:43 p.m. UTC | #13
On 08/14/2013 02:02 PM, Alex Bligh wrote:
>> > Not really. I meant different backing images, and not necessarily
>> > booting on the same host.
> So how does your cache solve the problem you mentioned in that
> para?
> 

If you have a fast network (think 10GbE), then qcow2 can easily boot
many VMs over many hosts from a single backing file without any
additional delay. If you don't, then you should do some sort of caching,
so that subsequent boots over many hosts do not hit the network again.

Regardless of the network, in a multi-user scenario, booting many VMs
with different backing files (possibly on different hosts) easily
creates a bottleneck at the storage device (i.e. disk) that hosts the
backing images due to random reads. Considering this, you would like to
keep cache images either on the VM hosts (in case of a slow network), or
on a tmpfs (in case of a fast network). In both cases, the small size of
the cache images help.

We have done a number of benchmarks when scaling to many hosts and I
just told you some of the results. If you want to know more, please
follow the link I sent earlier to a paper that we recently published on
the topic.

>> > This is essentially the same as what Eric mentioned in his email. As
>> > long as only one VM writes to the image and it is not simultaneously
>> > used by other VMs this should not happen.
> Correct. So you need one cache file per VM, not one per image.
> 

No, once the read-only cache is created, it can be used by different VMs
on the same host. But yes, it first needs to be created.

> Correct. So the patch would only speed up the first reboot of
> a VM that has already been booted on that node (as you have one
> cache file per VM). However, when the second VM boots with the
> same image, rather than the pages being hot in the page cache,
> it will be loading them from whatever device the cache is on
> (as per the above, we need one cache per VM), and whatever that
> is, it won't be faster than RAM.
> 
> So I fail to see how this speeds things up. Do you have measured
> numbers?

We did measure when the cache is either on memory/disk of the host or
when it is accessed over NFS (rwsize 64KB) and stored on remote
memory/disk. With a CentOS default installation, they all were booting
within 1% of each other. The idea behind this patch is not to make
booting faster, but to provide a mechanism to avoid scalability
bottlenecks that booting VMs (over many hosts) can create.

Kaveh
Alex Bligh Aug. 14, 2013, 1:50 p.m. UTC | #14
On 14 Aug 2013, at 14:43, Kaveh Razavi wrote:

> No, once the read-only cache is created, it can be used by different VMs
> on the same host. But yes, it first needs to be created.

OK - this was the point I had missed.

Assuming the cache quota is not exhausted, how do you know how that
a VM has finished 'creating' the cache? At any point it might
read a bit more from the backing image.

I'm wondering whether you could just use POSIX mandatory locking for
this, i.e. open it exclusive and r/w until the 'finish point', then
reopen RO, which would allow other VMs to share it. Any other VMs
starting before the cache was populated simply fail to get the
exclusive lock and go direct to the backing file.
Kaveh Razavi Aug. 14, 2013, 2:20 p.m. UTC | #15
Hi,

On 08/14/2013 11:29 AM, Stefan Hajnoczi wrote:
> 100 MB is small enough for RAM.  Did you try enabling the host kernel
> page cache for the backing file?  That way all guests running on this
> host share a single RAM-cached version of the backing file.
>

Yes, indeed. That is why we think it makes sense to store many of these
cache images on memory, but at the storage node to avoid hot-spotting
its disk(s). Relying on the page-cache at the storage node may not be
enough, since there is no guarantee on what stays there.

The VM host page cache can be evicted at any time, requiring it to go to
the network again to read from the backing file. Since these cache
images are small, it is possible to store many of them at the hosts,
instead of caching many complete backing images that are usually in GB
order.

> The other existing solution is to use the image streaming feature, which
> was designed to speed up deployment of image files over the network.  It
> copies the contents of the image from a remote server onto the host
> while allowing immediate random access from the guest.  This isn't a
> cache, this is a full copy of the image.
> 

Streaming the complete image may work well for some cases, but streaming
at scale to many hosts at the same time can easily create a bottleneck
at the network. In most scenarios, only a fraction of the backing file
is needed during the lifetime of a VM.

> I share an idea of how to turn this into a cache in a second, but first
> how to deploy this safely.  Since multiple QEMU processes can share a
> backing file and the cache must not suffer from corruptions due to
> races, you can use one qemu-nbd per backing image.  The QEMU processes
> connect to the local read-only qemu-nbd server.
> 
> If you want a cache you could enable copy-on-read without the image
> streaming feature (block_stream command) and evict old data using
> discard commands.  No qcow2 image format changes are necessary to do
> this.

This is an interesting alternative. I may be wrong, but I think there
are two limitations with this: 1) it is not persistent and 2) you can
not enforce quota.

(1) is important if you would like to have a pool of these cache images
that survives a reboot. (2) is important, if the caching medium is a
scarce resource such as memory and also if you want to make sure that
only important data blocks get cached (i.e. data blocks needed for booting).

> This is unsafe since other QEMU processes on the host are not
> synchronizing with each other.  The image file will be corrupted.

That is true. My solution earlier is allowing only a single qemu process
to write to the cache at a time. Other qemu processes can only read from
it once it is ready and no longer modified.

Kaveh
Kaveh Razavi Aug. 14, 2013, 2:26 p.m. UTC | #16
On 08/14/2013 03:50 PM, Alex Bligh wrote:
> Assuming the cache quota is not exhausted, how do you know how that
> a VM has finished 'creating' the cache? At any point it might
> read a bit more from the backing image.

I was assuming on shutdown.

> I'm wondering whether you could just use POSIX mandatory locking for
> this, i.e. open it exclusive and r/w until the 'finish point', then
> reopen RO, which would allow other VMs to share it. Any other VMs
> starting before the cache was populated simply fail to get the
> exclusive lock and go direct to the backing file.

This is a good idea, since it relaxes the requirement for releasing the
cache only on shutdown. I am not sure how the 'finish point' can be
recognized. Full cache quota is one obvious scenario, but I imagine most
VMs do/should not really read till that point (unless they are doing
something that should not be cached anyway - e.g. file-system check).
Another possibility is registering some sort of event to be executed
periodically (e.g. every 30 seconds), and if the cache is not modified
in a period, then that is the 'finish point'. I do not know how feasible
that is with the facilities that qemu provides.

Kaveh
Alex Bligh Aug. 14, 2013, 3:02 p.m. UTC | #17
On 14 Aug 2013, at 15:26, Kaveh Razavi wrote:

> This is a good idea, since it relaxes the requirement for releasing the
> cache only on shutdown. I am not sure how the 'finish point' can be
> recognized. Full cache quota is one obvious scenario, but I imagine most
> VMs do/should not really read till that point (unless they are doing
> something that should not be cached anyway - e.g. file-system check).
> Another possibility is registering some sort of event to be executed
> periodically (e.g. every 30 seconds), and if the cache is not modified
> in a period, then that is the 'finish point'. I do not know how feasible
> that is with the facilities that qemu provides.

You can set timers and modify them - that's easy enough. I suspect the
heuristic may need some tuning, but a simple one such as '60 seconds
after boot' might be sufficient. After all, if a VM takes more than
60 seconds to boot and does not fill the cache quota in that time,
I think we can assume the cache isn't going to do it much good!
Kevin Wolf Aug. 14, 2013, 3:32 p.m. UTC | #18
Am 14.08.2013 um 16:26 hat Kaveh Razavi geschrieben:
> On 08/14/2013 03:50 PM, Alex Bligh wrote:
> > Assuming the cache quota is not exhausted, how do you know how that
> > a VM has finished 'creating' the cache? At any point it might
> > read a bit more from the backing image.
> 
> I was assuming on shutdown.

Wait, so you're not really changing the cache while it's used, but you
only create it once and then use it like a regular backing file?  If so,
the only thing we need to talk about is the creation, because there's no
difference for using it.

Creation can use the existing copy-on-read functionality, and the only
thing you need additionally is a way to turn copy-on-read off at the
right point.

Or do I misunderstand what you're doing?

Kevin
Richard W.M. Jones Aug. 14, 2013, 3:58 p.m. UTC | #19
On Wed, Aug 14, 2013 at 01:03:48PM +0100, Alex Bligh wrote:
> 
> On 14 Aug 2013, at 12:52, Fam Zheng wrote:
> 
> > Yes, this one sounds good to have. VMDK and VHDX have this kind of
> > backing file status validation.
> 
> ... though I'd prefer something safer than looking at mtime, for
> instance a sequence number that is incremented prior to any
> bdrv_close if a write has been done since bdrv_open.

Yes, please not mtime.  User-mode Linux COW files use mtime to check
this, and it causes no end of problems (eg. if the backing file is
copied from one place to another without using the magic incantations
to preserve file times).

Rich.
Fam Zheng Aug. 15, 2013, 12:53 a.m. UTC | #20
On Wed, 08/14 13:03, Alex Bligh wrote:
> 
> On 14 Aug 2013, at 12:52, Fam Zheng wrote:
> 
> > Yes, this one sounds good to have. VMDK and VHDX have this kind of
> > backing file status validation.
> 
> ... though I'd prefer something safer than looking at mtime, for
> instance a sequence number that is incremented prior to any
> bdrv_close if a write has been done since bdrv_open.
> 
It should be incremented prior to any write for each r/w open, in case
program crashes after write but before close.

Fam
Alex Bligh Aug. 15, 2013, 5:51 a.m. UTC | #21
On 15 Aug 2013, at 01:53, Fam Zheng wrote:

> On Wed, 08/14 13:03, Alex Bligh wrote:
>> 
>> On 14 Aug 2013, at 12:52, Fam Zheng wrote:
>> 
>>> Yes, this one sounds good to have. VMDK and VHDX have this kind of
>>> backing file status validation.
>> 
>> ... though I'd prefer something safer than looking at mtime, for
>> instance a sequence number that is incremented prior to any
>> bdrv_close if a write has been done since bdrv_open.
>> 
> It should be incremented prior to any write for each r/w open, in case
> program crashes after write but before close.

Yup - well prior to the first write anyway.
Wayne Xia Aug. 15, 2013, 7:50 a.m. UTC | #22
于 2013-8-14 23:32, Kevin Wolf 写道:
> Am 14.08.2013 um 16:26 hat Kaveh Razavi geschrieben:
>> On 08/14/2013 03:50 PM, Alex Bligh wrote:
>>> Assuming the cache quota is not exhausted, how do you know how that
>>> a VM has finished 'creating' the cache? At any point it might
>>> read a bit more from the backing image.
>>
>> I was assuming on shutdown.
>
> Wait, so you're not really changing the cache while it's used, but you
> only create it once and then use it like a regular backing file?  If so,
> the only thing we need to talk about is the creation, because there's no
> difference for using it.
>
> Creation can use the existing copy-on-read functionality, and the only
> thing you need additionally is a way to turn copy-on-read off at the
> right point.
>
> Or do I misunderstand what you're doing?
>
> Kevin
>
   This cache capability seems have little to do with qcow2, but a
general block function: start/stop copy on read for one BS in a backing
chain. If so, suggest:
1 refine existing general copy on read code, not in qcow2.c but general
block code, make it able to start/stop copy on read for a BDS in the
chain.
2 add qmp interface for it.

Then the work flow will be:
step 1: prepare image, not related to qcow2 format.
qemu-img create cache.img -b base.img
qemu-img create vm1.img -b cache.img

step 2: boot vm1, vm2:
qemu -hda vm1.img -COR cache.img
qemu -hda vm2.img
Stefan Hajnoczi Aug. 15, 2013, 8:11 a.m. UTC | #23
On Wed, Aug 14, 2013 at 05:32:16PM +0200, Kevin Wolf wrote:
> Am 14.08.2013 um 16:26 hat Kaveh Razavi geschrieben:
> > On 08/14/2013 03:50 PM, Alex Bligh wrote:
> > > Assuming the cache quota is not exhausted, how do you know how that
> > > a VM has finished 'creating' the cache? At any point it might
> > > read a bit more from the backing image.
> > 
> > I was assuming on shutdown.
> 
> Wait, so you're not really changing the cache while it's used, but you
> only create it once and then use it like a regular backing file?  If so,
> the only thing we need to talk about is the creation, because there's no
> difference for using it.
> 
> Creation can use the existing copy-on-read functionality, and the only
> thing you need additionally is a way to turn copy-on-read off at the
> right point.
> 
> Or do I misunderstand what you're doing?

Yes, it seems we're talking about placing an intermediate backing file
on the host:

 /nfs/template.qcow2 <- /local/cache.qcow2 <- /local/vm001.qcow2

On first boot the image runs in "record" mode which populates the cache
via copy-on-read.

Once "record" mode is disabled the cache image can be shared with other
VMs on the host.  They all open it read-only and no longer modify the
cache.

At that point you have a normal qcow2 backing file chain:

 /nfs/template.qcow2 <- /local/cache.qcow2 <- /local/vm001.qcow2
                                           <- /local/vm002.qcow2

With this approach you don't need a qemu-nbd process that arbitrates
writes to the cache.qcow2 image file.  The disadvantage is that it only
caches the "record" mode read requests and does not adapt to changes if
the workload begins reading other data.  Also, it requires more
complicated management tools to handle the "record"/"playback" states of
the cache:

1. Launching vm002 while vm001 is still recording the cache.  (Bypass
   the cache temporarily for vm002.)
2. Switching the cache from "record" to "playback" after vm001 has
   finished its first run.
3. Switching vm002 to use the cache once it transitions to "playback"
   mode.

Stefan
Stefan Hajnoczi Aug. 15, 2013, 8:32 a.m. UTC | #24
On Wed, Aug 14, 2013 at 04:20:27PM +0200, Kaveh Razavi wrote:
> Hi,
> 
> On 08/14/2013 11:29 AM, Stefan Hajnoczi wrote:
> > 100 MB is small enough for RAM.  Did you try enabling the host kernel
> > page cache for the backing file?  That way all guests running on this
> > host share a single RAM-cached version of the backing file.
> >
> 
> Yes, indeed. That is why we think it makes sense to store many of these
> cache images on memory, but at the storage node to avoid hot-spotting
> its disk(s). Relying on the page-cache at the storage node may not be
> enough, since there is no guarantee on what stays there.
> 
> The VM host page cache can be evicted at any time, requiring it to go to
> the network again to read from the backing file. Since these cache
> images are small, it is possible to store many of them at the hosts,
> instead of caching many complete backing images that are usually in GB
> order.

I don't buy the argument about the page cache being evicted at any time:

At the scale where caching is important, provisioning a measily 100 MB
of RAM per guest should not be a challenge.

cgroups can be used to isolate page cache between VMs if you want to
guaranteed caches.

But it could be more interesting not to isolate so that the page cache
acts host-wide to reduce the overall I/O instead of narrowly focussing
on caching 100 MB for a specific image even if it is rarely accessed.

The real downside I see is that the page cache is volatile, so you could
see heavy I/O if multiple hosts reboot at the same time.

> > The other existing solution is to use the image streaming feature, which
> > was designed to speed up deployment of image files over the network.  It
> > copies the contents of the image from a remote server onto the host
> > while allowing immediate random access from the guest.  This isn't a
> > cache, this is a full copy of the image.
> > 
> 
> Streaming the complete image may work well for some cases, but streaming
> at scale to many hosts at the same time can easily create a bottleneck
> at the network. In most scenarios, only a fraction of the backing file
> is needed during the lifetime of a VM.

Streaming offers a rate limiting parameter so you can tune it to the
network conditions.

Copying the full image doesn't just reduce load on the NFS server, it
also means guests can continue to run if the NFS server becomes
unreachable.  That's an important property for reliability.

> > I share an idea of how to turn this into a cache in a second, but first
> > how to deploy this safely.  Since multiple QEMU processes can share a
> > backing file and the cache must not suffer from corruptions due to
> > races, you can use one qemu-nbd per backing image.  The QEMU processes
> > connect to the local read-only qemu-nbd server.
> > 
> > If you want a cache you could enable copy-on-read without the image
> > streaming feature (block_stream command) and evict old data using
> > discard commands.  No qcow2 image format changes are necessary to do
> > this.
> 
> This is an interesting alternative. I may be wrong, but I think there
> are two limitations with this: 1) it is not persistent and 2) you can
> not enforce quota.
> 
> (1) is important if you would like to have a pool of these cache images
> that survives a reboot. (2) is important, if the caching medium is a
> scarce resource such as memory and also if you want to make sure that
> only important data blocks get cached (i.e. data blocks needed for booting).

1)
It is persistent.  The backing file chain looks like this:

  /nfs/template.qcow2 <- /local/cache.qcow2 <- /local/vm001.qcow2

The cache is a regular qcow2 image file that is persistent.  The discard
command is used to evict data from the file.  Copy-on-read accesses are
used to populate the cache when the guest submits a read request.

2)
You can set cache size or other parameters as a qemu-nbd option (this
doesn't exist but could be implemented):

  $ qemu-img create -f qcow2 -o backing_file=/nfs/template.qcow2 cache.qcow2
  $ qemu-nbd --options cache-size=100MB,evict=lru cache.qcow2

So it's the qemu-nbd process that performs the cache housekeeping work.
The cache.qcow2 file itself just persists data and isn't aware of cache
settings.

Stefan
Kaveh Razavi Aug. 15, 2013, 12:25 p.m. UTC | #25
On 08/15/2013 10:32 AM, Stefan Hajnoczi wrote:
> I don't buy the argument about the page cache being evicted at any time:
>
> At the scale where caching is important, provisioning a measily 100 MB
> of RAM per guest should not be a challenge.
>
> cgroups can be used to isolate page cache between VMs if you want to
> guaranteed caches.
>
> But it could be more interesting not to isolate so that the page cache
> acts host-wide to reduce the overall I/O instead of narrowly focussing
> on caching 100 MB for a specific image even if it is rarely accessed.
>
> The real downside I see is that the page cache is volatile, so you could
> see heavy I/O if multiple hosts reboot at the same time.
>

At the VM hosts, the memory is mostly allocated to VMs. Without 
persisted caches, starting another VM from any of the possible backing 
VM images may or may not result in network traffic (depending on the 
page cache). Regardless of the page cache, the existing cache images 
persisted on the disk at hosts, can eliminate this at least on VM boot.

At the storage site however, I think it makes sense to dedicate memory 
for popular backing images (via tmpfs rather than page cache). The data 
blocks of the popular images used for booting will be accessed by all 
VMs starting from these "template" images.

> Streaming offers a rate limiting parameter so you can tune it to the
> network conditions.
>
> Copying the full image doesn't just reduce load on the NFS server, it
> also means guests can continue to run if the NFS server becomes
> unreachable.  That's an important property for reliability.

I am not really sure whether copying the entire image reduces the load 
on the NFS server, specially at scale. If copying the entire image at 
scale is desired/necessary, peer-to-peer approaches are documented to 
perform better. They are mostly implemented at the host file-system 
layer though (search for e.g. VMTorrent). I agree on the reliability 
consideration if you deal with an unreliable (remote) file-system.

> 1)
> It is persistent.  The backing file chain looks like this:
>
>    /nfs/template.qcow2 <- /local/cache.qcow2 <- /local/vm001.qcow2
>
> The cache is a regular qcow2 image file that is persistent.  The discard
> command is used to evict data from the file.  Copy-on-read accesses are
> used to populate the cache when the guest submits a read request.
>
> 2)
> You can set cache size or other parameters as a qemu-nbd option (this
> doesn't exist but could be implemented):
>
>    $ qemu-img create -f qcow2 -o backing_file=/nfs/template.qcow2 cache.qcow2
>    $ qemu-nbd --options cache-size=100MB,evict=lru cache.qcow2
>
> So it's the qemu-nbd process that performs the cache housekeeping work.
> The cache.qcow2 file itself just persists data and isn't aware of cache
> settings.

OK, this is better, since the user can also define a policy _and_ the 
cache can be shared by different VMs at the creation time without races. 
With an eviction policy 'none' in combination with cache_size, only the 
first accessed data blocks get cached, essentially providing the same 
functionality as this patch.

Kaveh
diff mbox

Patch

diff --git a/block.c b/block.c
index 01b66d8..52a92b4 100644
--- a/block.c
+++ b/block.c
@@ -920,18 +920,40 @@  int bdrv_open_backing_file(BlockDriverState *bs, QDict *options)
         back_drv = bdrv_find_format(bs->backing_format);
     }
 
-    /* backing files always opened read-only */
-    back_flags = bs->open_flags & ~(BDRV_O_RDWR | BDRV_O_SNAPSHOT);
-
+    /* backing files always opened read-only except for cache images,
+     * we first open the file with RDWR and check whether it is a cache 
+     * image. If so, we leave the RDWR, if not, we re-open read-only.
+     */
+    back_flags = (bs->open_flags & ~(BDRV_O_SNAPSHOT)) | BDRV_O_RDWR;
+    
     ret = bdrv_open(bs->backing_hd,
                     *backing_filename ? backing_filename : NULL, options,
                     back_flags, back_drv);
     if (ret < 0) {
+        goto out;
+    }
+    /* was not a cache image? */
+    if(bs->backing_hd->is_cache_img == false)
+    {
+        /* re-open read-only */
+        back_flags = bs->open_flags & ~(BDRV_O_SNAPSHOT | BDRV_O_RDWR);
+        bdrv_delete(bs->backing_hd);
+        ret = bdrv_open(bs->backing_hd,
+                *backing_filename ? backing_filename : NULL, options,
+                back_flags, back_drv);
+        if (ret < 0) {
+            goto out;
+        }
+    }
+out:
+    if (ret < 0)
+    {
         bdrv_delete(bs->backing_hd);
         bs->backing_hd = NULL;
         bs->open_flags |= BDRV_O_NO_BACKING;
         return ret;
     }
+
     return 0;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 3376901..3b0706a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -57,6 +57,7 @@  typedef struct {
 #define  QCOW2_EXT_MAGIC_END 0
 #define  QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA
 #define  QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857
+#define  QCOW2_EXT_MAGIC_CACHE_IMG 0x31393834
 
 static int qcow2_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
@@ -148,6 +149,27 @@  static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,
                 *p_feature_table = feature_table;
             }
             break;
+        
+        case QCOW2_EXT_MAGIC_CACHE_IMG:
+            bs->is_cache_img = true;
+            if(ext.len != 2 * sizeof(uint64_t)) {
+                fprintf(stderr, "ERROR: cache_img_extension is not %zd"
+                        "bytes (%"PRIu32")", 2 * sizeof(uint64_t), ext.len);
+                return 4;
+            }
+            if ((ret = bdrv_pread(bs->file, offset, &(s->cache_img_inuse),
+                sizeof(uint64_t))) != sizeof(uint64_t)) {
+                return ret;
+            }
+            be64_to_cpus(&(s->cache_img_inuse));
+            s->cache_img_cur_inuse = s->cache_img_inuse;
+            if ((ret = bdrv_pread(bs->file, offset + sizeof(uint64_t), 
+                            &(s->cache_img_quota), sizeof(uint64_t))) != 
+                    sizeof(uint64_t)) {
+                return ret;
+            }
+            be64_to_cpus(&(s->cache_img_quota));
+            break;
 
         default:
             /* unknown magic - save it in case we need to rewrite the header */
@@ -694,7 +716,6 @@  static coroutine_fn int qcow2_co_readv(BlockDriverState *bs, int64_t sector_num,
     qemu_co_mutex_lock(&s->lock);
 
     while (remaining_sectors != 0) {
-
         /* prepare next request */
         cur_nr_sectors = remaining_sectors;
         if (s->crypt_method) {
@@ -730,6 +751,31 @@  static coroutine_fn int qcow2_co_readv(BlockDriverState *bs, int64_t sector_num,
                     if (ret < 0) {
                         goto fail;
                     }
+                    /* do copy-on-read if this is a cache image */
+                    if (bs->is_cache_img && !s->is_cache_full && 
+                            !s->is_writing_on_cache)
+                    {
+                        qemu_co_mutex_unlock(&s->lock);
+                        s->is_writing_on_cache = true;
+                        ret = bdrv_co_writev(bs,
+                                             sector_num,
+                                             n1,
+                                             &hd_qiov);
+                        s->is_writing_on_cache = false;
+                        qemu_co_mutex_lock(&s->lock);
+                        if (ret < 0) {
+                            if (ret == (-ENOSPC))
+                            {
+                                s->is_cache_full = true;
+                            }
+                            else {
+                                /* error is other than cache space */
+                                fprintf(stderr, "Cache write error (%d)\n", 
+                                        ret);
+                                goto fail;
+                            }
+                        }
+                    }
                 }
             } else {
                 /* Note: in this case, no need to wait */
@@ -840,7 +886,6 @@  static coroutine_fn int qcow2_co_writev(BlockDriverState *bs,
     qemu_co_mutex_lock(&s->lock);
 
     while (remaining_sectors != 0) {
-
         l2meta = NULL;
 
         trace_qcow2_writev_start_part(qemu_coroutine_self());
@@ -859,6 +904,20 @@  static coroutine_fn int qcow2_co_writev(BlockDriverState *bs,
 
         assert((cluster_offset & 511) == 0);
 
+        if(bs->is_cache_img)
+        {
+            if(s->cache_img_cur_inuse + (cur_nr_sectors * 512) > 
+               s->cache_img_quota)
+            {
+                ret = -ENOSPC;
+                goto fail;
+            }
+            else
+            {
+                s->cache_img_cur_inuse += (cur_nr_sectors * 512);
+            }
+        }
+
         qemu_iovec_reset(&hd_qiov);
         qemu_iovec_concat(&hd_qiov, qiov, bytes_done,
             cur_nr_sectors * 512);
@@ -946,6 +1005,13 @@  fail:
 static void qcow2_close(BlockDriverState *bs)
 {
     BDRVQcowState *s = bs->opaque;
+
+    if (bs->is_cache_img && (s->cache_img_cur_inuse != s->cache_img_inuse))
+    {
+        s->cache_img_inuse = s->cache_img_cur_inuse;
+        qcow2_update_header(bs);
+    }
+
     g_free(s->l1_table);
 
     qcow2_cache_flush(bs, s->l2_table_cache);
@@ -1041,6 +1107,7 @@  int qcow2_update_header(BlockDriverState *bs)
     uint32_t refcount_table_clusters;
     size_t header_length;
     Qcow2UnknownHeaderExtension *uext;
+    char cache_img_ext[2 * sizeof(uint64_t)];
 
     buf = qemu_blockalign(bs, buflen);
 
@@ -1122,6 +1189,21 @@  int qcow2_update_header(BlockDriverState *bs)
         buflen -= ret;
     }
 
+    if (s->cache_img_quota)
+    {
+        cpu_to_be64s(&s->cache_img_inuse);
+        cpu_to_be64s(&s->cache_img_quota);
+        mempcpy(mempcpy(cache_img_ext, &s->cache_img_inuse, sizeof(uint64_t)),
+                &s->cache_img_quota, sizeof(uint64_t));
+        ret = header_ext_add(buf, QCOW2_EXT_MAGIC_CACHE_IMG, &cache_img_ext, 
+                sizeof(cache_img_ext), buflen);
+        be64_to_cpus(&s->cache_img_inuse);
+        be64_to_cpus(&s->cache_img_quota);
+
+        buf += ret;
+        buflen -= ret;
+    }
+
     /* Feature table */
     Qcow2Feature features[] = {
         {
@@ -1201,6 +1283,16 @@  static int qcow2_change_backing_file(BlockDriverState *bs,
     return qcow2_update_header(bs);
 }
 
+static int qcow2_update_cache_img_fields(BlockDriverState *bs,
+        uint64_t cache_img_inuse, uint64_t cache_img_quota)
+{
+    BDRVQcowState *s = bs->opaque;
+    s->cache_img_inuse = cache_img_inuse;
+    s->cache_img_quota = cache_img_quota;
+
+    return qcow2_update_header(bs);
+}
+
 static int preallocate(BlockDriverState *bs)
 {
     uint64_t nb_sectors;
@@ -1260,7 +1352,8 @@  static int preallocate(BlockDriverState *bs)
 static int qcow2_create2(const char *filename, int64_t total_size,
                          const char *backing_file, const char *backing_format,
                          int flags, size_t cluster_size, int prealloc,
-                         QEMUOptionParameter *options, int version)
+                         QEMUOptionParameter *options, int version, 
+                         uint64_t cache_img_quota)
 {
     /* Calculate cluster_bits */
     int cluster_bits;
@@ -1377,6 +1470,15 @@  static int qcow2_create2(const char *filename, int64_t total_size,
         }
     }
 
+    /* Is this a cache image? */
+    if (cache_img_quota) {
+        ret = qcow2_update_cache_img_fields(bs, 0, cache_img_quota);
+
+        if (ret < 0) {
+            goto out;
+        }
+    }
+
     /* And if we're supposed to preallocate metadata, do that now */
     if (prealloc) {
         BDRVQcowState *s = bs->opaque;
@@ -1403,6 +1505,7 @@  static int qcow2_create(const char *filename, QEMUOptionParameter *options)
     size_t cluster_size = DEFAULT_CLUSTER_SIZE;
     int prealloc = 0;
     int version = 2;
+    uint64_t cache_img_quota = 0;
 
     /* Read out options */
     while (options && options->name) {
@@ -1440,6 +1543,10 @@  static int qcow2_create(const char *filename, QEMUOptionParameter *options)
             }
         } else if (!strcmp(options->name, BLOCK_OPT_LAZY_REFCOUNTS)) {
             flags |= options->value.n ? BLOCK_FLAG_LAZY_REFCOUNTS : 0;
+        } else if (!strcmp(options->name, BLOCK_OPT_CACHE_IMG_QUOTA)) {
+            if (options->value.n) {
+                cache_img_quota = (uint64_t)(options->value.n);
+            }
         }
         options++;
     }
@@ -1457,7 +1564,8 @@  static int qcow2_create(const char *filename, QEMUOptionParameter *options)
     }
 
     return qcow2_create2(filename, sectors, backing_file, backing_fmt, flags,
-                         cluster_size, prealloc, options, version);
+                         cluster_size, prealloc, options, version, 
+                         cache_img_quota);
 }
 
 static int qcow2_make_empty(BlockDriverState *bs)
@@ -1774,6 +1882,11 @@  static QEMUOptionParameter qcow2_create_options[] = {
         .type = OPT_FLAG,
         .help = "Postpone refcount updates",
     },
+    {
+        .name = BLOCK_OPT_CACHE_IMG_QUOTA,
+        .type = OPT_SIZE,
+        .help = "Quota of the cache image"
+    },
     { NULL }
 };
 
diff --git a/block/qcow2.h b/block/qcow2.h
index dba9771..36922da 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -203,6 +203,12 @@  typedef struct BDRVQcowState {
     uint64_t compatible_features;
     uint64_t autoclear_features;
 
+    uint64_t cache_img_cur_inuse; /* current data size in the cache */
+    uint64_t cache_img_inuse; /* data size in the cache on open */
+    uint64_t cache_img_quota; /* max size allowed for cache image */
+    bool is_cache_full; /* whether cache is full */
+    bool is_writing_on_cache; /* currently writing to the cache */
+
     size_t unknown_header_fields_size;
     void* unknown_header_fields;
     QLIST_HEAD(, Qcow2UnknownHeaderExtension) unknown_header_ext;
diff --git a/include/block/block_int.h b/include/block/block_int.h
index e45f2a0..0e4f21f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -58,6 +58,7 @@ 
 #define BLOCK_OPT_COMPAT_LEVEL      "compat"
 #define BLOCK_OPT_LAZY_REFCOUNTS    "lazy_refcounts"
 #define BLOCK_OPT_ADAPTER_TYPE      "adapter_type"
+#define BLOCK_OPT_CACHE_IMG_QUOTA   "cache_img_quota"
 
 typedef struct BdrvTrackedRequest {
     BlockDriverState *bs;
@@ -255,6 +256,8 @@  struct BlockDriverState {
     BlockDriverState *backing_hd;
     BlockDriverState *file;
 
+    bool is_cache_img; /* if set, the image is a cache */
+
     NotifierList close_notifiers;
 
     /* Callback before write request is processed */