Patchwork [2/2] qcow2: Implement bdrv_truncate() for growing images

login
register
mail settings
Submitter Stefan Hajnoczi
Date April 24, 2010, 8:12 a.m.
Message ID <1272096733-6070-2-git-send-email-stefanha@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/50882/
State New
Headers show

Comments

Stefan Hajnoczi - April 24, 2010, 8:12 a.m.
This patch adds the ability to grow qcow2 images in-place using
bdrv_truncate().  This enables qemu-img resize command support for
qcow2.

Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
The notion of resizing an image with snapshots could lead to confusion:
users may expect snapshots to remain unchanged, but this is not possible
with the current qcow2 on-disk format where the header.size field is
global instead of per-snapshot.  Others may expect snapshots to change
size along with the current image data.  I think it is safest to not
support snapshots and perhaps add behavior later if there is a
consensus.

Backing images continue to work.  If the image is now larger than its
backing image, zeroes are read when accessing beyond the end of the
backing image.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
This applies to kevin/block.

 block/qcow2-cluster.c  |   64 ++++++++++++++++++++++++++++++++++++++---------
 block/qcow2-snapshot.c |    2 +-
 block/qcow2.c          |   43 +++++++++++++++++++++++++++++---
 block/qcow2.h          |    9 ++++++-
 4 files changed, 99 insertions(+), 19 deletions(-)
Kevin Wolf - April 26, 2010, 11:46 a.m.
Am 24.04.2010 10:12, schrieb Stefan Hajnoczi:
> This patch adds the ability to grow qcow2 images in-place using
> bdrv_truncate().  This enables qemu-img resize command support for
> qcow2.
> 
> Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
> The notion of resizing an image with snapshots could lead to confusion:
> users may expect snapshots to remain unchanged, but this is not possible
> with the current qcow2 on-disk format where the header.size field is
> global instead of per-snapshot.  Others may expect snapshots to change
> size along with the current image data.  I think it is safest to not
> support snapshots and perhaps add behavior later if there is a
> consensus.

I think it would make most sense if we kept the disk size per snapshot
(so your first option). The snapshot header is extensible, so it should
be possible.

Just disabling it is okay with me, too, but in that case this patch is
much more complicated than it needs to be: If there are no snapshots,
there is no vmstate area.

> Backing images continue to work.  If the image is now larger than its
> backing image, zeroes are read when accessing beyond the end of the
> backing image.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
> This applies to kevin/block.
> 
>  block/qcow2-cluster.c  |   64 ++++++++++++++++++++++++++++++++++++++---------
>  block/qcow2-snapshot.c |    2 +-
>  block/qcow2.c          |   43 +++++++++++++++++++++++++++++---
>  block/qcow2.h          |    9 ++++++-
>  4 files changed, 99 insertions(+), 19 deletions(-)
> 
> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index c11680d..20c8426 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -28,30 +28,39 @@
>  #include "block_int.h"
>  #include "block/qcow2.h"
>  
> -int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
> +/*
> + * qcow2_grow_l1_table_common
> + *
> + * Grows the L1 table and updates the header on disk.
> + *
> + * Setting new_l1_vm_state_index to s->l1_vm_state_index grows the vm state
> + * area.
> + *
> + * Setting new_l1_vm_state_index to the new end of image grows the image data
> + * area.

I don't understand this distinction. I'm sure you describe something
trivial here, but it sounds like magic.

There's really not much difference between data and vmstate clusters and
callers don't know on which of both they are working.

> + *
> + * Returns 0 on success, -errno in failure case.
> + */
> +static int qcow2_grow_l1_table_common(BlockDriverState *bs,
> +                                      int new_l1_vm_state_index,
> +                                      int new_l1_size)
>  {
>      BDRVQcowState *s = bs->opaque;
> -    int new_l1_size, new_l1_size2, ret, i;
> +    int new_l1_size2, ret, i;
>      uint64_t *new_l1_table;
>      int64_t new_l1_table_offset;
>      uint8_t data[12];
>  
> -    new_l1_size = s->l1_size;
> -    if (min_size <= new_l1_size)
> -        return 0;
> -    if (new_l1_size == 0) {
> -        new_l1_size = 1;
> -    }
> -    while (min_size > new_l1_size) {
> -        new_l1_size = (new_l1_size * 3 + 1) / 2;
> -    }
>  #ifdef DEBUG_ALLOC2
>      printf("grow l1_table from %d to %d\n", s->l1_size, new_l1_size);
>  #endif
>  
>      new_l1_size2 = sizeof(uint64_t) * new_l1_size;
>      new_l1_table = qemu_mallocz(align_offset(new_l1_size2, 512));
> -    memcpy(new_l1_table, s->l1_table, s->l1_size * sizeof(uint64_t));
> +    memcpy(new_l1_table, s->l1_table, s->l1_vm_state_index * sizeof(uint64_t));

This change smells like a buffer overflow. It both can read beyond the
end of the old L1 table and can overflow the new one if the L1 table
hasn't yet reached its full size.

I think this is the main problem of this patch: You seem to assume that
the L1 table is created with the full size needed to cover all of the
data area. Which seems to be true at least for images created by
qemu-img, but I don't think we can rely on it (nor do I think we should,
because it introduces a special case where there is none).

I rather consider this preallocation a performance optimization. If we
stop treating it as such, we would need to refuse opening any images
that don't fulfill this requirement.

> +    memcpy(&new_l1_table[new_l1_vm_state_index],
> +           &s->l1_table[s->l1_vm_state_index],
> +           (s->l1_size - s->l1_vm_state_index) * sizeof(uint64_t));
>  
>      /* write new table (align to cluster) */
>      BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ALLOC_TABLE);
> @@ -83,6 +92,7 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>      s->l1_table_offset = new_l1_table_offset;
>      s->l1_table = new_l1_table;
>      s->l1_size = new_l1_size;
> +    s->l1_vm_state_index = new_l1_vm_state_index;
>      return 0;
>   fail:
>      qemu_free(new_l1_table);
> @@ -90,6 +100,34 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>      return ret < 0 ? ret : -EIO;
>  }
>  
> +int qcow2_grow_l1_vm_state(BlockDriverState *bs, int min_size)
> +{
> +    BDRVQcowState *s = bs->opaque;
> +    int new_l1_size;
> +
> +    new_l1_size = s->l1_size;
> +    if (min_size <= new_l1_size)
> +        return 0;
> +    if (new_l1_size == 0) {
> +        new_l1_size = 1;
> +    }
> +    while (min_size > new_l1_size) {
> +        new_l1_size = (new_l1_size * 3 + 1) / 2;
> +    }
> +
> +    return qcow2_grow_l1_table_common(bs, s->l1_vm_state_index, new_l1_size);
> +}
> +
> +int qcow2_grow_l1_image_data(BlockDriverState *bs, int new_l1_size)
> +{
> +    BDRVQcowState *s = bs->opaque;
> +    int new_l1_vm_state_index;
> +
> +    new_l1_vm_state_index = new_l1_size;
> +    new_l1_size += s->l1_size - s->l1_vm_state_index;

Same assumption as above. This might turn into a negative number added
to new_l1_size.

> +    return qcow2_grow_l1_table_common(bs, new_l1_vm_state_index, new_l1_size);
> +}
> +
>  void qcow2_l2_cache_reset(BlockDriverState *bs)
>  {
>      BDRVQcowState *s = bs->opaque;
> @@ -524,7 +562,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
>  
>      l1_index = offset >> (s->l2_bits + s->cluster_bits);
>      if (l1_index >= s->l1_size) {qcow2_grow_l1_vm_state(
> -        ret = qcow2_grow_l1_table(bs, l1_index + 1);
> +        ret = qcow2_grow_l1_vm_state(bs, l1_index + 1);

Once you agree that get_cluster_table can't know if it's working on data
or vmstate, qcow2_grow_l1_vm_state is a misnomer at least.

>          if (ret < 0) {
>              return ret;
>          }
> diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
> index 2a21c17..7f0d810 100644
> --- a/block/qcow2-snapshot.c
> +++ b/block/qcow2-snapshot.c
> @@ -326,7 +326,7 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
>      if (qcow2_update_snapshot_refcount(bs, s->l1_table_offset, s->l1_size, -1) < 0)
>          goto fail;
>  
> -    if (qcow2_grow_l1_table(bs, sn->l1_size) < 0)
> +    if (qcow2_grow_l1_vm_state(bs, sn->l1_size) < 0)
>          goto fail;
>  
>      s->l1_size = sn->l1_size;
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 4a7ab66..ab622a2 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -140,7 +140,7 @@ static int qcow_read_extensions(BlockDriverState *bs, uint64_t start_offset,
>  static int qcow_open(BlockDriverState *bs, int flags)
>  {
>      BDRVQcowState *s = bs->opaque;
> -    int len, i, shift;
> +    int len, i;
>      QCowHeader header;
>      uint64_t ext_end;
>  
> @@ -188,8 +188,7 @@ static int qcow_open(BlockDriverState *bs, int flags)
>  
>      /* read the level 1 table */
>      s->l1_size = header.l1_size;
> -    shift = s->cluster_bits + s->l2_bits;
> -    s->l1_vm_state_index = (header.size + (1LL << shift) - 1) >> shift;
> +    s->l1_vm_state_index = size_to_l1(s, header.size);
>      /* the L1 table must contain at least enough entries to put
>         header.size bytes */
>      if (s->l1_size < s->l1_vm_state_index)
> @@ -1095,6 +1094,40 @@ static int qcow_make_empty(BlockDriverState *bs)
>      return 0;
>  }
>  
> +static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
> +{
> +    BDRVQcowState *s = bs->opaque;
> +    int ret, new_l1_size;
> +
> +    if (offset & 511) {
> +        return -EINVAL;
> +    }
> +
> +    /* cannot proceed if image has snapshots */
> +    if (s->nb_snapshots) {
> +        return -ENOTSUP;
> +    }
> +
> +    /* shrinking is currently not supported */
> +    if (offset < bs->total_sectors << BDRV_SECTOR_BITS) {
> +        return -ENOTSUP;
> +    }
> +
> +    new_l1_size = size_to_l1(s, offset);
> +    ret = qcow2_grow_l1_image_data(bs, new_l1_size);
> +    if (ret < 0) {
> +        return ret;
> +    }

Instead of the confusing changes above you could just increase the L1
table size using the old function and keep the data/vmstate thing local
to qcow2_truncate (or maybe better a qcow2_move_vmstate(bs, offset)
which internally grows the L1 table as needed)

Actually, I think this is not that different from your patch (you called
the almost same function qcow2_grow_l1_image_data and avoided the normal
calculation of the next L1 table size for some reason), but probably a
lot less confusing.

> +
> +    /* write updated header.size */
> +    offset = cpu_to_be64(offset);
> +    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, size), &offset,
> +                    sizeof(uint64_t)) != sizeof(uint64_t)) {
> +        return -EIO;
> +    }

The vmstate_offset field needs to be updated somewhere, too. In my
suggestion this would be in qcow2_move_vmstate.

Kevin
Stefan Hajnoczi - April 26, 2010, 2:01 p.m.
On Mon, Apr 26, 2010 at 12:46 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 24.04.2010 10:12, schrieb Stefan Hajnoczi:
>> This patch adds the ability to grow qcow2 images in-place using
>> bdrv_truncate().  This enables qemu-img resize command support for
>> qcow2.
>>
>> Snapshots are not supported and bdrv_truncate() will return -ENOTSUP.
>> The notion of resizing an image with snapshots could lead to confusion:
>> users may expect snapshots to remain unchanged, but this is not possible
>> with the current qcow2 on-disk format where the header.size field is
>> global instead of per-snapshot.  Others may expect snapshots to change
>> size along with the current image data.  I think it is safest to not
>> support snapshots and perhaps add behavior later if there is a
>> consensus.
>
> I think it would make most sense if we kept the disk size per snapshot
> (so your first option). The snapshot header is extensible, so it should
> be possible.
>
> Just disabling it is okay with me, too, but in that case this patch is
> much more complicated than it needs to be: If there are no snapshots,
> there is no vmstate area.

Good point, things are simplified greatly if there is no vm state
data.  I haven't read the save/load vm yet and failed to make the
connection here ;).

I will eliminate most of the code as you explained and send a v2
patch.  I have responded below because I hope to learn more by
discussing the comments with you, but most of the issues will not
apply to a slimmed down v2 so feel free to ignore if you are busy.

>> Backing images continue to work.  If the image is now larger than its
>> backing image, zeroes are read when accessing beyond the end of the
>> backing image.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>> ---
>> This applies to kevin/block.
>>
>>  block/qcow2-cluster.c  |   64 ++++++++++++++++++++++++++++++++++++++---------
>>  block/qcow2-snapshot.c |    2 +-
>>  block/qcow2.c          |   43 +++++++++++++++++++++++++++++---
>>  block/qcow2.h          |    9 ++++++-
>>  4 files changed, 99 insertions(+), 19 deletions(-)
>>
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index c11680d..20c8426 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -28,30 +28,39 @@
>>  #include "block_int.h"
>>  #include "block/qcow2.h"
>>
>> -int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>> +/*
>> + * qcow2_grow_l1_table_common
>> + *
>> + * Grows the L1 table and updates the header on disk.
>> + *
>> + * Setting new_l1_vm_state_index to s->l1_vm_state_index grows the vm state
>> + * area.
>> + *
>> + * Setting new_l1_vm_state_index to the new end of image grows the image data
>> + * area.
>
> I don't understand this distinction. I'm sure you describe something
> trivial here, but it sounds like magic.
>
> There's really not much difference between data and vmstate clusters and
> callers don't know on which of both they are working.

The new_l1_vm_state_index argument is used to position the vm state
data in the new L1 table.  If we're growing the L1 table to make room
for more vm state data, then new_l1_vm_state_index should keep its old
value (s->l1_vm_state_index).  If we're growing the image data and
need to move the vm state out of the way, then new_l1_vm_state_index
should be recalculated for the new image size.

The function allocates the new L1 table and performs two separate
copies: one of the image data L1 entries and one for the vm state L1
entries.  The new_l1_vm_state_index argument controls where the vm
state data entries are copied; it provides the ability the move the vm
state data to make room for new image data entries.

>> + *
>> + * Returns 0 on success, -errno in failure case.
>> + */
>> +static int qcow2_grow_l1_table_common(BlockDriverState *bs,
>> +                                      int new_l1_vm_state_index,
>> +                                      int new_l1_size)
>>  {
>>      BDRVQcowState *s = bs->opaque;
>> -    int new_l1_size, new_l1_size2, ret, i;
>> +    int new_l1_size2, ret, i;
>>      uint64_t *new_l1_table;
>>      int64_t new_l1_table_offset;
>>      uint8_t data[12];
>>
>> -    new_l1_size = s->l1_size;
>> -    if (min_size <= new_l1_size)
>> -        return 0;
>> -    if (new_l1_size == 0) {
>> -        new_l1_size = 1;
>> -    }
>> -    while (min_size > new_l1_size) {
>> -        new_l1_size = (new_l1_size * 3 + 1) / 2;
>> -    }
>>  #ifdef DEBUG_ALLOC2
>>      printf("grow l1_table from %d to %d\n", s->l1_size, new_l1_size);
>>  #endif
>>
>>      new_l1_size2 = sizeof(uint64_t) * new_l1_size;
>>      new_l1_table = qemu_mallocz(align_offset(new_l1_size2, 512));
>> -    memcpy(new_l1_table, s->l1_table, s->l1_size * sizeof(uint64_t));
>> +    memcpy(new_l1_table, s->l1_table, s->l1_vm_state_index * sizeof(uint64_t));
>
> This change smells like a buffer overflow. It both can read beyond the
> end of the old L1 table and can overflow the new one if the L1 table
> hasn't yet reached its full size.
>
> I think this is the main problem of this patch: You seem to assume that
> the L1 table is created with the full size needed to cover all of the
> data area. Which seems to be true at least for images created by
> qemu-img, but I don't think we can rely on it (nor do I think we should,
> because it introduces a special case where there is none).
>
> I rather consider this preallocation a performance optimization. If we
> stop treating it as such, we would need to refuse opening any images
> that don't fulfill this requirement.

From qcow_open():

/* read the level 1 table */
s->l1_size = header.l1_size;
shift = s->cluster_bits + s->l2_bits;
s->l1_vm_state_index = (header.size + (1LL << shift) - 1) >> shift;
/* the L1 table must contain at least enough entries to put
   header.size bytes */
if (s->l1_size < s->l1_vm_state_index)
    goto fail;

Images where L1 size is not at least l1_vm_state_index cannot be
opened by QEMU.  Right now the special case exists and perhaps some
qcow2 code already relies on this assumption.

>> +    memcpy(&new_l1_table[new_l1_vm_state_index],
>> +           &s->l1_table[s->l1_vm_state_index],
>> +           (s->l1_size - s->l1_vm_state_index) * sizeof(uint64_t));
>>
>>      /* write new table (align to cluster) */
>>      BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ALLOC_TABLE);
>> @@ -83,6 +92,7 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>>      s->l1_table_offset = new_l1_table_offset;
>>      s->l1_table = new_l1_table;
>>      s->l1_size = new_l1_size;
>> +    s->l1_vm_state_index = new_l1_vm_state_index;
>>      return 0;
>>   fail:
>>      qemu_free(new_l1_table);
>> @@ -90,6 +100,34 @@ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
>>      return ret < 0 ? ret : -EIO;
>>  }
>>
>> +int qcow2_grow_l1_vm_state(BlockDriverState *bs, int min_size)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int new_l1_size;
>> +
>> +    new_l1_size = s->l1_size;
>> +    if (min_size <= new_l1_size)
>> +        return 0;
>> +    if (new_l1_size == 0) {
>> +        new_l1_size = 1;
>> +    }
>> +    while (min_size > new_l1_size) {
>> +        new_l1_size = (new_l1_size * 3 + 1) / 2;
>> +    }
>> +
>> +    return qcow2_grow_l1_table_common(bs, s->l1_vm_state_index, new_l1_size);
>> +}
>> +
>> +int qcow2_grow_l1_image_data(BlockDriverState *bs, int new_l1_size)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int new_l1_vm_state_index;
>> +
>> +    new_l1_vm_state_index = new_l1_size;
>> +    new_l1_size += s->l1_size - s->l1_vm_state_index;
>
> Same assumption as above. This might turn into a negative number added
> to new_l1_size.
>
>> +    return qcow2_grow_l1_table_common(bs, new_l1_vm_state_index, new_l1_size);
>> +}
>> +
>>  void qcow2_l2_cache_reset(BlockDriverState *bs)
>>  {
>>      BDRVQcowState *s = bs->opaque;
>> @@ -524,7 +562,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
>>
>>      l1_index = offset >> (s->l2_bits + s->cluster_bits);
>>      if (l1_index >= s->l1_size) {qcow2_grow_l1_vm_state(
>> -        ret = qcow2_grow_l1_table(bs, l1_index + 1);
>> +        ret = qcow2_grow_l1_vm_state(bs, l1_index + 1);
>
> Once you agree that get_cluster_table can't know if it's working on data
> or vmstate, qcow2_grow_l1_vm_state is a misnomer at least.
>
>>          if (ret < 0) {
>>              return ret;
>>          }
>> diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
>> index 2a21c17..7f0d810 100644
>> --- a/block/qcow2-snapshot.c
>> +++ b/block/qcow2-snapshot.c
>> @@ -326,7 +326,7 @@ int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
>>      if (qcow2_update_snapshot_refcount(bs, s->l1_table_offset, s->l1_size, -1) < 0)
>>          goto fail;
>>
>> -    if (qcow2_grow_l1_table(bs, sn->l1_size) < 0)
>> +    if (qcow2_grow_l1_vm_state(bs, sn->l1_size) < 0)
>>          goto fail;
>>
>>      s->l1_size = sn->l1_size;
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 4a7ab66..ab622a2 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -140,7 +140,7 @@ static int qcow_read_extensions(BlockDriverState *bs, uint64_t start_offset,
>>  static int qcow_open(BlockDriverState *bs, int flags)
>>  {
>>      BDRVQcowState *s = bs->opaque;
>> -    int len, i, shift;
>> +    int len, i;
>>      QCowHeader header;
>>      uint64_t ext_end;
>>
>> @@ -188,8 +188,7 @@ static int qcow_open(BlockDriverState *bs, int flags)
>>
>>      /* read the level 1 table */
>>      s->l1_size = header.l1_size;
>> -    shift = s->cluster_bits + s->l2_bits;
>> -    s->l1_vm_state_index = (header.size + (1LL << shift) - 1) >> shift;
>> +    s->l1_vm_state_index = size_to_l1(s, header.size);
>>      /* the L1 table must contain at least enough entries to put
>>         header.size bytes */
>>      if (s->l1_size < s->l1_vm_state_index)
>> @@ -1095,6 +1094,40 @@ static int qcow_make_empty(BlockDriverState *bs)
>>      return 0;
>>  }
>>
>> +static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
>> +{
>> +    BDRVQcowState *s = bs->opaque;
>> +    int ret, new_l1_size;
>> +
>> +    if (offset & 511) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* cannot proceed if image has snapshots */
>> +    if (s->nb_snapshots) {
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    /* shrinking is currently not supported */
>> +    if (offset < bs->total_sectors << BDRV_SECTOR_BITS) {
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    new_l1_size = size_to_l1(s, offset);
>> +    ret = qcow2_grow_l1_image_data(bs, new_l1_size);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>
> Instead of the confusing changes above you could just increase the L1
> table size using the old function and keep the data/vmstate thing local
> to qcow2_truncate (or maybe better a qcow2_move_vmstate(bs, offset)
> which internally grows the L1 table as needed)
>
> Actually, I think this is not that different from your patch (you called
> the almost same function qcow2_grow_l1_image_data and avoided the normal
> calculation of the next L1 table size for some reason), but probably a
> lot less confusing.

I like the qcow2_move_vmstate() name.  It is clearer than
qcow2_grow_l1_image_data().

If I understand correctly, the next L1 table size calculation tries to
grow the table in steps large enough so that the grow operation does
not need to be performed frequently.  This makes sense when appending
arbitrary vm state data, but the truncate operation knows the exact
new size of the image and doesn't need to speculatively allocate more
L1 table space.

>> +
>> +    /* write updated header.size */
>> +    offset = cpu_to_be64(offset);
>> +    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, size), &offset,
>> +                    sizeof(uint64_t)) != sizeof(uint64_t)) {
>> +        return -EIO;
>> +    }
>
> The vmstate_offset field needs to be updated somewhere, too. In my
> suggestion this would be in qcow2_move_vmstate.

Which field (I can't find a vmstate_offset field)?  The patch updates
s->l1_vm_state_index in qcow2_grow_l1_table_common(), is that the one
you mean?

Stefan
Kevin Wolf - April 26, 2010, 3:01 p.m.
Am 26.04.2010 16:01, schrieb Stefan Hajnoczi:
> From qcow_open():
> 
> /* read the level 1 table */
> s->l1_size = header.l1_size;
> shift = s->cluster_bits + s->l2_bits;
> s->l1_vm_state_index = (header.size + (1LL << shift) - 1) >> shift;
> /* the L1 table must contain at least enough entries to put
>    header.size bytes */
> if (s->l1_size < s->l1_vm_state_index)
>     goto fail;
> 
> Images where L1 size is not at least l1_vm_state_index cannot be
> opened by QEMU.  Right now the special case exists and perhaps some
> qcow2 code already relies on this assumption.

Hm, yes. You win.

>>> @@ -1095,6 +1094,40 @@ static int qcow_make_empty(BlockDriverState *bs)
>>>      return 0;
>>>  }
>>>
>>> +static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
>>> +{
>>> +    BDRVQcowState *s = bs->opaque;
>>> +    int ret, new_l1_size;
>>> +
>>> +    if (offset & 511) {
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    /* cannot proceed if image has snapshots */
>>> +    if (s->nb_snapshots) {
>>> +        return -ENOTSUP;
>>> +    }
>>> +
>>> +    /* shrinking is currently not supported */
>>> +    if (offset < bs->total_sectors << BDRV_SECTOR_BITS) {
>>> +        return -ENOTSUP;
>>> +    }
>>> +
>>> +    new_l1_size = size_to_l1(s, offset);
>>> +    ret = qcow2_grow_l1_image_data(bs, new_l1_size);
>>> +    if (ret < 0) {
>>> +        return ret;
>>> +    }
>>
>> Instead of the confusing changes above you could just increase the L1
>> table size using the old function and keep the data/vmstate thing local
>> to qcow2_truncate (or maybe better a qcow2_move_vmstate(bs, offset)
>> which internally grows the L1 table as needed)
>>
>> Actually, I think this is not that different from your patch (you called
>> the almost same function qcow2_grow_l1_image_data and avoided the normal
>> calculation of the next L1 table size for some reason), but probably a
>> lot less confusing.
> 
> I like the qcow2_move_vmstate() name.  It is clearer than
> qcow2_grow_l1_image_data().
> 
> If I understand correctly, the next L1 table size calculation tries to
> grow the table in steps large enough so that the grow operation does
> not need to be performed frequently.  This makes sense when appending
> arbitrary vm state data, but the truncate operation knows the exact
> new size of the image and doesn't need to speculatively allocate more
> L1 table space.

Agreed. I just doubt that saving a few bytes is worth splitting up
qcow2_grow_l1_table when rounding up doesn't really hurt.

Maybe most of my real problem was really just naming, though. It made
everything look more complicated than it actually is. And the second
thing is probably that grow_l1_table started to do more than just
growing the L1 table.

>>> +
>>> +    /* write updated header.size */
>>> +    offset = cpu_to_be64(offset);
>>> +    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, size), &offset,
>>> +                    sizeof(uint64_t)) != sizeof(uint64_t)) {
>>> +        return -EIO;
>>> +    }
>>
>> The vmstate_offset field needs to be updated somewhere, too. In my
>> suggestion this would be in qcow2_move_vmstate.
> 
> Which field (I can't find a vmstate_offset field)?  The patch updates
> s->l1_vm_state_index in qcow2_grow_l1_table_common(), is that the one
> you mean?

I meant in the header. And you can't find it because it doesn't exist,
s->l1_vm_state_index is calculated in qcow_open. Never let me review
patches when I'm tired. :-)

Kevin

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c11680d..20c8426 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -28,30 +28,39 @@ 
 #include "block_int.h"
 #include "block/qcow2.h"
 
-int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
+/*
+ * qcow2_grow_l1_table_common
+ *
+ * Grows the L1 table and updates the header on disk.
+ *
+ * Setting new_l1_vm_state_index to s->l1_vm_state_index grows the vm state
+ * area.
+ *
+ * Setting new_l1_vm_state_index to the new end of image grows the image data
+ * area.
+ *
+ * Returns 0 on success, -errno in failure case.
+ */
+static int qcow2_grow_l1_table_common(BlockDriverState *bs,
+                                      int new_l1_vm_state_index,
+                                      int new_l1_size)
 {
     BDRVQcowState *s = bs->opaque;
-    int new_l1_size, new_l1_size2, ret, i;
+    int new_l1_size2, ret, i;
     uint64_t *new_l1_table;
     int64_t new_l1_table_offset;
     uint8_t data[12];
 
-    new_l1_size = s->l1_size;
-    if (min_size <= new_l1_size)
-        return 0;
-    if (new_l1_size == 0) {
-        new_l1_size = 1;
-    }
-    while (min_size > new_l1_size) {
-        new_l1_size = (new_l1_size * 3 + 1) / 2;
-    }
 #ifdef DEBUG_ALLOC2
     printf("grow l1_table from %d to %d\n", s->l1_size, new_l1_size);
 #endif
 
     new_l1_size2 = sizeof(uint64_t) * new_l1_size;
     new_l1_table = qemu_mallocz(align_offset(new_l1_size2, 512));
-    memcpy(new_l1_table, s->l1_table, s->l1_size * sizeof(uint64_t));
+    memcpy(new_l1_table, s->l1_table, s->l1_vm_state_index * sizeof(uint64_t));
+    memcpy(&new_l1_table[new_l1_vm_state_index],
+           &s->l1_table[s->l1_vm_state_index],
+           (s->l1_size - s->l1_vm_state_index) * sizeof(uint64_t));
 
     /* write new table (align to cluster) */
     BLKDBG_EVENT(bs->file, BLKDBG_L1_GROW_ALLOC_TABLE);
@@ -83,6 +92,7 @@  int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
     s->l1_table_offset = new_l1_table_offset;
     s->l1_table = new_l1_table;
     s->l1_size = new_l1_size;
+    s->l1_vm_state_index = new_l1_vm_state_index;
     return 0;
  fail:
     qemu_free(new_l1_table);
@@ -90,6 +100,34 @@  int qcow2_grow_l1_table(BlockDriverState *bs, int min_size)
     return ret < 0 ? ret : -EIO;
 }
 
+int qcow2_grow_l1_vm_state(BlockDriverState *bs, int min_size)
+{
+    BDRVQcowState *s = bs->opaque;
+    int new_l1_size;
+
+    new_l1_size = s->l1_size;
+    if (min_size <= new_l1_size)
+        return 0;
+    if (new_l1_size == 0) {
+        new_l1_size = 1;
+    }
+    while (min_size > new_l1_size) {
+        new_l1_size = (new_l1_size * 3 + 1) / 2;
+    }
+
+    return qcow2_grow_l1_table_common(bs, s->l1_vm_state_index, new_l1_size);
+}
+
+int qcow2_grow_l1_image_data(BlockDriverState *bs, int new_l1_size)
+{
+    BDRVQcowState *s = bs->opaque;
+    int new_l1_vm_state_index;
+
+    new_l1_vm_state_index = new_l1_size;
+    new_l1_size += s->l1_size - s->l1_vm_state_index;
+    return qcow2_grow_l1_table_common(bs, new_l1_vm_state_index, new_l1_size);
+}
+
 void qcow2_l2_cache_reset(BlockDriverState *bs)
 {
     BDRVQcowState *s = bs->opaque;
@@ -524,7 +562,7 @@  static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
 
     l1_index = offset >> (s->l2_bits + s->cluster_bits);
     if (l1_index >= s->l1_size) {
-        ret = qcow2_grow_l1_table(bs, l1_index + 1);
+        ret = qcow2_grow_l1_vm_state(bs, l1_index + 1);
         if (ret < 0) {
             return ret;
         }
diff --git a/block/qcow2-snapshot.c b/block/qcow2-snapshot.c
index 2a21c17..7f0d810 100644
--- a/block/qcow2-snapshot.c
+++ b/block/qcow2-snapshot.c
@@ -326,7 +326,7 @@  int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id)
     if (qcow2_update_snapshot_refcount(bs, s->l1_table_offset, s->l1_size, -1) < 0)
         goto fail;
 
-    if (qcow2_grow_l1_table(bs, sn->l1_size) < 0)
+    if (qcow2_grow_l1_vm_state(bs, sn->l1_size) < 0)
         goto fail;
 
     s->l1_size = sn->l1_size;
diff --git a/block/qcow2.c b/block/qcow2.c
index 4a7ab66..ab622a2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -140,7 +140,7 @@  static int qcow_read_extensions(BlockDriverState *bs, uint64_t start_offset,
 static int qcow_open(BlockDriverState *bs, int flags)
 {
     BDRVQcowState *s = bs->opaque;
-    int len, i, shift;
+    int len, i;
     QCowHeader header;
     uint64_t ext_end;
 
@@ -188,8 +188,7 @@  static int qcow_open(BlockDriverState *bs, int flags)
 
     /* read the level 1 table */
     s->l1_size = header.l1_size;
-    shift = s->cluster_bits + s->l2_bits;
-    s->l1_vm_state_index = (header.size + (1LL << shift) - 1) >> shift;
+    s->l1_vm_state_index = size_to_l1(s, header.size);
     /* the L1 table must contain at least enough entries to put
        header.size bytes */
     if (s->l1_size < s->l1_vm_state_index)
@@ -1095,6 +1094,40 @@  static int qcow_make_empty(BlockDriverState *bs)
     return 0;
 }
 
+static int qcow2_truncate(BlockDriverState *bs, int64_t offset)
+{
+    BDRVQcowState *s = bs->opaque;
+    int ret, new_l1_size;
+
+    if (offset & 511) {
+        return -EINVAL;
+    }
+
+    /* cannot proceed if image has snapshots */
+    if (s->nb_snapshots) {
+        return -ENOTSUP;
+    }
+
+    /* shrinking is currently not supported */
+    if (offset < bs->total_sectors << BDRV_SECTOR_BITS) {
+        return -ENOTSUP;
+    }
+
+    new_l1_size = size_to_l1(s, offset);
+    ret = qcow2_grow_l1_image_data(bs, new_l1_size);
+    if (ret < 0) {
+        return ret;
+    }
+
+    /* write updated header.size */
+    offset = cpu_to_be64(offset);
+    if (bdrv_pwrite(bs->file, offsetof(QCowHeader, size), &offset,
+                    sizeof(uint64_t)) != sizeof(uint64_t)) {
+        return -EIO;
+    }
+    return 0;
+}
+
 /* XXX: put compressed sectors first, then all the cluster aligned
    tables to avoid losing bytes in alignment */
 static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num,
@@ -1294,7 +1327,9 @@  static BlockDriver bdrv_qcow2 = {
     .bdrv_aio_readv	= qcow_aio_readv,
     .bdrv_aio_writev	= qcow_aio_writev,
     .bdrv_aio_flush	= qcow_aio_flush,
-    .bdrv_write_compressed = qcow_write_compressed,
+
+    .bdrv_truncate          = qcow2_truncate,
+    .bdrv_write_compressed  = qcow_write_compressed,
 
     .bdrv_snapshot_create   = qcow2_snapshot_create,
     .bdrv_snapshot_goto     = qcow2_snapshot_goto,
diff --git a/block/qcow2.h b/block/qcow2.h
index 5bd08db..c328248 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -150,6 +150,12 @@  static inline int size_to_clusters(BDRVQcowState *s, int64_t size)
     return (size + (s->cluster_size - 1)) >> s->cluster_bits;
 }
 
+static inline int size_to_l1(BDRVQcowState *s, int64_t size)
+{
+    int shift = s->cluster_bits + s->l2_bits;
+    return (size + (1ULL << shift) - 1) >> shift;
+}
+
 static inline int64_t align_offset(int64_t offset, int n)
 {
     offset = (offset + n - 1) & ~(n - 1);
@@ -182,7 +188,8 @@  int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 int qcow2_check_refcounts(BlockDriverState *bs);
 
 /* qcow2-cluster.c functions */
-int qcow2_grow_l1_table(BlockDriverState *bs, int min_size);
+int qcow2_grow_l1_vm_state(BlockDriverState *bs, int min_size);
+int qcow2_grow_l1_image_data(BlockDriverState *bs, int new_l1_size);
 void qcow2_l2_cache_reset(BlockDriverState *bs);
 int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);
 void qcow2_encrypt_sectors(BDRVQcowState *s, int64_t sector_num,