diff mbox

[1/2] qemu-img convert: Rewrite copying logic

Message ID 1424272761-27554-2-git-send-email-kwolf@redhat.com
State New
Headers show

Commit Message

Kevin Wolf Feb. 18, 2015, 3:19 p.m. UTC
The implementation of qemu-img convert is (a) messy, (b) buggy, and
(c) less efficient than possible. The changes required to beat some
sense into it are massive enough that incremental changes would only
make my and the reviewers' life harder. So throw it away and reimplement
it from scratch.

Let me give some examples what I mean by messy, buggy and inefficient:

(a) The copying logic of qemu-img convert has two separate branches for
    compressed and normal target images, which roughly do the same -
    except for a little code that handles actual differences between
    compressed and uncompressed images, and much more code that
    implements just a different set of optimisations and bugs. This is
    unnecessary code duplication, and makes the code for compressed
    output (unsurprisingly) suffer from bitrot.

    The code for uncompressed ouput is run twice to count the the total
    length for the progress bar. In the first run it just takes a
    shortcut and runs only half the loop, and when it's done, it toggles
    a boolean, jumps out of the loop with a backwards goto and starts
    over. Works, but pretty is something different.

(b) Converting while keeping a backing file (-B option) is broken in
    several ways. This includes not writing to the image file if the
    input has zero clusters or data filled with zeros (ignoring that the
    backing file will be visible instead).

    It also doesn't correctly limit every iteration of the copy loop to
    sectors of the same status so that too many sectors may be copied to
    in the target image. For -B this gives an unexpected result, for
    other images it just does more work than necessary.

    Conversion with a compressed target completely ignores any target
    backing file.

(c) qemu-img convert skips reading and writing an area if it knows from
    metadata that copying isn't needed (except for the bug mentioned
    above that ignores a status change in some cases). It does, however,
    read from the source even if it knows that it will read zeros, and
    then search for non-zero bytes in the read buffer, if it's possible
    that a write might be needed.

This reimplementation of the copying core reorganises the code to remove
the duplication and have a much more obvious code flow, by essentially
splitting the copy iteration loop into three parts:

1. Find the number of contiguous sectors of the same status at the
   current offset (This can also be called in a separate loop for the
   copying loop in order to determine the total sectors for the progress
   bar.)

2. Read sectors. If the status implies that there is no data there to
   read (zero or unallocated cluster), don't do anything.

3. Write sectors depending on the status. If it's data, write it. If
   we want the backing file to be visible (with -B), don't write it. If
   it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() to
   optimise the write at least where possible.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-img.c | 511 ++++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 305 insertions(+), 206 deletions(-)

Comments

Max Reitz Feb. 18, 2015, 5:18 p.m. UTC | #1
On 2015-02-18 at 10:19, Kevin Wolf wrote:
> The implementation of qemu-img convert is (a) messy, (b) buggy, and
> (c) less efficient than possible. The changes required to beat some
> sense into it are massive enough that incremental changes would only
> make my and the reviewers' life harder. So throw it away and reimplement
> it from scratch.
>
> Let me give some examples what I mean by messy, buggy and inefficient:
>
> (a) The copying logic of qemu-img convert has two separate branches for
>      compressed and normal target images, which roughly do the same -
>      except for a little code that handles actual differences between
>      compressed and uncompressed images, and much more code that
>      implements just a different set of optimisations and bugs. This is
>      unnecessary code duplication, and makes the code for compressed
>      output (unsurprisingly) suffer from bitrot.
>
>      The code for uncompressed ouput is run twice to count the the total
>      length for the progress bar. In the first run it just takes a
>      shortcut and runs only half the loop, and when it's done, it toggles
>      a boolean, jumps out of the loop with a backwards goto and starts
>      over. Works, but pretty is something different.
>
> (b) Converting while keeping a backing file (-B option) is broken in
>      several ways. This includes not writing to the image file if the
>      input has zero clusters or data filled with zeros (ignoring that the
>      backing file will be visible instead).
>
>      It also doesn't correctly limit every iteration of the copy loop to
>      sectors of the same status so that too many sectors may be copied to
>      in the target image. For -B this gives an unexpected result, for
>      other images it just does more work than necessary.
>
>      Conversion with a compressed target completely ignores any target
>      backing file.
>
> (c) qemu-img convert skips reading and writing an area if it knows from
>      metadata that copying isn't needed (except for the bug mentioned
>      above that ignores a status change in some cases). It does, however,
>      read from the source even if it knows that it will read zeros, and
>      then search for non-zero bytes in the read buffer, if it's possible
>      that a write might be needed.
>
> This reimplementation of the copying core reorganises the code to remove
> the duplication and have a much more obvious code flow, by essentially
> splitting the copy iteration loop into three parts:
>
> 1. Find the number of contiguous sectors of the same status at the
>     current offset (This can also be called in a separate loop for the
>     copying loop in order to determine the total sectors for the progress
>     bar.)
>
> 2. Read sectors. If the status implies that there is no data there to
>     read (zero or unallocated cluster), don't do anything.
>
> 3. Write sectors depending on the status. If it's data, write it. If
>     we want the backing file to be visible (with -B), don't write it. If
>     it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() to
>     optimise the write at least where possible.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   qemu-img.c | 511 ++++++++++++++++++++++++++++++++++++-------------------------
>   1 file changed, 305 insertions(+), 206 deletions(-)

You have been very careful not to write "Rewrite img_convert()" or 
something like that in the subject, so I can't really complain that 
there are still bugs in the function (which are not related to the 
copying logic), for instance:

$ ./qemu-img create -f qcow2 i1.qcow2 64M
Formatting 'i1.qcow2', fmt=qcow2 size=67108864 encryption=off 
cluster_size=65536 lazy_refcounts=off
$ ./qemu-img create -f qcow2 i2.qcow2 64M
Formatting 'i2.qcow2', fmt=qcow2 size=67108864 encryption=off 
cluster_size=65536 lazy_refcounts=off
$ ./qemu-img snapshot -c foo i1.qcow2
$ ./qemu-img snapshot -c foo i2.qcow2
$ ./qemu-io -c 'write 0 64M' i2.qcow2
wrote 67108864/67108864 bytes at offset 0
64 MiB, 1 ops; 0:00:01.32 (48.152 MiB/sec and 0.7524 ops/sec)
$ ./qemu-img convert -l snapshot.name=foo -O qcow2 i{1,2}.qcow2 o.qcow2
qemu-img: error while reading sector 131072: Input/output error

("No support for concatenating multiple snapshot" should be enforced for 
sn_opts != NULL)

> diff --git a/qemu-img.c b/qemu-img.c
> index e148af8..5c8386e 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1306,20 +1306,307 @@ out3:
>       return ret;
>   }
>   
> +enum ImgConvertBlockStatus {
> +    BLK_DATA,
> +    BLK_ZERO,
> +    BLK_BACKING_FILE,
> +};
> +
> +typedef struct ImgConvertState {
> +    BlockDriverState **src;
> +    int64_t *src_sectors;
> +    int src_cur, src_num;
> +    int64_t src_cur_offset;
> +    int64_t total_sectors;
> +    int64_t allocated_sectors;
> +    enum ImgConvertBlockStatus status;
> +    int64_t sector_next_status;
> +    BlockDriverState *target;
> +    bool has_zero_init;
> +    bool compressed;
> +    bool out_backing;

Maybe "out_backed" (or "out_is_backed" or "out_has_backing") would be 
better; "out_backing" to me sounds like this should describe the backing 
file.

> +    int min_sparse;
> +    size_t cluster_sectors;
> +    size_t buf_sectors;
> +} ImgConvertState;
> +
> +static void convert_select_part(ImgConvertState *s, int64_t sector_num)
> +{
> +    while (sector_num - s->src_cur_offset >= s->src_sectors[s->src_cur]) {
> +        s->src_cur_offset += s->src_sectors[s->src_cur];
> +        s->src_cur++;
> +        assert(s->src_cur < s->src_num);
> +    }
> +}
> +
> +static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
> +{
> +    int64_t ret;
> +    int n;
> +
> +    convert_select_part(s, sector_num);
> +
> +    assert(s->total_sectors > sector_num);
> +    n = MIN(s->total_sectors - sector_num, INT_MAX);

Maybe it would be better to use INT_MAX / BDRV_SECTOR_SIZE here (and in 
other places, too)... In practice, this would only be relevant to reads 
and writes, though, and those are limited by s->buf_sectors.

> +
> +    if (s->sector_next_status <= sector_num) {
> +        ret = bdrv_get_block_status(s->src[s->src_cur],
> +                                    sector_num - s->src_cur_offset,
> +                                    n, &n);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        if (ret & BDRV_BLOCK_ZERO) {
> +            s->status = BLK_ZERO;
> +        } else if (ret & BDRV_BLOCK_DATA) {
> +            s->status = BLK_DATA;
> +        } else if (!s->out_backing) {
> +            /* Without a target backing file we must copy over the contents of
> +             * the backing file as well. */
> +            /* TODO Check block status of the backing file chain to avoid
> +             * needlessly reading zeroes and limiting the iteration to the
> +             * buffer size */
> +            s->status = BLK_DATA;
> +        } else {
> +            s->status = BLK_BACKING_FILE;
> +        }
> +
> +        s->sector_next_status = sector_num + n;
> +    }
> +
> +    n = MIN(n, s->sector_next_status - sector_num);
> +    if (s->status == BLK_DATA) {
> +        n = MIN(n, s->buf_sectors);
> +    }
> +
> +    /* We need to write complete clusters for compressed images, so if an
> +     * unallocated area is shorter than that, we must consider the whole
> +     * cluster allocated. */
> +    if (s->compressed) {
> +        if (n < s->cluster_sectors) {
> +            n = MIN(s->cluster_sectors, s->total_sectors - sector_num);
> +            s->status = BLK_DATA;
> +        } else {
> +            n = QEMU_ALIGN_DOWN(n, s->cluster_sectors);
> +        }
> +    }
> +
> +    return n;
> +}
> +
> +static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> +                        uint8_t *buf)
> +{
> +    int n;
> +    int ret;
> +
> +    if (s->status == BLK_ZERO || s->status == BLK_BACKING_FILE) {
> +        return 0;
> +    }
> +
> +    assert(nb_sectors <= s->buf_sectors);
> +    while (nb_sectors > 0) {
> +        BlockDriverState *bs;
> +        int64_t bs_sectors;
> +
> +        /* In the case of compression with multiple source files, we can get a
> +         * nb_sectors that spreads into the next part. So we must be able to
> +         * read across multiple BDSes for one convert_read() call. */
> +        convert_select_part(s, sector_num);
> +        bs = s->src[s->src_cur];
> +        bs_sectors = s->src_sectors[s->src_cur];
> +
> +        n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
> +        ret = bdrv_read(bs, sector_num, buf, n);

Shouldn't this be sector_num - s->src_cur_offset?

> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        sector_num += n;
> +        nb_sectors -= n;
> +        buf += n * BDRV_SECTOR_SIZE;
> +    }
> +
> +    return 0;
> +}
> +
> +static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> +                         const uint8_t *buf)
> +{
> +    int ret;
> +
> +    while (nb_sectors > 0) {
> +

This empty line looks a bit strange to me...

> +        int n = nb_sectors;
> +
> +        switch (s->status) {
> +        case BLK_BACKING_FILE:
> +            /* If we have a backing file, leave clusters unallocated that are
> +             * unallocated in the source image, so that the backing file is
> +             * visible at the respective offset. */
> +            assert(s->out_backing);
> +            break;
> +
> +        case BLK_DATA:
> +            /* We must always write compressed clusters as a whole, so don't
> +             * try to find zeroed parts in the buffer. We can only save the
> +             * write if the buffer is completely zeroed. */

But you shouldn't be doing that if n < s->min_sparse, I think (if people 
specify -S 0, they don't want you to zero anything, and that's 
guaranteed by the man page). Considering that is_allocated_sectors_min() 
basically has a min = MIN(min, n) part, too, I'd be fine with making -S 
0 a special case here ("if (s->has_zero_init && s->min_sparse && 
buffer_is_zero(...))").

Actually, now that I'm looking at is_allocated_sectors_min(), if the 
first sectors is not allocated, it will return false and thus both the 
current qemu-img convert and the new one after this patch won't write 
data. That's a bug, I think (because it is guaranteed by the man page).

> +            if (s->compressed) {
> +                if (s->has_zero_init &&
> +                    buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> +                {
> +                    assert(!s->out_backing);
> +                    break;
> +                }
> +
> +                ret = bdrv_write_compressed(s->target, sector_num, buf, n);
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +                break;
> +            }
> +
> +            /* If there is real non-zero data, we must write it. Otherwise we
> +             * can treat it as zero sectors. */
> +            if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {

So I think this should be "if (!s->min_sparse || ...)".

> +                ret = bdrv_write(s->target, sector_num, buf, n);
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +                break;
> +            }
> +            /* fall-through */
> +
> +        case BLK_ZERO:
> +            if (s->has_zero_init) {

And I don't really know what to do about this. Should we write zeros if 
!s->min_sparse? Or is this a special case and the user can't expect 
qemu-img convert to really write zeros if the source image contained 
unallocated/zero clusters? That would make sense, but once again, the 
man page clearly states that "If sparse_size is 0, […] the destination 
image will always be fully allocated." Maybe we could argue that "fully" 
in this case means "as fully as the source image was allocated".

> +                break;
> +            }
> +            /* TODO Should this use BDRV_REQ_MAY_UNMAP? */

Can't we remove this comment? If BDRV_REQ_MAP_UNMAP results in zeros 
read, bdrv_make_zero() below will have been invoked and thus 
s->has_zero_init = true (unless you take my suggestion and leave 
s->has_zero_init false if bdrv_make_zero() failed, but I wouldn't really 
mind that case here).

> +            ret = bdrv_write_zeroes(s->target, sector_num, n, 0);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +            break;
> +        }
> +
> +        sector_num += n;
> +        nb_sectors -= n;
> +        buf += n * BDRV_SECTOR_SIZE;
> +    }
> +
> +    return 0;
> +}
> +
> +static int convert_do_copy(ImgConvertState *s)
> +{
> +    uint8_t *buf = NULL;
> +    int64_t sector_num, allocated_done;
> +    int ret;
> +    int n;
> +
> +    /* Check whether we have zero initialisation or can get it efficiently */
> +    s->has_zero_init = s->min_sparse && !s->out_backing
> +                     ? bdrv_has_zero_init(s->target)
> +                     : false;
> +
> +    if (!s->has_zero_init && !s->out_backing &&
> +        bdrv_can_write_zeroes_with_unmap(s->target))
> +    {
> +        ret = bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
> +        if (ret < 0) {
> +            goto fail;

I think just leaving s->has_zero_init false should be fine, too (instead 
of failing).

> +        }
> +        s->has_zero_init = 1;

s/1/true/

> +    }
> +
> +    /* Allocate buffer for copied data. For compressed images, only one cluster
> +     * can be copied at a time. */
> +    if (s->compressed) {
> +        if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) {
> +            error_report("invalid cluster size");
> +            ret = -EINVAL;
> +            goto fail;
> +        }
> +        s->buf_sectors = s->cluster_sectors;
> +    }
> +    buf = qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
> +
> +    /* Calculate allocated sectors for progress */
> +    s->allocated_sectors = 0;
> +    sector_num = 0;
> +    while (sector_num < s->total_sectors) {
> +        n = convert_iteration_sectors(s, sector_num);
> +        if (n < 0) {
> +            ret = n;
> +            goto fail;
> +        }
> +        if (s->status == BLK_DATA) {
> +            s->allocated_sectors += n;
> +        }
> +        sector_num += n;
> +    }
> +
> +    /* Do the copy */
> +    s->src_cur = 0;
> +    s->src_cur_offset = 0;
> +    s->sector_next_status = 0;
> +
> +    sector_num = 0;
> +    allocated_done = 0;
> +
> +    while (sector_num < s->total_sectors) {
> +        n = convert_iteration_sectors(s, sector_num);
> +        if (n < 0) {
> +            ret = n;
> +            goto fail;
> +        }
> +        if (s->status == BLK_DATA) {
> +            allocated_done += n;
> +            qemu_progress_print(100.0 * allocated_done / s->allocated_sectors,
> +                                0);
> +        }
> +
> +        ret = convert_read(s, sector_num, n, buf);
> +        if (ret < 0) {
> +            error_report("error while reading sector %" PRId64
> +                         ": %s", sector_num, strerror(-ret));
> +            goto fail;
> +        }
> +
> +        ret = convert_write(s, sector_num, n, buf);
> +        if (ret < 0) {
> +            error_report("error while writing sector %" PRId64
> +                         ": %s", sector_num, strerror(-ret));
> +            goto fail;
> +        }
> +
> +        sector_num += n;
> +    }
> +
> +    if (s->compressed) {
> +        /* signal EOF to align */
> +        bdrv_write_compressed(s->target, 0, NULL, 0);

Is there a reason for ignoring the return value other than "That's how 
img_convert() used to do it"?


All in all, this patch looks pretty good to me and certainly makes the 
function much more clear. But as the bdrv_read() call probably needs to 
be fixed, and because I think the -S 0 behavior is buggy, I cannot give 
an R-b at this point (but I will if those are fixed, or if you tell me 
why they don't need to be).

Max

> +    }
> +
> +    ret = 0;
> +fail:
> +    qemu_vfree(buf);
> +    return ret;
> +}
> +
>   static int img_convert(int argc, char **argv)
>   {
> -    int c, n, n1, bs_n, bs_i, compress, cluster_sectors, skip_create;
> +    int c, bs_n, bs_i, compress, cluster_sectors, skip_create;
>       int64_t ret = 0;
>       int progress = 0, flags, src_flags;
>       const char *fmt, *out_fmt, *cache, *src_cache, *out_baseimg, *out_filename;
>       BlockDriver *drv, *proto_drv;
>       BlockBackend **blk = NULL, *out_blk = NULL;
>       BlockDriverState **bs = NULL, *out_bs = NULL;
> -    int64_t total_sectors, nb_sectors, sector_num, bs_offset;
> +    int64_t total_sectors;
>       int64_t *bs_sectors = NULL;
> -    uint8_t * buf = NULL;
>       size_t bufsectors = IO_BUF_SIZE / BDRV_SECTOR_SIZE;
> -    const uint8_t *buf1;
>       BlockDriverInfo bdi;
>       QemuOpts *opts = NULL;
>       QemuOptsList *create_opts = NULL;
> @@ -1330,6 +1617,7 @@ static int img_convert(int argc, char **argv)
>       bool quiet = false;
>       Error *local_err = NULL;
>       QemuOpts *sn_opts = NULL;
> +    ImgConvertState state;
>   
>       fmt = NULL;
>       out_fmt = "raw";
> @@ -1622,9 +1910,6 @@ static int img_convert(int argc, char **argv)
>       }
>       out_bs = blk_bs(out_blk);
>   
> -    bs_i = 0;
> -    bs_offset = 0;
> -
>       /* increase bufsectors from the default 4096 (2M) if opt_transfer_length
>        * or discard_alignment of the out_bs is greater. Limit to 32768 (16MB)
>        * as maximum. */
> @@ -1633,8 +1918,6 @@ static int img_convert(int argc, char **argv)
>                                            out_bs->bl.discard_alignment))
>                       );
>   
> -    buf = qemu_blockalign(out_bs, bufsectors * BDRV_SECTOR_SIZE);
> -
>       if (skip_create) {
>           int64_t output_sectors = bdrv_nb_sectors(out_bs);
>           if (output_sectors < 0) {
> @@ -1661,203 +1944,20 @@ static int img_convert(int argc, char **argv)
>           cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
>       }
>   
> -    if (compress) {
> -        if (cluster_sectors <= 0 || cluster_sectors > bufsectors) {
> -            error_report("invalid cluster size");
> -            ret = -1;
> -            goto out;
> -        }
> -        sector_num = 0;
> -
> -        nb_sectors = total_sectors;
> -
> -        for(;;) {
> -            int64_t bs_num;
> -            int remainder;
> -            uint8_t *buf2;
> -
> -            nb_sectors = total_sectors - sector_num;
> -            if (nb_sectors <= 0)
> -                break;
> -            if (nb_sectors >= cluster_sectors)
> -                n = cluster_sectors;
> -            else
> -                n = nb_sectors;
> -
> -            bs_num = sector_num - bs_offset;
> -            assert (bs_num >= 0);
> -            remainder = n;
> -            buf2 = buf;
> -            while (remainder > 0) {
> -                int nlow;
> -                while (bs_num == bs_sectors[bs_i]) {
> -                    bs_offset += bs_sectors[bs_i];
> -                    bs_i++;
> -                    assert (bs_i < bs_n);
> -                    bs_num = 0;
> -                    /* printf("changing part: sector_num=%" PRId64 ", "
> -                       "bs_i=%d, bs_offset=%" PRId64 ", bs_sectors=%" PRId64
> -                       "\n", sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
> -                }
> -                assert (bs_num < bs_sectors[bs_i]);
> -
> -                nlow = remainder > bs_sectors[bs_i] - bs_num
> -                    ? bs_sectors[bs_i] - bs_num : remainder;
> -
> -                ret = bdrv_read(bs[bs_i], bs_num, buf2, nlow);
> -                if (ret < 0) {
> -                    error_report("error while reading sector %" PRId64 ": %s",
> -                                 bs_num, strerror(-ret));
> -                    goto out;
> -                }
> -
> -                buf2 += nlow * 512;
> -                bs_num += nlow;
> -
> -                remainder -= nlow;
> -            }
> -            assert (remainder == 0);
> -
> -            if (!buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
> -                ret = bdrv_write_compressed(out_bs, sector_num, buf, n);
> -                if (ret != 0) {
> -                    error_report("error while compressing sector %" PRId64
> -                                 ": %s", sector_num, strerror(-ret));
> -                    goto out;
> -                }
> -            }
> -            sector_num += n;
> -            qemu_progress_print(100.0 * sector_num / total_sectors, 0);
> -        }
> -        /* signal EOF to align */
> -        bdrv_write_compressed(out_bs, 0, NULL, 0);
> -    } else {
> -        int64_t sectors_to_read, sectors_read, sector_num_next_status;
> -        bool count_allocated_sectors;
> -        int has_zero_init = min_sparse ? bdrv_has_zero_init(out_bs) : 0;
> -
> -        if (!has_zero_init && bdrv_can_write_zeroes_with_unmap(out_bs)) {
> -            ret = bdrv_make_zero(out_bs, BDRV_REQ_MAY_UNMAP);
> -            if (ret < 0) {
> -                goto out;
> -            }
> -            has_zero_init = 1;
> -        }
> -
> -        sectors_to_read = total_sectors;
> -        count_allocated_sectors = progress && (out_baseimg || has_zero_init);
> -restart:
> -        sector_num = 0; // total number of sectors converted so far
> -        sectors_read = 0;
> -        sector_num_next_status = 0;
> -
> -        for(;;) {
> -            nb_sectors = total_sectors - sector_num;
> -            if (nb_sectors <= 0) {
> -                if (count_allocated_sectors) {
> -                    sectors_to_read = sectors_read;
> -                    count_allocated_sectors = false;
> -                    goto restart;
> -                }
> -                ret = 0;
> -                break;
> -            }
> -
> -            while (sector_num - bs_offset >= bs_sectors[bs_i]) {
> -                bs_offset += bs_sectors[bs_i];
> -                bs_i ++;
> -                assert (bs_i < bs_n);
> -                /* printf("changing part: sector_num=%" PRId64 ", bs_i=%d, "
> -                  "bs_offset=%" PRId64 ", bs_sectors=%" PRId64 "\n",
> -                   sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
> -            }
> -
> -            if ((out_baseimg || has_zero_init) &&
> -                sector_num >= sector_num_next_status) {
> -                n = nb_sectors > INT_MAX ? INT_MAX : nb_sectors;
> -                ret = bdrv_get_block_status(bs[bs_i], sector_num - bs_offset,
> -                                            n, &n1);
> -                if (ret < 0) {
> -                    error_report("error while reading block status of sector %"
> -                                 PRId64 ": %s", sector_num - bs_offset,
> -                                 strerror(-ret));
> -                    goto out;
> -                }
> -                /* If the output image is zero initialized, we are not working
> -                 * on a shared base and the input is zero we can skip the next
> -                 * n1 sectors */
> -                if (has_zero_init && !out_baseimg && (ret & BDRV_BLOCK_ZERO)) {
> -                    sector_num += n1;
> -                    continue;
> -                }
> -                /* If the output image is being created as a copy on write
> -                 * image, assume that sectors which are unallocated in the
> -                 * input image are present in both the output's and input's
> -                 * base images (no need to copy them). */
> -                if (out_baseimg) {
> -                    if (!(ret & BDRV_BLOCK_DATA)) {
> -                        sector_num += n1;
> -                        continue;
> -                    }
> -                    /* The next 'n1' sectors are allocated in the input image.
> -                     * Copy only those as they may be followed by unallocated
> -                     * sectors. */
> -                    nb_sectors = n1;
> -                }
> -                /* avoid redundant callouts to get_block_status */
> -                sector_num_next_status = sector_num + n1;
> -            }
> -
> -            n = MIN(nb_sectors, bufsectors);
> -
> -            /* round down request length to an aligned sector, but
> -             * do not bother doing this on short requests. They happen
> -             * when we found an all-zero area, and the next sector to
> -             * write will not be sector_num + n. */
> -            if (cluster_sectors > 0 && n >= cluster_sectors) {
> -                int64_t next_aligned_sector = (sector_num + n);
> -                next_aligned_sector -= next_aligned_sector % cluster_sectors;
> -                if (sector_num + n > next_aligned_sector) {
> -                    n = next_aligned_sector - sector_num;
> -                }
> -            }
> -
> -            n = MIN(n, bs_sectors[bs_i] - (sector_num - bs_offset));
> -
> -            sectors_read += n;
> -            if (count_allocated_sectors) {
> -                sector_num += n;
> -                continue;
> -            }
> +    state = (ImgConvertState) {
> +        .src                = bs,
> +        .src_sectors        = bs_sectors,
> +        .src_num            = bs_n,
> +        .total_sectors      = total_sectors,
> +        .target             = out_bs,
> +        .compressed         = compress,
> +        .out_backing        = (bool) out_baseimg,
> +        .min_sparse         = min_sparse,
> +        .cluster_sectors    = cluster_sectors,
> +        .buf_sectors        = bufsectors,
> +    };
> +    ret = convert_do_copy(&state);
>   
> -            n1 = n;
> -            ret = bdrv_read(bs[bs_i], sector_num - bs_offset, buf, n);
> -            if (ret < 0) {
> -                error_report("error while reading sector %" PRId64 ": %s",
> -                             sector_num - bs_offset, strerror(-ret));
> -                goto out;
> -            }
> -            /* NOTE: at the same time we convert, we do not write zero
> -               sectors to have a chance to compress the image. Ideally, we
> -               should add a specific call to have the info to go faster */
> -            buf1 = buf;
> -            while (n > 0) {
> -                if (!has_zero_init ||
> -                    is_allocated_sectors_min(buf1, n, &n1, min_sparse)) {
> -                    ret = bdrv_write(out_bs, sector_num, buf1, n1);
> -                    if (ret < 0) {
> -                        error_report("error while writing sector %" PRId64
> -                                     ": %s", sector_num, strerror(-ret));
> -                        goto out;
> -                    }
> -                }
> -                sector_num += n1;
> -                n -= n1;
> -                buf1 += n1 * 512;
> -            }
> -            qemu_progress_print(100.0 * sectors_read / sectors_to_read, 0);
> -        }
> -    }
>   out:
>       if (!ret) {
>           qemu_progress_print(100, 0);
> @@ -1865,7 +1965,6 @@ out:
>       qemu_progress_end();
>       qemu_opts_del(opts);
>       qemu_opts_free(create_opts);
> -    qemu_vfree(buf);
>       qemu_opts_del(sn_opts);
>       blk_unref(out_blk);
>       g_free(bs);
Kevin Wolf Feb. 19, 2015, 3:47 p.m. UTC | #2
Am 18.02.2015 um 18:18 hat Max Reitz geschrieben:
> On 2015-02-18 at 10:19, Kevin Wolf wrote:
> >The implementation of qemu-img convert is (a) messy, (b) buggy, and
> >(c) less efficient than possible. The changes required to beat some
> >sense into it are massive enough that incremental changes would only
> >make my and the reviewers' life harder. So throw it away and reimplement
> >it from scratch.
> >
> >Let me give some examples what I mean by messy, buggy and inefficient:
> >
> >(a) The copying logic of qemu-img convert has two separate branches for
> >     compressed and normal target images, which roughly do the same -
> >     except for a little code that handles actual differences between
> >     compressed and uncompressed images, and much more code that
> >     implements just a different set of optimisations and bugs. This is
> >     unnecessary code duplication, and makes the code for compressed
> >     output (unsurprisingly) suffer from bitrot.
> >
> >     The code for uncompressed ouput is run twice to count the the total
> >     length for the progress bar. In the first run it just takes a
> >     shortcut and runs only half the loop, and when it's done, it toggles
> >     a boolean, jumps out of the loop with a backwards goto and starts
> >     over. Works, but pretty is something different.
> >
> >(b) Converting while keeping a backing file (-B option) is broken in
> >     several ways. This includes not writing to the image file if the
> >     input has zero clusters or data filled with zeros (ignoring that the
> >     backing file will be visible instead).
> >
> >     It also doesn't correctly limit every iteration of the copy loop to
> >     sectors of the same status so that too many sectors may be copied to
> >     in the target image. For -B this gives an unexpected result, for
> >     other images it just does more work than necessary.
> >
> >     Conversion with a compressed target completely ignores any target
> >     backing file.
> >
> >(c) qemu-img convert skips reading and writing an area if it knows from
> >     metadata that copying isn't needed (except for the bug mentioned
> >     above that ignores a status change in some cases). It does, however,
> >     read from the source even if it knows that it will read zeros, and
> >     then search for non-zero bytes in the read buffer, if it's possible
> >     that a write might be needed.
> >
> >This reimplementation of the copying core reorganises the code to remove
> >the duplication and have a much more obvious code flow, by essentially
> >splitting the copy iteration loop into three parts:
> >
> >1. Find the number of contiguous sectors of the same status at the
> >    current offset (This can also be called in a separate loop for the
> >    copying loop in order to determine the total sectors for the progress
> >    bar.)
> >
> >2. Read sectors. If the status implies that there is no data there to
> >    read (zero or unallocated cluster), don't do anything.
> >
> >3. Write sectors depending on the status. If it's data, write it. If
> >    we want the backing file to be visible (with -B), don't write it. If
> >    it's zeroed, skip it if you can, otherwise use bdrv_write_zeroes() to
> >    optimise the write at least where possible.
> >
> >Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >---
> >  qemu-img.c | 511 ++++++++++++++++++++++++++++++++++++-------------------------
> >  1 file changed, 305 insertions(+), 206 deletions(-)
> 
> You have been very careful not to write "Rewrite img_convert()" or
> something like that in the subject, so I can't really complain that
> there are still bugs in the function (which are not related to the
> copying logic), for instance:
> 
> $ ./qemu-img create -f qcow2 i1.qcow2 64M
> Formatting 'i1.qcow2', fmt=qcow2 size=67108864 encryption=off
> cluster_size=65536 lazy_refcounts=off
> $ ./qemu-img create -f qcow2 i2.qcow2 64M
> Formatting 'i2.qcow2', fmt=qcow2 size=67108864 encryption=off
> cluster_size=65536 lazy_refcounts=off
> $ ./qemu-img snapshot -c foo i1.qcow2
> $ ./qemu-img snapshot -c foo i2.qcow2
> $ ./qemu-io -c 'write 0 64M' i2.qcow2
> wrote 67108864/67108864 bytes at offset 0
> 64 MiB, 1 ops; 0:00:01.32 (48.152 MiB/sec and 0.7524 ops/sec)
> $ ./qemu-img convert -l snapshot.name=foo -O qcow2 i{1,2}.qcow2 o.qcow2
> qemu-img: error while reading sector 131072: Input/output error
> 
> ("No support for concatenating multiple snapshot" should be enforced
> for sn_opts != NULL)

Probably worth addressing, though not in this patch.

> >diff --git a/qemu-img.c b/qemu-img.c
> >index e148af8..5c8386e 100644
> >--- a/qemu-img.c
> >+++ b/qemu-img.c
> >@@ -1306,20 +1306,307 @@ out3:
> >      return ret;
> >  }
> >+enum ImgConvertBlockStatus {
> >+    BLK_DATA,
> >+    BLK_ZERO,
> >+    BLK_BACKING_FILE,
> >+};
> >+
> >+typedef struct ImgConvertState {
> >+    BlockDriverState **src;
> >+    int64_t *src_sectors;
> >+    int src_cur, src_num;
> >+    int64_t src_cur_offset;
> >+    int64_t total_sectors;
> >+    int64_t allocated_sectors;
> >+    enum ImgConvertBlockStatus status;
> >+    int64_t sector_next_status;
> >+    BlockDriverState *target;
> >+    bool has_zero_init;
> >+    bool compressed;
> >+    bool out_backing;
> 
> Maybe "out_backed" (or "out_is_backed" or "out_has_backing") would
> be better; "out_backing" to me sounds like this should describe the
> backing file.

Okay, will rename as target_has_backing.

> >+    int min_sparse;
> >+    size_t cluster_sectors;
> >+    size_t buf_sectors;
> >+} ImgConvertState;
> >+
> >+static void convert_select_part(ImgConvertState *s, int64_t sector_num)
> >+{
> >+    while (sector_num - s->src_cur_offset >= s->src_sectors[s->src_cur]) {
> >+        s->src_cur_offset += s->src_sectors[s->src_cur];
> >+        s->src_cur++;
> >+        assert(s->src_cur < s->src_num);
> >+    }
> >+}
> >+
> >+static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
> >+{
> >+    int64_t ret;
> >+    int n;
> >+
> >+    convert_select_part(s, sector_num);
> >+
> >+    assert(s->total_sectors > sector_num);
> >+    n = MIN(s->total_sectors - sector_num, INT_MAX);
> 
> Maybe it would be better to use INT_MAX / BDRV_SECTOR_SIZE here (and
> in other places, too)... In practice, this would only be relevant to
> reads and writes, though, and those are limited by s->buf_sectors.

I'll make that BDRV_REQUEST_MAX_SECTORS.

> >+
> >+    if (s->sector_next_status <= sector_num) {
> >+        ret = bdrv_get_block_status(s->src[s->src_cur],
> >+                                    sector_num - s->src_cur_offset,
> >+                                    n, &n);
> >+        if (ret < 0) {
> >+            return ret;
> >+        }
> >+
> >+        if (ret & BDRV_BLOCK_ZERO) {
> >+            s->status = BLK_ZERO;
> >+        } else if (ret & BDRV_BLOCK_DATA) {
> >+            s->status = BLK_DATA;
> >+        } else if (!s->out_backing) {
> >+            /* Without a target backing file we must copy over the contents of
> >+             * the backing file as well. */
> >+            /* TODO Check block status of the backing file chain to avoid
> >+             * needlessly reading zeroes and limiting the iteration to the
> >+             * buffer size */
> >+            s->status = BLK_DATA;
> >+        } else {
> >+            s->status = BLK_BACKING_FILE;
> >+        }
> >+
> >+        s->sector_next_status = sector_num + n;
> >+    }
> >+
> >+    n = MIN(n, s->sector_next_status - sector_num);
> >+    if (s->status == BLK_DATA) {
> >+        n = MIN(n, s->buf_sectors);
> >+    }
> >+
> >+    /* We need to write complete clusters for compressed images, so if an
> >+     * unallocated area is shorter than that, we must consider the whole
> >+     * cluster allocated. */
> >+    if (s->compressed) {
> >+        if (n < s->cluster_sectors) {
> >+            n = MIN(s->cluster_sectors, s->total_sectors - sector_num);
> >+            s->status = BLK_DATA;
> >+        } else {
> >+            n = QEMU_ALIGN_DOWN(n, s->cluster_sectors);
> >+        }
> >+    }
> >+
> >+    return n;
> >+}
> >+
> >+static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> >+                        uint8_t *buf)
> >+{
> >+    int n;
> >+    int ret;
> >+
> >+    if (s->status == BLK_ZERO || s->status == BLK_BACKING_FILE) {
> >+        return 0;
> >+    }
> >+
> >+    assert(nb_sectors <= s->buf_sectors);
> >+    while (nb_sectors > 0) {
> >+        BlockDriverState *bs;
> >+        int64_t bs_sectors;
> >+
> >+        /* In the case of compression with multiple source files, we can get a
> >+         * nb_sectors that spreads into the next part. So we must be able to
> >+         * read across multiple BDSes for one convert_read() call. */
> >+        convert_select_part(s, sector_num);
> >+        bs = s->src[s->src_cur];
> >+        bs_sectors = s->src_sectors[s->src_cur];
> >+
> >+        n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
> >+        ret = bdrv_read(bs, sector_num, buf, n);
> 
> Shouldn't this be sector_num - s->src_cur_offset?

I suppose it should be, thanks for catching this. Not a good sign with
respect to our qemu-iotests coverage.

> >+        if (ret < 0) {
> >+            return ret;
> >+        }
> >+
> >+        sector_num += n;
> >+        nb_sectors -= n;
> >+        buf += n * BDRV_SECTOR_SIZE;
> >+    }
> >+
> >+    return 0;
> >+}
> >+
> >+static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
> >+                         const uint8_t *buf)
> >+{
> >+    int ret;
> >+
> >+    while (nb_sectors > 0) {
> >+
> 
> This empty line looks a bit strange to me...

Removed.

> >+        int n = nb_sectors;
> >+
> >+        switch (s->status) {
> >+        case BLK_BACKING_FILE:
> >+            /* If we have a backing file, leave clusters unallocated that are
> >+             * unallocated in the source image, so that the backing file is
> >+             * visible at the respective offset. */
> >+            assert(s->out_backing);
> >+            break;
> >+
> >+        case BLK_DATA:
> >+            /* We must always write compressed clusters as a whole, so don't
> >+             * try to find zeroed parts in the buffer. We can only save the
> >+             * write if the buffer is completely zeroed. */
> 
> But you shouldn't be doing that if n < s->min_sparse, I think (if
> people specify -S 0, they don't want you to zero anything, and
> that's guaranteed by the man page). Considering that
> is_allocated_sectors_min() basically has a min = MIN(min, n) part,
> too, I'd be fine with making -S 0 a special case here ("if
> (s->has_zero_init && s->min_sparse && buffer_is_zero(...))").

Hm, okay. With a compressed target n isn't variable, so there's only
always or never, depending on min_sparse < cluster_size. But I can do
it.

I'm actually not really sure what the use case for -S with a compressed
target would be.

> Actually, now that I'm looking at is_allocated_sectors_min(), if the
> first sectors is not allocated, it will return false and thus both
> the current qemu-img convert and the new one after this patch won't
> write data. That's a bug, I think (because it is guaranteed by the
> man page).

Or the description in the man page is wrong.

The intention with -S was that we avoid splitting up writes into too
many small chunks because that costs a lot of time. If you look at it
from that angle, it's doing exactly the right thing because skipping
zeros at the start doesn't increase the number of write requests.

> >+            if (s->compressed) {
> >+                if (s->has_zero_init &&
> >+                    buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
> >+                {
> >+                    assert(!s->out_backing);
> >+                    break;
> >+                }

I wonder whether !s->has_zero_init shouldn't actually treat it as
BLK_ZERO. That should "compress" even better than writing a gzip
compressed cluster of zeros.

> >+                ret = bdrv_write_compressed(s->target, sector_num, buf, n);
> >+                if (ret < 0) {
> >+                    return ret;
> >+                }
> >+                break;
> >+            }
> >+
> >+            /* If there is real non-zero data, we must write it. Otherwise we
> >+             * can treat it as zero sectors. */
> >+            if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {
> 
> So I think this should be "if (!s->min_sparse || ...)".

I'd rather change is_allocated_sectors_min(). But only if there is a use
case, otherwise we should fix the description in the man page.

> >+                ret = bdrv_write(s->target, sector_num, buf, n);
> >+                if (ret < 0) {
> >+                    return ret;
> >+                }
> >+                break;
> >+            }
> >+            /* fall-through */
> >+
> >+        case BLK_ZERO:
> >+            if (s->has_zero_init) {
> 
> And I don't really know what to do about this. Should we write zeros
> if !s->min_sparse? Or is this a special case and the user can't
> expect qemu-img convert to really write zeros if the source image
> contained unallocated/zero clusters? That would make sense, but once
> again, the man page clearly states that "If sparse_size is 0, […]
> the destination image will always be fully allocated." Maybe we
> could argue that "fully" in this case means "as fully as the source
> image was allocated".

FWIW, bdrv_is_allocated() considers zero clusters allocated. ;-)

I don't think that -S 0 should mean full preallocation. If you want
that, I think you can get it with -o (at the cost of writing some parts
of the image twice).

> >+                break;
> >+            }
> >+            /* TODO Should this use BDRV_REQ_MAY_UNMAP? */
> 
> Can't we remove this comment? If BDRV_REQ_MAP_UNMAP results in zeros
> read, bdrv_make_zero() below will have been invoked and thus
> s->has_zero_init = true (unless you take my suggestion and leave
> s->has_zero_init false if bdrv_make_zero() failed, but I wouldn't
> really mind that case here).

Good point.

(I wanted to remove the comment either way, it was just there to get
input during the review.)

> >+            ret = bdrv_write_zeroes(s->target, sector_num, n, 0);
> >+            if (ret < 0) {
> >+                return ret;
> >+            }
> >+            break;
> >+        }
> >+
> >+        sector_num += n;
> >+        nb_sectors -= n;
> >+        buf += n * BDRV_SECTOR_SIZE;
> >+    }
> >+
> >+    return 0;
> >+}
> >+
> >+static int convert_do_copy(ImgConvertState *s)
> >+{
> >+    uint8_t *buf = NULL;
> >+    int64_t sector_num, allocated_done;
> >+    int ret;
> >+    int n;
> >+
> >+    /* Check whether we have zero initialisation or can get it efficiently */
> >+    s->has_zero_init = s->min_sparse && !s->out_backing
> >+                     ? bdrv_has_zero_init(s->target)
> >+                     : false;
> >+
> >+    if (!s->has_zero_init && !s->out_backing &&
> >+        bdrv_can_write_zeroes_with_unmap(s->target))
> >+    {
> >+        ret = bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
> >+        if (ret < 0) {
> >+            goto fail;
> 
> I think just leaving s->has_zero_init false should be fine, too
> (instead of failing).
> 
> >+        }
> >+        s->has_zero_init = 1;
> 
> s/1/true/

Yes.

> >+    }
> >+
> >+    /* Allocate buffer for copied data. For compressed images, only one cluster
> >+     * can be copied at a time. */
> >+    if (s->compressed) {
> >+        if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) {
> >+            error_report("invalid cluster size");
> >+            ret = -EINVAL;
> >+            goto fail;
> >+        }
> >+        s->buf_sectors = s->cluster_sectors;
> >+    }
> >+    buf = qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
> >+
> >+    /* Calculate allocated sectors for progress */
> >+    s->allocated_sectors = 0;
> >+    sector_num = 0;
> >+    while (sector_num < s->total_sectors) {
> >+        n = convert_iteration_sectors(s, sector_num);
> >+        if (n < 0) {
> >+            ret = n;
> >+            goto fail;
> >+        }
> >+        if (s->status == BLK_DATA) {
> >+            s->allocated_sectors += n;
> >+        }
> >+        sector_num += n;
> >+    }
> >+
> >+    /* Do the copy */
> >+    s->src_cur = 0;
> >+    s->src_cur_offset = 0;
> >+    s->sector_next_status = 0;
> >+
> >+    sector_num = 0;
> >+    allocated_done = 0;
> >+
> >+    while (sector_num < s->total_sectors) {
> >+        n = convert_iteration_sectors(s, sector_num);
> >+        if (n < 0) {
> >+            ret = n;
> >+            goto fail;
> >+        }
> >+        if (s->status == BLK_DATA) {
> >+            allocated_done += n;
> >+            qemu_progress_print(100.0 * allocated_done / s->allocated_sectors,
> >+                                0);
> >+        }
> >+
> >+        ret = convert_read(s, sector_num, n, buf);
> >+        if (ret < 0) {
> >+            error_report("error while reading sector %" PRId64
> >+                         ": %s", sector_num, strerror(-ret));
> >+            goto fail;
> >+        }
> >+
> >+        ret = convert_write(s, sector_num, n, buf);
> >+        if (ret < 0) {
> >+            error_report("error while writing sector %" PRId64
> >+                         ": %s", sector_num, strerror(-ret));
> >+            goto fail;
> >+        }
> >+
> >+        sector_num += n;
> >+    }
> >+
> >+    if (s->compressed) {
> >+        /* signal EOF to align */
> >+        bdrv_write_compressed(s->target, 0, NULL, 0);
> 
> Is there a reason for ignoring the return value other than "That's
> how img_convert() used to do it"?

No. Isn't that one good enough? ;-)

So the code in qcow2 says this:

    if (nb_sectors == 0) {
        /* align end of file to a sector boundary to ease reading with
           sector based I/Os */
        cluster_offset = bdrv_getlength(bs->file);
        return bdrv_truncate(bs->file, cluster_offset);
    }

I don't think we have any such restrictions any more, so it's mostly
useless. Perhaps ancient qemu versions would fail to read such an image,
but recent ones shouldn't.

In fact, our bdrv_pwrite() currently maps to sector-aligned functions in
the protocol driver, so I think at the moment we already get the
alignment automatically. This might change again once we convert block
drivers to byte offsets.

Kevin
Max Reitz Feb. 19, 2015, 4:01 p.m. UTC | #3
On 2015-02-19 at 10:47, Kevin Wolf wrote:
> Am 18.02.2015 um 18:18 hat Max Reitz geschrieben:

[snip]

>> Actually, now that I'm looking at is_allocated_sectors_min(), if the
>> first sectors is not allocated, it will return false and thus both
>> the current qemu-img convert and the new one after this patch won't
>> write data. That's a bug, I think (because it is guaranteed by the
>> man page).
> Or the description in the man page is wrong.
>
> The intention with -S was that we avoid splitting up writes into too
> many small chunks because that costs a lot of time. If you look at it
> from that angle, it's doing exactly the right thing because skipping
> zeros at the start doesn't increase the number of write requests.

Feel free to fix it. But I remember someone recently asking about 
preallocation for qemu-img convert in the #qemu IRC channel, and "-S 0" 
was one of the valid answers.

The nice thing about -S 0 is that it works with all image formats; I 
know that qcow2 is always our main concern (especially when it comes to 
the target of qemu-img convert), but I think using -S 0 for 
preallocation is justified.

Considering that in this case the man page was lying (because the code 
did not allocate everything), I'd be fine with fixing up the man page 
and leaving the behavior as-is, though, too.


[snip]

>>> +
>>> +    if (s->compressed) {
>>> +        /* signal EOF to align */
>>> +        bdrv_write_compressed(s->target, 0, NULL, 0);
>> Is there a reason for ignoring the return value other than "That's
>> how img_convert() used to do it"?
> No. Isn't that one good enough? ;-)

I don't know, your commit message states that the old implementation is 
buggy, so I don't think that's enough. :-)

> So the code in qcow2 says this:
>
>      if (nb_sectors == 0) {
>          /* align end of file to a sector boundary to ease reading with
>             sector based I/Os */
>          cluster_offset = bdrv_getlength(bs->file);
>          return bdrv_truncate(bs->file, cluster_offset);
>      }
>
> I don't think we have any such restrictions any more, so it's mostly
> useless. Perhaps ancient qemu versions would fail to read such an image,
> but recent ones shouldn't.
>
> In fact, our bdrv_pwrite() currently maps to sector-aligned functions in
> the protocol driver, so I think at the moment we already get the
> alignment automatically. This might change again once we convert block
> drivers to byte offsets.

So you're intending to drop the bdrv_write_compressed() completely? I'd 
be fine with that as well.

Max
diff mbox

Patch

diff --git a/qemu-img.c b/qemu-img.c
index e148af8..5c8386e 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1306,20 +1306,307 @@  out3:
     return ret;
 }
 
+enum ImgConvertBlockStatus {
+    BLK_DATA,
+    BLK_ZERO,
+    BLK_BACKING_FILE,
+};
+
+typedef struct ImgConvertState {
+    BlockDriverState **src;
+    int64_t *src_sectors;
+    int src_cur, src_num;
+    int64_t src_cur_offset;
+    int64_t total_sectors;
+    int64_t allocated_sectors;
+    enum ImgConvertBlockStatus status;
+    int64_t sector_next_status;
+    BlockDriverState *target;
+    bool has_zero_init;
+    bool compressed;
+    bool out_backing;
+    int min_sparse;
+    size_t cluster_sectors;
+    size_t buf_sectors;
+} ImgConvertState;
+
+static void convert_select_part(ImgConvertState *s, int64_t sector_num)
+{
+    while (sector_num - s->src_cur_offset >= s->src_sectors[s->src_cur]) {
+        s->src_cur_offset += s->src_sectors[s->src_cur];
+        s->src_cur++;
+        assert(s->src_cur < s->src_num);
+    }
+}
+
+static int convert_iteration_sectors(ImgConvertState *s, int64_t sector_num)
+{
+    int64_t ret;
+    int n;
+
+    convert_select_part(s, sector_num);
+
+    assert(s->total_sectors > sector_num);
+    n = MIN(s->total_sectors - sector_num, INT_MAX);
+
+    if (s->sector_next_status <= sector_num) {
+        ret = bdrv_get_block_status(s->src[s->src_cur],
+                                    sector_num - s->src_cur_offset,
+                                    n, &n);
+        if (ret < 0) {
+            return ret;
+        }
+
+        if (ret & BDRV_BLOCK_ZERO) {
+            s->status = BLK_ZERO;
+        } else if (ret & BDRV_BLOCK_DATA) {
+            s->status = BLK_DATA;
+        } else if (!s->out_backing) {
+            /* Without a target backing file we must copy over the contents of
+             * the backing file as well. */
+            /* TODO Check block status of the backing file chain to avoid
+             * needlessly reading zeroes and limiting the iteration to the
+             * buffer size */
+            s->status = BLK_DATA;
+        } else {
+            s->status = BLK_BACKING_FILE;
+        }
+
+        s->sector_next_status = sector_num + n;
+    }
+
+    n = MIN(n, s->sector_next_status - sector_num);
+    if (s->status == BLK_DATA) {
+        n = MIN(n, s->buf_sectors);
+    }
+
+    /* We need to write complete clusters for compressed images, so if an
+     * unallocated area is shorter than that, we must consider the whole
+     * cluster allocated. */
+    if (s->compressed) {
+        if (n < s->cluster_sectors) {
+            n = MIN(s->cluster_sectors, s->total_sectors - sector_num);
+            s->status = BLK_DATA;
+        } else {
+            n = QEMU_ALIGN_DOWN(n, s->cluster_sectors);
+        }
+    }
+
+    return n;
+}
+
+static int convert_read(ImgConvertState *s, int64_t sector_num, int nb_sectors,
+                        uint8_t *buf)
+{
+    int n;
+    int ret;
+
+    if (s->status == BLK_ZERO || s->status == BLK_BACKING_FILE) {
+        return 0;
+    }
+
+    assert(nb_sectors <= s->buf_sectors);
+    while (nb_sectors > 0) {
+        BlockDriverState *bs;
+        int64_t bs_sectors;
+
+        /* In the case of compression with multiple source files, we can get a
+         * nb_sectors that spreads into the next part. So we must be able to
+         * read across multiple BDSes for one convert_read() call. */
+        convert_select_part(s, sector_num);
+        bs = s->src[s->src_cur];
+        bs_sectors = s->src_sectors[s->src_cur];
+
+        n = MIN(nb_sectors, bs_sectors - (sector_num - s->src_cur_offset));
+        ret = bdrv_read(bs, sector_num, buf, n);
+        if (ret < 0) {
+            return ret;
+        }
+
+        sector_num += n;
+        nb_sectors -= n;
+        buf += n * BDRV_SECTOR_SIZE;
+    }
+
+    return 0;
+}
+
+static int convert_write(ImgConvertState *s, int64_t sector_num, int nb_sectors,
+                         const uint8_t *buf)
+{
+    int ret;
+
+    while (nb_sectors > 0) {
+
+        int n = nb_sectors;
+
+        switch (s->status) {
+        case BLK_BACKING_FILE:
+            /* If we have a backing file, leave clusters unallocated that are
+             * unallocated in the source image, so that the backing file is
+             * visible at the respective offset. */
+            assert(s->out_backing);
+            break;
+
+        case BLK_DATA:
+            /* We must always write compressed clusters as a whole, so don't
+             * try to find zeroed parts in the buffer. We can only save the
+             * write if the buffer is completely zeroed. */
+            if (s->compressed) {
+                if (s->has_zero_init &&
+                    buffer_is_zero(buf, n * BDRV_SECTOR_SIZE))
+                {
+                    assert(!s->out_backing);
+                    break;
+                }
+
+                ret = bdrv_write_compressed(s->target, sector_num, buf, n);
+                if (ret < 0) {
+                    return ret;
+                }
+                break;
+            }
+
+            /* If there is real non-zero data, we must write it. Otherwise we
+             * can treat it as zero sectors. */
+            if (is_allocated_sectors_min(buf, n, &n, s->min_sparse)) {
+                ret = bdrv_write(s->target, sector_num, buf, n);
+                if (ret < 0) {
+                    return ret;
+                }
+                break;
+            }
+            /* fall-through */
+
+        case BLK_ZERO:
+            if (s->has_zero_init) {
+                break;
+            }
+            /* TODO Should this use BDRV_REQ_MAY_UNMAP? */
+            ret = bdrv_write_zeroes(s->target, sector_num, n, 0);
+            if (ret < 0) {
+                return ret;
+            }
+            break;
+        }
+
+        sector_num += n;
+        nb_sectors -= n;
+        buf += n * BDRV_SECTOR_SIZE;
+    }
+
+    return 0;
+}
+
+static int convert_do_copy(ImgConvertState *s)
+{
+    uint8_t *buf = NULL;
+    int64_t sector_num, allocated_done;
+    int ret;
+    int n;
+
+    /* Check whether we have zero initialisation or can get it efficiently */
+    s->has_zero_init = s->min_sparse && !s->out_backing
+                     ? bdrv_has_zero_init(s->target)
+                     : false;
+
+    if (!s->has_zero_init && !s->out_backing &&
+        bdrv_can_write_zeroes_with_unmap(s->target))
+    {
+        ret = bdrv_make_zero(s->target, BDRV_REQ_MAY_UNMAP);
+        if (ret < 0) {
+            goto fail;
+        }
+        s->has_zero_init = 1;
+    }
+
+    /* Allocate buffer for copied data. For compressed images, only one cluster
+     * can be copied at a time. */
+    if (s->compressed) {
+        if (s->cluster_sectors <= 0 || s->cluster_sectors > s->buf_sectors) {
+            error_report("invalid cluster size");
+            ret = -EINVAL;
+            goto fail;
+        }
+        s->buf_sectors = s->cluster_sectors;
+    }
+    buf = qemu_blockalign(s->target, s->buf_sectors * BDRV_SECTOR_SIZE);
+
+    /* Calculate allocated sectors for progress */
+    s->allocated_sectors = 0;
+    sector_num = 0;
+    while (sector_num < s->total_sectors) {
+        n = convert_iteration_sectors(s, sector_num);
+        if (n < 0) {
+            ret = n;
+            goto fail;
+        }
+        if (s->status == BLK_DATA) {
+            s->allocated_sectors += n;
+        }
+        sector_num += n;
+    }
+
+    /* Do the copy */
+    s->src_cur = 0;
+    s->src_cur_offset = 0;
+    s->sector_next_status = 0;
+
+    sector_num = 0;
+    allocated_done = 0;
+
+    while (sector_num < s->total_sectors) {
+        n = convert_iteration_sectors(s, sector_num);
+        if (n < 0) {
+            ret = n;
+            goto fail;
+        }
+        if (s->status == BLK_DATA) {
+            allocated_done += n;
+            qemu_progress_print(100.0 * allocated_done / s->allocated_sectors,
+                                0);
+        }
+
+        ret = convert_read(s, sector_num, n, buf);
+        if (ret < 0) {
+            error_report("error while reading sector %" PRId64
+                         ": %s", sector_num, strerror(-ret));
+            goto fail;
+        }
+
+        ret = convert_write(s, sector_num, n, buf);
+        if (ret < 0) {
+            error_report("error while writing sector %" PRId64
+                         ": %s", sector_num, strerror(-ret));
+            goto fail;
+        }
+
+        sector_num += n;
+    }
+
+    if (s->compressed) {
+        /* signal EOF to align */
+        bdrv_write_compressed(s->target, 0, NULL, 0);
+    }
+
+    ret = 0;
+fail:
+    qemu_vfree(buf);
+    return ret;
+}
+
 static int img_convert(int argc, char **argv)
 {
-    int c, n, n1, bs_n, bs_i, compress, cluster_sectors, skip_create;
+    int c, bs_n, bs_i, compress, cluster_sectors, skip_create;
     int64_t ret = 0;
     int progress = 0, flags, src_flags;
     const char *fmt, *out_fmt, *cache, *src_cache, *out_baseimg, *out_filename;
     BlockDriver *drv, *proto_drv;
     BlockBackend **blk = NULL, *out_blk = NULL;
     BlockDriverState **bs = NULL, *out_bs = NULL;
-    int64_t total_sectors, nb_sectors, sector_num, bs_offset;
+    int64_t total_sectors;
     int64_t *bs_sectors = NULL;
-    uint8_t * buf = NULL;
     size_t bufsectors = IO_BUF_SIZE / BDRV_SECTOR_SIZE;
-    const uint8_t *buf1;
     BlockDriverInfo bdi;
     QemuOpts *opts = NULL;
     QemuOptsList *create_opts = NULL;
@@ -1330,6 +1617,7 @@  static int img_convert(int argc, char **argv)
     bool quiet = false;
     Error *local_err = NULL;
     QemuOpts *sn_opts = NULL;
+    ImgConvertState state;
 
     fmt = NULL;
     out_fmt = "raw";
@@ -1622,9 +1910,6 @@  static int img_convert(int argc, char **argv)
     }
     out_bs = blk_bs(out_blk);
 
-    bs_i = 0;
-    bs_offset = 0;
-
     /* increase bufsectors from the default 4096 (2M) if opt_transfer_length
      * or discard_alignment of the out_bs is greater. Limit to 32768 (16MB)
      * as maximum. */
@@ -1633,8 +1918,6 @@  static int img_convert(int argc, char **argv)
                                          out_bs->bl.discard_alignment))
                     );
 
-    buf = qemu_blockalign(out_bs, bufsectors * BDRV_SECTOR_SIZE);
-
     if (skip_create) {
         int64_t output_sectors = bdrv_nb_sectors(out_bs);
         if (output_sectors < 0) {
@@ -1661,203 +1944,20 @@  static int img_convert(int argc, char **argv)
         cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
     }
 
-    if (compress) {
-        if (cluster_sectors <= 0 || cluster_sectors > bufsectors) {
-            error_report("invalid cluster size");
-            ret = -1;
-            goto out;
-        }
-        sector_num = 0;
-
-        nb_sectors = total_sectors;
-
-        for(;;) {
-            int64_t bs_num;
-            int remainder;
-            uint8_t *buf2;
-
-            nb_sectors = total_sectors - sector_num;
-            if (nb_sectors <= 0)
-                break;
-            if (nb_sectors >= cluster_sectors)
-                n = cluster_sectors;
-            else
-                n = nb_sectors;
-
-            bs_num = sector_num - bs_offset;
-            assert (bs_num >= 0);
-            remainder = n;
-            buf2 = buf;
-            while (remainder > 0) {
-                int nlow;
-                while (bs_num == bs_sectors[bs_i]) {
-                    bs_offset += bs_sectors[bs_i];
-                    bs_i++;
-                    assert (bs_i < bs_n);
-                    bs_num = 0;
-                    /* printf("changing part: sector_num=%" PRId64 ", "
-                       "bs_i=%d, bs_offset=%" PRId64 ", bs_sectors=%" PRId64
-                       "\n", sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
-                }
-                assert (bs_num < bs_sectors[bs_i]);
-
-                nlow = remainder > bs_sectors[bs_i] - bs_num
-                    ? bs_sectors[bs_i] - bs_num : remainder;
-
-                ret = bdrv_read(bs[bs_i], bs_num, buf2, nlow);
-                if (ret < 0) {
-                    error_report("error while reading sector %" PRId64 ": %s",
-                                 bs_num, strerror(-ret));
-                    goto out;
-                }
-
-                buf2 += nlow * 512;
-                bs_num += nlow;
-
-                remainder -= nlow;
-            }
-            assert (remainder == 0);
-
-            if (!buffer_is_zero(buf, n * BDRV_SECTOR_SIZE)) {
-                ret = bdrv_write_compressed(out_bs, sector_num, buf, n);
-                if (ret != 0) {
-                    error_report("error while compressing sector %" PRId64
-                                 ": %s", sector_num, strerror(-ret));
-                    goto out;
-                }
-            }
-            sector_num += n;
-            qemu_progress_print(100.0 * sector_num / total_sectors, 0);
-        }
-        /* signal EOF to align */
-        bdrv_write_compressed(out_bs, 0, NULL, 0);
-    } else {
-        int64_t sectors_to_read, sectors_read, sector_num_next_status;
-        bool count_allocated_sectors;
-        int has_zero_init = min_sparse ? bdrv_has_zero_init(out_bs) : 0;
-
-        if (!has_zero_init && bdrv_can_write_zeroes_with_unmap(out_bs)) {
-            ret = bdrv_make_zero(out_bs, BDRV_REQ_MAY_UNMAP);
-            if (ret < 0) {
-                goto out;
-            }
-            has_zero_init = 1;
-        }
-
-        sectors_to_read = total_sectors;
-        count_allocated_sectors = progress && (out_baseimg || has_zero_init);
-restart:
-        sector_num = 0; // total number of sectors converted so far
-        sectors_read = 0;
-        sector_num_next_status = 0;
-
-        for(;;) {
-            nb_sectors = total_sectors - sector_num;
-            if (nb_sectors <= 0) {
-                if (count_allocated_sectors) {
-                    sectors_to_read = sectors_read;
-                    count_allocated_sectors = false;
-                    goto restart;
-                }
-                ret = 0;
-                break;
-            }
-
-            while (sector_num - bs_offset >= bs_sectors[bs_i]) {
-                bs_offset += bs_sectors[bs_i];
-                bs_i ++;
-                assert (bs_i < bs_n);
-                /* printf("changing part: sector_num=%" PRId64 ", bs_i=%d, "
-                  "bs_offset=%" PRId64 ", bs_sectors=%" PRId64 "\n",
-                   sector_num, bs_i, bs_offset, bs_sectors[bs_i]); */
-            }
-
-            if ((out_baseimg || has_zero_init) &&
-                sector_num >= sector_num_next_status) {
-                n = nb_sectors > INT_MAX ? INT_MAX : nb_sectors;
-                ret = bdrv_get_block_status(bs[bs_i], sector_num - bs_offset,
-                                            n, &n1);
-                if (ret < 0) {
-                    error_report("error while reading block status of sector %"
-                                 PRId64 ": %s", sector_num - bs_offset,
-                                 strerror(-ret));
-                    goto out;
-                }
-                /* If the output image is zero initialized, we are not working
-                 * on a shared base and the input is zero we can skip the next
-                 * n1 sectors */
-                if (has_zero_init && !out_baseimg && (ret & BDRV_BLOCK_ZERO)) {
-                    sector_num += n1;
-                    continue;
-                }
-                /* If the output image is being created as a copy on write
-                 * image, assume that sectors which are unallocated in the
-                 * input image are present in both the output's and input's
-                 * base images (no need to copy them). */
-                if (out_baseimg) {
-                    if (!(ret & BDRV_BLOCK_DATA)) {
-                        sector_num += n1;
-                        continue;
-                    }
-                    /* The next 'n1' sectors are allocated in the input image.
-                     * Copy only those as they may be followed by unallocated
-                     * sectors. */
-                    nb_sectors = n1;
-                }
-                /* avoid redundant callouts to get_block_status */
-                sector_num_next_status = sector_num + n1;
-            }
-
-            n = MIN(nb_sectors, bufsectors);
-
-            /* round down request length to an aligned sector, but
-             * do not bother doing this on short requests. They happen
-             * when we found an all-zero area, and the next sector to
-             * write will not be sector_num + n. */
-            if (cluster_sectors > 0 && n >= cluster_sectors) {
-                int64_t next_aligned_sector = (sector_num + n);
-                next_aligned_sector -= next_aligned_sector % cluster_sectors;
-                if (sector_num + n > next_aligned_sector) {
-                    n = next_aligned_sector - sector_num;
-                }
-            }
-
-            n = MIN(n, bs_sectors[bs_i] - (sector_num - bs_offset));
-
-            sectors_read += n;
-            if (count_allocated_sectors) {
-                sector_num += n;
-                continue;
-            }
+    state = (ImgConvertState) {
+        .src                = bs,
+        .src_sectors        = bs_sectors,
+        .src_num            = bs_n,
+        .total_sectors      = total_sectors,
+        .target             = out_bs,
+        .compressed         = compress,
+        .out_backing        = (bool) out_baseimg,
+        .min_sparse         = min_sparse,
+        .cluster_sectors    = cluster_sectors,
+        .buf_sectors        = bufsectors,
+    };
+    ret = convert_do_copy(&state);
 
-            n1 = n;
-            ret = bdrv_read(bs[bs_i], sector_num - bs_offset, buf, n);
-            if (ret < 0) {
-                error_report("error while reading sector %" PRId64 ": %s",
-                             sector_num - bs_offset, strerror(-ret));
-                goto out;
-            }
-            /* NOTE: at the same time we convert, we do not write zero
-               sectors to have a chance to compress the image. Ideally, we
-               should add a specific call to have the info to go faster */
-            buf1 = buf;
-            while (n > 0) {
-                if (!has_zero_init ||
-                    is_allocated_sectors_min(buf1, n, &n1, min_sparse)) {
-                    ret = bdrv_write(out_bs, sector_num, buf1, n1);
-                    if (ret < 0) {
-                        error_report("error while writing sector %" PRId64
-                                     ": %s", sector_num, strerror(-ret));
-                        goto out;
-                    }
-                }
-                sector_num += n1;
-                n -= n1;
-                buf1 += n1 * 512;
-            }
-            qemu_progress_print(100.0 * sectors_read / sectors_to_read, 0);
-        }
-    }
 out:
     if (!ret) {
         qemu_progress_print(100, 0);
@@ -1865,7 +1965,6 @@  out:
     qemu_progress_end();
     qemu_opts_del(opts);
     qemu_opts_free(create_opts);
-    qemu_vfree(buf);
     qemu_opts_del(sn_opts);
     blk_unref(out_blk);
     g_free(bs);