Patchwork block: Produce zeros when protocols reading beyond end of file

login
register
mail settings
Submitter Asias He
Date Aug. 5, 2013, 8:11 a.m.
Message ID <1375690310-9051-1-git-send-email-asias@redhat.com>
Download mbox | patch
Permalink /patch/264594/
State New
Headers show

Comments

Asias He - Aug. 5, 2013, 8:11 a.m.
From: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>

While Asias is debugging an issue creating qcow2 images on top of
non-file protocols.  It boils down to this example using NBD:

$ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'

Notice the open -g option to set bs->growable.  This means you can
read/write beyond end of file.  Reading beyond end of file is supposed
to produce zeroes.

We rely on this behavior in qcow2_create2() during qcow2 image
creation.  We create a new file and then write the qcow2 header
structure using bdrv_pwrite().  Since QCowHeader is not a multiple of
sector size, block.c first uses bdrv_read() on the empty file to fetch
the first sector (should be all zeroes).

Here is the output from the qemu-io NBD example above:

$ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'
00000000:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
00000010:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
00000020:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
...

We are not zeroing the buffer!  As a result qcow2 image creation on top
of protocols is not guaranteed to work even when file creation is
supported by the protocol.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Asias He <asias@redhat.com>
---
 block.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)
Kevin Wolf - Aug. 5, 2013, 12:41 p.m.
Am 05.08.2013 um 10:11 hat Asias He geschrieben:
> From: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> 
> While Asias is debugging an issue creating qcow2 images on top of
> non-file protocols.  It boils down to this example using NBD:
> 
> $ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'
> 
> Notice the open -g option to set bs->growable.  This means you can
> read/write beyond end of file.  Reading beyond end of file is supposed
> to produce zeroes.
> 
> We rely on this behavior in qcow2_create2() during qcow2 image
> creation.  We create a new file and then write the qcow2 header
> structure using bdrv_pwrite().  Since QCowHeader is not a multiple of
> sector size, block.c first uses bdrv_read() on the empty file to fetch
> the first sector (should be all zeroes).
> 
> Here is the output from the qemu-io NBD example above:
> 
> $ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'
> 00000000:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> 00000010:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> 00000020:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> ...
> 
> We are not zeroing the buffer!  As a result qcow2 image creation on top
> of protocols is not guaranteed to work even when file creation is
> supported by the protocol.
> 
> Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> Signed-off-by: Asias He <asias@redhat.com>
> ---
>  block.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index 01b66d8..deaf0a0 100644
> --- a/block.c
> +++ b/block.c
> @@ -2544,7 +2544,35 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>          }
>      }
>  
> -    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    if (!bs->drv->protocol_name) {

I think !bs->growable is the right check.

Checking for the protocol name is always a hack and most times wrong.

> +        ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    } else {
> +        /* NBD doesn't support reading beyond end of file. */

This is not only for NBD, make it a neutral comment like:

/* Read zeros after EOF of growable BDSes */

> +        int64_t len, total_sectors, max_nb_sectors;
> +
> +        len = bdrv_getlength(bs);
> +        if (len < 0) {
> +            ret = len;
> +            goto out;
> +        }
> +
> +        total_sectors = len >> BDRV_SECTOR_BITS;
> +        max_nb_sectors = MAX(0, total_sectors - sector_num);
> +        if (max_nb_sectors > 0) {
> +            ret = drv->bdrv_co_readv(bs, sector_num,
> +                                     MIN(nb_sectors, max_nb_sectors), qiov);
> +        } else {
> +            ret = 0;
> +        }
> +
> +        /* Reading beyond end of file is supposed to produce zeroes */
> +        if (ret == 0 && total_sectors < sector_num + nb_sectors) {
> +            size_t offset = MAX(0, total_sectors - sector_num);
> +            size_t bytes = (sector_num + nb_sectors - offset) *
> +                            BDRV_SECTOR_SIZE;

uint64_t for both offset and bytes, size_t can be 32 bits.

> +            qemu_iovec_memset(qiov, offset * BDRV_SECTOR_SIZE, 0, bytes);
> +        }
> +    }
>  
>  out:
>      tracked_request_end(&req);

Kevin
Stefan Hajnoczi - Aug. 16, 2013, 8:41 a.m.
On Mon, Aug 5, 2013 at 10:11 AM, Asias He <asias@redhat.com> wrote:
> From: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
>
> While Asias is debugging an issue creating qcow2 images on top of
> non-file protocols.  It boils down to this example using NBD:
>
> $ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'
>
> Notice the open -g option to set bs->growable.  This means you can
> read/write beyond end of file.  Reading beyond end of file is supposed
> to produce zeroes.
>
> We rely on this behavior in qcow2_create2() during qcow2 image
> creation.  We create a new file and then write the qcow2 header
> structure using bdrv_pwrite().  Since QCowHeader is not a multiple of
> sector size, block.c first uses bdrv_read() on the empty file to fetch
> the first sector (should be all zeroes).
>
> Here is the output from the qemu-io NBD example above:
>
> $ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'
> 00000000:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> 00000010:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> 00000020:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> ...
>
> We are not zeroing the buffer!  As a result qcow2 image creation on top
> of protocols is not guaranteed to work even when file creation is
> supported by the protocol.
>
> Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> Signed-off-by: Asias He <asias@redhat.com>
> ---
>  block.c | 30 +++++++++++++++++++++++++++++-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/block.c b/block.c
> index 01b66d8..deaf0a0 100644
> --- a/block.c
> +++ b/block.c
> @@ -2544,7 +2544,35 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>          }
>      }
>
> -    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    if (!bs->drv->protocol_name) {
> +        ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> +    } else {
> +        /* NBD doesn't support reading beyond end of file. */
> +        int64_t len, total_sectors, max_nb_sectors;
> +
> +        len = bdrv_getlength(bs);
> +        if (len < 0) {
> +            ret = len;
> +            goto out;
> +        }
> +
> +        total_sectors = len >> BDRV_SECTOR_BITS;
> +        max_nb_sectors = MAX(0, total_sectors - sector_num);
> +        if (max_nb_sectors > 0) {
> +            ret = drv->bdrv_co_readv(bs, sector_num,
> +                                     MIN(nb_sectors, max_nb_sectors), qiov);
> +        } else {
> +            ret = 0;
> +        }
> +
> +        /* Reading beyond end of file is supposed to produce zeroes */
> +        if (ret == 0 && total_sectors < sector_num + nb_sectors) {
> +            size_t offset = MAX(0, total_sectors - sector_num);
> +            size_t bytes = (sector_num + nb_sectors - offset) *
> +                            BDRV_SECTOR_SIZE;
> +            qemu_iovec_memset(qiov, offset * BDRV_SECTOR_SIZE, 0, bytes);
> +        }
> +    }

This patch breaks qemu-iotests ./check -qcow2 022.  This happens
because qcow2 temporarily sets ->growable = 1 for vmstate accesses
(which are stored beyond the end of regular image data).

static int qcow2_load_vmstate(BlockDriverState *bs, uint8_t *buf,
                              int64_t pos, int size)
{
    BDRVQcowState *s = bs->opaque;
    int growable = bs->growable;
    int ret;

    BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_LOAD);
    bs->growable = 1;
    ret = bdrv_pread(bs, qcow2_vm_state_offset(s) + pos, buf, size);
    bs->growable = growable;

    return ret;
}

Please *always* run qemu-iotests before submitting block patches:
http://qemu-project.org/Documentation/QemuIoTests

A simple but ugly way to fix this is for block.c to also have a
->zero_beyond_eof flag which enables the behavior you are adding.
qcow2_load_vmstate() would disable ->zero_beyond_eof temporarily in
addition to enabling ->growable.

It's not easy to call the internal qcow2_co_readv() from
qcow2_load_vmstate() because the vmstate functions are byte
granularity (like bdrv_pread()) while .bdrv_co_readv() is
sector-granularity.

Stefan
Asias He - Aug. 19, 2013, 6:36 a.m.
On Fri, Aug 16, 2013 at 10:41:36AM +0200, Stefan Hajnoczi wrote:
> On Mon, Aug 5, 2013 at 10:11 AM, Asias He <asias@redhat.com> wrote:
> > From: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> >
> > While Asias is debugging an issue creating qcow2 images on top of
> > non-file protocols.  It boils down to this example using NBD:
> >
> > $ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'
> >
> > Notice the open -g option to set bs->growable.  This means you can
> > read/write beyond end of file.  Reading beyond end of file is supposed
> > to produce zeroes.
> >
> > We rely on this behavior in qcow2_create2() during qcow2 image
> > creation.  We create a new file and then write the qcow2 header
> > structure using bdrv_pwrite().  Since QCowHeader is not a multiple of
> > sector size, block.c first uses bdrv_read() on the empty file to fetch
> > the first sector (should be all zeroes).
> >
> > Here is the output from the qemu-io NBD example above:
> >
> > $ qemu-io -c 'open -g nbd+unix:///?socket=/tmp/nbd.sock' -c 'read -v 0 512'
> > 00000000:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> > 00000010:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> > 00000020:  ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab  ................
> > ...
> >
> > We are not zeroing the buffer!  As a result qcow2 image creation on top
> > of protocols is not guaranteed to work even when file creation is
> > supported by the protocol.
> >
> > Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
> > Signed-off-by: Asias He <asias@redhat.com>
> > ---
> >  block.c | 30 +++++++++++++++++++++++++++++-
> >  1 file changed, 29 insertions(+), 1 deletion(-)
> >
> > diff --git a/block.c b/block.c
> > index 01b66d8..deaf0a0 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -2544,7 +2544,35 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> >          }
> >      }
> >
> > -    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> > +    if (!bs->drv->protocol_name) {
> > +        ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
> > +    } else {
> > +        /* NBD doesn't support reading beyond end of file. */
> > +        int64_t len, total_sectors, max_nb_sectors;
> > +
> > +        len = bdrv_getlength(bs);
> > +        if (len < 0) {
> > +            ret = len;
> > +            goto out;
> > +        }
> > +
> > +        total_sectors = len >> BDRV_SECTOR_BITS;
> > +        max_nb_sectors = MAX(0, total_sectors - sector_num);
> > +        if (max_nb_sectors > 0) {
> > +            ret = drv->bdrv_co_readv(bs, sector_num,
> > +                                     MIN(nb_sectors, max_nb_sectors), qiov);
> > +        } else {
> > +            ret = 0;
> > +        }
> > +
> > +        /* Reading beyond end of file is supposed to produce zeroes */
> > +        if (ret == 0 && total_sectors < sector_num + nb_sectors) {
> > +            size_t offset = MAX(0, total_sectors - sector_num);
> > +            size_t bytes = (sector_num + nb_sectors - offset) *
> > +                            BDRV_SECTOR_SIZE;
> > +            qemu_iovec_memset(qiov, offset * BDRV_SECTOR_SIZE, 0, bytes);
> > +        }
> > +    }
> 
> This patch breaks qemu-iotests ./check -qcow2 022.  This happens
> because qcow2 temporarily sets ->growable = 1 for vmstate accesses
> (which are stored beyond the end of regular image data).

I am a bit confused. This is from the other mail:

"""
> > I think it would break qcow2_load_vmstate(), which is basically a                                                                   
> > bdrv_pread() after the end of the image.                                                                                            
>                                                                                                                                       
> I see, then only protocols have to zeroing the buffer?  In case of                                                                    
> protocols, I think bdrv_getlength() returns the underlying file                                                                       
> length, so qcow2_load_vmstate() would be a bdrv_pread() within the                                                                    
> result of bdrv_getlength().                                                                                                           

Limiting it to protocols solves the problem, I think.
"""

And in v1 of this patch, Kevin wanted bs->growable check instad of the
protocol_name one.

"""
> -    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);                                                                      
> +    if (!bs->drv->protocol_name) {                                                                                                   

I think !bs->growable is the right check.

Checking for the protocol name is always a hack and most times wrong.
"""

Switching back to the protocol_name check, ./check -qcow2 022 test passes.

> static int qcow2_load_vmstate(BlockDriverState *bs, uint8_t *buf,
>                               int64_t pos, int size)
> {
>     BDRVQcowState *s = bs->opaque;
>     int growable = bs->growable;
>     int ret;
> 
>     BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_LOAD);
>     bs->growable = 1;
>     ret = bdrv_pread(bs, qcow2_vm_state_offset(s) + pos, buf, size);
>     bs->growable = growable;
> 
>     return ret;
> }
> 
> Please *always* run qemu-iotests before submitting block patches:
> http://qemu-project.org/Documentation/QemuIoTests

OK.

> A simple but ugly way to fix this is for block.c to also have a
> ->zero_beyond_eof flag which enables the behavior you are adding.
> qcow2_load_vmstate() would disable ->zero_beyond_eof temporarily in
> addition to enabling ->growable.

I am wondering why the ->growable logic is introduced in the first
place.  Adding yet another this kind of flag looks realy ugly ;(

> It's not easy to call the internal qcow2_co_readv() from
> qcow2_load_vmstate() because the vmstate functions are byte
> granularity (like bdrv_pread()) while .bdrv_co_readv() is
> sector-granularity.
> 
> Stefan
Stefan Hajnoczi - Aug. 19, 2013, 12:09 p.m.
On Mon, Aug 19, 2013 at 02:36:14PM +0800, Asias He wrote:
> On Fri, Aug 16, 2013 at 10:41:36AM +0200, Stefan Hajnoczi wrote:
> > On Mon, Aug 5, 2013 at 10:11 AM, Asias He <asias@redhat.com> wrote:
> > A simple but ugly way to fix this is for block.c to also have a
> > ->zero_beyond_eof flag which enables the behavior you are adding.
> > qcow2_load_vmstate() would disable ->zero_beyond_eof temporarily in
> > addition to enabling ->growable.
> 
> I am wondering why the ->growable logic is introduced in the first
> place.  Adding yet another this kind of flag looks realy ugly ;(

bs->growable serves two functions:

1. It means you can read/write beyond the end of the file, for example,
   when creating a new image file.  Normally BlockDriverState rejects
   requests beyond the EOF.

2. qcow2 uses it as a hack to implement the vmstate area after the end
   of regular guest data.  This is the ugly part but it always worked up
   until now.

#1 is a legitimate use case.  You don't always know how big the file
will end up so it's much more convenient to grow on demand, instead of
having to bdrv_truncate() all the time.

#2 is hacky but hard to solve without duplicating the bounce buffer and
coroutine wrapping logic in bdrv_pread() (Kevin has suggested calling
bdrv_co_readv() internally instead of bdrv_pread()).

Stefan

Patch

diff --git a/block.c b/block.c
index 01b66d8..deaf0a0 100644
--- a/block.c
+++ b/block.c
@@ -2544,7 +2544,35 @@  static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         }
     }
 
-    ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+    if (!bs->drv->protocol_name) {
+        ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
+    } else {
+        /* NBD doesn't support reading beyond end of file. */
+        int64_t len, total_sectors, max_nb_sectors;
+
+        len = bdrv_getlength(bs);
+        if (len < 0) {
+            ret = len;
+            goto out;
+        }
+
+        total_sectors = len >> BDRV_SECTOR_BITS;
+        max_nb_sectors = MAX(0, total_sectors - sector_num);
+        if (max_nb_sectors > 0) {
+            ret = drv->bdrv_co_readv(bs, sector_num,
+                                     MIN(nb_sectors, max_nb_sectors), qiov);
+        } else {
+            ret = 0;
+        }
+
+        /* Reading beyond end of file is supposed to produce zeroes */
+        if (ret == 0 && total_sectors < sector_num + nb_sectors) {
+            size_t offset = MAX(0, total_sectors - sector_num);
+            size_t bytes = (sector_num + nb_sectors - offset) *
+                            BDRV_SECTOR_SIZE;
+            qemu_iovec_memset(qiov, offset * BDRV_SECTOR_SIZE, 0, bytes);
+        }
+    }
 
 out:
     tracked_request_end(&req);