diff mbox

qcow2: Bring synchronous read/write back to life

Message ID 1255006928-7600-1-git-send-email-kwolf@redhat.com
State Under Review
Headers show

Commit Message

Kevin Wolf Oct. 8, 2009, 1:02 p.m. UTC
When the synchronous read and write functions were dropped, they were replaced
by generic emulation functions. Unfortunately, these emulation functions don't
provide the same semantics as the original functions did.

The original bdrv_read would mean that we read some data synchronously and that
we won't be interrupted during this read. The latter assumption is no longer
true with the emulation function which needs to use qemu_aio_poll and therefore
allows the callback of any other concurrent AIO request to be run during the
read. Which in turn means that (meta)data read earlier could have changed and
be invalid now. qcow2 is not prepared to work in this way and it's just scary
how many places there are where other requests could run.

I'm not sure yet where exactly it breaks, but you'll see breakage with virtio
on qcow2 with a backing file. Providing synchronous functions again fixes the
problem for me.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c |    6 ++--
 block/qcow2.c         |   51 +++++++++++++++++++++++++++++++++++++++++++++++-
 block/qcow2.h         |    3 ++
 3 files changed, 55 insertions(+), 5 deletions(-)

Comments

Anthony Liguori Oct. 8, 2009, 2:30 p.m. UTC | #1
Kevin Wolf wrote:
> When the synchronous read and write functions were dropped, they were replaced
> by generic emulation functions. Unfortunately, these emulation functions don't
> provide the same semantics as the original functions did.
>
> The original bdrv_read would mean that we read some data synchronously and that
> we won't be interrupted during this read. The latter assumption is no longer
> true with the emulation function which needs to use qemu_aio_poll and therefore
> allows the callback of any other concurrent AIO request to be run during the
> read.

Perhaps you could create a mechanism to freeze the qcow2 image by 
queuing all completions within qcow2 until the image was unfrozen.  This 
would have the same effect switching to synchronous read/write.

You may also have to queue new read/write requests...

Introducing sync read/write seems like a major step backwards to me.

Regards,

Anthony Liguori
Kevin Wolf Oct. 8, 2009, 2:47 p.m. UTC | #2
Am 08.10.2009 16:30, schrieb Anthony Liguori:
> Kevin Wolf wrote:
>> When the synchronous read and write functions were dropped, they were replaced
>> by generic emulation functions. Unfortunately, these emulation functions don't
>> provide the same semantics as the original functions did.
>>
>> The original bdrv_read would mean that we read some data synchronously and that
>> we won't be interrupted during this read. The latter assumption is no longer
>> true with the emulation function which needs to use qemu_aio_poll and therefore
>> allows the callback of any other concurrent AIO request to be run during the
>> read.
> 
> Perhaps you could create a mechanism to freeze the qcow2 image by 
> queuing all completions within qcow2 until the image was unfrozen.  This 
> would have the same effect switching to synchronous read/write.
> 
> You may also have to queue new read/write requests...
> 
> Introducing sync read/write seems like a major step backwards to me.

Right, I was expecting your reaction. ;-) I do even agree that it's not
nice to have the synchronous functions back. But removing them caused a
regression, so the removal should be reverted until it is done right.

I just want to make clear that we're talking about data corruption here.
This is not just something that we can care about when we are bored some
time in the future.

For stable, I think taking this patch is the only reasonable thing to
do. Any other solution would be way too invasive. For master we can
discuss other solutions. However, I really don't feel like hacking
around it with a quick fix and breaking other stuff, so this takes a bit
more time. For the meantime I would prefer committing this to master, too.

By the way, I don't think queuing things in qcow2 is the right thing to
do. It probably wouldn't even help if, say, VMDK implemented AIO and I
used a VMDK image as a backing file for qcow2. The real solution is to
fix the generic bdrv_read/write emulation. Now we just need to find the
best way to do this.

Kevin
Anthony Liguori Oct. 8, 2009, 3:01 p.m. UTC | #3
Kevin Wolf wrote:
> Am 08.10.2009 16:30, schrieb Anthony Liguori:
>   
>> Kevin Wolf wrote:
>>     
>>> When the synchronous read and write functions were dropped, they were replaced
>>> by generic emulation functions. Unfortunately, these emulation functions don't
>>> provide the same semantics as the original functions did.
>>>
>>> The original bdrv_read would mean that we read some data synchronously and that
>>> we won't be interrupted during this read. The latter assumption is no longer
>>> true with the emulation function which needs to use qemu_aio_poll and therefore
>>> allows the callback of any other concurrent AIO request to be run during the
>>> read.
>>>       
>> Perhaps you could create a mechanism to freeze the qcow2 image by 
>> queuing all completions within qcow2 until the image was unfrozen.  This 
>> would have the same effect switching to synchronous read/write.
>>
>> You may also have to queue new read/write requests...
>>
>> Introducing sync read/write seems like a major step backwards to me.
>>     
>
> Right, I was expecting your reaction. ;-) I do even agree that it's not
> nice to have the synchronous functions back. But removing them caused a
> regression, so the removal should be reverted until it is done right.
>
> I just want to make clear that we're talking about data corruption here.
> This is not just something that we can care about when we are bored some
> time in the future.
>   

Yeah, okay.  Can we do a more direct revert though so that it's clearer 
in the commit log?

> By the way, I don't think queuing things in qcow2 is the right thing to
> do. It probably wouldn't even help if, say, VMDK implemented AIO and I
> used a VMDK image as a backing file for qcow2. The real solution is to
> fix the generic bdrv_read/write emulation. Now we just need to find the
> best way to do this.
>   

Ideally, you could say qemu_aio_wait(aiocb) and it would only wait for 
that particular request.  I'm not convinced though that there aren't 
dependency issues though since we can generate synthetic aiocbs.

Regards,

Anthony Liguori
Kevin Wolf Oct. 8, 2009, 3:23 p.m. UTC | #4
Am 08.10.2009 17:01, schrieb Anthony Liguori:
> Kevin Wolf wrote:
>> Am 08.10.2009 16:30, schrieb Anthony Liguori:
>>   
>>> Kevin Wolf wrote:
>>>     
>>>> When the synchronous read and write functions were dropped, they were replaced
>>>> by generic emulation functions. Unfortunately, these emulation functions don't
>>>> provide the same semantics as the original functions did.
>>>>
>>>> The original bdrv_read would mean that we read some data synchronously and that
>>>> we won't be interrupted during this read. The latter assumption is no longer
>>>> true with the emulation function which needs to use qemu_aio_poll and therefore
>>>> allows the callback of any other concurrent AIO request to be run during the
>>>> read.
>>>>       
>>> Perhaps you could create a mechanism to freeze the qcow2 image by 
>>> queuing all completions within qcow2 until the image was unfrozen.  This 
>>> would have the same effect switching to synchronous read/write.
>>>
>>> You may also have to queue new read/write requests...
>>>
>>> Introducing sync read/write seems like a major step backwards to me.
>>>     
>>
>> Right, I was expecting your reaction. ;-) I do even agree that it's not
>> nice to have the synchronous functions back. But removing them caused a
>> regression, so the removal should be reverted until it is done right.
>>
>> I just want to make clear that we're talking about data corruption here.
>> This is not just something that we can care about when we are bored some
>> time in the future.
>>   
> 
> Yeah, okay.  Can we do a more direct revert though so that it's clearer 
> in the commit log?

Hm, it's not a single commit to revert. The main part is that we need
qcow_write back which was removed in ade40677. The removal of the
synchronous functions from the BlockDriver must have happened longer ago.

And then I needed to put one or two changes on top to make it work with
the changes since it was dropped.

>> By the way, I don't think queuing things in qcow2 is the right thing to
>> do. It probably wouldn't even help if, say, VMDK implemented AIO and I
>> used a VMDK image as a backing file for qcow2. The real solution is to
>> fix the generic bdrv_read/write emulation. Now we just need to find the
>> best way to do this.
> 
> Ideally, you could say qemu_aio_wait(aiocb) and it would only wait for 
> that particular request.  I'm not convinced though that there aren't 
> dependency issues though since we can generate synthetic aiocbs.

You can't only wait for this single request but you also need to allow
waiting for all requests that were submitted while processing the
request you're waiting for. I think you need to create something like a
new context for all requests that are newly started.

Kevin
Jamie Lokier Oct. 8, 2009, 3:28 p.m. UTC | #5
Kevin Wolf wrote:
> The original bdrv_read would mean that we read some data
> synchronously and that we won't be interrupted during this read. The
> latter assumption is no longer true with the emulation function
> which needs to use qemu_aio_poll and therefore allows the callback
> of any other concurrent AIO request to be run during the read. Which
> in turn means that (meta)data read earlier could have changed and be
> invalid now.

I'm not sure if I understand your description, specifically
"(meta)data read earlier could have changed and be invalid now".

Do you mean:

                                  Async call into qcow2 #2
                                  ------------------------

                                  issues a request with bdrv_read/write

Async call into qcow2 #1
------------------------
reads some metadata from memory (**)
does some calculations
issues a request with bdrv_read/write

                                  the request completes
                                  updates some metadata in memory
                                  async call finished with result

the request completes
updates some metadata in memory
.... ERROR, memory isn't what it was at point (**)

Thanks,
-- Jamie
Kevin Wolf Oct. 8, 2009, 3:47 p.m. UTC | #6
Am 08.10.2009 17:28, schrieb Jamie Lokier:
> Kevin Wolf wrote:
>> The original bdrv_read would mean that we read some data
>> synchronously and that we won't be interrupted during this read. The
>> latter assumption is no longer true with the emulation function
>> which needs to use qemu_aio_poll and therefore allows the callback
>> of any other concurrent AIO request to be run during the read. Which
>> in turn means that (meta)data read earlier could have changed and be
>> invalid now.
> 
> I'm not sure if I understand your description, specifically
> "(meta)data read earlier could have changed and be invalid now".
> 
> Do you mean:

No, it's not exactly what I meant. But you're right, your scenario could
happen, too. If global state is changed in an AIO callback called during
bdrv_read/write (e.g. a new L2 table is loaded into the cache), bad
things are almost guaranteed to happen.

What I was thinking of is:

> 
>                                   Async call into qcow2 #2
>                                   ------------------------
> 
>                                   issues a request with bdrv_read/write
> 
> Async call into qcow2 #1
> ------------------------
> reads some metadata from memory (**)

#1 reads some metadata from disk (from the image file)

> does some calculations
> issues a request with bdrv_read/write
> 
>                                   the request completes
>                                   updates some metadata in memory

#2 updates the metadata in the file

>                                   async call finished with result
> 
> the request completes
> updates some metadata in memory
> .... ERROR, memory isn't what it was at point (**)

#1 still uses the old metadata which it had loaded into memory
(specifically those parts on the stack).

Also, I was thinking of #2 as being a regular AIO write (no bdrv_write
involved), but again your version could work as well. As I said I don't
know yet which of them really happens.

Kevin
Mark McLoughlin Oct. 8, 2009, 6:47 p.m. UTC | #7
On Thu, 2009-10-08 at 10:01 -0500, Anthony Liguori wrote:
> Kevin Wolf wrote:
> > Am 08.10.2009 16:30, schrieb Anthony Liguori:
> >   
> >> Kevin Wolf wrote:
> >>     
> >>> When the synchronous read and write functions were dropped, they were replaced
> >>> by generic emulation functions. Unfortunately, these emulation functions don't
> >>> provide the same semantics as the original functions did.
> >>>
> >>> The original bdrv_read would mean that we read some data synchronously and that
> >>> we won't be interrupted during this read. The latter assumption is no longer
> >>> true with the emulation function which needs to use qemu_aio_poll and therefore
> >>> allows the callback of any other concurrent AIO request to be run during the
> >>> read.
> >>>       
> >> Perhaps you could create a mechanism to freeze the qcow2 image by 
> >> queuing all completions within qcow2 until the image was unfrozen.  This 
> >> would have the same effect switching to synchronous read/write.
> >>
> >> You may also have to queue new read/write requests...
> >>
> >> Introducing sync read/write seems like a major step backwards to me.
> >>     
> >
> > Right, I was expecting your reaction. ;-) I do even agree that it's not
> > nice to have the synchronous functions back. But removing them caused a
> > regression, so the removal should be reverted until it is done right.
> >
> > I just want to make clear that we're talking about data corruption here.
> > This is not just something that we can care about when we are bored some
> > time in the future.
> >   
> 
> Yeah, okay.  Can we do a more direct revert though so that it's clearer 
> in the commit log?

FWIW, here's the Fedora 12 (qemu-kvm-0.11.0) report on this:

  https://bugzilla.redhat.com/524734

Cheers,
Mark.
diff mbox

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index e444e53..a7de820 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -306,8 +306,8 @@  void qcow2_encrypt_sectors(BDRVQcowState *s, int64_t sector_num,
 }
 
 
-static int qcow_read(BlockDriverState *bs, int64_t sector_num,
-                     uint8_t *buf, int nb_sectors)
+int qcow2_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
+    int nb_sectors)
 {
     BDRVQcowState *s = bs->opaque;
     int ret, index_in_cluster, n, n1;
@@ -358,7 +358,7 @@  static int copy_sectors(BlockDriverState *bs, uint64_t start_sect,
     n = n_end - n_start;
     if (n <= 0)
         return 0;
-    ret = qcow_read(bs, start_sect + n_start, s->cluster_data, n);
+    ret = qcow2_read(bs, start_sect + n_start, s->cluster_data, n);
     if (ret < 0)
         return ret;
     if (s->crypt_method) {
diff --git a/block/qcow2.c b/block/qcow2.c
index a9e7682..52584ed 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -934,6 +934,51 @@  static int qcow_make_empty(BlockDriverState *bs)
     return 0;
 }
 
+static int qcow2_write(BlockDriverState *bs, int64_t sector_num,
+                     const uint8_t *buf, int nb_sectors)
+{
+    BDRVQcowState *s = bs->opaque;
+    int ret, index_in_cluster, n;
+    uint64_t cluster_offset;
+    int n_end;
+    QCowL2Meta l2meta;
+
+    while (nb_sectors > 0) {
+        memset(&l2meta, 0, sizeof(l2meta));
+
+        index_in_cluster = sector_num & (s->cluster_sectors - 1);
+        n_end = index_in_cluster + nb_sectors;
+        if (s->crypt_method &&
+            n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
+            n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
+        cluster_offset = qcow2_alloc_cluster_offset(bs, sector_num << 9,
+                                              index_in_cluster,
+                                              n_end, &n, &l2meta);
+        if (!cluster_offset)
+            return -1;
+        if (s->crypt_method) {
+            qcow2_encrypt_sectors(s, sector_num, s->cluster_data, buf, n, 1,
+                            &s->aes_encrypt_key);
+            ret = bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512,
+                              s->cluster_data, n * 512);
+        } else {
+            ret = bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512);
+        }
+        if (ret != n * 512 || qcow2_alloc_cluster_link_l2(bs, cluster_offset, &l2meta) < 0) {
+            qcow2_free_any_clusters(bs, cluster_offset, l2meta.nb_clusters);
+            return -1;
+        }
+        nb_sectors -= n;
+        sector_num += n;
+        buf += n * 512;
+        if (l2meta.nb_clusters != 0) {
+            QLIST_REMOVE(&l2meta, next_in_flight);
+        }
+    }
+    s->cluster_cache_offset = -1; /* disable compressed cache */
+    return 0;
+}
+
 /* XXX: put compressed sectors first, then all the cluster aligned
    tables to avoid losing bytes in alignment */
 static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num,
@@ -1121,8 +1166,10 @@  static BlockDriver bdrv_qcow2 = {
     .bdrv_set_key	= qcow_set_key,
     .bdrv_make_empty	= qcow_make_empty,
 
-    .bdrv_aio_readv	= qcow_aio_readv,
-    .bdrv_aio_writev	= qcow_aio_writev,
+    .bdrv_read          = qcow2_read,
+    .bdrv_write         = qcow2_write,
+    .bdrv_aio_readv     = qcow_aio_readv,
+    .bdrv_aio_writev    = qcow_aio_writev,
     .bdrv_write_compressed = qcow_write_compressed,
 
     .bdrv_snapshot_create   = qcow2_snapshot_create,
diff --git a/block/qcow2.h b/block/qcow2.h
index 26ab5d9..d3f690a 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -202,6 +202,9 @@  uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
     QCowL2Meta *m);
 
+int qcow2_read(BlockDriverState *bs, int64_t sector_num, uint8_t *buf,
+    int nb_sectors);
+
 /* qcow2-snapshot.c functions */
 int qcow2_snapshot_create(BlockDriverState *bs, QEMUSnapshotInfo *sn_info);
 int qcow2_snapshot_goto(BlockDriverState *bs, const char *snapshot_id);