diff mbox

[RFC] qed: add support for Copy-on-Read

Message ID 1301533714-28997-1-git-send-email-aliguori@us.ibm.com
State New
Headers show

Commit Message

Anthony Liguori March 31, 2011, 1:08 a.m. UTC
When creating an image using qemu-img, just pass '-o copy_on_read' and then
whenever QED reads from a backing file, it will write the block to the QED
file after the read completes ensuring that you only fetch from the backing
device once.

This is very useful for streaming images over a slow connection.

This isn't ready for merge yet as it's not playing nice with synchronize I/O.

I think it's fairly easy to do the same thing in qcow2 by just hooking adding
some logic after bdrv_aio_write() to call back into qcow2 with a synchronous
I/O write in the backing file case.   Thoughts on whether that would actually
work?

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
 block/qed.c |   53 +++++++++++++++++++++++++++++++++++++++++++++++++----
 block/qed.h |    4 +++-
 2 files changed, 52 insertions(+), 5 deletions(-)

Comments

Stefan Hajnoczi April 1, 2011, 9:42 a.m. UTC | #1
On Wed, Mar 30, 2011 at 08:08:34PM -0500, Anthony Liguori wrote:
> When creating an image using qemu-img, just pass '-o copy_on_read' and then
> whenever QED reads from a backing file, it will write the block to the QED
> file after the read completes ensuring that you only fetch from the backing
> device once.
> 
> This is very useful for streaming images over a slow connection.
> 
> This isn't ready for merge yet as it's not playing nice with synchronize I/O.

What is the issue here?  Streaming had issues with aio contexts and
synchronous I/O emulation but copy-on-read by itself looks safe to me.

> I think it's fairly easy to do the same thing in qcow2 by just hooking adding
> some logic after bdrv_aio_write() to call back into qcow2 with a synchronous
> I/O write in the backing file case.   Thoughts on whether that would actually
> work?

Why not do the follow-up .bdrv_aio_writev() for qcow2 too?  I don't see
a reason to do it synchronously.

Stefan
Kevin Wolf April 1, 2011, 11:11 a.m. UTC | #2
Am 31.03.2011 03:08, schrieb Anthony Liguori:
> When creating an image using qemu-img, just pass '-o copy_on_read' and then
> whenever QED reads from a backing file, it will write the block to the QED
> file after the read completes ensuring that you only fetch from the backing
> device once.

As you probably know, I don't agree with the interface. Copy on read
should be first and foremost a runtime option. It's okay to fetch the
default for this option from the image file, but it's not the right
primary interface.

> This is very useful for streaming images over a slow connection.
> 
> This isn't ready for merge yet as it's not playing nice with synchronize I/O.

Like Stefan I'm not sure what you mean here.

One problem that I see is that if you have a concurrent write request
from the guest, the COR write request may overwrite the guest's request,
which is obviously wrong. Is this what you mean?

> I think it's fairly easy to do the same thing in qcow2 by just hooking adding
> some logic after bdrv_aio_write() to call back into qcow2 with a synchronous
> I/O write in the backing file case.   Thoughts on whether that would actually
> work?

I can't see a reason why it wouldn't work. Except that it has the same
problem that I described above. A synchronous write isn't going to solve
this, you'd additionally need a qemu_aio_flush() if you want to avoid
proper locking. But that's really the same for QED and qcow2.

Kevin
Anthony Liguori April 1, 2011, 12:28 p.m. UTC | #3
On 04/01/2011 04:42 AM, Stefan Hajnoczi wrote:
> On Wed, Mar 30, 2011 at 08:08:34PM -0500, Anthony Liguori wrote:
>> When creating an image using qemu-img, just pass '-o copy_on_read' and then
>> whenever QED reads from a backing file, it will write the block to the QED
>> file after the read completes ensuring that you only fetch from the backing
>> device once.
>>
>> This is very useful for streaming images over a slow connection.
>>
>> This isn't ready for merge yet as it's not playing nice with synchronize I/O.
> What is the issue here?  Streaming had issues with aio contexts and
> synchronous I/O emulation but copy-on-read by itself looks safe to me.

Here's the scenario that fails for me although I'm starting to suspect 
block/curl as the real culprit.

qemu-img create -f qed -o copy_on_read -b 
http://linux.nssl.noaa.gov/fedora/fedora/linux/releases/14/Fedora/x86_64/iso/Fedora-14-x86_64-DVD.iso 
cached_iso.img

qemu -cdrom cache_iso.img -boot d

And I have a patch that does a bunch of synchronous reads of the disk.

>> I think it's fairly easy to do the same thing in qcow2 by just hooking adding
>> some logic after bdrv_aio_write() to call back into qcow2 with a synchronous
>> I/O write in the backing file case.   Thoughts on whether that would actually
>> work?
> Why not do the follow-up .bdrv_aio_writev() for qcow2 too?  I don't see
> a reason to do it synchronously.

Oh, I assumed with coroutines that that would be the preference in 
qcow2.  If not, AIO is just as good for me :-)

Regards,

Anthony Liguori

> Stefan
Anthony Liguori April 1, 2011, 12:36 p.m. UTC | #4
On 04/01/2011 06:11 AM, Kevin Wolf wrote:
> Am 31.03.2011 03:08, schrieb Anthony Liguori:
>> When creating an image using qemu-img, just pass '-o copy_on_read' and then
>> whenever QED reads from a backing file, it will write the block to the QED
>> file after the read completes ensuring that you only fetch from the backing
>> device once.
> As you probably know, I don't agree with the interface. Copy on read
> should be first and foremost a runtime option. It's okay to fetch the
> default for this option from the image file, but it's not the right
> primary interface.

That's the main reason I posted this.  I wanted to revisit that 
discussion and see if we're any close to having a primary interface for 
this.

I think blockdev is still a ways off.  Would a -drive 
file=image.img,cor=on make sense as an intermediate mechanism?

>> This is very useful for streaming images over a slow connection.
>>
>> This isn't ready for merge yet as it's not playing nice with synchronize I/O.
> Like Stefan I'm not sure what you mean here.
>
> One problem that I see is that if you have a concurrent write request
> from the guest, the COR write request may overwrite the guest's request,
> which is obviously wrong. Is this what you mean?

Yes, I think you're right here but I don't think this is the issue.  But 
surely the semantics of a simultaneous read/write are undefined at least 
on the read side.  I guess having the write be undefined is unexpected.

Regards,

Anthony Liguori

> Kevin
Kevin Wolf April 1, 2011, 12:44 p.m. UTC | #5
Am 01.04.2011 14:36, schrieb Anthony Liguori:
> On 04/01/2011 06:11 AM, Kevin Wolf wrote:
>> Am 31.03.2011 03:08, schrieb Anthony Liguori:
>>> When creating an image using qemu-img, just pass '-o copy_on_read' and then
>>> whenever QED reads from a backing file, it will write the block to the QED
>>> file after the read completes ensuring that you only fetch from the backing
>>> device once.
>> As you probably know, I don't agree with the interface. Copy on read
>> should be first and foremost a runtime option. It's okay to fetch the
>> default for this option from the image file, but it's not the right
>> primary interface.
> 
> That's the main reason I posted this.  I wanted to revisit that 
> discussion and see if we're any close to having a primary interface for 
> this.
> 
> I think blockdev is still a ways off.  Would a -drive 
> file=image.img,cor=on make sense as an intermediate mechanism?

I don't like it much (and I think neither do you), but I agree that
blockdev won't be there tomorrow, so I wouldn't object to it.

>>> This is very useful for streaming images over a slow connection.
>>>
>>> This isn't ready for merge yet as it's not playing nice with synchronize I/O.
>> Like Stefan I'm not sure what you mean here.
>>
>> One problem that I see is that if you have a concurrent write request
>> from the guest, the COR write request may overwrite the guest's request,
>> which is obviously wrong. Is this what you mean?
> 
> Yes, I think you're right here but I don't think this is the issue.  But 
> surely the semantics of a simultaneous read/write are undefined at least 
> on the read side.  I guess having the write be undefined is unexpected.

Yes and no. If you have the read and the write to same sector, then it
would be undefined whether the read returns the old or the new data. But
if the write completes successfully, the sector must have been updated.
I doubt that this is something that guests usually do.

However, we're not operating on a sector level, but on a cluster level
here. So the guest could actually access two different sectors and still
get its write request overwritten by the COR.

Kevin
diff mbox

Patch

diff --git a/block/qed.c b/block/qed.c
index 75ae244..292833b 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -448,7 +448,8 @@  static int bdrv_qed_flush(BlockDriverState *bs)
 
 static int qed_create(const char *filename, uint32_t cluster_size,
                       uint64_t image_size, uint32_t table_size,
-                      const char *backing_file, const char *backing_fmt)
+                      const char *backing_file, const char *backing_fmt,
+                      bool copy_on_read)
 {
     QEDHeader header = {
         .magic = QED_MAGIC,
@@ -490,6 +491,9 @@  static int qed_create(const char *filename, uint32_t cluster_size,
         if (qed_fmt_is_raw(backing_fmt)) {
             header.features |= QED_F_BACKING_FORMAT_NO_PROBE;
         }
+        if (copy_on_read) {
+            header.compat_features |= QED_CF_COPY_ON_READ;
+        }
     }
 
     qed_header_cpu_to_le(&header, &le_header);
@@ -523,6 +527,7 @@  static int bdrv_qed_create(const char *filename, QEMUOptionParameter *options)
     uint32_t table_size = QED_DEFAULT_TABLE_SIZE;
     const char *backing_file = NULL;
     const char *backing_fmt = NULL;
+    bool copy_on_read = false;
 
     while (options && options->name) {
         if (!strcmp(options->name, BLOCK_OPT_SIZE)) {
@@ -539,6 +544,10 @@  static int bdrv_qed_create(const char *filename, QEMUOptionParameter *options)
             if (options->value.n) {
                 table_size = options->value.n;
             }
+        } else if (!strcmp(options->name, "copy_on_read")) {
+            if (options->value.n) {
+                copy_on_read = true;
+            }
         }
         options++;
     }
@@ -559,9 +568,14 @@  static int bdrv_qed_create(const char *filename, QEMUOptionParameter *options)
                 qed_max_image_size(cluster_size, table_size));
         return -EINVAL;
     }
+    if (copy_on_read && !backing_file) {
+        fprintf(stderr,
+                "QED only supports Copy-on-Read with a backing file\n");
+        return -EINVAL;
+    }
 
     return qed_create(filename, cluster_size, image_size, table_size,
-                      backing_file, backing_fmt);
+                      backing_file, backing_fmt, copy_on_read);
 }
 
 typedef struct {
@@ -1085,6 +1099,27 @@  static void qed_aio_write_data(void *opaque, int ret,
 }
 
 /**
+ * Copy on read callback
+ *
+ * Write data from backing file to QED that's been read if CoR is enabled.
+ */
+static void qed_copy_on_read_cb(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    BlockDriverAIOCB *cor_acb;
+
+    cor_acb = bdrv_aio_writev(s->bs,
+                              acb->cur_pos / BDRV_SECTOR_SIZE,
+                              &acb->cur_qiov,
+                              acb->cur_qiov.size / BDRV_SECTOR_SIZE,
+                              qed_aio_next_io, acb);
+    if (!cor_acb) {
+        qed_aio_complete(acb, -EIO);
+    }
+}
+
+/**
  * Read data cluster
  *
  * @opaque:     Read request
@@ -1102,6 +1137,7 @@  static void qed_aio_read_data(void *opaque, int ret,
     BDRVQEDState *s = acb_to_s(acb);
     BlockDriverState *bs = acb->common.bs;
     BlockDriverAIOCB *file_acb;
+    BlockDriverCompletionFunc *cb;
 
     /* Adjust offset into cluster */
     offset += qed_offset_into_cluster(s, acb->cur_pos);
@@ -1114,10 +1150,15 @@  static void qed_aio_read_data(void *opaque, int ret,
 
     qemu_iovec_copy(&acb->cur_qiov, acb->qiov, acb->qiov_offset, len);
 
+    cb = qed_aio_next_io;
+
     /* Handle backing file and unallocated sparse hole reads */
     if (ret != QED_CLUSTER_FOUND) {
+        if ((s->header.compat_features & QED_CF_COPY_ON_READ)) {
+            cb = qed_copy_on_read_cb;
+        }
         qed_read_backing_file(s, acb->cur_pos, &acb->cur_qiov,
-                              qed_aio_next_io, acb);
+                              cb, acb);
         return;
     }
 
@@ -1125,7 +1166,7 @@  static void qed_aio_read_data(void *opaque, int ret,
     file_acb = bdrv_aio_readv(bs->file, offset / BDRV_SECTOR_SIZE,
                               &acb->cur_qiov,
                               acb->cur_qiov.size / BDRV_SECTOR_SIZE,
-                              qed_aio_next_io, acb);
+                              cb, acb);
     if (!file_acb) {
         ret = -EIO;
         goto err;
@@ -1338,6 +1379,10 @@  static QEMUOptionParameter qed_create_options[] = {
         .name = BLOCK_OPT_TABLE_SIZE,
         .type = OPT_SIZE,
         .help = "L1/L2 table size (in clusters)"
+    }, {
+        .name = "copy_on_read",
+        .type = OPT_FLAG,
+        .help = "Copy blocks from base image on read"
     },
     { /* end of list */ }
 };
diff --git a/block/qed.h b/block/qed.h
index 2925e37..ec958af 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -53,6 +53,8 @@  enum {
     /* The image needs a consistency check before use */
     QED_F_NEED_CHECK = 0x02,
 
+    QED_CF_COPY_ON_READ = 0x01,
+
     /* The backing file format must not be probed, treat as raw image */
     QED_F_BACKING_FORMAT_NO_PROBE = 0x04,
 
@@ -60,7 +62,7 @@  enum {
     QED_FEATURE_MASK = QED_F_BACKING_FILE | /* supported feature bits */
                        QED_F_NEED_CHECK |
                        QED_F_BACKING_FORMAT_NO_PROBE,
-    QED_COMPAT_FEATURE_MASK = 0,            /* supported compat feature bits */
+    QED_COMPAT_FEATURE_MASK = QED_CF_COPY_ON_READ, /* supported compat feature bits */
     QED_AUTOCLEAR_FEATURE_MASK = 0,         /* supported autoclear feature bits */
 
     /* Data is stored in groups of sectors called clusters.  Cluster size must