diff mbox

[RFC,v2,20/23] qcow2: Cancel COW when overwritten

Message ID 1360761733-25347-21-git-send-email-kwolf@redhat.com
State New
Headers show

Commit Message

Kevin Wolf Feb. 13, 2013, 1:22 p.m. UTC
This is the first part of an optimisation to improve the performance of
sequential cluster allocations.

Typically, write requests aren't aligned to cluster boundaries, so
sequential allocation means that every other request has to wait for the
COW of the previous request to complete. We can do better: Just cancel
the COW instead of waiting for it and then overwriting the same area
with the second write request.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c |  150 ++++++++++++++++++++++++++++++++++++++++++++----
 block/qcow2.c         |   16 +++++
 block/qcow2.h         |   37 ++++++++++++
 trace-events          |    2 +
 4 files changed, 192 insertions(+), 13 deletions(-)

Comments

Stefan Hajnoczi Feb. 18, 2013, 3:03 p.m. UTC | #1
On Wed, Feb 13, 2013 at 02:22:10PM +0100, Kevin Wolf wrote:
> @@ -689,6 +691,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
>      int i, j = 0, l2_index, ret;
>      uint64_t *old_cluster, *l2_table;
>      uint64_t cluster_offset = m->alloc_offset;
> +    bool has_wr_lock = false;
>  
>      trace_qcow2_cluster_link_l2(qemu_coroutine_self(), m->nb_clusters);
>      assert(m->nb_clusters > 0);
> @@ -707,6 +710,16 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
>      }
>  
>      /* Update L2 table. */
> +    qemu_co_mutex_unlock(&s->lock);
> +    qemu_co_rwlock_wrlock(&m->l2_writeback_lock);
> +    has_wr_lock = true;

This variable is only used in err.  IMO it's simpler to have an
err_l2_writeback_locked label which unlocks l2_writeback_lock.  That way
has_wr_lock is avoided and I need to keep less local variables in mind
when reading the code.
Stefan Hajnoczi Feb. 18, 2013, 3:46 p.m. UTC | #2
On Wed, Feb 13, 2013 at 02:22:10PM +0100, Kevin Wolf wrote:
> @@ -707,6 +710,16 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
>      }
>  
>      /* Update L2 table. */
> +    qemu_co_mutex_unlock(&s->lock);
> +    qemu_co_rwlock_wrlock(&m->l2_writeback_lock);
> +    has_wr_lock = true;
> +    qemu_co_mutex_lock(&s->lock);
> +
> +    if (m->no_l2_update) {

QcowL2Meta now has a no_l2_update field.  A sign that we're abusing the
QcowL2Meta struct and really need a different abstraction?
Kevin Wolf Feb. 21, 2013, 9:47 a.m. UTC | #3
On Mon, Feb 18, 2013 at 04:46:49PM +0100, Stefan Hajnoczi wrote:
> On Wed, Feb 13, 2013 at 02:22:10PM +0100, Kevin Wolf wrote:
> > @@ -707,6 +710,16 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
> >      }
> >  
> >      /* Update L2 table. */
> > +    qemu_co_mutex_unlock(&s->lock);
> > +    qemu_co_rwlock_wrlock(&m->l2_writeback_lock);
> > +    has_wr_lock = true;
> > +    qemu_co_mutex_lock(&s->lock);
> > +
> > +    if (m->no_l2_update) {
> 
> QcowL2Meta now has a no_l2_update field.  A sign that we're abusing the
> QcowL2Meta struct and really need a different abstraction?

I think the abstraction is the right one, even though maybe it could use
a rename. Maybe QcowL2Meta -> QcowCOWRequest or something. The comment
in qcow2.h correctly describes what it's meant for.

/**
 * Describes an in-flight (part of a) write request that writes to clusters
 * that are not referenced in their L2 table yet.
 */
typedef struct QCowL2Meta

However, I think m->no_l2_update is actually redundant: The goal is that
only one request that touches a cluster should be responsible for
updating the L2 table, and it makes sense to choose the first one to do
that. We already have this information in m->parent: The first one is
the root of the tree described by these parent links. So if I'm not
mistaken, m->no_l2_update == (m->parent != NULL).

Kevin
Stefan Hajnoczi Feb. 21, 2013, 12:34 p.m. UTC | #4
On Thu, Feb 21, 2013 at 10:47:05AM +0100, Kevin Wolf wrote:
> On Mon, Feb 18, 2013 at 04:46:49PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Feb 13, 2013 at 02:22:10PM +0100, Kevin Wolf wrote:
> > > @@ -707,6 +710,16 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
> > >      }
> > >  
> > >      /* Update L2 table. */
> > > +    qemu_co_mutex_unlock(&s->lock);
> > > +    qemu_co_rwlock_wrlock(&m->l2_writeback_lock);
> > > +    has_wr_lock = true;
> > > +    qemu_co_mutex_lock(&s->lock);
> > > +
> > > +    if (m->no_l2_update) {
> > 
> > QcowL2Meta now has a no_l2_update field.  A sign that we're abusing the
> > QcowL2Meta struct and really need a different abstraction?
> 
> I think the abstraction is the right one, even though maybe it could use
> a rename. Maybe QcowL2Meta -> QcowCOWRequest or something. The comment
> in qcow2.h correctly describes what it's meant for.
> 
> /**
>  * Describes an in-flight (part of a) write request that writes to clusters
>  * that are not referenced in their L2 table yet.
>  */
> typedef struct QCowL2Meta
> 
> However, I think m->no_l2_update is actually redundant: The goal is that
> only one request that touches a cluster should be responsible for
> updating the L2 table, and it makes sense to choose the first one to do
> that. We already have this information in m->parent: The first one is
> the root of the tree described by these parent links. So if I'm not
> mistaken, m->no_l2_update == (m->parent != NULL).

Okay, that makes sense.

Stefan
diff mbox

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 33e595a..3ef1cff 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -659,6 +659,8 @@  static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r)
     BDRVQcowState *s = bs->opaque;
     int ret;
 
+    r->final = true;
+
     if (r->nb_sectors == 0) {
         return 0;
     }
@@ -689,6 +691,7 @@  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
     int i, j = 0, l2_index, ret;
     uint64_t *old_cluster, *l2_table;
     uint64_t cluster_offset = m->alloc_offset;
+    bool has_wr_lock = false;
 
     trace_qcow2_cluster_link_l2(qemu_coroutine_self(), m->nb_clusters);
     assert(m->nb_clusters > 0);
@@ -707,6 +710,16 @@  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
     }
 
     /* Update L2 table. */
+    qemu_co_mutex_unlock(&s->lock);
+    qemu_co_rwlock_wrlock(&m->l2_writeback_lock);
+    has_wr_lock = true;
+    qemu_co_mutex_lock(&s->lock);
+
+    if (m->no_l2_update) {
+        ret = 0;
+        goto err;
+    }
+
     if (s->compatible_features & QCOW2_COMPAT_LAZY_REFCOUNTS) {
         qcow2_mark_dirty(bs);
     }
@@ -753,6 +766,9 @@  int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m)
 
     ret = 0;
 err:
+    if (has_wr_lock) {
+        qemu_co_rwlock_unlock(&m->l2_writeback_lock);
+    }
     g_free(old_cluster);
     return ret;
  }
@@ -847,17 +863,24 @@  void qcow2_delete_kick_l2meta_bh(void *opaque)
  *           bytes from guest_offset that can be read before the next
  *           dependency must be processed (or the request is complete)
  *
+ *   1       if we're reusing an in-flight cluster allocation. *cur_bytes
+ *           indicates the number of bytes from guest_offset that can be
+ *           used for writing to this cluster.
+ *
  *   -EAGAIN if we had to wait for another request, previously gathered
  *           information on cluster allocation may be invalid now. The caller
  *           must start over anyway, so consider *cur_bytes undefined.
  */
 static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
-    uint64_t *cur_bytes)
+    uint64_t *host_offset, uint64_t *cur_bytes, QCowL2Meta **m)
 {
     BDRVQcowState *s = bs->opaque;
     QCowL2Meta *old_alloc;
     uint64_t bytes = *cur_bytes;
 
+    trace_qcow2_handle_dep(qemu_coroutine_self(), guest_offset, *host_offset,
+                           *cur_bytes);
+
     QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight) {
 
         uint64_t start = guest_offset;
@@ -868,22 +891,111 @@  static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset,
         if (end <= old_start || start >= old_end) {
             /* No intersection */
         } else {
+            uint64_t new_bytes;
+            uint64_t old_cow_end;
+
+           /*
+            * Shorten the request to stop at the start of a running
+            * allocation.
+            */
             if (start < old_start) {
-                /* Stop at the start of a running allocation */
-                bytes = old_start - start;
+                new_bytes = old_start - start;
             } else {
-                bytes = 0;
+                new_bytes = 0;
+            }
+
+            if (new_bytes > 0) {
+                bytes = new_bytes;
+                continue;
             }
 
-            if (bytes == 0) {
-                /* Wait for the dependency to complete. We need to recheck
-                 * the free/allocated clusters when we continue. */
-                qemu_co_mutex_unlock(&s->lock);
+            /* Stop if already an l2meta exists. We would have to consider
+             * locks held by it (l2_writeback_lock) and set m->next etc. */
+            if (*m) {
+                *cur_bytes = 0;
+                return 0;
+            }
+
+
+            /*
+             * Check if we're just overwriting some COW of the old allocation
+             * that is safe to be replaced by the data of this request.
+             */
+            old_cow_end = old_alloc->offset + old_alloc->cow_end.offset;
+
+            if ((old_end & (s->cluster_size - 1)) == 0
+                && start >= old_cow_end
+                && !old_alloc->cow_end.final
+                && !*host_offset)
+            {
+                uint64_t subcluster_offset;
+                int nb_sectors;
+
+                trace_qcow2_overwrite_cow(qemu_coroutine_self(), start, end,
+                                          old_start, old_end);
+
+                subcluster_offset = offset_into_cluster(s, guest_offset);
+                nb_sectors = (subcluster_offset + bytes) >> BDRV_SECTOR_BITS;
+
+                /* Move forward cluster by cluster when overwriting COW areas,
+                 * or we'd have to deal with multiple overlapping requests and
+                 * things would become complicated. */
+                nb_sectors = MIN(s->cluster_sectors, nb_sectors);
+
+                /* Shorten the COW area at the end of the old request */
+                old_alloc->cow_end.nb_sectors =
+                    (guest_offset - old_cow_end) >> BDRV_SECTOR_BITS;
+
+                /* The new data region starts in the same cluster where the COW
+                 * region at the end of the old request starts. */
+                *host_offset = start_of_cluster(s,
+                    old_alloc->alloc_offset + old_alloc->cow_end.offset)
+                    + subcluster_offset;
+
+                /* Create new l2meta that doesn't actually allocate new L2
+                 * entries, but describes the new data area so that reads
+                 * access the right cluster */
+                *m = g_malloc0(sizeof(**m));
+                **m = (QCowL2Meta) {
+                    .parent         = old_alloc,
+                    .alloc_offset   = start_of_cluster(s, *host_offset),
+                    .offset         = guest_offset & ~(s->cluster_size - 1),
+                    .nb_clusters    = 1,
+                    .nb_available   = nb_sectors,
+                    .no_l2_update   = true,
+
+                    .cow_start = {
+                        .offset     = subcluster_offset,
+                        .nb_sectors = 0,
+                    },
+                    .cow_end = {
+                        .offset     = nb_sectors << BDRV_SECTOR_BITS,
+                        .nb_sectors = s->cluster_sectors - nb_sectors,
+                    },
+                };
+                qemu_co_queue_init(&(*m)->dependent_requests);
+                qemu_co_rwlock_init(&(*m)->l2_writeback_lock);
+                QLIST_INSERT_HEAD(&s->cluster_allocs, *m, next_in_flight);
+
+                /* Ensure that the L2 isn't written before COW has completed */
+                assert(!old_alloc->l2_writeback_lock.writer);
+                qemu_co_rwlock_rdlock(&old_alloc->l2_writeback_lock);
+
+                /* Stop at next cluster boundary */
+                *cur_bytes = MIN(*cur_bytes, s->cluster_size - subcluster_offset);
+
+                return 1;
+            }
+
+            /* Wait for the dependency to complete. We need to recheck
+             * the free/allocated clusters when we continue. */
+            qemu_co_mutex_unlock(&s->lock);
+            if (old_alloc->sleeping) {
                 kick_l2meta(old_alloc);
-                qemu_co_queue_wait(&old_alloc->dependent_requests);
-                qemu_co_mutex_lock(&s->lock);
-                return -EAGAIN;
             }
+            qemu_co_queue_wait(&old_alloc->dependent_requests);
+            qemu_co_mutex_lock(&s->lock);
+            return -EAGAIN;
         }
     }
 
@@ -1167,6 +1279,7 @@  static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
         },
     };
     qemu_co_queue_init(&(*m)->dependent_requests);
+    qemu_co_rwlock_init(&(*m)->l2_writeback_lock);
     QLIST_INSERT_HEAD(&s->cluster_allocs, *m, next_in_flight);
 
     *bytes = MIN(*bytes, (nb_sectors * BDRV_SECTOR_SIZE)
@@ -1221,6 +1334,7 @@  again:
     remaining = (n_end - n_start) << BDRV_SECTOR_BITS;
     cluster_offset = 0;
     *host_offset = 0;
+    *m = NULL;
 
     /*
      * Now start gathering as many contiguous clusters as possible:
@@ -1234,17 +1348,27 @@  again:
      *         contiguous clusters (the situation could have changed while we
      *         were sleeping)
      *
-     *      c) TODO: Request starts in the same cluster as the in-flight
+     *      c) Request starts in the same cluster as the in-flight
      *         allocation ends. Shorten the COW of the in-fight allocation, set
      *         cluster_offset to write to the same cluster and set up the right
      *         synchronisation between the in-flight request and the new one.
      */
     cur_bytes = remaining;
-    ret = handle_dependencies(bs, start, &cur_bytes);
+    ret = handle_dependencies(bs, start, &cluster_offset, &cur_bytes, m);
     if (ret == -EAGAIN) {
         goto again;
     } else if (ret < 0) {
         return ret;
+    } else if (ret) {
+        *host_offset = start_of_cluster(s, cluster_offset);
+
+        start           += cur_bytes;
+        remaining       -= cur_bytes;
+        cluster_offset  += cur_bytes;
+
+        goto done;
+    } else if (cur_bytes == 0) {
+        goto done;
     } else {
         /* handle_dependencies() may have decreased cur_bytes (shortened
          * the allocations below) so that the next dependency is processed
diff --git a/block/qcow2.c b/block/qcow2.c
index 2819336..23818cb 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -805,6 +805,22 @@  again:
 
     qemu_co_mutex_unlock(&s->lock);
 
+    if (m->parent) {
+        /*
+         * Allow the parent to write the L2 table entry now that both guest
+         * data write and COW have completed. Don't remove the request from the
+         * queue yet until the L2 is updated, we still need it for overriding
+         * the cluster_offset of requests to the same area.
+         *
+         * The release of the l2_writeback_lock and the queuing must be
+         * performed atomically to avoid a race. This code implements this
+         * correctly because unlocking only schedules a BH to wake the parent,
+         * but doesn't yield yet.
+         */
+        qemu_co_rwlock_unlock(&m->parent->l2_writeback_lock);
+        qemu_co_queue_wait(&m->parent->dependent_requests);
+    }
+
     /* Take the request off the list of running requests */
     if (m->nb_clusters != 0) {
         QLIST_REMOVE(m, next_in_flight);
diff --git a/block/qcow2.h b/block/qcow2.h
index 504f10f..9596715 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -216,6 +216,12 @@  typedef struct Qcow2COWRegion {
 
     /** Number of sectors to copy */
     int         nb_sectors;
+
+    /**
+     * Becomes true when offset and nb_sectors cannot be modified any more
+     * because the COW is already being processed.
+     */
+    bool        final;
 } Qcow2COWRegion;
 
 /**
@@ -254,6 +260,12 @@  typedef struct QCowL2Meta
     bool sleeping;
     bool error;
 
+    /**
+     * true indicates that this is not an actual cluster allocation, but reuses
+     * the allocation of an earlier request, so don't update the L2 table.
+     */
+    bool no_l2_update;
+
     /** Coroutine that handles delayed COW and updates L2 entry */
     Coroutine *co;
 
@@ -275,6 +287,31 @@  typedef struct QCowL2Meta
      */
     Qcow2COWRegion cow_end;
 
+    /**
+     * The L2 table must only be updated when all COWs to the same sector have
+     * completed. A request that shortens part of the COW, takes a reader lock
+     * until its data and COW is written.
+     */
+    CoRwlock l2_writeback_lock;
+
+    /**
+     * QCowL2Metas can form a tree. The root of this tree is a request that
+     * actually allocated the host cluster in the image file. Children are
+     * requests that overwrite (part of) a COW region of their parents, so that
+     * the COW of their parent was cancelled.
+     *
+     * Important rules to bear in mind:
+     *
+     * - The parent must not write its L2 entry before all children have
+     *   written both their guest data and COW regions. If it did, the image on
+     *   disk would yet have incomplete COW and data would be corrupted.
+     *
+     * - The chilren must stay in s->cluster_allocs until the parent has
+     *   written the L2 entry, so that read requests correctly return the new
+     *   data even though the L2 table is not updated yet.
+     */
+    struct QCowL2Meta *parent;
+
     /** Pointer to next L2Meta of the same write request */
     struct QCowL2Meta *next;
 
diff --git a/trace-events b/trace-events
index 3531717..4cc3511 100644
--- a/trace-events
+++ b/trace-events
@@ -481,6 +481,8 @@  qcow2_writev_done_part(void *co, int cur_nr_sectors) "co %p cur_nr_sectors %d"
 qcow2_writev_data(void *co, uint64_t offset) "co %p offset %" PRIx64
 
 qcow2_alloc_clusters_offset(void *co, uint64_t offset, int n_start, int n_end) "co %p offet %" PRIx64 " n_start %d n_end %d"
+qcow2_handle_dep(void *co, uint64_t guest_offset, uint64_t host_offset, uint64_t bytes) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " bytes %" PRIx64
+qcow2_overwrite_cow(void *co, uint64_t start, uint64_t end, uint64_t old_start, uint64_t old_end) "co %p start %" PRIx64 " end %" PRIx64 " old_start %" PRIx64 " old_end %" PRIx64
 qcow2_handle_copied(void *co, uint64_t guest_offset, uint64_t host_offset, uint64_t bytes) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " bytes %" PRIx64
 qcow2_handle_alloc(void *co, uint64_t guest_offset, uint64_t host_offset, uint64_t bytes) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " bytes %" PRIx64
 qcow2_do_alloc_clusters_offset(void *co, uint64_t guest_offset, uint64_t host_offset, int nb_clusters) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " nb_clusters %d"