From patchwork Wed Feb 13 13:22:10 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kevin Wolf X-Patchwork-Id: 220171 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id C54772C0292 for ; Thu, 14 Feb 2013 02:40:33 +1100 (EST) Received: from localhost ([::1]:34858 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U5cJQ-0007m3-Eb for incoming@patchwork.ozlabs.org; Wed, 13 Feb 2013 08:23:48 -0500 Received: from eggs.gnu.org ([208.118.235.92]:51870) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U5cIm-0007NX-HH for qemu-devel@nongnu.org; Wed, 13 Feb 2013 08:23:13 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U5cId-00030k-NE for qemu-devel@nongnu.org; Wed, 13 Feb 2013 08:23:08 -0500 Received: from mx1.redhat.com ([209.132.183.28]:35945) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U5cId-00030J-Cz for qemu-devel@nongnu.org; Wed, 13 Feb 2013 08:22:59 -0500 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r1DDMwYZ004802 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 13 Feb 2013 08:22:58 -0500 Received: from dhcp-5-188.str.redhat.com (dhcp-192-240.str.redhat.com [10.33.192.240]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id r1DDMGRr011418; Wed, 13 Feb 2013 08:22:57 -0500 From: Kevin Wolf To: qemu-devel@nongnu.org Date: Wed, 13 Feb 2013 14:22:10 +0100 Message-Id: <1360761733-25347-21-git-send-email-kwolf@redhat.com> In-Reply-To: <1360761733-25347-1-git-send-email-kwolf@redhat.com> References: <1360761733-25347-1-git-send-email-kwolf@redhat.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 209.132.183.28 Cc: kwolf@redhat.com, pbonzini@redhat.com, lersek@redhat.com, stefanha@redhat.com Subject: [Qemu-devel] [RFC PATCH v2 20/23] qcow2: Cancel COW when overwritten X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org This is the first part of an optimisation to improve the performance of sequential cluster allocations. Typically, write requests aren't aligned to cluster boundaries, so sequential allocation means that every other request has to wait for the COW of the previous request to complete. We can do better: Just cancel the COW instead of waiting for it and then overwriting the same area with the second write request. Signed-off-by: Kevin Wolf --- block/qcow2-cluster.c | 150 ++++++++++++++++++++++++++++++++++++++++++++---- block/qcow2.c | 16 +++++ block/qcow2.h | 37 ++++++++++++ trace-events | 2 + 4 files changed, 192 insertions(+), 13 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 33e595a..3ef1cff 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -659,6 +659,8 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m, Qcow2COWRegion *r) BDRVQcowState *s = bs->opaque; int ret; + r->final = true; + if (r->nb_sectors == 0) { return 0; } @@ -689,6 +691,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) int i, j = 0, l2_index, ret; uint64_t *old_cluster, *l2_table; uint64_t cluster_offset = m->alloc_offset; + bool has_wr_lock = false; trace_qcow2_cluster_link_l2(qemu_coroutine_self(), m->nb_clusters); assert(m->nb_clusters > 0); @@ -707,6 +710,16 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) } /* Update L2 table. */ + qemu_co_mutex_unlock(&s->lock); + qemu_co_rwlock_wrlock(&m->l2_writeback_lock); + has_wr_lock = true; + qemu_co_mutex_lock(&s->lock); + + if (m->no_l2_update) { + ret = 0; + goto err; + } + if (s->compatible_features & QCOW2_COMPAT_LAZY_REFCOUNTS) { qcow2_mark_dirty(bs); } @@ -753,6 +766,9 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) ret = 0; err: + if (has_wr_lock) { + qemu_co_rwlock_unlock(&m->l2_writeback_lock); + } g_free(old_cluster); return ret; } @@ -847,17 +863,24 @@ void qcow2_delete_kick_l2meta_bh(void *opaque) * bytes from guest_offset that can be read before the next * dependency must be processed (or the request is complete) * + * 1 if we're reusing an in-flight cluster allocation. *cur_bytes + * indicates the number of bytes from guest_offset that can be + * used for writing to this cluster. + * * -EAGAIN if we had to wait for another request, previously gathered * information on cluster allocation may be invalid now. The caller * must start over anyway, so consider *cur_bytes undefined. */ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset, - uint64_t *cur_bytes) + uint64_t *host_offset, uint64_t *cur_bytes, QCowL2Meta **m) { BDRVQcowState *s = bs->opaque; QCowL2Meta *old_alloc; uint64_t bytes = *cur_bytes; + trace_qcow2_handle_dep(qemu_coroutine_self(), guest_offset, *host_offset, + *cur_bytes); + QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight) { uint64_t start = guest_offset; @@ -868,22 +891,111 @@ static int handle_dependencies(BlockDriverState *bs, uint64_t guest_offset, if (end <= old_start || start >= old_end) { /* No intersection */ } else { + uint64_t new_bytes; + uint64_t old_cow_end; + + /* + * Shorten the request to stop at the start of a running + * allocation. + */ if (start < old_start) { - /* Stop at the start of a running allocation */ - bytes = old_start - start; + new_bytes = old_start - start; } else { - bytes = 0; + new_bytes = 0; + } + + if (new_bytes > 0) { + bytes = new_bytes; + continue; } - if (bytes == 0) { - /* Wait for the dependency to complete. We need to recheck - * the free/allocated clusters when we continue. */ - qemu_co_mutex_unlock(&s->lock); + /* Stop if already an l2meta exists. We would have to consider + * locks held by it (l2_writeback_lock) and set m->next etc. */ + if (*m) { + *cur_bytes = 0; + return 0; + } + + + /* + * Check if we're just overwriting some COW of the old allocation + * that is safe to be replaced by the data of this request. + */ + old_cow_end = old_alloc->offset + old_alloc->cow_end.offset; + + if ((old_end & (s->cluster_size - 1)) == 0 + && start >= old_cow_end + && !old_alloc->cow_end.final + && !*host_offset) + { + uint64_t subcluster_offset; + int nb_sectors; + + trace_qcow2_overwrite_cow(qemu_coroutine_self(), start, end, + old_start, old_end); + + subcluster_offset = offset_into_cluster(s, guest_offset); + nb_sectors = (subcluster_offset + bytes) >> BDRV_SECTOR_BITS; + + /* Move forward cluster by cluster when overwriting COW areas, + * or we'd have to deal with multiple overlapping requests and + * things would become complicated. */ + nb_sectors = MIN(s->cluster_sectors, nb_sectors); + + /* Shorten the COW area at the end of the old request */ + old_alloc->cow_end.nb_sectors = + (guest_offset - old_cow_end) >> BDRV_SECTOR_BITS; + + /* The new data region starts in the same cluster where the COW + * region at the end of the old request starts. */ + *host_offset = start_of_cluster(s, + old_alloc->alloc_offset + old_alloc->cow_end.offset) + + subcluster_offset; + + /* Create new l2meta that doesn't actually allocate new L2 + * entries, but describes the new data area so that reads + * access the right cluster */ + *m = g_malloc0(sizeof(**m)); + **m = (QCowL2Meta) { + .parent = old_alloc, + .alloc_offset = start_of_cluster(s, *host_offset), + .offset = guest_offset & ~(s->cluster_size - 1), + .nb_clusters = 1, + .nb_available = nb_sectors, + .no_l2_update = true, + + .cow_start = { + .offset = subcluster_offset, + .nb_sectors = 0, + }, + .cow_end = { + .offset = nb_sectors << BDRV_SECTOR_BITS, + .nb_sectors = s->cluster_sectors - nb_sectors, + }, + }; + qemu_co_queue_init(&(*m)->dependent_requests); + qemu_co_rwlock_init(&(*m)->l2_writeback_lock); + QLIST_INSERT_HEAD(&s->cluster_allocs, *m, next_in_flight); + + /* Ensure that the L2 isn't written before COW has completed */ + assert(!old_alloc->l2_writeback_lock.writer); + qemu_co_rwlock_rdlock(&old_alloc->l2_writeback_lock); + + /* Stop at next cluster boundary */ + *cur_bytes = MIN(*cur_bytes, s->cluster_size - subcluster_offset); + + return 1; + } + + /* Wait for the dependency to complete. We need to recheck + * the free/allocated clusters when we continue. */ + qemu_co_mutex_unlock(&s->lock); + if (old_alloc->sleeping) { kick_l2meta(old_alloc); - qemu_co_queue_wait(&old_alloc->dependent_requests); - qemu_co_mutex_lock(&s->lock); - return -EAGAIN; } + qemu_co_queue_wait(&old_alloc->dependent_requests); + qemu_co_mutex_lock(&s->lock); + return -EAGAIN; } } @@ -1167,6 +1279,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, }, }; qemu_co_queue_init(&(*m)->dependent_requests); + qemu_co_rwlock_init(&(*m)->l2_writeback_lock); QLIST_INSERT_HEAD(&s->cluster_allocs, *m, next_in_flight); *bytes = MIN(*bytes, (nb_sectors * BDRV_SECTOR_SIZE) @@ -1221,6 +1334,7 @@ again: remaining = (n_end - n_start) << BDRV_SECTOR_BITS; cluster_offset = 0; *host_offset = 0; + *m = NULL; /* * Now start gathering as many contiguous clusters as possible: @@ -1234,17 +1348,27 @@ again: * contiguous clusters (the situation could have changed while we * were sleeping) * - * c) TODO: Request starts in the same cluster as the in-flight + * c) Request starts in the same cluster as the in-flight * allocation ends. Shorten the COW of the in-fight allocation, set * cluster_offset to write to the same cluster and set up the right * synchronisation between the in-flight request and the new one. */ cur_bytes = remaining; - ret = handle_dependencies(bs, start, &cur_bytes); + ret = handle_dependencies(bs, start, &cluster_offset, &cur_bytes, m); if (ret == -EAGAIN) { goto again; } else if (ret < 0) { return ret; + } else if (ret) { + *host_offset = start_of_cluster(s, cluster_offset); + + start += cur_bytes; + remaining -= cur_bytes; + cluster_offset += cur_bytes; + + goto done; + } else if (cur_bytes == 0) { + goto done; } else { /* handle_dependencies() may have decreased cur_bytes (shortened * the allocations below) so that the next dependency is processed diff --git a/block/qcow2.c b/block/qcow2.c index 2819336..23818cb 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -805,6 +805,22 @@ again: qemu_co_mutex_unlock(&s->lock); + if (m->parent) { + /* + * Allow the parent to write the L2 table entry now that both guest + * data write and COW have completed. Don't remove the request from the + * queue yet until the L2 is updated, we still need it for overriding + * the cluster_offset of requests to the same area. + * + * The release of the l2_writeback_lock and the queuing must be + * performed atomically to avoid a race. This code implements this + * correctly because unlocking only schedules a BH to wake the parent, + * but doesn't yield yet. + */ + qemu_co_rwlock_unlock(&m->parent->l2_writeback_lock); + qemu_co_queue_wait(&m->parent->dependent_requests); + } + /* Take the request off the list of running requests */ if (m->nb_clusters != 0) { QLIST_REMOVE(m, next_in_flight); diff --git a/block/qcow2.h b/block/qcow2.h index 504f10f..9596715 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -216,6 +216,12 @@ typedef struct Qcow2COWRegion { /** Number of sectors to copy */ int nb_sectors; + + /** + * Becomes true when offset and nb_sectors cannot be modified any more + * because the COW is already being processed. + */ + bool final; } Qcow2COWRegion; /** @@ -254,6 +260,12 @@ typedef struct QCowL2Meta bool sleeping; bool error; + /** + * true indicates that this is not an actual cluster allocation, but reuses + * the allocation of an earlier request, so don't update the L2 table. + */ + bool no_l2_update; + /** Coroutine that handles delayed COW and updates L2 entry */ Coroutine *co; @@ -275,6 +287,31 @@ typedef struct QCowL2Meta */ Qcow2COWRegion cow_end; + /** + * The L2 table must only be updated when all COWs to the same sector have + * completed. A request that shortens part of the COW, takes a reader lock + * until its data and COW is written. + */ + CoRwlock l2_writeback_lock; + + /** + * QCowL2Metas can form a tree. The root of this tree is a request that + * actually allocated the host cluster in the image file. Children are + * requests that overwrite (part of) a COW region of their parents, so that + * the COW of their parent was cancelled. + * + * Important rules to bear in mind: + * + * - The parent must not write its L2 entry before all children have + * written both their guest data and COW regions. If it did, the image on + * disk would yet have incomplete COW and data would be corrupted. + * + * - The chilren must stay in s->cluster_allocs until the parent has + * written the L2 entry, so that read requests correctly return the new + * data even though the L2 table is not updated yet. + */ + struct QCowL2Meta *parent; + /** Pointer to next L2Meta of the same write request */ struct QCowL2Meta *next; diff --git a/trace-events b/trace-events index 3531717..4cc3511 100644 --- a/trace-events +++ b/trace-events @@ -481,6 +481,8 @@ qcow2_writev_done_part(void *co, int cur_nr_sectors) "co %p cur_nr_sectors %d" qcow2_writev_data(void *co, uint64_t offset) "co %p offset %" PRIx64 qcow2_alloc_clusters_offset(void *co, uint64_t offset, int n_start, int n_end) "co %p offet %" PRIx64 " n_start %d n_end %d" +qcow2_handle_dep(void *co, uint64_t guest_offset, uint64_t host_offset, uint64_t bytes) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " bytes %" PRIx64 +qcow2_overwrite_cow(void *co, uint64_t start, uint64_t end, uint64_t old_start, uint64_t old_end) "co %p start %" PRIx64 " end %" PRIx64 " old_start %" PRIx64 " old_end %" PRIx64 qcow2_handle_copied(void *co, uint64_t guest_offset, uint64_t host_offset, uint64_t bytes) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " bytes %" PRIx64 qcow2_handle_alloc(void *co, uint64_t guest_offset, uint64_t host_offset, uint64_t bytes) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " bytes %" PRIx64 qcow2_do_alloc_clusters_offset(void *co, uint64_t guest_offset, uint64_t host_offset, int nb_clusters) "co %p guest_offet %" PRIx64 " host_offset %" PRIx64 " nb_clusters %d"