From patchwork Mon Sep 6 10:04:38 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stefan Hajnoczi X-Patchwork-Id: 63911 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [199.232.76.165]) by ozlabs.org (Postfix) with ESMTP id 607EDB70F3 for ; Mon, 6 Sep 2010 20:13:27 +1000 (EST) Received: from localhost ([127.0.0.1]:60976 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OsYcO-0007oe-O7 for incoming@patchwork.ozlabs.org; Mon, 06 Sep 2010 06:08:04 -0400 Received: from [140.186.70.92] (port=44637 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OsYZS-0006p4-7n for qemu-devel@nongnu.org; Mon, 06 Sep 2010 06:05:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OsYZN-00036c-Rr for qemu-devel@nongnu.org; Mon, 06 Sep 2010 06:05:01 -0400 Received: from mtagate3.uk.ibm.com ([194.196.100.163]:43899) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OsYZN-00033n-8R for qemu-devel@nongnu.org; Mon, 06 Sep 2010 06:04:57 -0400 Received: from d12nrmr1607.megacenter.de.ibm.com (d12nrmr1607.megacenter.de.ibm.com [9.149.167.49]) by mtagate3.uk.ibm.com (8.13.1/8.13.1) with ESMTP id o86A4tp6015575 for ; Mon, 6 Sep 2010 10:04:55 GMT Received: from d12av04.megacenter.de.ibm.com (d12av04.megacenter.de.ibm.com [9.149.165.229]) by d12nrmr1607.megacenter.de.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o86A4qNv4091992 for ; Mon, 6 Sep 2010 12:04:54 +0200 Received: from d12av04.megacenter.de.ibm.com (loopback [127.0.0.1]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id o86A4qn1005680 for ; Mon, 6 Sep 2010 12:04:52 +0200 Received: from stefan-thinkpad.manchester-maybrook.uk.ibm.com (dyn-9-174-219-60.manchester-maybrook.uk.ibm.com [9.174.219.60]) by d12av04.megacenter.de.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id o86A4qRH005673; Mon, 6 Sep 2010 12:04:52 +0200 From: Stefan Hajnoczi To: Date: Mon, 6 Sep 2010 11:04:38 +0100 Message-Id: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com> X-Mailer: git-send-email 1.7.1 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) Cc: Kevin Wolf , Anthony Liguori , Stefan Hajnoczi Subject: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org QEMU Enhanced Disk format is a disk image format that forgoes features found in qcow2 in favor of better levels of performance and data integrity. Due to its simpler on-disk layout, it is possible to safely perform metadata updates more efficiently. Installations, suspend-to-disk, and other allocation-heavy I/O workloads will see increased performance due to fewer I/Os and syncs. Workloads that do not cause new clusters to be allocated will perform similar to raw images due to in-memory metadata caching. The format supports sparse disk images. It does not rely on the host filesystem holes feature, making it a good choice for sparse disk images that need to be transferred over channels where holes are not supported. Backing files are supported so only deltas against a base image can be stored. The file format is extensible so that additional features can be added later with graceful compatibility handling. Internal snapshots are not supported. This eliminates the need for additional metadata to track copy-on-write clusters. Compression and encryption are not supported. They add complexity and can be implemented at other layers in the stack (i.e. inside the guest or on the host). The format is currently functional with the following features missing: * Resizing the disk image. The capability has been designed in but the code has not been written yet. * Resetting the image after backing file commit completes. * Changing the backing filename. * Consistency check (fsck). This is simple due to the on-disk layout. Signed-off-by: Anthony Liguori Signed-off-by: Stefan Hajnoczi --- This code is also available from git (for development and testing the tracing and blkverify features are pulled in, whereas this single squashed patch applies to mainline qemu.git): http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/qed Numbers for RHEL6 install, cache=none disk image on ext3. This is an interactive install on my laptop, so not a proper benchmark but I want to show there is real difference today: * raw: 4m4s * qed: 4m21s (107%) * qcow2: 4m46s (117%) Makefile.objs | 1 + block/qcow2.c | 22 - block/qed-cluster.c | 136 +++++++ block/qed-gencb.c | 32 ++ block/qed-l2-cache.c | 131 ++++++ block/qed-table.c | 242 +++++++++++ block/qed.c | 1103 ++++++++++++++++++++++++++++++++++++++++++++++++++ block/qed.h | 212 ++++++++++ cutils.c | 53 +++ qemu-common.h | 3 + 10 files changed, 1913 insertions(+), 22 deletions(-) create mode 100644 block/qed-cluster.c create mode 100644 block/qed-gencb.c create mode 100644 block/qed-l2-cache.c create mode 100644 block/qed-table.c create mode 100644 block/qed.c create mode 100644 block/qed.h diff --git a/Makefile.objs b/Makefile.objs index 4a1eaa1..a5acb32 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -14,6 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o +block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o diff --git a/block/qcow2.c b/block/qcow2.c index a53014d..72c923a 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -767,28 +767,6 @@ static int qcow2_change_backing_file(BlockDriverState *bs, return qcow2_update_ext_header(bs, backing_file, backing_fmt); } -static int get_bits_from_size(size_t size) -{ - int res = 0; - - if (size == 0) { - return -1; - } - - while (size != 1) { - /* Not a power of two */ - if (size & 1) { - return -1; - } - - size >>= 1; - res++; - } - - return res; -} - - static int preallocate(BlockDriverState *bs) { uint64_t nb_sectors; diff --git a/block/qed-cluster.c b/block/qed-cluster.c new file mode 100644 index 0000000..6deea27 --- /dev/null +++ b/block/qed-cluster.c @@ -0,0 +1,136 @@ +/* + * QEMU Enhanced Disk Format Cluster functions + * + * Copyright IBM, Corp. 2010 + * + * Authors: + * Stefan Hajnoczi + * Anthony Liguori + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include "qed.h" + +/** + * Count the number of contiguous data clusters + * + * @s: QED state + * @table: L2 table + * @index: First cluster index + * @n: Maximum number of clusters + * @offset: Set to first cluster offset + * + * This function scans tables for contiguous allocated or free clusters. + */ +static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s, + QEDTable *table, + unsigned int index, + unsigned int n, + uint64_t *offset) +{ + unsigned int end = MIN(index + n, s->table_nelems); + uint64_t last = table->offsets[index]; + unsigned int i; + + *offset = last; + + for (i = index + 1; i < end; i++) { + if (last == 0) { + /* Counting free clusters */ + if (table->offsets[i] != 0) { + break; + } + } else { + /* Counting allocated clusters */ + if (table->offsets[i] != last + s->header.cluster_size) { + break; + } + last = table->offsets[i]; + } + } + return i - index; +} + +typedef struct { + BDRVQEDState *s; + uint64_t pos; + size_t len; + + QEDRequest *request; + + /* User callback */ + QEDFindClusterFunc *cb; + void *opaque; +} QEDFindClusterCB; + +static void qed_find_cluster_cb(void *opaque, int ret) +{ + QEDFindClusterCB *find_cluster_cb = opaque; + BDRVQEDState *s = find_cluster_cb->s; + QEDRequest *request = find_cluster_cb->request; + uint64_t offset = 0; + size_t len = 0; + unsigned int index; + unsigned int n; + + if (ret) { + ret = QED_CLUSTER_ERROR; + goto out; + } + + index = qed_l2_index(s, find_cluster_cb->pos); + n = qed_bytes_to_clusters(s, + qed_offset_into_cluster(s, find_cluster_cb->pos) + + find_cluster_cb->len); + n = qed_count_contiguous_clusters(s, request->l2_table->table, + index, n, &offset); + + ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2; + len = MIN(find_cluster_cb->len, n * s->header.cluster_size - + qed_offset_into_cluster(s, find_cluster_cb->pos)); + +out: + find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len); + qemu_free(find_cluster_cb); +} + +/** + * Find the offset of a data cluster + * + * @s: QED state + * @pos: Byte position in device + * @len: Number of bytes + * @cb: Completion function + * @opaque: User data for completion function + */ +void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos, + size_t len, QEDFindClusterFunc *cb, void *opaque) +{ + QEDFindClusterCB *find_cluster_cb; + uint64_t l2_offset; + + /* Limit length to L2 boundary. Requests are broken up at the L2 boundary + * so that a request acts on one L2 table at a time. + */ + len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos); + + l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)]; + if (!l2_offset) { + cb(opaque, QED_CLUSTER_L1, 0, len); + return; + } + + find_cluster_cb = qemu_malloc(sizeof(*find_cluster_cb)); + find_cluster_cb->s = s; + find_cluster_cb->pos = pos; + find_cluster_cb->len = len; + find_cluster_cb->cb = cb; + find_cluster_cb->opaque = opaque; + find_cluster_cb->request = request; + + qed_read_l2_table(s, request, l2_offset, + qed_find_cluster_cb, find_cluster_cb); +} diff --git a/block/qed-gencb.c b/block/qed-gencb.c new file mode 100644 index 0000000..d389e12 --- /dev/null +++ b/block/qed-gencb.c @@ -0,0 +1,32 @@ +/* + * QEMU Enhanced Disk Format + * + * Copyright IBM, Corp. 2010 + * + * Authors: + * Stefan Hajnoczi + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include "qed.h" + +void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque) +{ + GenericCB *gencb = qemu_malloc(len); + gencb->cb = cb; + gencb->opaque = opaque; + return gencb; +} + +void gencb_complete(void *opaque, int ret) +{ + GenericCB *gencb = opaque; + BlockDriverCompletionFunc *cb = gencb->cb; + void *user_opaque = gencb->opaque; + + qemu_free(gencb); + cb(user_opaque, ret); +} diff --git a/block/qed-l2-cache.c b/block/qed-l2-cache.c new file mode 100644 index 0000000..747a629 --- /dev/null +++ b/block/qed-l2-cache.c @@ -0,0 +1,131 @@ +/* + * QEMU Enhanced Disk Format L2 Cache + * + * Copyright IBM, Corp. 2010 + * + * Authors: + * Anthony Liguori + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include "qed.h" + +/* Each L2 holds 2GB so this let's us fully cache a 100GB disk */ +#define MAX_L2_CACHE_SIZE 50 + +/** + * Initialize the L2 cache + */ +void qed_init_l2_cache(L2TableCache *l2_cache, + L2TableAllocFunc *alloc_l2_table, + void *alloc_l2_table_opaque) +{ + QTAILQ_INIT(&l2_cache->entries); + l2_cache->n_entries = 0; + l2_cache->alloc_l2_table = alloc_l2_table; + l2_cache->alloc_l2_table_opaque = alloc_l2_table_opaque; +} + +/** + * Free the L2 cache + */ +void qed_free_l2_cache(L2TableCache *l2_cache) +{ + CachedL2Table *entry, *next_entry; + + QTAILQ_FOREACH_SAFE(entry, &l2_cache->entries, node, next_entry) { + qemu_free(entry->table); + qemu_free(entry); + } +} + +/** + * Allocate an uninitialized entry from the cache + * + * The returned entry has a reference count of 1 and is owned by the caller. + */ +CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache) +{ + CachedL2Table *entry; + + entry = qemu_mallocz(sizeof(*entry)); + entry->table = l2_cache->alloc_l2_table(l2_cache->alloc_l2_table_opaque); + entry->ref++; + + return entry; +} + +/** + * Decrease an entry's reference count and free if necessary when the reference + * count drops to zero. + */ +void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry) +{ + if (!entry) { + return; + } + + entry->ref--; + if (entry->ref == 0) { + qemu_free(entry->table); + qemu_free(entry); + } +} + +/** + * Find an entry in the L2 cache. This may return NULL and it's up to the + * caller to satisfy the cache miss. + * + * For a cached entry, this function increases the reference count and returns + * the entry. + */ +CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset) +{ + CachedL2Table *entry; + + QTAILQ_FOREACH(entry, &l2_cache->entries, node) { + if (entry->offset == offset) { + entry->ref++; + return entry; + } + } + return NULL; +} + +/** + * Commit an L2 cache entry into the cache. This is meant to be used as part of + * the process to satisfy a cache miss. A caller would allocate an entry which + * is not actually in the L2 cache and then once the entry was valid and + * present on disk, the entry can be committed into the cache. + * + * Since the cache is write-through, it's important that this function is not + * called until the entry is present on disk and the L1 has been updated to + * point to the entry. + * + * This function will take a reference to the entry so the caller is still + * responsible for unreferencing the entry. + */ +void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table) +{ + CachedL2Table *entry; + + entry = qed_find_l2_cache_entry(l2_cache, l2_table->offset); + if (entry) { + qed_unref_l2_cache_entry(l2_cache, entry); + return; + } + + if (l2_cache->n_entries >= MAX_L2_CACHE_SIZE) { + entry = QTAILQ_FIRST(&l2_cache->entries); + QTAILQ_REMOVE(&l2_cache->entries, entry, node); + l2_cache->n_entries--; + qed_unref_l2_cache_entry(l2_cache, entry); + } + + l2_table->ref++; + l2_cache->n_entries++; + QTAILQ_INSERT_TAIL(&l2_cache->entries, l2_table, node); +} diff --git a/block/qed-table.c b/block/qed-table.c new file mode 100644 index 0000000..9a72582 --- /dev/null +++ b/block/qed-table.c @@ -0,0 +1,242 @@ +/* + * QEMU Enhanced Disk Format Table I/O + * + * Copyright IBM, Corp. 2010 + * + * Authors: + * Stefan Hajnoczi + * Anthony Liguori + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include "qed.h" + +typedef struct { + GenericCB gencb; + BDRVQEDState *s; + QEDTable *table; + + struct iovec iov; + QEMUIOVector qiov; +} QEDReadTableCB; + +static void qed_read_table_cb(void *opaque, int ret) +{ + QEDReadTableCB *read_table_cb = opaque; + QEDTable *table = read_table_cb->table; + int noffsets = read_table_cb->iov.iov_len / sizeof(uint64_t); + int i; + + /* Handle I/O error */ + if (ret) { + goto out; + } + + /* Byteswap and verify offsets */ + for (i = 0; i < noffsets; i++) { + table->offsets[i] = le64_to_cpu(table->offsets[i]); + } + +out: + /* Completion */ + gencb_complete(&read_table_cb->gencb, ret); +} + +static void qed_read_table(BDRVQEDState *s, uint64_t offset, QEDTable *table, + BlockDriverCompletionFunc *cb, void *opaque) +{ + QEDReadTableCB *read_table_cb = gencb_alloc(sizeof(*read_table_cb), + cb, opaque); + QEMUIOVector *qiov = &read_table_cb->qiov; + BlockDriverAIOCB *aiocb; + + read_table_cb->s = s; + read_table_cb->table = table; + read_table_cb->iov.iov_base = table->offsets, + read_table_cb->iov.iov_len = s->header.cluster_size * s->header.table_size, + + qemu_iovec_init_external(qiov, &read_table_cb->iov, 1); + aiocb = bdrv_aio_readv(s->bs->file, offset / BDRV_SECTOR_SIZE, qiov, + read_table_cb->iov.iov_len / BDRV_SECTOR_SIZE, + qed_read_table_cb, read_table_cb); + if (!aiocb) { + qed_read_table_cb(read_table_cb, -EIO); + } +} + +typedef struct { + GenericCB gencb; + BDRVQEDState *s; + QEDTable *orig_table; + bool flush; /* flush after write? */ + + struct iovec iov; + QEMUIOVector qiov; + + QEDTable table; +} QEDWriteTableCB; + +static void qed_write_table_cb(void *opaque, int ret) +{ + QEDWriteTableCB *write_table_cb = opaque; + + if (ret) { + goto out; + } + + if (write_table_cb->flush) { + /* We still need to flush first */ + write_table_cb->flush = false; + bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb, + write_table_cb); + return; + } + +out: + gencb_complete(&write_table_cb->gencb, ret); + return; +} + +/** + * Write out an updated part or all of a table + * + * @s: QED state + * @offset: Offset of table in image file, in bytes + * @table: Table + * @index: Index of first element + * @n: Number of elements + * @flush: Whether or not to sync to disk + * @cb: Completion function + * @opaque: Argument for completion function + */ +static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table, + unsigned int index, unsigned int n, bool flush, + BlockDriverCompletionFunc *cb, void *opaque) +{ + QEDWriteTableCB *write_table_cb; + BlockDriverAIOCB *aiocb; + unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1; + unsigned int start, end, i; + size_t len_bytes; + + /* Calculate indices of the first and one after last elements */ + start = index & ~sector_mask; + end = (index + n + sector_mask) & ~sector_mask; + + len_bytes = (end - start) * sizeof(uint64_t); + + write_table_cb = gencb_alloc(sizeof(*write_table_cb) + len_bytes, + cb, opaque); + write_table_cb->s = s; + write_table_cb->orig_table = table; + write_table_cb->flush = flush; + write_table_cb->iov.iov_base = write_table_cb->table.offsets; + write_table_cb->iov.iov_len = len_bytes; + qemu_iovec_init_external(&write_table_cb->qiov, &write_table_cb->iov, 1); + + /* Byteswap table */ + for (i = start; i < end; i++) { + write_table_cb->table.offsets[i - start] = cpu_to_le64(table->offsets[i]); + } + + /* Adjust for offset into table */ + offset += start * sizeof(uint64_t); + + aiocb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE, + &write_table_cb->qiov, + write_table_cb->iov.iov_len / BDRV_SECTOR_SIZE, + qed_write_table_cb, write_table_cb); + if (!aiocb) { + qed_write_table_cb(write_table_cb, -EIO); + } +} + +static void qed_read_l1_table_cb(void *opaque, int ret) +{ + *(int *)opaque = ret; +} + +/** + * Read the L1 table synchronously + */ +int qed_read_l1_table(BDRVQEDState *s) +{ + int ret = -EINPROGRESS; + + /* TODO push/pop async context? */ + + qed_read_table(s, s->header.l1_table_offset, + s->l1_table, qed_read_l1_table_cb, &ret); + while (ret == -EINPROGRESS) { + qemu_aio_wait(); + } + return ret; +} + +void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n, + BlockDriverCompletionFunc *cb, void *opaque) +{ + qed_write_table(s, s->header.l1_table_offset, + s->l1_table, index, n, false, cb, opaque); +} + +typedef struct { + GenericCB gencb; + BDRVQEDState *s; + uint64_t l2_offset; + QEDRequest *request; +} QEDReadL2TableCB; + +static void qed_read_l2_table_cb(void *opaque, int ret) +{ + QEDReadL2TableCB *read_l2_table_cb = opaque; + QEDRequest *request = read_l2_table_cb->request; + BDRVQEDState *s = read_l2_table_cb->s; + + if (ret) { + /* can't trust loaded L2 table anymore */ + qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table); + request->l2_table = NULL; + } else { + request->l2_table->offset = read_l2_table_cb->l2_offset; + qed_commit_l2_cache_entry(&s->l2_cache, request->l2_table); + } + + gencb_complete(&read_l2_table_cb->gencb, ret); +} + +void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset, + BlockDriverCompletionFunc *cb, void *opaque) +{ + QEDReadL2TableCB *read_l2_table_cb; + + qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table); + + /* Check for cached L2 entry */ + request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset); + if (request->l2_table) { + cb(opaque, 0); + return; + } + + request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache); + + read_l2_table_cb = gencb_alloc(sizeof(*read_l2_table_cb), cb, opaque); + read_l2_table_cb->s = s; + read_l2_table_cb->l2_offset = offset; + read_l2_table_cb->request = request; + + qed_read_table(s, offset, request->l2_table->table, + qed_read_l2_table_cb, read_l2_table_cb); +} + +void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request, + unsigned int index, unsigned int n, bool flush, + BlockDriverCompletionFunc *cb, void *opaque) +{ + qed_write_table(s, request->l2_table->offset, + request->l2_table->table, index, n, flush, cb, opaque); +} diff --git a/block/qed.c b/block/qed.c new file mode 100644 index 0000000..cf64418 --- /dev/null +++ b/block/qed.c @@ -0,0 +1,1103 @@ +/* + * QEMU Enhanced Disk Format + * + * Copyright IBM, Corp. 2010 + * + * Authors: + * Stefan Hajnoczi + * Anthony Liguori + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#include "qed.h" + +/* TODO blkdebug support */ +/* TODO BlockDriverState::buffer_alignment */ +/* TODO check L2 table sizes before accessing them? */ +/* TODO skip zero prefill since the filesystem should zero the sectors anyway */ +/* TODO if a table element's offset is invalid then the image is broken. If + * there was a power failure and the table update reached storage but the data + * being pointed to did not, forget about the lost data by clearing the offset. + * However, need to be careful to detect invalid offsets for tables that are + * read *after* more clusters have been allocated. */ + +enum { + QED_MAGIC = 'Q' | 'E' << 8 | 'D' << 16 | '\0' << 24, + + /* The image supports a backing file */ + QED_F_BACKING_FILE = 0x01, + + /* The image has the backing file format */ + QED_CF_BACKING_FORMAT = 0x01, + + /* Feature bits must be used when the on-disk format changes */ + QED_FEATURE_MASK = QED_F_BACKING_FILE, /* supported feature bits */ + QED_COMPAT_FEATURE_MASK = QED_CF_BACKING_FORMAT, /* supported compat feature bits */ + + /* Data is stored in groups of sectors called clusters. Cluster size must + * be large to avoid keeping too much metadata. I/O requests that have + * sub-cluster size will require read-modify-write. + */ + QED_MIN_CLUSTER_SIZE = 4 * 1024, /* in bytes */ + QED_MAX_CLUSTER_SIZE = 64 * 1024 * 1024, + QED_DEFAULT_CLUSTER_SIZE = 64 * 1024, + + /* Allocated clusters are tracked using a 2-level pagetable. Table size is + * a multiple of clusters so large maximum image sizes can be supported + * without jacking up the cluster size too much. + */ + QED_MIN_TABLE_SIZE = 1, /* in clusters */ + QED_MAX_TABLE_SIZE = 16, + QED_DEFAULT_TABLE_SIZE = 4, +}; + +static void qed_aio_cancel(BlockDriverAIOCB *acb) +{ + qemu_aio_release(acb); +} + +static AIOPool qed_aio_pool = { + .aiocb_size = sizeof(QEDAIOCB), + .cancel = qed_aio_cancel, +}; + +/** + * Allocate memory that satisfies image file and backing file alignment requirements + * + * TODO make this common and consider propagating max buffer_alignment to the root image + */ +static void *qed_memalign(BDRVQEDState *s, size_t len) +{ + size_t align = s->bs->file->buffer_alignment; + BlockDriverState *backing_hd = s->bs->backing_hd; + + if (backing_hd && backing_hd->buffer_alignment > align) { + align = backing_hd->buffer_alignment; + } + + return qemu_memalign(align, len); +} + +static int bdrv_qed_probe(const uint8_t *buf, int buf_size, + const char *filename) +{ + const QEDHeader *header = (const void *)buf; + + if (buf_size < sizeof(*header)) { + return 0; + } + if (le32_to_cpu(header->magic) != QED_MAGIC) { + return 0; + } + return 100; +} + +static void qed_header_le_to_cpu(const QEDHeader *le, QEDHeader *cpu) +{ + cpu->magic = le32_to_cpu(le->magic); + cpu->cluster_size = le32_to_cpu(le->cluster_size); + cpu->table_size = le32_to_cpu(le->table_size); + cpu->first_cluster = le32_to_cpu(le->first_cluster); + cpu->features = le64_to_cpu(le->features); + cpu->compat_features = le64_to_cpu(le->compat_features); + cpu->l1_table_offset = le64_to_cpu(le->l1_table_offset); + cpu->image_size = le64_to_cpu(le->image_size); + cpu->backing_file_offset = le32_to_cpu(le->backing_file_offset); + cpu->backing_file_size = le32_to_cpu(le->backing_file_size); + cpu->backing_fmt_offset = le32_to_cpu(le->backing_fmt_offset); + cpu->backing_fmt_size = le32_to_cpu(le->backing_fmt_size); +} + +static void qed_header_cpu_to_le(const QEDHeader *cpu, QEDHeader *le) +{ + le->magic = cpu_to_le32(cpu->magic); + le->cluster_size = cpu_to_le32(cpu->cluster_size); + le->table_size = cpu_to_le32(cpu->table_size); + le->first_cluster = cpu_to_le32(cpu->first_cluster); + le->features = cpu_to_le64(cpu->features); + le->compat_features = cpu_to_le64(cpu->compat_features); + le->l1_table_offset = cpu_to_le64(cpu->l1_table_offset); + le->image_size = cpu_to_le64(cpu->image_size); + le->backing_file_offset = cpu_to_le32(cpu->backing_file_offset); + le->backing_file_size = cpu_to_le32(cpu->backing_file_size); + le->backing_fmt_offset = cpu_to_le32(cpu->backing_fmt_offset); + le->backing_fmt_size = cpu_to_le32(cpu->backing_fmt_size); +} + +static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size) +{ + uint64_t table_entries; + uint64_t l2_size; + + table_entries = (table_size * cluster_size) / 8; + l2_size = table_entries * cluster_size; + + return l2_size * table_entries; +} + +static bool qed_is_cluster_size_valid(uint32_t cluster_size) +{ + if (cluster_size < QED_MIN_CLUSTER_SIZE || + cluster_size > QED_MAX_CLUSTER_SIZE) { + return false; + } + if (cluster_size & (cluster_size - 1)) { + return false; /* not power of 2 */ + } + return true; +} + +static bool qed_is_table_size_valid(uint32_t table_size) +{ + if (table_size < QED_MIN_TABLE_SIZE || + table_size > QED_MAX_TABLE_SIZE) { + return false; + } + if (table_size & (table_size - 1)) { + return false; /* not power of 2 */ + } + return true; +} + +static bool qed_is_image_size_valid(uint64_t image_size, uint32_t cluster_size, + uint32_t table_size) +{ + if (image_size == 0) { + /* Supporting zero size images makes life harder because even the L1 + * table is not needed. Make life simple and forbid zero size images. + */ + return false; + } + if (image_size & (cluster_size - 1)) { + return false; /* not multiple of cluster size */ + } + if (image_size > qed_max_image_size(cluster_size, table_size)) { + return false; /* image is too large */ + } + return true; +} + +/** + * Test if a byte offset is cluster aligned and within the image file + */ +static bool qed_check_byte_offset(BDRVQEDState *s, uint64_t offset) +{ + if (offset & (s->header.cluster_size - 1)) { + return false; + } + if (offset == 0) { + return false; /* first cluster contains the header and is not valid */ + } + return offset < s->file_size; +} + +/** + * Read a string of known length from the image file + * + * @file: Image file + * @offset: File offset to start of string, in bytes + * @n: String length in bytes + * @buf: Destination buffer + * @buflen: Destination buffer length in bytes + * + * The string is NUL-terminated. + */ +static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n, + char *buf, size_t buflen) +{ + int ret; + if (n >= buflen) { + return -EINVAL; + } + ret = bdrv_pread(file, offset, buf, n); + if (ret != n) { + return ret; + } + buf[n] = '\0'; + return 0; +} + +/** + * Allocate new clusters + * + * @s: QED state + * @n: Number of contiguous clusters to allocate + * @offset: Offset of first allocated cluster, filled in on success + */ +static int qed_alloc_clusters(BDRVQEDState *s, unsigned int n, uint64_t *offset) +{ + *offset = s->file_size; + s->file_size += n * s->header.cluster_size; + return 0; +} + +static QEDTable *qed_alloc_table(void *opaque) +{ + BDRVQEDState *s = opaque; + + /* Honor O_DIRECT memory alignment requirements */ + return qed_memalign(s, s->header.cluster_size * s->header.table_size); +} + +/** + * Allocate a new zeroed L2 table + */ +static CachedL2Table *qed_new_l2_table(BDRVQEDState *s) +{ + uint64_t offset; + int ret; + CachedL2Table *l2_table; + + ret = qed_alloc_clusters(s, s->header.table_size, &offset); + if (ret) { + return NULL; + } + + l2_table = qed_alloc_l2_cache_entry(&s->l2_cache); + l2_table->offset = offset; + + memset(l2_table->table->offsets, 0, + s->header.cluster_size * s->header.table_size); + return l2_table; +} + +static int bdrv_qed_open(BlockDriverState *bs, int flags) +{ + BDRVQEDState *s = bs->opaque; + QEDHeader le_header; + int64_t file_size; + int ret; + + s->bs = bs; + QSIMPLEQ_INIT(&s->allocating_write_reqs); + + ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header)); + if (ret != sizeof(le_header)) { + return ret; + } + qed_header_le_to_cpu(&le_header, &s->header); + + if (s->header.magic != QED_MAGIC) { + return -ENOENT; + } + if (s->header.features & ~QED_FEATURE_MASK) { + return -ENOTSUP; /* image uses unsupported feature bits */ + } + if (!qed_is_cluster_size_valid(s->header.cluster_size)) { + return -EINVAL; + } + + /* Round up file size to the next cluster */ + file_size = bdrv_getlength(bs->file); + if (file_size < 0) { + return file_size; + } + s->file_size = qed_start_of_cluster(s, file_size + s->header.cluster_size - 1); + + if (!qed_is_table_size_valid(s->header.table_size)) { + return -EINVAL; + } + if (!qed_is_image_size_valid(s->header.image_size, + s->header.cluster_size, + s->header.table_size)) { + return -EINVAL; + } + if (!qed_check_byte_offset(s, s->header.l1_table_offset)) { + return -EINVAL; + } + + s->table_nelems = (s->header.cluster_size * s->header.table_size) / + sizeof(s->l1_table->offsets[0]); + s->l2_shift = get_bits_from_size(s->header.cluster_size); + s->l2_mask = s->table_nelems - 1; + s->l1_shift = s->l2_shift + get_bits_from_size(s->l2_mask + 1); + + if ((s->header.features & QED_F_BACKING_FILE)) { + ret = qed_read_string(bs->file, s->header.backing_file_offset, + s->header.backing_file_size, bs->backing_file, + sizeof(bs->backing_file)); + if (ret < 0) { + return ret; + } + + if ((s->header.compat_features & QED_CF_BACKING_FORMAT)) { + ret = qed_read_string(bs->file, s->header.backing_fmt_offset, + s->header.backing_fmt_size, + bs->backing_format, + sizeof(bs->backing_format)); + if (ret < 0) { + return ret; + } + } + } + + s->l1_table = qed_alloc_table(s); + qed_init_l2_cache(&s->l2_cache, qed_alloc_table, s); + + ret = qed_read_l1_table(s); + if (ret) { + qed_free_l2_cache(&s->l2_cache); + qemu_free(s->l1_table); + } + return ret; +} + +static void bdrv_qed_close(BlockDriverState *bs) +{ + BDRVQEDState *s = bs->opaque; + + qed_free_l2_cache(&s->l2_cache); + qemu_free(s->l1_table); +} + +static void bdrv_qed_flush(BlockDriverState *bs) +{ + bdrv_flush(bs->file); +} + +static int qed_create(const char *filename, uint32_t cluster_size, + uint64_t image_size, uint32_t table_size, + const char *backing_file, const char *backing_fmt) +{ + QEDHeader header = { + .magic = QED_MAGIC, + .cluster_size = cluster_size, + .table_size = table_size, + .first_cluster = 1, + .features = 0, + .compat_features = 0, + .l1_table_offset = cluster_size, + .image_size = image_size, + }; + QEDHeader le_header; + uint8_t *l1_table = NULL; + size_t l1_size = header.cluster_size * header.table_size; + int ret = 0; + int fd; + + fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0644); + if (fd < 0) { + return -errno; + } + + if (backing_file) { + header.features |= QED_F_BACKING_FILE; + header.backing_file_offset = sizeof(le_header); + header.backing_file_size = strlen(backing_file); + if (backing_fmt) { + header.compat_features |= QED_CF_BACKING_FORMAT; + header.backing_fmt_offset = header.backing_file_offset + + header.backing_file_size; + header.backing_fmt_size = strlen(backing_fmt); + } + } + + qed_header_cpu_to_le(&header, &le_header); + if (qemu_write_full(fd, &le_header, sizeof(le_header)) != sizeof(le_header)) { + ret = -errno; + goto out; + } + if (qemu_write_full(fd, backing_file, header.backing_file_size) != header.backing_file_size) { + ret = -errno; + goto out; + } + if (qemu_write_full(fd, backing_fmt, header.backing_fmt_size) != header.backing_fmt_size) { + ret = -errno; + goto out; + } + + l1_table = qemu_mallocz(l1_size); + lseek(fd, header.l1_table_offset, SEEK_SET); + if (qemu_write_full(fd, l1_table, l1_size) != l1_size) { + ret = -errno; + goto out; + } + +out: + qemu_free(l1_table); + close(fd); + return ret; +} + +static int bdrv_qed_create(const char *filename, QEMUOptionParameter *options) +{ + uint64_t image_size = 0; + uint32_t cluster_size = QED_DEFAULT_CLUSTER_SIZE; + uint32_t table_size = QED_DEFAULT_TABLE_SIZE; + const char *backing_file = NULL; + const char *backing_fmt = NULL; + + while (options && options->name) { + if (!strcmp(options->name, BLOCK_OPT_SIZE)) { + image_size = options->value.n; + } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FILE)) { + backing_file = options->value.s; + } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FMT)) { + backing_fmt = options->value.s; + } else if (!strcmp(options->name, BLOCK_OPT_CLUSTER_SIZE)) { + if (options->value.n) { + cluster_size = options->value.n; + } + } else if (!strcmp(options->name, "table_size")) { + if (options->value.n) { + table_size = options->value.n; + } + } + options++; + } + + if (!qed_is_cluster_size_valid(cluster_size)) { + fprintf(stderr, "QED cluster size must be within range [%u, %u] and power of 2\n", + QED_MIN_CLUSTER_SIZE, QED_MAX_CLUSTER_SIZE); + return -EINVAL; + } + if (!qed_is_table_size_valid(table_size)) { + fprintf(stderr, "QED table size must be within range [%u, %u] and power of 2\n", + QED_MIN_TABLE_SIZE, QED_MAX_TABLE_SIZE); + return -EINVAL; + } + if (!qed_is_image_size_valid(image_size, cluster_size, table_size)) { + fprintf(stderr, + "QED image size must be a non-zero multiple of cluster size and less than %s\n", + bytes_to_str(qed_max_image_size(cluster_size, table_size))); + return -EINVAL; + } + + return qed_create(filename, cluster_size, image_size, table_size, + backing_file, backing_fmt); +} + +typedef struct { + int is_allocated; + int *pnum; +} QEDIsAllocatedCB; + +static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t len) +{ + QEDIsAllocatedCB *cb = opaque; + *cb->pnum = len / BDRV_SECTOR_SIZE; + cb->is_allocated = ret == QED_CLUSTER_FOUND; +} + +static int bdrv_qed_is_allocated(BlockDriverState *bs, int64_t sector_num, + int nb_sectors, int *pnum) +{ + BDRVQEDState *s = bs->opaque; + uint64_t pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE; + size_t len = (size_t)nb_sectors * BDRV_SECTOR_SIZE; + QEDIsAllocatedCB cb = { + .is_allocated = -1, + .pnum = pnum, + }; + QEDRequest request = { .l2_table = NULL }; + + /* TODO push/pop async context? */ + + qed_find_cluster(s, &request, pos, len, qed_is_allocated_cb, &cb); + + while (cb.is_allocated == -1) { + qemu_aio_wait(); + } + + qed_unref_l2_cache_entry(&s->l2_cache, request.l2_table); + + return cb.is_allocated; +} + +static int bdrv_qed_make_empty(BlockDriverState *bs) +{ + return -ENOTSUP; /* TODO */ +} + +static BDRVQEDState *acb_to_s(QEDAIOCB *acb) +{ + return acb->common.bs->opaque; +} + +typedef struct { + GenericCB gencb; + BDRVQEDState *s; + QEMUIOVector qiov; + struct iovec iov; + uint64_t offset; +} CopyFromBackingFileCB; + +static void qed_copy_from_backing_file_cb(void *opaque, int ret) +{ + CopyFromBackingFileCB *copy_cb = opaque; + qemu_vfree(copy_cb->iov.iov_base); + gencb_complete(©_cb->gencb, ret); +} + +static void qed_copy_from_backing_file_write(void *opaque, int ret) +{ + CopyFromBackingFileCB *copy_cb = opaque; + BDRVQEDState *s = copy_cb->s; + BlockDriverAIOCB *aiocb; + + if (ret) { + qed_copy_from_backing_file_cb(copy_cb, ret); + return; + } + + aiocb = bdrv_aio_writev(s->bs->file, copy_cb->offset / BDRV_SECTOR_SIZE, + ©_cb->qiov, + copy_cb->qiov.size / BDRV_SECTOR_SIZE, + qed_copy_from_backing_file_cb, copy_cb); + if (!aiocb) { + qed_copy_from_backing_file_cb(copy_cb, -EIO); + } +} + +/** + * Copy data from backing file into the image + * + * @s: QED state + * @pos: Byte position in device + * @len: Number of bytes + * @offset: Byte offset in image file + * @cb: Completion function + * @opaque: User data for completion function + */ +static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos, + uint64_t len, uint64_t offset, + BlockDriverCompletionFunc *cb, + void *opaque) +{ + CopyFromBackingFileCB *copy_cb; + BlockDriverAIOCB *aiocb; + + /* Skip copy entirely if there is no work to do */ + if (len == 0) { + cb(opaque, 0); + return; + } + + copy_cb = gencb_alloc(sizeof(*copy_cb), cb, opaque); + copy_cb->s = s; + copy_cb->offset = offset; + copy_cb->iov.iov_base = qed_memalign(s, len); + copy_cb->iov.iov_len = len; + qemu_iovec_init_external(©_cb->qiov, ©_cb->iov, 1); + + /* Zero sectors if there is no backing file */ + if (!s->bs->backing_hd) { + memset(copy_cb->iov.iov_base, 0, len); + qed_copy_from_backing_file_write(copy_cb, 0); + return; + } + + aiocb = bdrv_aio_readv(s->bs->backing_hd, pos / BDRV_SECTOR_SIZE, + ©_cb->qiov, len / BDRV_SECTOR_SIZE, + qed_copy_from_backing_file_write, copy_cb); + if (!aiocb) { + qed_copy_from_backing_file_cb(copy_cb, -EIO); + } +} + +/** + * Link one or more contiguous clusters into a table + * + * @s: QED state + * @table: L2 table + * @index: First cluster index + * @n: Number of contiguous clusters + * @cluster: First cluster byte offset in image file + */ +static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index, + unsigned int n, uint64_t cluster) +{ + int i; + for (i = index; i < index + n; i++) { + table->offsets[i] = cluster; + cluster += s->header.cluster_size; + } +} + +static void qed_aio_next_io(void *opaque, int ret); + +static void qed_aio_complete_bh(void *opaque) +{ + QEDAIOCB *acb = opaque; + BlockDriverCompletionFunc *cb = acb->common.cb; + void *user_opaque = acb->common.opaque; + int ret = acb->bh_ret; + + qemu_bh_delete(acb->bh); + qemu_aio_release(acb); + + /* Invoke callback */ + cb(user_opaque, ret); +} + +static void qed_aio_complete(QEDAIOCB *acb, int ret) +{ + BDRVQEDState *s = acb_to_s(acb); + + /* Free resources */ + qemu_iovec_destroy(&acb->cur_qiov); + qed_unref_l2_cache_entry(&s->l2_cache, acb->request.l2_table); + + /* Arrange for a bh to invoke the completion function */ + acb->bh_ret = ret; + acb->bh = qemu_bh_new(qed_aio_complete_bh, acb); + qemu_bh_schedule(acb->bh); + + /* Start next allocating write request waiting behind this one. Note that + * requests enqueue themselves when they first hit an unallocated cluster + * but they wait until the entire request is finished before waking up the + * next request in the queue. This ensures that we don't cycle through + * requests multiple times but rather finish one at a time completely. + */ + if (acb == QSIMPLEQ_FIRST(&s->allocating_write_reqs)) { + QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next); + acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs); + if (acb) { + qed_aio_next_io(acb, 0); + } + } +} + +/** + * Construct an iovec array for the current cluster + * + * @acb: I/O request + * @len: Maximum number of bytes + */ +static void qed_acb_build_qiov(QEDAIOCB *acb, size_t len) +{ + struct iovec *iov_end = &acb->qiov->iov[acb->qiov->niov]; + size_t iov_offset = acb->cur_iov_offset; + struct iovec *iov = acb->cur_iov; + + /* Fill in one cluster's worth of iovecs */ + while (iov != iov_end && len > 0) { + size_t nbytes = MIN(iov->iov_len - iov_offset, len); + + qemu_iovec_add(&acb->cur_qiov, iov->iov_base + iov_offset, nbytes); + iov_offset += nbytes; + len -= nbytes; + + if (iov_offset >= iov->iov_len) { + iov_offset = 0; + iov++; + } + } + + /* Stash state for next time */ + acb->cur_iov = iov; + acb->cur_iov_offset = iov_offset; +} + +/** + * Commit the current L2 table to the cache + */ +static void qed_commit_l2_update(void *opaque, int ret) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + + qed_commit_l2_cache_entry(&s->l2_cache, acb->request.l2_table); + qed_aio_next_io(opaque, ret); +} + +/** + * Update L1 table with new L2 table offset and write it out + */ +static void qed_aio_write_l1_update(void *opaque, int ret) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + int index; + + if (ret) { + qed_aio_complete(acb, ret); + return; + } + + index = qed_l1_index(s, acb->cur_pos); + s->l1_table->offsets[index] = acb->request.l2_table->offset; + + qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb); +} + +/** + * Update L2 table with new cluster offsets and write them out + */ +static void qed_aio_write_l2_update(void *opaque, int ret) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1; + int index; + + if (ret) { + goto err; + } + + if (need_alloc) { + qed_unref_l2_cache_entry(&s->l2_cache, acb->request.l2_table); + acb->request.l2_table = qed_new_l2_table(s); + if (!acb->request.l2_table) { + ret = -EIO; + goto err; + } + } + + index = qed_l2_index(s, acb->cur_pos); + qed_update_l2_table(s, acb->request.l2_table->table, index, acb->cur_nclusters, + acb->cur_cluster); + + if (need_alloc) { + /* Write out the whole new L2 table */ + qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true, + qed_aio_write_l1_update, acb); + } else { + /* Write out only the updated part of the L2 table */ + qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false, + qed_aio_next_io, acb); + } + return; + +err: + qed_aio_complete(acb, ret); +} + +/** + * Write data to the image file + */ +static void qed_aio_write_main(void *opaque, int ret) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + bool need_alloc = acb->find_cluster_ret != QED_CLUSTER_FOUND; + uint64_t offset = acb->cur_cluster; + BlockDriverAIOCB *file_acb; + + if (ret) { + qed_aio_complete(acb, ret); + return; + } + + offset += qed_offset_into_cluster(s, acb->cur_pos); + file_acb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE, + &acb->cur_qiov, + acb->cur_qiov.size / BDRV_SECTOR_SIZE, + need_alloc ? qed_aio_write_l2_update : + qed_aio_next_io, + acb); + if (!file_acb) { + qed_aio_complete(acb, -EIO); + } +} + +/** + * Populate back untouched region of new data cluster + */ +static void qed_aio_write_postfill(void *opaque, int ret) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + uint64_t start = acb->cur_pos + acb->cur_qiov.size; + uint64_t len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start; + uint64_t offset = acb->cur_cluster + qed_offset_into_cluster(s, acb->cur_pos) + acb->cur_qiov.size; + + if (ret) { + qed_aio_complete(acb, ret); + return; + } + + qed_copy_from_backing_file(s, start, len, offset, + qed_aio_write_main, acb); +} + +/** + * Populate front untouched region of new data cluster + */ +static void qed_aio_write_prefill(QEDAIOCB *acb) +{ + BDRVQEDState *s = acb_to_s(acb); + uint64_t start = qed_start_of_cluster(s, acb->cur_pos); + uint64_t len = qed_offset_into_cluster(s, acb->cur_pos); + + qed_copy_from_backing_file(s, start, len, acb->cur_cluster, + qed_aio_write_postfill, acb); +} + +/** + * Write data cluster + * + * @opaque: Write request + * @ret: QED_CLUSTER_FOUND, QED_CLUSTER_L2, QED_CLUSTER_L1, + * or QED_CLUSTER_ERROR + * @offset: Cluster offset in bytes + * @len: Length in bytes + * + * Callback from qed_find_cluster(). + */ +static void qed_aio_write_data(void *opaque, int ret, + uint64_t offset, size_t len) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + bool need_alloc = ret != QED_CLUSTER_FOUND; + + if (ret == QED_CLUSTER_ERROR) { + goto err; + } + + /* Freeze this request if another allocating write is in progress */ + if (need_alloc) { + if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) { + QSIMPLEQ_INSERT_TAIL(&s->allocating_write_reqs, acb, next); + } + if (acb != QSIMPLEQ_FIRST(&s->allocating_write_reqs)) { + return; /* wait for existing request to finish */ + } + } + + acb->cur_nclusters = qed_bytes_to_clusters(s, + qed_offset_into_cluster(s, acb->cur_pos) + len); + + if (need_alloc) { + if (qed_alloc_clusters(s, acb->cur_nclusters, &offset) != 0) { + goto err; + } + } + + acb->find_cluster_ret = ret; + acb->cur_cluster = offset; + qed_acb_build_qiov(acb, len); + + if (need_alloc) { + qed_aio_write_prefill(acb); + } else { + qed_aio_write_main(acb, 0); + } + return; + +err: + qed_aio_complete(acb, -EIO); +} + +/** + * Read data cluster + * + * @opaque: Read request + * @ret: QED_CLUSTER_FOUND, QED_CLUSTER_L2, QED_CLUSTER_L1, + * or QED_CLUSTER_ERROR + * @offset: Cluster offset in bytes + * @len: Length in bytes + * + * Callback from qed_find_cluster(). + */ +static void qed_aio_read_data(void *opaque, int ret, + uint64_t offset, size_t len) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + BlockDriverState *bs = acb->common.bs; + BlockDriverState *file = bs->file; + BlockDriverAIOCB *file_acb; + + if (ret == QED_CLUSTER_ERROR) { + goto err; + } + + qed_acb_build_qiov(acb, len); + + /* Adjust offset into cluster */ + offset += qed_offset_into_cluster(s, acb->cur_pos); + + /* Handle backing file and unallocated sparse hole reads */ + if (ret != QED_CLUSTER_FOUND) { + if (!bs->backing_hd) { + qemu_iovec_zero(&acb->cur_qiov); + qed_aio_next_io(acb, 0); + return; + } + + /* Pass through read to backing file */ + offset = acb->cur_pos; + file = bs->backing_hd; + } + + file_acb = bdrv_aio_readv(file, offset / BDRV_SECTOR_SIZE, + &acb->cur_qiov, + acb->cur_qiov.size / BDRV_SECTOR_SIZE, + qed_aio_next_io, acb); + if (!file_acb) { + goto err; + } + return; + +err: + qed_aio_complete(acb, -EIO); +} + +/** + * Begin next I/O or complete the request + */ +static void qed_aio_next_io(void *opaque, int ret) +{ + QEDAIOCB *acb = opaque; + BDRVQEDState *s = acb_to_s(acb); + QEDFindClusterFunc *io_fn = + acb->is_write ? qed_aio_write_data : qed_aio_read_data; + + /* Handle I/O error */ + if (ret) { + qed_aio_complete(acb, ret); + return; + } + + acb->cur_pos += acb->cur_qiov.size; + qemu_iovec_reset(&acb->cur_qiov); + + /* Complete request */ + if (acb->cur_pos >= acb->end_pos) { + qed_aio_complete(acb, 0); + return; + } + + /* Find next cluster and start I/O */ + qed_find_cluster(s, &acb->request, + acb->cur_pos, acb->end_pos - acb->cur_pos, + io_fn, acb); +} + +static BlockDriverAIOCB *qed_aio_setup(BlockDriverState *bs, + int64_t sector_num, + QEMUIOVector *qiov, int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque, bool is_write) +{ + QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque); + + acb->is_write = is_write; + acb->qiov = qiov; + acb->cur_iov = acb->qiov->iov; + acb->cur_iov_offset = 0; + acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE; + acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE; + acb->request.l2_table = NULL; + qemu_iovec_init(&acb->cur_qiov, qiov->niov); + + /* Start request */ + qed_aio_next_io(acb, 0); + return &acb->common; +} + +static BlockDriverAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs, + int64_t sector_num, + QEMUIOVector *qiov, int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque) +{ + return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, false); +} + +static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs, + int64_t sector_num, + QEMUIOVector *qiov, int nb_sectors, + BlockDriverCompletionFunc *cb, + void *opaque) +{ + return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, true); +} + +static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs, + BlockDriverCompletionFunc *cb, + void *opaque) +{ + return bdrv_aio_flush(bs->file, cb, opaque); +} + +static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset) +{ + return -ENOTSUP; /* TODO */ +} + +static int64_t bdrv_qed_getlength(BlockDriverState *bs) +{ + BDRVQEDState *s = bs->opaque; + return s->header.image_size; +} + +static int bdrv_qed_get_info(BlockDriverState *bs, BlockDriverInfo *bdi) +{ + BDRVQEDState *s = bs->opaque; + + memset(bdi, 0, sizeof(*bdi)); + bdi->cluster_size = s->header.cluster_size; + return 0; +} + +static int bdrv_qed_change_backing_file(BlockDriverState *bs, + const char *backing_file, + const char *backing_fmt) +{ + return -ENOTSUP; /* TODO */ +} + +static int bdrv_qed_check(BlockDriverState* bs, BdrvCheckResult *result) +{ + return -ENOTSUP; /* TODO */ +} + +static QEMUOptionParameter qed_create_options[] = { + { + .name = BLOCK_OPT_SIZE, + .type = OPT_SIZE, + .help = "Virtual disk size (in bytes)" + }, { + .name = BLOCK_OPT_BACKING_FILE, + .type = OPT_STRING, + .help = "File name of a base image" + }, { + .name = BLOCK_OPT_BACKING_FMT, + .type = OPT_STRING, + .help = "Image format of the base image" + }, { + .name = BLOCK_OPT_CLUSTER_SIZE, + .type = OPT_SIZE, + .help = "Cluster size (in bytes)" + }, { + .name = "table_size", + .type = OPT_SIZE, + .help = "L1/L2 table size (in clusters)" + }, + { /* end of list */ } +}; + +static BlockDriver bdrv_qed = { + .format_name = "qed", + .instance_size = sizeof(BDRVQEDState), + .create_options = qed_create_options, + + .bdrv_probe = bdrv_qed_probe, + .bdrv_open = bdrv_qed_open, + .bdrv_close = bdrv_qed_close, + .bdrv_create = bdrv_qed_create, + .bdrv_flush = bdrv_qed_flush, + .bdrv_is_allocated = bdrv_qed_is_allocated, + .bdrv_make_empty = bdrv_qed_make_empty, + .bdrv_aio_readv = bdrv_qed_aio_readv, + .bdrv_aio_writev = bdrv_qed_aio_writev, + .bdrv_aio_flush = bdrv_qed_aio_flush, + .bdrv_truncate = bdrv_qed_truncate, + .bdrv_getlength = bdrv_qed_getlength, + .bdrv_get_info = bdrv_qed_get_info, + .bdrv_change_backing_file = bdrv_qed_change_backing_file, + .bdrv_check = bdrv_qed_check, +}; + +static void bdrv_qed_init(void) +{ + bdrv_register(&bdrv_qed); +} + +block_init(bdrv_qed_init); diff --git a/block/qed.h b/block/qed.h new file mode 100644 index 0000000..4711fbd --- /dev/null +++ b/block/qed.h @@ -0,0 +1,212 @@ +/* + * QEMU Enhanced Disk Format + * + * Copyright IBM, Corp. 2010 + * + * Authors: + * Stefan Hajnoczi + * Anthony Liguori + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING file in the top-level directory. + * + */ + +#ifndef BLOCK_QED_H +#define BLOCK_QED_H + +#include "block_int.h" + +/* The layout of a QED file is as follows: + * + * +--------+----------+----------+----------+-----+ + * | header | L1 table | cluster0 | cluster1 | ... | + * +--------+----------+----------+----------+-----+ + * + * There is a 2-level pagetable for cluster allocation: + * + * +----------+ + * | L1 table | + * +----------+ + * ,------' | '------. + * +----------+ | +----------+ + * | L2 table | ... | L2 table | + * +----------+ +----------+ + * ,------' | '------. + * +----------+ | +----------+ + * | Data | ... | Data | + * +----------+ +----------+ + * + * The L1 table is fixed size and always present. L2 tables are allocated on + * demand. The L1 table size determines the maximum possible image size; it + * can be influenced using the cluster_size and table_size values. + * + * All fields are little-endian on disk. + */ + +typedef struct { + uint32_t magic; /* QED */ + + uint32_t cluster_size; /* in bytes */ + uint32_t table_size; /* table size, in clusters */ + uint32_t first_cluster; /* first usable cluster */ + + uint64_t features; /* format feature bits */ + uint64_t compat_features; /* compatible feature bits */ + uint64_t l1_table_offset; /* L1 table offset, in bytes */ + uint64_t image_size; /* total image size, in bytes */ + + uint32_t backing_file_offset; /* in bytes from start of header */ + uint32_t backing_file_size; /* in bytes */ + uint32_t backing_fmt_offset; /* in bytes from start of header */ + uint32_t backing_fmt_size; /* in bytes */ +} QEDHeader; + +typedef struct { + uint64_t offsets[0]; /* in bytes */ +} QEDTable; + +/* The L2 cache is a simple write-through cache for L2 structures */ +typedef struct CachedL2Table { + QEDTable *table; + uint64_t offset; /* offset=0 indicates an invalidate entry */ + QTAILQ_ENTRY(CachedL2Table) node; + int ref; +} CachedL2Table; + +/** + * Allocate an L2 table + * + * This callback is used by the L2 cache to allocate tables without knowing + * their size or alignment requirements. + */ +typedef QEDTable *L2TableAllocFunc(void *opaque); + +typedef struct { + QTAILQ_HEAD(, CachedL2Table) entries; + unsigned int n_entries; + L2TableAllocFunc *alloc_l2_table; + void *alloc_l2_table_opaque; +} L2TableCache; + +typedef struct QEDRequest { + CachedL2Table *l2_table; +} QEDRequest; + +typedef struct QEDAIOCB { + BlockDriverAIOCB common; + QEMUBH *bh; + int bh_ret; /* final return status for completion bh */ + QSIMPLEQ_ENTRY(QEDAIOCB) next; /* next request */ + bool is_write; /* false - read, true - write */ + + /* User scatter-gather list */ + QEMUIOVector *qiov; + struct iovec *cur_iov; /* current iovec to process */ + size_t cur_iov_offset; /* byte count already processed in iovec */ + + /* Current cluster scatter-gather list */ + QEMUIOVector cur_qiov; + uint64_t cur_pos; /* position on block device, in bytes */ + uint64_t end_pos; + uint64_t cur_cluster; /* cluster offset in image file */ + unsigned int cur_nclusters; /* number of clusters being accessed */ + int find_cluster_ret; /* used for L1/L2 update */ + + QEDRequest request; +} QEDAIOCB; + +typedef struct { + BlockDriverState *bs; /* device */ + uint64_t file_size; /* length of image file, in bytes */ + + QEDHeader header; /* always cpu-endian */ + QEDTable *l1_table; + L2TableCache l2_cache; /* l2 table cache */ + uint32_t table_nelems; + uint32_t l1_shift; + uint32_t l2_shift; + uint32_t l2_mask; + + /* Allocating write request queue */ + QSIMPLEQ_HEAD(, QEDAIOCB) allocating_write_reqs; +} BDRVQEDState; + +enum { + QED_CLUSTER_FOUND, /* cluster found */ + QED_CLUSTER_L2, /* cluster missing in L2 */ + QED_CLUSTER_L1, /* cluster missing in L1 */ + QED_CLUSTER_ERROR, /* error looking up cluster */ +}; + +typedef void QEDFindClusterFunc(void *opaque, int ret, uint64_t offset, size_t len); + +/** + * Generic callback for chaining async callbacks + */ +typedef struct { + BlockDriverCompletionFunc *cb; + void *opaque; +} GenericCB; + +void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque); +void gencb_complete(void *opaque, int ret); + +/** + * L2 cache functions + */ +void qed_init_l2_cache(L2TableCache *l2_cache, L2TableAllocFunc *alloc_l2_table, void *alloc_l2_table_opaque); +void qed_free_l2_cache(L2TableCache *l2_cache); +CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache); +void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry); +CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset); +void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table); + +/** + * Table I/O functions + */ +int qed_read_l1_table(BDRVQEDState *s); +void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n, + BlockDriverCompletionFunc *cb, void *opaque); +void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset, + BlockDriverCompletionFunc *cb, void *opaque); +void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request, + unsigned int index, unsigned int n, bool flush, + BlockDriverCompletionFunc *cb, void *opaque); + +/** + * Cluster functions + */ +void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos, + size_t len, QEDFindClusterFunc *cb, void *opaque); + +/** + * Utility functions + */ +static inline uint64_t qed_start_of_cluster(BDRVQEDState *s, uint64_t offset) +{ + return offset & ~(uint64_t)(s->header.cluster_size - 1); +} + +static inline uint64_t qed_offset_into_cluster(BDRVQEDState *s, uint64_t offset) +{ + return offset & (s->header.cluster_size - 1); +} + +static inline unsigned int qed_bytes_to_clusters(BDRVQEDState *s, size_t bytes) +{ + return qed_start_of_cluster(s, bytes + (s->header.cluster_size - 1)) / + (s->header.cluster_size - 1); +} + +static inline unsigned int qed_l1_index(BDRVQEDState *s, uint64_t pos) +{ + return pos >> s->l1_shift; +} + +static inline unsigned int qed_l2_index(BDRVQEDState *s, uint64_t pos) +{ + return (pos >> s->l2_shift) & s->l2_mask; +} + +#endif /* BLOCK_QED_H */ diff --git a/cutils.c b/cutils.c index 036ae3c..e5b6fae 100644 --- a/cutils.c +++ b/cutils.c @@ -234,6 +234,14 @@ void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf, size_t count) } } +void qemu_iovec_zero(QEMUIOVector *qiov) +{ + struct iovec *iov; + for (iov = qiov->iov; iov != &qiov->iov[qiov->niov]; iov++) { + memset(iov->iov_base, 0, iov->iov_len); + } +} + #ifndef _WIN32 /* Sets a specific flag */ int fcntl_setfl(int fd, int flag) @@ -251,3 +259,48 @@ int fcntl_setfl(int fd, int flag) } #endif +/** + * Get the number of bits for a power of 2 + * + * The following is true for powers of 2: + * n == 1 << get_bits_from_size(n) + */ +int get_bits_from_size(size_t size) +{ + int res = 0; + + if (size == 0) { + return -1; + } + + while (size != 1) { + /* Not a power of two */ + if (size & 1) { + return -1; + } + + size >>= 1; + res++; + } + + return res; +} + +const char *bytes_to_str(uint64_t size) +{ + static char buffer[64]; + + if (size < (1ULL << 10)) { + snprintf(buffer, sizeof(buffer), "%" PRIu64 " byte(s)", size); + } else if (size < (1ULL << 20)) { + snprintf(buffer, sizeof(buffer), "%" PRIu64 " KB(s)", size >> 10); + } else if (size < (1ULL << 30)) { + snprintf(buffer, sizeof(buffer), "%" PRIu64 " MB(s)", size >> 20); + } else if (size < (1ULL << 40)) { + snprintf(buffer, sizeof(buffer), "%" PRIu64 " GB(s)", size >> 30); + } else { + snprintf(buffer, sizeof(buffer), "%" PRIu64 " TB(s)", size >> 40); + } + + return buffer; +} diff --git a/qemu-common.h b/qemu-common.h index dfd3dc0..754b107 100644 --- a/qemu-common.h +++ b/qemu-common.h @@ -137,6 +137,8 @@ time_t mktimegm(struct tm *tm); int qemu_fls(int i); int qemu_fdatasync(int fd); int fcntl_setfl(int fd, int flag); +int get_bits_from_size(size_t size); +const char *bytes_to_str(uint64_t size); /* path.c */ void init_paths(const char *prefix); @@ -283,6 +285,7 @@ void qemu_iovec_destroy(QEMUIOVector *qiov); void qemu_iovec_reset(QEMUIOVector *qiov); void qemu_iovec_to_buffer(QEMUIOVector *qiov, void *buf); void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void *buf, size_t count); +void qemu_iovec_zero(QEMUIOVector *qiov); struct Monitor; typedef struct Monitor Monitor;