Message ID | 1251730129-18693-1-git-send-email-kwolf@redhat.com |
---|---|
State | Superseded |
Headers | show |
On 08/31/2009 05:48 PM, Kevin Wolf wrote: > When two AIO requests write to the same cluster, and this cluster is > unallocated, currently both requests allocate a new cluster and the second one > merges the first one when it is completed. This means an cluster allocation, a > read and a cluster deallocation which cause some overhead. If we simply let the > second request wait until the first one is done, we improve overall performance > with AIO requests (specifially, qcow2/virtio combinations). > > This patch maintains a list of in-flight requests that have allocated new > clusters. A second request touching the same cluster is limited so that it > either doesn't touch the allocation of the first request (so it can have a > non-overlapping allocation) or it waits for the first request to complete. > Can't this cause an even/odd pattern where all even-numbered requests run first, then all the odd-numbered requests? (0 goes to disk, 1 depends on it, 2 depends on 1, which isn't allocating now, so it goes to disk, 3 depends on 2, ...) Do you have performance numbers?
Avi Kivity schrieb: > On 08/31/2009 05:48 PM, Kevin Wolf wrote: >> When two AIO requests write to the same cluster, and this cluster is >> unallocated, currently both requests allocate a new cluster and the second one >> merges the first one when it is completed. This means an cluster allocation, a >> read and a cluster deallocation which cause some overhead. If we simply let the >> second request wait until the first one is done, we improve overall performance >> with AIO requests (specifially, qcow2/virtio combinations). >> >> This patch maintains a list of in-flight requests that have allocated new >> clusters. A second request touching the same cluster is limited so that it >> either doesn't touch the allocation of the first request (so it can have a >> non-overlapping allocation) or it waits for the first request to complete. >> > > Can't this cause an even/odd pattern where all even-numbered requests > run first, then all the odd-numbered requests? > > (0 goes to disk, 1 depends on it, 2 depends on 1, which isn't allocating > now, so it goes to disk, 3 depends on 2, ...) I guess it can happen in theory, not sure if it matters in practice. You are worried about image fragmentation? I think we could do something about it with a cleverer cluster allocation. However, I don't think it's an argument against this patch. What currently happens isn't much better: Allocate n clusters, free n-1. Almost as good in producing fragmentation. > Do you have performance numbers? No really detailed numbers. Installation time for RHEL on qcow2/virtio went down from 34 min to 19 min, though. Kevin
On 09/01/2009 01:50 PM, Kevin Wolf wrote: > >> Can't this cause an even/odd pattern where all even-numbered requests >> run first, then all the odd-numbered requests? >> >> (0 goes to disk, 1 depends on it, 2 depends on 1, which isn't allocating >> now, so it goes to disk, 3 depends on 2, ...) >> > I guess it can happen in theory, not sure if it matters in practice. We should check then. > You > are worried about image fragmentation? I think we could do something > about it with a cleverer cluster allocation. > Not only image fragmentation - the odd requests will require RMW. > However, I don't think it's an argument against this patch. What > currently happens isn't much better: Allocate n clusters, free n-1. > Almost as good in producing fragmentation. > The patch introduces complexity so it makes working towards a non-fragmenting solution harder. I'm not saying it could be simplified, it's a side effect of using a state machine design. >> Do you have performance numbers? >> > No really detailed numbers. Installation time for RHEL on qcow2/virtio > went down from 34 min to 19 min, though. > That's very impressive. cache=none or cache=wt?
On Tue, Sep 01, 2009 at 02:24:46PM +0300, Avi Kivity wrote: > On 09/01/2009 01:50 PM, Kevin Wolf wrote: > > > >>Can't this cause an even/odd pattern where all even-numbered requests > >>run first, then all the odd-numbered requests? > >> > >>(0 goes to disk, 1 depends on it, 2 depends on 1, which isn't allocating > >>now, so it goes to disk, 3 depends on 2, ...) > >I guess it can happen in theory, not sure if it matters in practice. > > We should check then. > > > You > >are worried about image fragmentation? I think we could do something > >about it with a cleverer cluster allocation. > > Not only image fragmentation - the odd requests will require RMW. > > >However, I don't think it's an argument against this patch. What > >currently happens isn't much better: Allocate n clusters, free n-1. > >Almost as good in producing fragmentation. > > The patch introduces complexity so it makes working towards a > non-fragmenting solution harder. I'm not saying it could be > simplified, it's a side effect of using a state machine design. > > >>Do you have performance numbers? > >No really detailed numbers. Installation time for RHEL on qcow2/virtio > >went down from 34 min to 19 min, though. > > That's very impressive. cache=none or cache=wt? > And how it compares with raw? -- Gleb.
Avi Kivity schrieb: > On 09/01/2009 01:50 PM, Kevin Wolf wrote: >> You >> are worried about image fragmentation? I think we could do something >> about it with a cleverer cluster allocation. >> > > Not only image fragmentation - the odd requests will require RMW. How that? The case you're thinking of is that the first and third request are already completed and then the second one starts, right? Assuming that request 2 involves some sectors in the last cluster of 1 and the first one of 3. Then request 2 is written in three phases: The first one overwrites the last sectors of requests 1 (cluster already allocated => no RMW). The second one writes to unallocated, cluster-aligned space (writing complete clusters => no RMW). The third one overwrites the first sectors of request 3 (cluster already allocated => no RMW). >>> Do you have performance numbers? >>> >> No really detailed numbers. Installation time for RHEL on qcow2/virtio >> went down from 34 min to 19 min, though. >> > > That's very impressive. cache=none or cache=wt? This was with cache=none as are most of my ad-hoc tests (with cache=none being the default for RHEL). Kevin
On 09/01/2009 02:43 PM, Kevin Wolf wrote: > Avi Kivity schrieb: > >> On 09/01/2009 01:50 PM, Kevin Wolf wrote: >> >>> You >>> are worried about image fragmentation? I think we could do something >>> about it with a cleverer cluster allocation. >>> >>> >> Not only image fragmentation - the odd requests will require RMW. >> > How that? > > The case you're thinking of is that the first and third request are > already completed and then the second one starts, right? Assuming that > request 2 involves some sectors in the last cluster of 1 and the first > one of 3. > > Then request 2 is written in three phases: The first one overwrites the > last sectors of requests 1 (cluster already allocated => no RMW). The > second one writes to unallocated, cluster-aligned space (writing > complete clusters => no RMW). The third one overwrites the first sectors > of request 3 (cluster already allocated => no RMW). > Ah, ok. No RMW, but a three-way split as well as fragmentation.
Gleb Natapov schrieb: >>> No really detailed numbers. Installation time for RHEL on qcow2/virtio >>> went down from 34 min to 19 min, though. >> That's very impressive. cache=none or cache=wt? >> > And how it compares with raw? I just ran another installation on raw and it took about 16:30 min. And with the next patch series that I'm going to send today we'll be in the same region with qcow2. Kevin
On 09/01/2009 03:29 PM, Kevin Wolf wrote: > Gleb Natapov schrieb: > >>>> No really detailed numbers. Installation time for RHEL on qcow2/virtio >>>> went down from 34 min to 19 min, though. >>>> >>> That's very impressive. cache=none or cache=wt? >>> >>> >> And how it compares with raw? >> > I just ran another installation on raw and it took about 16:30 min. > > And with the next patch series that I'm going to send today we'll be in > the same region with qcow2. > Looks pretty good then. Was that 'raw' as in 'disk partition', or as 'raw file on ext3'?
Avi Kivity schrieb: > On 09/01/2009 03:29 PM, Kevin Wolf wrote: >> Gleb Natapov schrieb: >> >>>>> No really detailed numbers. Installation time for RHEL on qcow2/virtio >>>>> went down from 34 min to 19 min, though. >>>>> >>>> That's very impressive. cache=none or cache=wt? >>>> >>>> >>> And how it compares with raw? >>> >> I just ran another installation on raw and it took about 16:30 min. >> >> And with the next patch series that I'm going to send today we'll be in >> the same region with qcow2. >> > > Looks pretty good then. Was that 'raw' as in 'disk partition', or as > 'raw file on ext3'? raw file on ext3. As I said this is just manual ad-hoc testing on my laptop, so you certainly can do better benchmarks than these. And I think I'll do some later this week. Kevin
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 057dac5..d4631c3 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -684,6 +684,7 @@ uint64_t qcow2_alloc_cluster_offset(BlockDriverState *bs, int l2_index, ret; uint64_t l2_offset, *l2_table, cluster_offset; int nb_clusters, i = 0; + QCowL2Meta *old_alloc; ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index); if (ret == 0) @@ -732,6 +733,44 @@ uint64_t qcow2_alloc_cluster_offset(BlockDriverState *bs, } nb_clusters = i; + /* + * Check if there already is an AIO write request in flight which allocates + * the same cluster. In this case we need to wait until the previous + * request has completed and updated the L2 table accordingly. + */ + LIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight) { + + uint64_t end_offset = offset + nb_clusters * s->cluster_size; + uint64_t old_offset = old_alloc->offset; + uint64_t old_end_offset = old_alloc->offset + + old_alloc->nb_clusters * s->cluster_size; + + if (end_offset < old_offset || offset > old_end_offset) { + /* No intersection */ + } else { + if (offset < old_offset) { + /* Stop at the start of a running allocation */ + nb_clusters = (old_offset - offset) >> s->cluster_bits; + } else { + nb_clusters = 0; + } + + if (nb_clusters == 0) { + /* Set dependency and wait for a callback */ + m->depends_on = old_alloc; + m->nb_clusters = 0; + *num = 0; + return 0; + } + } + } + + if (!nb_clusters) { + abort(); + } + + LIST_INSERT_HEAD(&s->cluster_allocs, m, next_in_flight); + /* allocate a new cluster */ cluster_offset = qcow2_alloc_clusters(bs, nb_clusters * s->cluster_size); diff --git a/block/qcow2.c b/block/qcow2.c index b8eae90..8579e01 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -219,6 +219,8 @@ static int qcow_open(BlockDriverState *bs, const char *filename, int flags) if (qcow2_refcount_init(bs) < 0) goto fail; + LIST_INIT(&s->cluster_allocs); + /* read qcow2 extensions */ if (header.backing_file_offset) ext_end = header.backing_file_offset; @@ -338,6 +340,7 @@ typedef struct QCowAIOCB { QEMUIOVector hd_qiov; QEMUBH *bh; QCowL2Meta l2meta; + LIST_ENTRY(QCowAIOCB) next_depend; } QCowAIOCB; static void qcow_aio_cancel(BlockDriverAIOCB *blockacb) @@ -500,6 +503,7 @@ static QCowAIOCB *qcow_aio_setup(BlockDriverState *bs, acb->n = 0; acb->cluster_offset = 0; acb->l2meta.nb_clusters = 0; + LIST_INIT(&acb->l2meta.dependent_requests); return acb; } @@ -517,6 +521,33 @@ static BlockDriverAIOCB *qcow_aio_readv(BlockDriverState *bs, return &acb->common; } +static void qcow_aio_write_cb(void *opaque, int ret); + +static void run_dependent_requests(QCowL2Meta *m) +{ + QCowAIOCB *req; + QCowAIOCB *next; + + /* Take the request off the list of running requests */ + if (m->nb_clusters != 0) { + LIST_REMOVE(m, next_in_flight); + } + + /* + * Restart all dependent requests. + * Can't use LIST_FOREACH here - the next link might not be the same + * any more after the callback (request could depend on a different + * request now) + */ + for (req = m->dependent_requests.lh_first; req != NULL; req = next) { + next = req->next_depend.le_next; + qcow_aio_write_cb(req, 0); + } + + /* Empty the list for the next part of the request */ + LIST_INIT(&m->dependent_requests); +} + static void qcow_aio_write_cb(void *opaque, int ret) { QCowAIOCB *acb = opaque; @@ -528,14 +559,15 @@ static void qcow_aio_write_cb(void *opaque, int ret) acb->hd_aiocb = NULL; + if (ret >= 0) { + ret = qcow2_alloc_cluster_link_l2(bs, acb->cluster_offset, &acb->l2meta); + } + + run_dependent_requests(&acb->l2meta); + if (ret < 0) goto done; - if (qcow2_alloc_cluster_link_l2(bs, acb->cluster_offset, &acb->l2meta) < 0) { - qcow2_free_any_clusters(bs, acb->cluster_offset, acb->l2meta.nb_clusters); - goto done; - } - acb->nb_sectors -= acb->n; acb->sector_num += acb->n; acb->buf += acb->n * 512; @@ -555,6 +587,14 @@ static void qcow_aio_write_cb(void *opaque, int ret) acb->cluster_offset = qcow2_alloc_cluster_offset(bs, acb->sector_num << 9, index_in_cluster, n_end, &acb->n, &acb->l2meta); + + /* Need to wait for another request? If so, we are done for now. */ + if (!acb->cluster_offset && acb->l2meta.depends_on != NULL) { + LIST_INSERT_HEAD(&acb->l2meta.depends_on->dependent_requests, + acb, next_depend); + return; + } + if (!acb->cluster_offset || (acb->cluster_offset & 511) != 0) { ret = -EIO; goto done; @@ -650,6 +690,7 @@ static int preallocate(BlockDriverState *bs) nb_sectors = bdrv_getlength(bs) >> 9; offset = 0; + LIST_INIT(&meta.dependent_requests); while (nb_sectors) { num = MIN(nb_sectors, INT_MAX >> 9); @@ -665,6 +706,10 @@ static int preallocate(BlockDriverState *bs) return -1; } + /* There are no dependent requests, but we need to remove our request + * from the list of in-flight requests */ + run_dependent_requests(&meta); + /* TODO Preallocate data if requested */ nb_sectors -= num; diff --git a/block/qcow2.h b/block/qcow2.h index c99b374..965a2f4 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -98,6 +98,7 @@ typedef struct BDRVQcowState { uint8_t *cluster_cache; uint8_t *cluster_data; uint64_t cluster_cache_offset; + LIST_HEAD(QCowClusterAlloc, QCowL2Meta) cluster_allocs; uint64_t *refcount_table; uint64_t refcount_table_offset; @@ -128,6 +129,8 @@ typedef struct QCowCreateState { int64_t refcount_block_offset; } QCowCreateState; +struct QCowAIOCB; + /* XXX This could be private for qcow2-cluster.c */ typedef struct QCowL2Meta { @@ -135,6 +138,10 @@ typedef struct QCowL2Meta int n_start; int nb_available; int nb_clusters; + struct QCowL2Meta *depends_on; + LIST_HEAD(QCowAioDependencies, QCowAIOCB) dependent_requests; + + LIST_ENTRY(QCowL2Meta) next_in_flight; } QCowL2Meta; static inline int size_to_clusters(BDRVQcowState *s, int64_t size)
When two AIO requests write to the same cluster, and this cluster is unallocated, currently both requests allocate a new cluster and the second one merges the first one when it is completed. This means an cluster allocation, a read and a cluster deallocation which cause some overhead. If we simply let the second request wait until the first one is done, we improve overall performance with AIO requests (specifially, qcow2/virtio combinations). This patch maintains a list of in-flight requests that have allocated new clusters. A second request touching the same cluster is limited so that it either doesn't touch the allocation of the first request (so it can have a non-overlapping allocation) or it waits for the first request to complete. Signed-off-by: Kevin Wolf <kwolf@redhat.com> --- block/qcow2-cluster.c | 39 ++++++++++++++++++++++++++++++++++ block/qcow2.c | 55 ++++++++++++++++++++++++++++++++++++++++++++---- block/qcow2.h | 7 ++++++ 3 files changed, 96 insertions(+), 5 deletions(-)