diff mbox series

[v3,4/6] util: implement seqcache

Message ID 20210305173507.393137-5-vsementsov@virtuozzo.com
State New
Headers show
Series qcow2: compressed write cache | expand

Commit Message

Vladimir Sementsov-Ogievskiy March 5, 2021, 5:35 p.m. UTC
Implement cache for small sequential unaligned writes, so that they may
be cached until we get a complete cluster and then write it.

The cache is intended to be used for backup to qcow2 compressed target
opened in O_DIRECT mode, but can be reused for any similar (even not
block-layer related) task.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 include/qemu/seqcache.h |  42 +++++
 util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
 MAINTAINERS             |   6 +
 util/meson.build        |   1 +
 4 files changed, 410 insertions(+)
 create mode 100644 include/qemu/seqcache.h
 create mode 100644 util/seqcache.c

Comments

Max Reitz March 12, 2021, 1:41 p.m. UTC | #1
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
> Implement cache for small sequential unaligned writes, so that they may
> be cached until we get a complete cluster and then write it.
> 
> The cache is intended to be used for backup to qcow2 compressed target
> opened in O_DIRECT mode, but can be reused for any similar (even not
> block-layer related) task.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   include/qemu/seqcache.h |  42 +++++
>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>   MAINTAINERS             |   6 +
>   util/meson.build        |   1 +
>   4 files changed, 410 insertions(+)
>   create mode 100644 include/qemu/seqcache.h
>   create mode 100644 util/seqcache.c

Looks quite good to me, thanks.  Nice explanations, too. :)

The only design question I have is whether there’s a reason you’re using 
a list again instead of a hash table.  I suppose we do need the list 
anyway because of the next_flush iterator, so using a hash table would 
only complicate the implementation, but still.

[...]

> diff --git a/util/seqcache.c b/util/seqcache.c
> new file mode 100644
> index 0000000000..d923560eab
> --- /dev/null
> +++ b/util/seqcache.c
> @@ -0,0 +1,361 @@
> +/*
> + * Cache for small sequential write requests.
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + *
> + *
> + * = Description =
> + *
> + * SeqCache is an abbreviation for Sequential Cache.
> + *
> + * The Cache is intended to improve performance of small unaligned sequential
> + * writes. Cache has a cluster_size parameter and the unit of caching is aligned
> + * cluster.  Cache has a list of cached clusters, several "finished" ones and at
> + * most one "unfinished". "unfinished" cluster is a cluster where last byte of
> + * last write operation is cached and there is a free place after that byte to
> + * the end of cluster. "finished" clusters are just stored in cache to be read
> + * or flushed and don't allow any writes to them.
> + *
> + * If write to the cache intersects cluster bounds, it's splat into several

*split

(Though I like “splat”.  Sounds like a wet blotch.)

> + * requests by cluster bounds. So, consider a write that doesn't intersect
> + * cluster bounds to describe the whole picture:
> + *
> + * There are two cases allowed:
> + *
> + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
> + *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
> + *    filled up to the cluster bound it becomes "finished".
> + *
> + * 2. Write to new cluster, not existing in the cache. It may be sequential to
> + *    previous write or not. Current "unfinshed" cluster (if exists) becomes

*unfinished

> + *    "finished" and new "unfinished" cluster is started. Note also that offset
> + *    of the write to new cluster is not required to be aligned.
> + *
> + *    Any other write operation (non-sequential write to "unfinished" cluster
> + *    or write to any of "finished" clusters) will crash.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/queue.h"
> +#include "qemu/seqcache.h"
> +
> +/*
> + * Cluster
> + *
> + * Representation of one cached cluster, aligned to SeqCache::cluster_size.
> + * Caches only one subregion of the cluster, started at @offset (may be
> + * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
> + * The whole subregion always lay in one aligned cluster:
> + *
> + *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
> + *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
> + *
> + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
> + * allocated @buf length is at least:
> + *
> + *      cluster_size - offset % cluster_size
> + */
> +typedef struct Cluster {
> +   int64_t offset;
> +   int64_t bytes;
> +   uint8_t *buf;
> +   QSIMPLEQ_ENTRY(Cluster) entry;
> +} Cluster;
> +
> +/*
> + * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
> + * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
> + * calls.
> + *
> + * @cur_write->offset may be unaligned to @cluster_size if first write offset is
> + * not aligned (for example, if there was a flush request and cache was flushed,
> + * then we continue from the middle of the cluster with an empty cache).
> + *
> + * @cur_write may be NULL, which means we don't have current cluster and next
> + * seqcache_write() will start a new one.
> + *
> + * @all is a list of all clusters cached in the cache, some "finished" and one
> + * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
> + * one in the list.
> + *
> + * @nb_clusters is number of elements in @all list.
> + *
> + * @next_flush is an iterator for flushing. SeqCache knows nothing about how
> + * data should be flushing, so for flush user calls seqcache_get_next_flush()

s/flushing/flushed/

> + * (which moves @next_flush iterator) and when data is flushed somehow and cache
> + * is not needed anymore, user can call seqcache_discard_cluster().

AFAIU, next_flush always points to the first finished cluster that has 
not yet been returned by seqcache_get_next_flush(), is that correct? 
(Yes, at least the latter part is implied by this comment, I’m just 
asking for clarity, because I want to be sure the simple

   s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);

in seqcache_get_next_flush() does what I think it does, which is never 
to let s->next_flush be NULL even though there are still flushable 
clusters somewhere.)

> + */
> +typedef struct SeqCache {
> +    int64_t cluster_size;
> +    int64_t nb_clusters;
> +    Cluster *cur_write;
> +    Cluster *next_flush;
> +    QSIMPLEQ_HEAD(, Cluster) all;
> +} SeqCache;

[...]

> +/* Align down @offset to s->cluster_size and search for corresponding cluster */
> +static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
> +{
> +    Cluster *cl;
> +    int64_t cl_start = cluster_start(s, offset);
> +
> +    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
> +        if (cluster_start(s, cl->offset) == cl_start) {
> +            return cl;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* Makes current "unfinished" cluster the "finished" one. */

This sounds a bit like there’d be only a single finished cluster, so I’d 
rather write it as “Mark the current "unfinished" cluster as "finished".”

> +static void seqcache_finalize_current_cluster(SeqCache *s)
> +{
> +    if (s->cur_write && !s->next_flush) {
> +        s->next_flush = s->cur_write;
> +    }
> +    s->cur_write = NULL;
> +}
> +
> +/*
> + * Write an @offset, @bytes, @buf request into the cache. The requests MUST not

s/requests/request/

> + * intersect cluster bounds.
> + */
> +static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
> +                               uint8_t *buf)

Could use a const, though not a must.

> +{
> +    assert(bytes > 0);
> +    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
> +
> +    if (s->cur_write && offset == cached_end(s->cur_write)) {
> +        /* Continue sequential process */
> +        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
> +        s->cur_write->bytes += bytes;
> +
> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
> +            seqcache_finalize_current_cluster(s);
> +        }
> +
> +        return;
> +    }
> +
> +    /* We are starting a new cluster. Check that it's really new */
> +    assert(!seqcache_find_cluster(s, offset));
> +
> +    seqcache_finalize_current_cluster(s);
> +
> +    s->cur_write = g_new(Cluster, 1);
> +    *s->cur_write = (Cluster) {
> +        .offset = offset,
> +        .bytes = bytes,
> +        .buf = g_malloc(s->cluster_size),

I have to ask: Why not s->cluster_size - offset % s->cluster_size as 
advertised by the comment describing Cluster?

More practical question: Should we use qemu_memalign() (aligning either 
at the cluster size or at the block alignment, which would need to be 
passed to seqcache_new()) when offset is aligned to the cluster size? 
(Or with a custom alignment, if it is aligned to that.)

I feel that for O_DIRECT images it might be kind of important to align 
the buffer to the host block size.

> +    };
> +
> +    memcpy(s->cur_write->buf, buf, bytes);
> +    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
> +    s->nb_clusters++;
> +}
> +
> +/* Write an @offset, @bytes, @buf request into the cache. */
> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)

“const” might again find its place here.

> +{
> +    while (bytes) {
> +        int64_t cl_end = cluster_end(s, offset);
> +        int64_t chunk = MIN(bytes, cl_end - offset);
> +
> +        seqcache_write_one(s, offset, chunk, buf);
> +        offset += chunk;
> +        bytes -= chunk;
> +        buf += chunk;
> +    }
> +}

[...]

> +/*
> + * Get next region for flushing. @offset, @bytes and @buf are out-parameters
> + * to return the region.
> + *
> + * @unfinished is in-out argument which means that user is interested in
> + * flushing unfinished cluster too:
> + *
> + * If there are "finished" clusters, "finished" cluster is returned and
> + * *@unfinished is set to false, independently of its original value.
> + *
> + * If there are no "finished" clusters but "unfinished" exists (i.e.
> + * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
> + * value remains the same and the following logic works:
> + *
> + *    If *@unfinished:
> + *       return s->cur_write unfinished cluster for flushing
> + *    Else
> + *       return nothing
> + *
> + *
> + * Returns true and set @offset, @bytes, @buf and @unfinished if there is
> + * something to flush (accordingly to @unfinished value), returns false
> + * otherwise.
> + *
> + * Nothing is removed from the cache.

Out of curiosity, mainly, is that because the returned *buf is only 
valid as long as the entry is still in the cache, or is there a 
different reason that I’m missing?

(Hm, perhaps the fact that the user may want to keep it available for 
reading through seqcache_read()?)

> + */
> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
> +                             uint8_t **buf, bool *unfinished)

Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the 
callers would hate that, though.

> +{
> +    Cluster *req = s->next_flush;
> +
> +    if (s->next_flush) {
> +        *unfinished = false;
> +        req = s->next_flush;
> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
> +        if (s->next_flush == s->cur_write) {
> +            s->next_flush = NULL;
> +        }
> +    } else if (s->cur_write && *unfinished) {
> +        req = s->cur_write;

I was wondering whether flushing an unfinished cluster wouldn’t kind of 
finalize it, but I suppose the problem with that would be that you can’t 
add data to a finished cluster, which wouldn’t be that great if you’re 
just flushing the cache without wanting to drop it all.

(The problem I see is that flushing it later will mean all the data that 
already has been written here will have to be rewritten.  Not that bad, 
I suppose.)

> +    } else {
> +        return false;
> +    }
> +
> +    *offset = req->offset;
> +    *bytes = req->bytes;
> +    *buf = req->buf;
> +
> +    return true;
> +}
> +
> +/*
> + * Find corresponding cluster and drop it. No matter does requested @offset is
> + * cached itself or not.

The second sentence sounds strange grammatically, if I understand 
correctly, I’d change this to something like “Find the cluster 
corresponding to @offset and drop it.  It does not matter whether 
@offset itself is actually within that cluster’s cached range or not.”

Max

> + */
Vladimir Sementsov-Ogievskiy March 12, 2021, 2:37 p.m. UTC | #2
12.03.2021 16:41, Max Reitz wrote:
> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>> Implement cache for small sequential unaligned writes, so that they may
>> be cached until we get a complete cluster and then write it.
>>
>> The cache is intended to be used for backup to qcow2 compressed target
>> opened in O_DIRECT mode, but can be reused for any similar (even not
>> block-layer related) task.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   include/qemu/seqcache.h |  42 +++++
>>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>>   MAINTAINERS             |   6 +
>>   util/meson.build        |   1 +
>>   4 files changed, 410 insertions(+)
>>   create mode 100644 include/qemu/seqcache.h
>>   create mode 100644 util/seqcache.c
> 
> Looks quite good to me, thanks.  Nice explanations, too. :)
> 
> The only design question I have is whether there’s a reason you’re using a list again instead of a hash table.  I suppose we do need the list anyway because of the next_flush iterator, so using a hash table would only complicate the implementation, but still.

Yes, it seems correct for flush iterator go in same order as writes comes, so we need a list. We can add a hash table, it will only help on read.. But for compressed cache in qcow2 we try to flush often enough, so there should not be many clusters in the cache. So I think addition of hash table may be done later if needed.

> 
> [...]
> 
>> diff --git a/util/seqcache.c b/util/seqcache.c
>> new file mode 100644
>> index 0000000000..d923560eab
>> --- /dev/null
>> +++ b/util/seqcache.c
>> @@ -0,0 +1,361 @@
>> +/*
>> + * Cache for small sequential write requests.
>> + *
>> + * Copyright (c) 2021 Virtuozzo International GmbH.
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + *
>> + *
>> + * = Description =
>> + *
>> + * SeqCache is an abbreviation for Sequential Cache.
>> + *
>> + * The Cache is intended to improve performance of small unaligned sequential
>> + * writes. Cache has a cluster_size parameter and the unit of caching is aligned
>> + * cluster.  Cache has a list of cached clusters, several "finished" ones and at
>> + * most one "unfinished". "unfinished" cluster is a cluster where last byte of
>> + * last write operation is cached and there is a free place after that byte to
>> + * the end of cluster. "finished" clusters are just stored in cache to be read
>> + * or flushed and don't allow any writes to them.
>> + *
>> + * If write to the cache intersects cluster bounds, it's splat into several
> 
> *split
> 
> (Though I like “splat”.  Sounds like a wet blotch.)
> 
>> + * requests by cluster bounds. So, consider a write that doesn't intersect
>> + * cluster bounds to describe the whole picture:
>> + *
>> + * There are two cases allowed:
>> + *
>> + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
>> + *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
>> + *    filled up to the cluster bound it becomes "finished".
>> + *
>> + * 2. Write to new cluster, not existing in the cache. It may be sequential to
>> + *    previous write or not. Current "unfinshed" cluster (if exists) becomes
> 
> *unfinished
> 
>> + *    "finished" and new "unfinished" cluster is started. Note also that offset
>> + *    of the write to new cluster is not required to be aligned.
>> + *
>> + *    Any other write operation (non-sequential write to "unfinished" cluster
>> + *    or write to any of "finished" clusters) will crash.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +
>> +#include "qemu/queue.h"
>> +#include "qemu/seqcache.h"
>> +
>> +/*
>> + * Cluster
>> + *
>> + * Representation of one cached cluster, aligned to SeqCache::cluster_size.
>> + * Caches only one subregion of the cluster, started at @offset (may be
>> + * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
>> + * The whole subregion always lay in one aligned cluster:
>> + *
>> + *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
>> + *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
>> + *
>> + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
>> + * allocated @buf length is at least:
>> + *
>> + *      cluster_size - offset % cluster_size
>> + */
>> +typedef struct Cluster {
>> +   int64_t offset;
>> +   int64_t bytes;
>> +   uint8_t *buf;
>> +   QSIMPLEQ_ENTRY(Cluster) entry;
>> +} Cluster;
>> +
>> +/*
>> + * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
>> + * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
>> + * calls.
>> + *
>> + * @cur_write->offset may be unaligned to @cluster_size if first write offset is
>> + * not aligned (for example, if there was a flush request and cache was flushed,
>> + * then we continue from the middle of the cluster with an empty cache).
>> + *
>> + * @cur_write may be NULL, which means we don't have current cluster and next
>> + * seqcache_write() will start a new one.
>> + *
>> + * @all is a list of all clusters cached in the cache, some "finished" and one
>> + * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
>> + * one in the list.
>> + *
>> + * @nb_clusters is number of elements in @all list.
>> + *
>> + * @next_flush is an iterator for flushing. SeqCache knows nothing about how
>> + * data should be flushing, so for flush user calls seqcache_get_next_flush()
> 
> s/flushing/flushed/
> 
>> + * (which moves @next_flush iterator) and when data is flushed somehow and cache
>> + * is not needed anymore, user can call seqcache_discard_cluster().
> 
> AFAIU, next_flush always points to the first finished cluster that has not yet been returned by seqcache_get_next_flush(), is that correct?

Yes, right.

  (Yes, at least the latter part is implied by this comment, I’m just asking for clarity, because I want to be sure the simple
> 
>    s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);
> 
> in seqcache_get_next_flush() does what I think it does, which is never to let s->next_flush be NULL even though there are still flushable clusters somewhere.)
> 
>> + */
>> +typedef struct SeqCache {
>> +    int64_t cluster_size;
>> +    int64_t nb_clusters;
>> +    Cluster *cur_write;
>> +    Cluster *next_flush;
>> +    QSIMPLEQ_HEAD(, Cluster) all;
>> +} SeqCache;
> 
> [...]
> 
>> +/* Align down @offset to s->cluster_size and search for corresponding cluster */
>> +static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
>> +{
>> +    Cluster *cl;
>> +    int64_t cl_start = cluster_start(s, offset);
>> +
>> +    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
>> +        if (cluster_start(s, cl->offset) == cl_start) {
>> +            return cl;
>> +        }
>> +    }
>> +
>> +    return NULL;
>> +}
>> +
>> +/* Makes current "unfinished" cluster the "finished" one. */
> 
> This sounds a bit like there’d be only a single finished cluster, so I’d rather write it as “Mark the current "unfinished" cluster as "finished".”
> 
>> +static void seqcache_finalize_current_cluster(SeqCache *s)
>> +{
>> +    if (s->cur_write && !s->next_flush) {
>> +        s->next_flush = s->cur_write;
>> +    }
>> +    s->cur_write = NULL;
>> +}
>> +
>> +/*
>> + * Write an @offset, @bytes, @buf request into the cache. The requests MUST not
> 
> s/requests/request/
> 
>> + * intersect cluster bounds.
>> + */
>> +static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
>> +                               uint8_t *buf)
> 
> Could use a const, though not a must.
> 
>> +{
>> +    assert(bytes > 0);
>> +    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
>> +
>> +    if (s->cur_write && offset == cached_end(s->cur_write)) {
>> +        /* Continue sequential process */
>> +        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
>> +        s->cur_write->bytes += bytes;
>> +
>> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
>> +            seqcache_finalize_current_cluster(s);
>> +        }
>> +
>> +        return;
>> +    }
>> +
>> +    /* We are starting a new cluster. Check that it's really new */
>> +    assert(!seqcache_find_cluster(s, offset));
>> +
>> +    seqcache_finalize_current_cluster(s);
>> +
>> +    s->cur_write = g_new(Cluster, 1);
>> +    *s->cur_write = (Cluster) {
>> +        .offset = offset,
>> +        .bytes = bytes,
>> +        .buf = g_malloc(s->cluster_size),
> 
> I have to ask: Why not s->cluster_size - offset % s->cluster_size as advertised by the comment describing Cluster?

I just didn't care, it should be rare case. But for generic code better be precise. I'll update it.

> 
> More practical question: Should we use qemu_memalign() (aligning either at the cluster size or at the block alignment, which would need to be passed to seqcache_new()) when offset is aligned to the cluster size? (Or with a custom alignment, if it is aligned to that.)
> 
> I feel that for O_DIRECT images it might be kind of important to align the buffer to the host block size.

Will do

> 
>> +    };
>> +
>> +    memcpy(s->cur_write->buf, buf, bytes);
>> +    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
>> +    s->nb_clusters++;
>> +}
>> +
>> +/* Write an @offset, @bytes, @buf request into the cache. */
>> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
> 
> “const” might again find its place here.
> 
>> +{
>> +    while (bytes) {
>> +        int64_t cl_end = cluster_end(s, offset);
>> +        int64_t chunk = MIN(bytes, cl_end - offset);
>> +
>> +        seqcache_write_one(s, offset, chunk, buf);
>> +        offset += chunk;
>> +        bytes -= chunk;
>> +        buf += chunk;
>> +    }
>> +}
> 
> [...]
> 
>> +/*
>> + * Get next region for flushing. @offset, @bytes and @buf are out-parameters
>> + * to return the region.
>> + *
>> + * @unfinished is in-out argument which means that user is interested in
>> + * flushing unfinished cluster too:
>> + *
>> + * If there are "finished" clusters, "finished" cluster is returned and
>> + * *@unfinished is set to false, independently of its original value.
>> + *
>> + * If there are no "finished" clusters but "unfinished" exists (i.e.
>> + * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
>> + * value remains the same and the following logic works:
>> + *
>> + *    If *@unfinished:
>> + *       return s->cur_write unfinished cluster for flushing
>> + *    Else
>> + *       return nothing
>> + *
>> + *
>> + * Returns true and set @offset, @bytes, @buf and @unfinished if there is
>> + * something to flush (accordingly to @unfinished value), returns false
>> + * otherwise.
>> + *
>> + * Nothing is removed from the cache.
> 
> Out of curiosity, mainly, is that because the returned *buf is only valid as long as the entry is still in the cache, or is there a different reason that I’m missing?
> 
> (Hm, perhaps the fact that the user may want to keep it available for reading through seqcache_read()?)

Yes, that's for read. This way reads are not blocked by in-flight cache flush. We just read from cache if data in the cache, and if not read from the image. We don't have to wait for intersecting in-flight write to finish.

> 
>> + */
>> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
>> +                             uint8_t **buf, bool *unfinished)
> 
> Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the callers would hate that, though.

Will do. And actually I wrote quite big explanation but missed the fact that caller don't get ownership on buf, it should be mentioned.

> 
>> +{
>> +    Cluster *req = s->next_flush;
>> +
>> +    if (s->next_flush) {
>> +        *unfinished = false;
>> +        req = s->next_flush;
>> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
>> +        if (s->next_flush == s->cur_write) {
>> +            s->next_flush = NULL;
>> +        }
>> +    } else if (s->cur_write && *unfinished) {
>> +        req = s->cur_write;
> 
> I was wondering whether flushing an unfinished cluster wouldn’t kind of finalize it, but I suppose the problem with that would be that you can’t add data to a finished cluster, which wouldn’t be that great if you’re just flushing the cache without wanting to drop it all.
> 
> (The problem I see is that flushing it later will mean all the data that already has been written here will have to be rewritten.  Not that bad, I suppose.)

Yes that's all correct. Also there is additional strong reason: qcow2 depends on the fact that clusters become "finished" by defined rules: only when it really finished up the the end or when qcow2 starts writing another cluster.

For "finished" clusters with unaligned end we can safely align this end up to some good alignment writing a bit more data than needed. It's safe because tail of the cluster is never used. And we'll perform better with aligned write avoiding RMW.

But when flushing "unfinished" cluster, we should write exactly what we have in the cache, as there may happen parallel write to the same cluster, which will continue the sequential process.

> 
>> +    } else {
>> +        return false;
>> +    }
>> +
>> +    *offset = req->offset;
>> +    *bytes = req->bytes;
>> +    *buf = req->buf;
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * Find corresponding cluster and drop it. No matter does requested @offset is
>> + * cached itself or not.
> 
> The second sentence sounds strange grammatically, if I understand correctly, I’d change this to something like “Find the cluster corresponding to @offset and drop it.  It does not matter whether @offset itself is actually within that cluster’s cached range or not.”
> 

Right, thanks for good rewording. I'm afraid, I often use russian language constructions with english words.
Max Reitz March 12, 2021, 3:13 p.m. UTC | #3
On 12.03.21 15:37, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 16:41, Max Reitz wrote:
>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>> Implement cache for small sequential unaligned writes, so that they may
>>> be cached until we get a complete cluster and then write it.
>>>
>>> The cache is intended to be used for backup to qcow2 compressed target
>>> opened in O_DIRECT mode, but can be reused for any similar (even not
>>> block-layer related) task.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   include/qemu/seqcache.h |  42 +++++
>>>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>>>   MAINTAINERS             |   6 +
>>>   util/meson.build        |   1 +
>>>   4 files changed, 410 insertions(+)
>>>   create mode 100644 include/qemu/seqcache.h
>>>   create mode 100644 util/seqcache.c
>>
>> Looks quite good to me, thanks.  Nice explanations, too. :)
>>
>> The only design question I have is whether there’s a reason you’re 
>> using a list again instead of a hash table.  I suppose we do need the 
>> list anyway because of the next_flush iterator, so using a hash table 
>> would only complicate the implementation, but still.
> 
> Yes, it seems correct for flush iterator go in same order as writes 
> comes, so we need a list. We can add a hash table, it will only help on 
> read.. But for compressed cache in qcow2 we try to flush often enough, 
> so there should not be many clusters in the cache. So I think addition 
> of hash table may be done later if needed.

Sure.  The problem I see is that we’ll probably never reach the point of 
it really being needed. O:)

So I think it’s a question of now or never.

[...]

>>> + */
>>> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t 
>>> *bytes,
>>> +                             uint8_t **buf, bool *unfinished)
>>
>> Could be “uint8_t *const *buf”, I suppose.  Don’t know how much the 
>> callers would hate that, though.
> 
> Will do. And actually I wrote quite big explanation but missed the fact 
> that caller don't get ownership on buf, it should be mentioned.

Great, thanks.

>>> +{
>>> +    Cluster *req = s->next_flush;
>>> +
>>> +    if (s->next_flush) {
>>> +        *unfinished = false;
>>> +        req = s->next_flush;
>>> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
>>> +        if (s->next_flush == s->cur_write) {
>>> +            s->next_flush = NULL;
>>> +        }
>>> +    } else if (s->cur_write && *unfinished) {
>>> +        req = s->cur_write;
>>
>> I was wondering whether flushing an unfinished cluster wouldn’t kind 
>> of finalize it, but I suppose the problem with that would be that you 
>> can’t add data to a finished cluster, which wouldn’t be that great if 
>> you’re just flushing the cache without wanting to drop it all.
>>
>> (The problem I see is that flushing it later will mean all the data 
>> that already has been written here will have to be rewritten.  Not 
>> that bad, I suppose.)
> 
> Yes that's all correct. Also there is additional strong reason: qcow2 
> depends on the fact that clusters become "finished" by defined rules: 
> only when it really finished up the the end or when qcow2 starts writing 
> another cluster.
> 
> For "finished" clusters with unaligned end we can safely align this end 
> up to some good alignment writing a bit more data than needed. It's safe 
> because tail of the cluster is never used. And we'll perform better with 
> aligned write avoiding RMW.
> 
> But when flushing "unfinished" cluster, we should write exactly what we 
> have in the cache, as there may happen parallel write to the same 
> cluster, which will continue the sequential process.

OK, thanks for the explanation.

Max
Vladimir Sementsov-Ogievskiy June 4, 2021, 2:31 p.m. UTC | #4
05.03.2021 20:35, Vladimir Sementsov-Ogievskiy wrote:
> Implement cache for small sequential unaligned writes, so that they may
> be cached until we get a complete cluster and then write it.
> 
> The cache is intended to be used for backup to qcow2 compressed target
> opened in O_DIRECT mode, but can be reused for any similar (even not
> block-layer related) task.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   include/qemu/seqcache.h |  42 +++++
>   util/seqcache.c         | 361 ++++++++++++++++++++++++++++++++++++++++
>   MAINTAINERS             |   6 +
>   util/meson.build        |   1 +
>   4 files changed, 410 insertions(+)
>   create mode 100644 include/qemu/seqcache.h
>   create mode 100644 util/seqcache.c
> 
> diff --git a/include/qemu/seqcache.h b/include/qemu/seqcache.h
> new file mode 100644
> index 0000000000..a308d54b01
> --- /dev/null
> +++ b/include/qemu/seqcache.h
> @@ -0,0 +1,42 @@
> +/*
> + * Cache for small sequential write requests.
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#ifndef QEMU_SEQCACHE_H
> +#define QEMU_SEQCACHE_H
> +
> +typedef struct SeqCache SeqCache;
> +
> +SeqCache *seqcache_new(int64_t cluster_size);
> +void seqcache_free(SeqCache *s);
> +
> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
> +int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
> +
> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
> +                             uint8_t **buf, bool *unfinished);
> +void seqcache_discard_cluster(SeqCache *s, int64_t offset);
> +
> +uint64_t seqcache_nb_clusters(SeqCache *s);
> +
> +#endif /* QEMU_SEQCACHE_H */
> diff --git a/util/seqcache.c b/util/seqcache.c
> new file mode 100644
> index 0000000000..d923560eab
> --- /dev/null
> +++ b/util/seqcache.c
> @@ -0,0 +1,361 @@
> +/*
> + * Cache for small sequential write requests.
> + *
> + * Copyright (c) 2021 Virtuozzo International GmbH.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + *
> + *
> + * = Description =
> + *
> + * SeqCache is an abbreviation for Sequential Cache.
> + *
> + * The Cache is intended to improve performance of small unaligned sequential
> + * writes. Cache has a cluster_size parameter and the unit of caching is aligned
> + * cluster.  Cache has a list of cached clusters, several "finished" ones and at
> + * most one "unfinished". "unfinished" cluster is a cluster where last byte of
> + * last write operation is cached and there is a free place after that byte to
> + * the end of cluster. "finished" clusters are just stored in cache to be read
> + * or flushed and don't allow any writes to them.
> + *
> + * If write to the cache intersects cluster bounds, it's splat into several
> + * requests by cluster bounds. So, consider a write that doesn't intersect
> + * cluster bounds to describe the whole picture:
> + *
> + * There are two cases allowed:
> + *
> + * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
> + *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
> + *    filled up to the cluster bound it becomes "finished".
> + *
> + * 2. Write to new cluster, not existing in the cache. It may be sequential to
> + *    previous write or not. Current "unfinshed" cluster (if exists) becomes
> + *    "finished" and new "unfinished" cluster is started. Note also that offset
> + *    of the write to new cluster is not required to be aligned.
> + *
> + *    Any other write operation (non-sequential write to "unfinished" cluster
> + *    or write to any of "finished" clusters) will crash.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "qemu/queue.h"
> +#include "qemu/seqcache.h"
> +
> +/*
> + * Cluster
> + *
> + * Representation of one cached cluster, aligned to SeqCache::cluster_size.
> + * Caches only one subregion of the cluster, started at @offset (may be
> + * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
> + * The whole subregion always lay in one aligned cluster:
> + *
> + *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
> + *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
> + *
> + * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
> + * allocated @buf length is at least:
> + *
> + *      cluster_size - offset % cluster_size
> + */
> +typedef struct Cluster {
> +   int64_t offset;
> +   int64_t bytes;
> +   uint8_t *buf;
> +   QSIMPLEQ_ENTRY(Cluster) entry;
> +} Cluster;
> +
> +/*
> + * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
> + * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
> + * calls.
> + *
> + * @cur_write->offset may be unaligned to @cluster_size if first write offset is
> + * not aligned (for example, if there was a flush request and cache was flushed,
> + * then we continue from the middle of the cluster with an empty cache).
> + *
> + * @cur_write may be NULL, which means we don't have current cluster and next
> + * seqcache_write() will start a new one.
> + *
> + * @all is a list of all clusters cached in the cache, some "finished" and one
> + * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
> + * one in the list.
> + *
> + * @nb_clusters is number of elements in @all list.
> + *
> + * @next_flush is an iterator for flushing. SeqCache knows nothing about how
> + * data should be flushing, so for flush user calls seqcache_get_next_flush()
> + * (which moves @next_flush iterator) and when data is flushed somehow and cache
> + * is not needed anymore, user can call seqcache_discard_cluster().
> + */
> +typedef struct SeqCache {
> +    int64_t cluster_size;
> +    int64_t nb_clusters;
> +    Cluster *cur_write;
> +    Cluster *next_flush;
> +    QSIMPLEQ_HEAD(, Cluster) all;
> +} SeqCache;
> +
> +static void cluster_free(Cluster *req)
> +{
> +    if (req) {
> +        g_free(req->buf);
> +        g_free(req);
> +    }
> +}
> +
> +SeqCache *seqcache_new(int64_t cluster_size)
> +{
> +    SeqCache *s = g_new(SeqCache, 1);
> +
> +    *s = (SeqCache) {
> +        .cluster_size = cluster_size,
> +    };
> +
> +    QSIMPLEQ_INIT(&s->all);
> +
> +    return s;
> +}
> +
> +/*
> + * User should clean the cache calling seqcache_get_next_flush() and
> + * seqcache_discard_cluster() sequentially before freeing it.
> + */
> +void seqcache_free(SeqCache *s)
> +{
> +    if (s) {
> +        assert(QSIMPLEQ_EMPTY(&s->all));
> +        g_free(s);
> +    }
> +}
> +
> +/* End of cached region inside one cluster */
> +static int64_t cached_end(Cluster *cl)
> +{
> +    return cl->offset + cl->bytes;
> +}
> +
> +/* Start of aligned cluster containing the @offset */
> +static int64_t cluster_start(SeqCache *s, int64_t offset)
> +{
> +    return QEMU_ALIGN_DOWN(offset, s->cluster_size);
> +}
> +
> +/* End of aligned cluster containing the @offset */
> +static int64_t cluster_end(SeqCache *s, int64_t offset)
> +{
> +    return cluster_start(s, offset) + s->cluster_size;
> +}
> +
> +/* Align down @offset to s->cluster_size and search for corresponding cluster */
> +static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
> +{
> +    Cluster *cl;
> +    int64_t cl_start = cluster_start(s, offset);
> +
> +    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
> +        if (cluster_start(s, cl->offset) == cl_start) {
> +            return cl;
> +        }
> +    }
> +
> +    return NULL;
> +}
> +
> +/* Makes current "unfinished" cluster the "finished" one. */
> +static void seqcache_finalize_current_cluster(SeqCache *s)
> +{
> +    if (s->cur_write && !s->next_flush) {
> +        s->next_flush = s->cur_write;
> +    }
> +    s->cur_write = NULL;
> +}
> +
> +/*
> + * Write an @offset, @bytes, @buf request into the cache. The requests MUST not
> + * intersect cluster bounds.
> + */
> +static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
> +                               uint8_t *buf)
> +{
> +    assert(bytes > 0);
> +    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
> +
> +    if (s->cur_write && offset == cached_end(s->cur_write)) {
> +        /* Continue sequential process */
> +        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
> +        s->cur_write->bytes += bytes;
> +
> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
> +            seqcache_finalize_current_cluster(s);
> +        }
> +
> +        return;
> +    }
> +
> +    /* We are starting a new cluster. Check that it's really new */
> +    assert(!seqcache_find_cluster(s, offset));
> +
> +    seqcache_finalize_current_cluster(s);
> +
> +    s->cur_write = g_new(Cluster, 1);
> +    *s->cur_write = (Cluster) {
> +        .offset = offset,
> +        .bytes = bytes,
> +        .buf = g_malloc(s->cluster_size),
> +    };
> +
> +    memcpy(s->cur_write->buf, buf, bytes);
> +    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
> +    s->nb_clusters++;

Here, must also do

> +        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
> +            seqcache_finalize_current_cluster(s);
> +        }


otherwise we'll try to continue that new filled buffer by next cluster on next seqcache_write_one(), and buffer is overflowed [faced this bug in our downstream release].


> +}
> +
> +/* Write an @offset, @bytes, @buf request into the cache. */
> +void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
> +{
> +    while (bytes) {
> +        int64_t cl_end = cluster_end(s, offset);
> +        int64_t chunk = MIN(bytes, cl_end - offset);
> +
> +        seqcache_write_one(s, offset, chunk, buf);
> +        offset += chunk;
> +        bytes -= chunk;
> +        buf += chunk;
> +    }
> +}
> +
> +/*
> + * Read from the cache.
> + *
> + * If @offset misses the cache, return 0.
> + *
> + * If @offset is inside the cache, copy corresponding available data to @buf
> + * (may be less than required @bytes if hole reached earlier) and return number
> + * of bytes copied.
> + */
> +int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
> +{
> +    uint8_t *pos = buf;
> +
> +    while (bytes) {
> +        Cluster *cl = seqcache_find_cluster(s, offset);
> +        int64_t chunk;
> +
> +        if (!cl || offset < cl->offset || offset >= cached_end(cl)) {
> +            break;
> +        }
> +
> +        chunk = MIN(bytes, cached_end(cl) - offset);
> +        memcpy(pos, cl->buf + offset - cl->offset, chunk);
> +        offset += chunk;
> +        bytes -= chunk;
> +        pos += chunk;
> +
> +        if (!QEMU_IS_ALIGNED(offset, s->cluster_size)) {
> +            break;
> +        }
> +    }
> +
> +    return pos - buf;
> +}
> +
> +/*
> + * Get next region for flushing. @offset, @bytes and @buf are out-parameters
> + * to return the region.
> + *
> + * @unfinished is in-out argument which means that user is interested in
> + * flushing unfinished cluster too:
> + *
> + * If there are "finished" clusters, "finished" cluster is returned and
> + * *@unfinished is set to false, independently of its original value.
> + *
> + * If there are no "finished" clusters but "unfinished" exists (i.e.
> + * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
> + * value remains the same and the following logic works:
> + *
> + *    If *@unfinished:
> + *       return s->cur_write unfinished cluster for flushing
> + *    Else
> + *       return nothing
> + *
> + *
> + * Returns true and set @offset, @bytes, @buf and @unfinished if there is
> + * something to flush (accordingly to @unfinished value), returns false
> + * otherwise.
> + *
> + * Nothing is removed from the cache.
> + */
> +bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
> +                             uint8_t **buf, bool *unfinished)
> +{
> +    Cluster *req = s->next_flush;
> +
> +    if (s->next_flush) {
> +        *unfinished = false;
> +        req = s->next_flush;
> +        s->next_flush = QSIMPLEQ_NEXT(req, entry);
> +        if (s->next_flush == s->cur_write) {
> +            s->next_flush = NULL;
> +        }
> +    } else if (s->cur_write && *unfinished) {
> +        req = s->cur_write;
> +    } else {
> +        return false;
> +    }
> +
> +    *offset = req->offset;
> +    *bytes = req->bytes;
> +    *buf = req->buf;
> +
> +    return true;
> +}
> +
> +/*
> + * Find corresponding cluster and drop it. No matter does requested @offset is
> + * cached itself or not.
> + */
> +void seqcache_discard_cluster(SeqCache *s, int64_t offset)
> +{
> +    Cluster *cl = seqcache_find_cluster(s, offset);
> +
> +    if (!cl) {
> +        return;
> +    }
> +
> +    if (cl == s->cur_write) {
> +        assert(cl != s->next_flush);
> +        s->cur_write = NULL;
> +    } else if (cl == s->next_flush) {
> +        assert(cl != s->cur_write);
> +        s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);
> +        if (s->next_flush == s->cur_write) {
> +            s->next_flush = NULL;
> +        }
> +    }
> +
> +    QSIMPLEQ_REMOVE(&s->all, cl, Cluster, entry);
> +    cluster_free(cl);
> +    s->nb_clusters--;
> +}
> +
> +/* Returns number of cached clusters including unfinished */
> +uint64_t seqcache_nb_clusters(SeqCache *s)
> +{
> +    return s->nb_clusters;
> +}
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9b2aa18e1f..5e263c3343 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3347,3 +3347,9 @@ Performance Tools and Tests
>   M: Ahmed Karaman <ahmedkhaledkaraman@gmail.com>
>   S: Maintained
>   F: scripts/performance/
> +
> +Sequential Cache
> +M: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> +S: Supported
> +F: util/seqcache.c
> +F: include/qemu/seqcache.h
> diff --git a/util/meson.build b/util/meson.build
> index 984fba965f..6682e463ac 100644
> --- a/util/meson.build
> +++ b/util/meson.build
> @@ -75,6 +75,7 @@ if have_block
>     util_ss.add(files('block-helpers.c'))
>     util_ss.add(files('qemu-coroutine-sleep.c'))
>     util_ss.add(files('qemu-co-shared-resource.c'))
> +  util_ss.add(files('seqcache.c'))
>     util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
>     util_ss.add(files('readline.c'))
>     util_ss.add(files('throttle.c'))
>
diff mbox series

Patch

diff --git a/include/qemu/seqcache.h b/include/qemu/seqcache.h
new file mode 100644
index 0000000000..a308d54b01
--- /dev/null
+++ b/include/qemu/seqcache.h
@@ -0,0 +1,42 @@ 
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef QEMU_SEQCACHE_H
+#define QEMU_SEQCACHE_H
+
+typedef struct SeqCache SeqCache;
+
+SeqCache *seqcache_new(int64_t cluster_size);
+void seqcache_free(SeqCache *s);
+
+void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
+int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf);
+
+bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
+                             uint8_t **buf, bool *unfinished);
+void seqcache_discard_cluster(SeqCache *s, int64_t offset);
+
+uint64_t seqcache_nb_clusters(SeqCache *s);
+
+#endif /* QEMU_SEQCACHE_H */
diff --git a/util/seqcache.c b/util/seqcache.c
new file mode 100644
index 0000000000..d923560eab
--- /dev/null
+++ b/util/seqcache.c
@@ -0,0 +1,361 @@ 
+/*
+ * Cache for small sequential write requests.
+ *
+ * Copyright (c) 2021 Virtuozzo International GmbH.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ *
+ * = Description =
+ *
+ * SeqCache is an abbreviation for Sequential Cache.
+ *
+ * The Cache is intended to improve performance of small unaligned sequential
+ * writes. Cache has a cluster_size parameter and the unit of caching is aligned
+ * cluster.  Cache has a list of cached clusters, several "finished" ones and at
+ * most one "unfinished". "unfinished" cluster is a cluster where last byte of
+ * last write operation is cached and there is a free place after that byte to
+ * the end of cluster. "finished" clusters are just stored in cache to be read
+ * or flushed and don't allow any writes to them.
+ *
+ * If write to the cache intersects cluster bounds, it's splat into several
+ * requests by cluster bounds. So, consider a write that doesn't intersect
+ * cluster bounds to describe the whole picture:
+ *
+ * There are two cases allowed:
+ *
+ * 1. Sequential write to "unfinished" cluster. Actually it's a write sequential
+ *    previous write. The data goes to "unfinished" cluster. If "unfinished" is
+ *    filled up to the cluster bound it becomes "finished".
+ *
+ * 2. Write to new cluster, not existing in the cache. It may be sequential to
+ *    previous write or not. Current "unfinshed" cluster (if exists) becomes
+ *    "finished" and new "unfinished" cluster is started. Note also that offset
+ *    of the write to new cluster is not required to be aligned.
+ *
+ *    Any other write operation (non-sequential write to "unfinished" cluster
+ *    or write to any of "finished" clusters) will crash.
+ */
+
+#include "qemu/osdep.h"
+
+#include "qemu/queue.h"
+#include "qemu/seqcache.h"
+
+/*
+ * Cluster
+ *
+ * Representation of one cached cluster, aligned to SeqCache::cluster_size.
+ * Caches only one subregion of the cluster, started at @offset (may be
+ * unaligned to cluster_size) and of @bytes length (may be unaligned as well).
+ * The whole subregion always lay in one aligned cluster:
+ *
+ *      QEMU_ALIGN_DOWN(offset, cluster_size) ==
+ *          QEMU_ALIGN_DOWN(offset + bytes - 1, cluster_size)
+ *
+ * @buf is allocated to be able to fill the cluster up to the cluster end, i.e.
+ * allocated @buf length is at least:
+ *
+ *      cluster_size - offset % cluster_size
+ */
+typedef struct Cluster {
+   int64_t offset;
+   int64_t bytes;
+   uint8_t *buf;
+   QSIMPLEQ_ENTRY(Cluster) entry;
+} Cluster;
+
+/*
+ * SeqCache caches small sequential writes into "unfinished" @cur_write cluster,
+ * until entire cluster (of @cluster_size bytes) is filled by seqcache_write()
+ * calls.
+ *
+ * @cur_write->offset may be unaligned to @cluster_size if first write offset is
+ * not aligned (for example, if there was a flush request and cache was flushed,
+ * then we continue from the middle of the cluster with an empty cache).
+ *
+ * @cur_write may be NULL, which means we don't have current cluster and next
+ * seqcache_write() will start a new one.
+ *
+ * @all is a list of all clusters cached in the cache, some "finished" and one
+ * "unfinished" @cur_write (if exists). If @cur_write is not NULL it is a last
+ * one in the list.
+ *
+ * @nb_clusters is number of elements in @all list.
+ *
+ * @next_flush is an iterator for flushing. SeqCache knows nothing about how
+ * data should be flushing, so for flush user calls seqcache_get_next_flush()
+ * (which moves @next_flush iterator) and when data is flushed somehow and cache
+ * is not needed anymore, user can call seqcache_discard_cluster().
+ */
+typedef struct SeqCache {
+    int64_t cluster_size;
+    int64_t nb_clusters;
+    Cluster *cur_write;
+    Cluster *next_flush;
+    QSIMPLEQ_HEAD(, Cluster) all;
+} SeqCache;
+
+static void cluster_free(Cluster *req)
+{
+    if (req) {
+        g_free(req->buf);
+        g_free(req);
+    }
+}
+
+SeqCache *seqcache_new(int64_t cluster_size)
+{
+    SeqCache *s = g_new(SeqCache, 1);
+
+    *s = (SeqCache) {
+        .cluster_size = cluster_size,
+    };
+
+    QSIMPLEQ_INIT(&s->all);
+
+    return s;
+}
+
+/*
+ * User should clean the cache calling seqcache_get_next_flush() and
+ * seqcache_discard_cluster() sequentially before freeing it.
+ */
+void seqcache_free(SeqCache *s)
+{
+    if (s) {
+        assert(QSIMPLEQ_EMPTY(&s->all));
+        g_free(s);
+    }
+}
+
+/* End of cached region inside one cluster */
+static int64_t cached_end(Cluster *cl)
+{
+    return cl->offset + cl->bytes;
+}
+
+/* Start of aligned cluster containing the @offset */
+static int64_t cluster_start(SeqCache *s, int64_t offset)
+{
+    return QEMU_ALIGN_DOWN(offset, s->cluster_size);
+}
+
+/* End of aligned cluster containing the @offset */
+static int64_t cluster_end(SeqCache *s, int64_t offset)
+{
+    return cluster_start(s, offset) + s->cluster_size;
+}
+
+/* Align down @offset to s->cluster_size and search for corresponding cluster */
+static Cluster *seqcache_find_cluster(SeqCache *s, int64_t offset)
+{
+    Cluster *cl;
+    int64_t cl_start = cluster_start(s, offset);
+
+    QSIMPLEQ_FOREACH(cl, &s->all, entry) {
+        if (cluster_start(s, cl->offset) == cl_start) {
+            return cl;
+        }
+    }
+
+    return NULL;
+}
+
+/* Makes current "unfinished" cluster the "finished" one. */
+static void seqcache_finalize_current_cluster(SeqCache *s)
+{
+    if (s->cur_write && !s->next_flush) {
+        s->next_flush = s->cur_write;
+    }
+    s->cur_write = NULL;
+}
+
+/*
+ * Write an @offset, @bytes, @buf request into the cache. The requests MUST not
+ * intersect cluster bounds.
+ */
+static void seqcache_write_one(SeqCache *s, int64_t offset, int64_t bytes,
+                               uint8_t *buf)
+{
+    assert(bytes > 0);
+    assert(cluster_start(s, offset) == cluster_start(s, offset + bytes - 1));
+
+    if (s->cur_write && offset == cached_end(s->cur_write)) {
+        /* Continue sequential process */
+        memcpy(s->cur_write->buf + s->cur_write->bytes, buf, bytes);
+        s->cur_write->bytes += bytes;
+
+        if (cached_end(s->cur_write) == cluster_end(s, s->cur_write->offset)) {
+            seqcache_finalize_current_cluster(s);
+        }
+
+        return;
+    }
+
+    /* We are starting a new cluster. Check that it's really new */
+    assert(!seqcache_find_cluster(s, offset));
+
+    seqcache_finalize_current_cluster(s);
+
+    s->cur_write = g_new(Cluster, 1);
+    *s->cur_write = (Cluster) {
+        .offset = offset,
+        .bytes = bytes,
+        .buf = g_malloc(s->cluster_size),
+    };
+
+    memcpy(s->cur_write->buf, buf, bytes);
+    QSIMPLEQ_INSERT_TAIL(&s->all, s->cur_write, entry);
+    s->nb_clusters++;
+}
+
+/* Write an @offset, @bytes, @buf request into the cache. */
+void seqcache_write(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
+{
+    while (bytes) {
+        int64_t cl_end = cluster_end(s, offset);
+        int64_t chunk = MIN(bytes, cl_end - offset);
+
+        seqcache_write_one(s, offset, chunk, buf);
+        offset += chunk;
+        bytes -= chunk;
+        buf += chunk;
+    }
+}
+
+/*
+ * Read from the cache.
+ *
+ * If @offset misses the cache, return 0.
+ *
+ * If @offset is inside the cache, copy corresponding available data to @buf
+ * (may be less than required @bytes if hole reached earlier) and return number
+ * of bytes copied.
+ */
+int64_t seqcache_read(SeqCache *s, int64_t offset, int64_t bytes, uint8_t *buf)
+{
+    uint8_t *pos = buf;
+
+    while (bytes) {
+        Cluster *cl = seqcache_find_cluster(s, offset);
+        int64_t chunk;
+
+        if (!cl || offset < cl->offset || offset >= cached_end(cl)) {
+            break;
+        }
+
+        chunk = MIN(bytes, cached_end(cl) - offset);
+        memcpy(pos, cl->buf + offset - cl->offset, chunk);
+        offset += chunk;
+        bytes -= chunk;
+        pos += chunk;
+
+        if (!QEMU_IS_ALIGNED(offset, s->cluster_size)) {
+            break;
+        }
+    }
+
+    return pos - buf;
+}
+
+/*
+ * Get next region for flushing. @offset, @bytes and @buf are out-parameters
+ * to return the region.
+ *
+ * @unfinished is in-out argument which means that user is interested in
+ * flushing unfinished cluster too:
+ *
+ * If there are "finished" clusters, "finished" cluster is returned and
+ * *@unfinished is set to false, independently of its original value.
+ *
+ * If there are no "finished" clusters but "unfinished" exists (i.e.
+ * s->cur_write != NULL and it is the only element of s->all), then *@unfinished
+ * value remains the same and the following logic works:
+ *
+ *    If *@unfinished:
+ *       return s->cur_write unfinished cluster for flushing
+ *    Else
+ *       return nothing
+ *
+ *
+ * Returns true and set @offset, @bytes, @buf and @unfinished if there is
+ * something to flush (accordingly to @unfinished value), returns false
+ * otherwise.
+ *
+ * Nothing is removed from the cache.
+ */
+bool seqcache_get_next_flush(SeqCache *s, int64_t *offset, int64_t *bytes,
+                             uint8_t **buf, bool *unfinished)
+{
+    Cluster *req = s->next_flush;
+
+    if (s->next_flush) {
+        *unfinished = false;
+        req = s->next_flush;
+        s->next_flush = QSIMPLEQ_NEXT(req, entry);
+        if (s->next_flush == s->cur_write) {
+            s->next_flush = NULL;
+        }
+    } else if (s->cur_write && *unfinished) {
+        req = s->cur_write;
+    } else {
+        return false;
+    }
+
+    *offset = req->offset;
+    *bytes = req->bytes;
+    *buf = req->buf;
+
+    return true;
+}
+
+/*
+ * Find corresponding cluster and drop it. No matter does requested @offset is
+ * cached itself or not.
+ */
+void seqcache_discard_cluster(SeqCache *s, int64_t offset)
+{
+    Cluster *cl = seqcache_find_cluster(s, offset);
+
+    if (!cl) {
+        return;
+    }
+
+    if (cl == s->cur_write) {
+        assert(cl != s->next_flush);
+        s->cur_write = NULL;
+    } else if (cl == s->next_flush) {
+        assert(cl != s->cur_write);
+        s->next_flush = QSIMPLEQ_NEXT(s->next_flush, entry);
+        if (s->next_flush == s->cur_write) {
+            s->next_flush = NULL;
+        }
+    }
+
+    QSIMPLEQ_REMOVE(&s->all, cl, Cluster, entry);
+    cluster_free(cl);
+    s->nb_clusters--;
+}
+
+/* Returns number of cached clusters including unfinished */
+uint64_t seqcache_nb_clusters(SeqCache *s)
+{
+    return s->nb_clusters;
+}
diff --git a/MAINTAINERS b/MAINTAINERS
index 9b2aa18e1f..5e263c3343 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3347,3 +3347,9 @@  Performance Tools and Tests
 M: Ahmed Karaman <ahmedkhaledkaraman@gmail.com>
 S: Maintained
 F: scripts/performance/
+
+Sequential Cache
+M: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
+S: Supported
+F: util/seqcache.c
+F: include/qemu/seqcache.h
diff --git a/util/meson.build b/util/meson.build
index 984fba965f..6682e463ac 100644
--- a/util/meson.build
+++ b/util/meson.build
@@ -75,6 +75,7 @@  if have_block
   util_ss.add(files('block-helpers.c'))
   util_ss.add(files('qemu-coroutine-sleep.c'))
   util_ss.add(files('qemu-co-shared-resource.c'))
+  util_ss.add(files('seqcache.c'))
   util_ss.add(files('thread-pool.c', 'qemu-timer.c'))
   util_ss.add(files('readline.c'))
   util_ss.add(files('throttle.c'))