diff mbox series

[v3,3/6] block/qcow2: introduce inflight writes counters: fix discard

Message ID 20210305173507.393137-4-vsementsov@virtuozzo.com
State New
Headers show
Series qcow2: compressed write cache | expand

Commit Message

Vladimir Sementsov-Ogievskiy March 5, 2021, 5:35 p.m. UTC
There is a bug in qcow2: host cluster can be discarded (refcount
becomes 0) and reused during data write. In this case data write may
pollute another cluster (recently allocated) or even metadata.

To fix the issue let's track inflight writes to host cluster in the
hash table and consider new counter when discarding and reusing host
clusters.

Enable qcow2-discard-during-rewrite as it is fixed.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/qcow2.h                                 |   9 ++
 block/qcow2-refcount.c                        | 149 +++++++++++++++++-
 block/qcow2.c                                 |  26 ++-
 .../tests/qcow2-discard-during-rewrite        |   2 +-
 4 files changed, 181 insertions(+), 5 deletions(-)

Comments

Max Reitz March 11, 2021, 7:58 p.m. UTC | #1
On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
> There is a bug in qcow2: host cluster can be discarded (refcount
> becomes 0) and reused during data write. In this case data write may
> pollute another cluster (recently allocated) or even metadata.

I was about to ask whether we couldn’t somehow let discard requests 
check overlapping I/O with the tracked_request infrastructure from 
block/io.c (or perhaps even just let them be serializing).

But I suppose there may be other reasons for clusters being discarded 
other than explicit discard requests, so we have to have something in 
qcow2...

> To fix the issue let's track inflight writes to host cluster in the
> hash table and consider new counter when discarding and reusing host
> clusters.

This didn’t really explain anything that would help me understand what’s 
going on in this patch.  The patch itself doesn’t really help me with 
comments either.  It’s possible that I’m just daft, but honestly I have 
a hard time really understanding what’s supposed to be going on.

Coming back from jumping all over this patch, I also wonder why this 
isn’t split in multiple patches, where the first introduces the 
infrastructure and explains exactly what’s going on, and the next 
patches make use of it so I could understand what each check etc. is 
supposed to represent and do.


To serve as an example, after reading through the patch, I still have no 
idea how it prevents that discard problem you’re trying to fix.  Maybe 
I’m lazy, but I would have liked exactly that to be explained by this 
commit message.  Actually, I don’t even understand the problem just from 
this commit message alone, but I could resort to HEAD^ and some 
thinking.  Not sure if that’s ideal, though.

So the commit message says that “host cluster can be discarded and 
reused during data write”, which to me just seems like the exact 
behavior we want.  The second sentence describes a problem, but doesn’t 
say how reclaiming discarded clusters leads there.

I suppose what leads to the problem is that there first needs to be a 
background write even before the discard that is settled only after the 
re-claiming write (which isn’t mentioned).

As far as I understand, this patch addresses that problem by preventing 
the discarded clusters from being allocated while said background write 
is not settled yet.  This is done by keeping track of all host clusters 
that are currently the target of some write operation, and preventing 
them from being allocated.  OK, makes sense.
(This latter part does roughly correspond to the second paragraph of 
this commit message, but I couldn’t make sense of it without 
understanding why we’d do it.  What’s notably missing from my 
explanation is the part that you hinted at with “when discarding”, but 
my problem is that that I just don’t understand what that means at all. 
  Perhaps it has to do with how I don’t understand the change to 
update_refcount(), and consequently the whole “slow path” in 
update_inflight_write_cnt().)


Side note: Intuitively, this hash table feels like an unnecessarily big 
hammer to me, but I can’t think of a better solution either, so.  (I 
mainly don’t like how every write request will result in one allocation 
and hash table entry per cluster it touches, even when no data cluster 
is ever discarded.)

> Enable qcow2-discard-during-rewrite as it is fixed.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/qcow2.h                                 |   9 ++
>   block/qcow2-refcount.c                        | 149 +++++++++++++++++-
>   block/qcow2.c                                 |  26 ++-
>   .../tests/qcow2-discard-during-rewrite        |   2 +-
>   4 files changed, 181 insertions(+), 5 deletions(-)
> 
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 0678073b74..e18d8dfbae 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
>        * is to convert the image with the desired compression type set.
>        */
>       Qcow2CompressionType compression_type;
> +
> +    GHashTable *inflight_writes_counters;
>   } BDRVQcow2State;
>   
>   typedef struct Qcow2COWRegion {
> @@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
>   int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>   int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
>   
> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
> +                              int64_t length);
> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
> +                              int64_t length);
> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
> +                                     int64_t length);
> +
>   /* qcow2-cluster.c functions */
>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>                           bool exact_size);
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 8e649b008e..464d133368 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -799,6 +799,140 @@ found:
>       }
>   }
>   
> +/*
> + * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
> + * hasm map. And it's keys are cluster indices.

*hash, *its

I’d rather document this for s->inflight_writes_counters, though (like 
“/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)

> + */
> +typedef struct Qcow2InFlightRefcount {
> +    /*
> +     * Number of in-flight writes to the cluster, always > 0, as when becomes

s/when becomes/when it becomes/

> +     * 0 the entry is removed from s->inflight_writes_counters.
> +     */
> +    uint64_t inflight_writes_cnt;
> +
> +    /* Cluster refcount is known to be zero */
> +    bool refcount_zero;
> +
> +    /* Cluster refcount was made zero with this discard-type */
> +    enum qcow2_discard_type type;
> +} Qcow2InFlightRefcount;
> +
> +static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
> +                                           int64_t cluster_index)
> +{
> +    Qcow2InFlightRefcount *infl;
> +
> +    if (!s->inflight_writes_counters) {
> +        return NULL;
> +    }
> +
> +    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
> +
> +    if (infl) {
> +        assert(infl->inflight_writes_cnt > 0);
> +    }
> +
> +    return infl;
> +}
> +
> +/*
> + * Returns true if there are any in-flight writes to the cluster blocking
> + * its reallocation.
> + */
> +static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
> +{
> +    return !!find_infl_wr(s, cluster_index);

I wonder if g_hash_table_contains() would be quicker. *shrug*

> +}
> +
> +static int update_inflight_write_cnt(BlockDriverState *bs,
> +                                     int64_t offset, int64_t length,
> +                                     bool decrease, bool locked)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int64_t start, last, cluster_offset;
> +
> +    if (locked) {
> +        qemu_co_mutex_assert_locked(&s->lock);
> +    }
> +
> +    start = start_of_cluster(s, offset);
> +    last = start_of_cluster(s, offset + length - 1);
> +    for (cluster_offset = start; cluster_offset <= last;
> +         cluster_offset += s->cluster_size)
> +    {
> +        int64_t cluster_index = cluster_offset >> s->cluster_bits;
> +        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
> +
> +        if (!infl) {
> +            infl = g_new0(Qcow2InFlightRefcount, 1);
> +            g_hash_table_insert(s->inflight_writes_counters,
> +                                g_memdup(&cluster_index, sizeof(cluster_index)),
> +                                infl);

I suppose we could save one allocation by putting the cluster index into 
Qcow2InFlightRefcount and then giving the key as &infl->cluster_index. 
Not sure if that’s too nasty, though.

> +        }
> +
> +        if (decrease) {
> +            assert(infl->inflight_writes_cnt >= 1);
> +            infl->inflight_writes_cnt--;
> +        } else {
> +            infl->inflight_writes_cnt++;
> +        }
> +
> +        if (infl->inflight_writes_cnt == 0) {
> +            bool refcount_zero = infl->refcount_zero;
> +            enum qcow2_discard_type type = infl->type;
> +
> +            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
> +
> +            if (refcount_zero) {
> +                /*
> +                 * Slow path. We must reset normal refcount to actually release

I don’t understand what “slow path” means in this context.  (Nor do I 
really understand the rest, but more on that below.)

> +                 * the cluster.
> +                 */
> +                int ret;
> +
> +                if (!locked) {
> +                    qemu_co_mutex_lock(&s->lock);
> +                }
> +                ret = qcow2_update_cluster_refcount(bs, cluster_index, 0,
> +                                                    true, type);

I don’t understand this, but then again I’m writing this very early in 
my review, and at this point I don’t understand anything, really.  (The 
comment doesn’t explain to me what’s happening here.)

Revisiting later, refcount_zero is set by update_refcount() when the 
refcount drops to zero, so invoking this function here will finalize the 
change.  I don’t quite understand what’s going on in update_refcount(), 
though.

(And even after finding out why, I don’t understand why the comment 
doesn’t explain why we must invoke qcow2_update_cluster_refcount() with 
addend=0.)

> +                if (!locked) {
> +                    qemu_co_mutex_unlock(&s->lock);
> +                }
> +
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +            }
> +        }
> +
> +    }
> +
> +    return 0;
> +}
> +
> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
> +                              int64_t length)
> +{
> +    return update_inflight_write_cnt(bs, offset, length, false, false);
> +}
> +
> +/*
> + * Called with s->lock not locked by caller. Will take s->lock only if need to
> + * release the cluster (refcount is 0 and inflight-write-cnt becomes zero).
> + */

Why doesn’t qcow2_inflight_writes_inc() get the same comment?

...Oh, because @locked doesn’t have an influence there.  Why isn’t it 
worth a comment that *_inc() works both with a locked and an unlocked mutex?

> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
> +                              int64_t length)
> +{
> +    return update_inflight_write_cnt(bs, offset, length, true, false);
> +}
> +
> +/* Called with s->lock locked. */
> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
> +                                     int64_t length)
> +{
> +    return update_inflight_write_cnt(bs, offset, length, true, true);
> +}
> +
>   /* XXX: cache several refcount block clusters ? */
>   /* @addend is the absolute value of the addend; if @decrease is set, @addend
>    * will be subtracted from the current refcount, otherwise it will be added */
> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>   
>           if (refcount == 0) {
>               void *table;
> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
> +
> +            if (infl) {
> +                infl->refcount_zero = true;	
> +                infl->type = type;
> +                continue;
> +            }

I don’t understand what this is supposed to do exactly.  It seems like 
it wants to keep metadata structures in the cache that are still in use 
(because dropping them from the caches is what happens next), but users 
of metadata structures won’t set in-flight counters for those metadata 
structures, will they?

(I also wondered why the continue wasn’t put before the 
s->set_refcount() call, but I suppose the same effect is had by the 
has_infl_wr() check in alloc_clusters_noref() and 
qcow2_alloc_clusters_at().)

>   
>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>                                                   offset);
> @@ -983,7 +1124,7 @@ retry:
>   
>           if (ret < 0) {
>               return ret;
> -        } else if (refcount != 0) {
> +        } else if (refcount != 0 || has_infl_wr(s, next_cluster_index)) {
>               goto retry;
>           }
>       }
> @@ -1046,7 +1187,7 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
>               ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
>               if (ret < 0) {
>                   return ret;
> -            } else if (refcount != 0) {
> +            } else if (refcount != 0 || has_infl_wr(s, cluster_index)) {
>                   break;
>               }
>           }
> @@ -2294,7 +2435,9 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
>            contiguous_free_clusters < cluster_count;
>            cluster++)
>       {
> -        if (!s->get_refcount(*refcount_table, cluster)) {
> +        if (!s->get_refcount(*refcount_table, cluster) &&
> +            !has_infl_wr(s, cluster))

Is this really necessary?

> +        {
>               contiguous_free_clusters++;
>               if (first_gap) {
>                   /* If this is the first free cluster found, update
> diff --git a/block/qcow2.c b/block/qcow2.c
> index d9f49a52e7..6ee94421e0 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c

[...]

> @@ -4536,13 +4553,20 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>       }
>   
>       ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
> -    qemu_co_mutex_unlock(&s->lock);
>       if (ret < 0) {
> +        qemu_co_mutex_unlock(&s->lock);
>           goto fail;
>       }
>   
> +    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);

It was my impression that this function could be called either with or 
without a lock, that it wouldn’t really matter.  Well, at least that’s 
what I figured out for lack of a comment regarding the contract how it 
is to be invoked.

Max

> +
> +    qemu_co_mutex_unlock(&s->lock);
> +
>       BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
>       ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
> +
> +    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
> +
>       if (ret < 0) {
>           goto fail;
>       }
Vladimir Sementsov-Ogievskiy March 12, 2021, 9:09 a.m. UTC | #2
11.03.2021 22:58, Max Reitz wrote:
> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>> There is a bug in qcow2: host cluster can be discarded (refcount
>> becomes 0) and reused during data write. In this case data write may
>> pollute another cluster (recently allocated) or even metadata.
> 
> I was about to ask whether we couldn’t somehow let discard requests check overlapping I/O with the tracked_request infrastructure from block/io.c (or perhaps even just let them be serializing).
> 
> But I suppose there may be other reasons for clusters being discarded other than explicit discard requests, so we have to have something in qcow2...

For example, host cluster which contain exactly one compressed cluster may be discarded when corresponding guest cluster is rewritten by non-compressed write.

Also, (not very good argument for fixing existing bug but still) serializing will not help with compressed-cache, because with compressed cache we'll have data writes (cache flushes) not related to current in-flight request. And that leads us to something like serializing of operations with .file child of qcow2 node. But we can't do it, as "discarding" a host cluster is operation with metadata, which may do no operation with underlying .file child.

> 
>> To fix the issue let's track inflight writes to host cluster in the
>> hash table and consider new counter when discarding and reusing host
>> clusters.
> 
> This didn’t really explain anything that would help me understand what’s going on in this patch.  The patch itself doesn’t really help me with comments either.  It’s possible that I’m just daft, but honestly I have a hard time really understanding what’s supposed to be going on.

Sorry for this. I believe into your experience of understanding my patches, so that shows that something is wrong with the patch itself)

> 
> Coming back from jumping all over this patch, I also wonder why this isn’t split in multiple patches, where the first introduces the infrastructure and explains exactly what’s going on, and the next patches make use of it so I could understand what each check etc. is supposed to represent and do.

OK

> 
> 
> To serve as an example, after reading through the patch, I still have no idea how it prevents that discard problem you’re trying to fix.  Maybe I’m lazy, but I would have liked exactly that to be explained by this commit message.  Actually, I don’t even understand the problem just from this commit message alone, but I could resort to HEAD^ and some thinking.  Not sure if that’s ideal, though.

The problem:

1. Start write to qcow2. Assume guest cluster G and corresponding host cluster is H.

2. The write requests come to the point of data writing to .file. The write to .file is started and qcow2 mutex is unlocked.

3. At this time refcount of H becomes 0. For example, it may be due to discard operation on qcow2 node, or rewriting compressed data by normal write, or some operation with snapshots..

4. Next, some operations occurs and leads to allocation of H for some other needs: it may be another write-to-qcow2-node operation, or allocation of L2 table or some other data or metadata cluster allocation.

5. So, at this point H is used for something other. Assume, L2 table is written into H.

6. And now, our write from [2] finishes. And pollutes L2 table in H. That's a bug.


The problem becomes more possible with compressed write cache, because any flush operation may trigger deferred data writes, so problem can be reproduced like:

1. Write to guest cluster G2, which triggers flushing of data to host cluster H, which is unrelated to G2.

2. Discard cluster G in parallel

[...]

So, problem is simply triggered by parallel write and discard to different guest clusters. That's why I should do something with this old (and most probably never triggered) problem in qcow2 driver prior to implementing compressed cache.


Hmm, probably we can avoid all these difficulties by implementing compressed cache as a filter driver between qcow2 and its .file nodes.. I tried to do it, encountered some problems, but I don't remember what exactly. Probably I should return to this try. Some obvious problems:

1. Metadata write will go through cache, and we should handle it somehow to not break the sequence of unaligned chunks. Simplest way is to add BDRV_REQ_COLORED flag and do compressed data writes to .file with this flag. And cache filter would track this flag for sequential caching.

2. We can't enable caching by default this way, so user will have to explicitly add a filter.. And actually, it's a real problem of qcow2 driver that in O_DIRECT mode it writes compressed data by small unaligned chunks. So, good to fix it inside qcow2..

With filter we avoid the problem of parallel reuse of host cluster, as all writes go through the cache and flushes will be serialized. So, the problem in qcow2 driver is not enlarged by the cache and we can leave it unfixed..

> 
> So the commit message says that “host cluster can be discarded and reused during data write”, which to me just seems like the exact behavior we want.  The second sentence describes a problem, but doesn’t say how reclaiming discarded clusters leads there.
> 
> I suppose what leads to the problem is that there first needs to be a background write even before the discard that is settled only after the re-claiming write (which isn’t mentioned).
> 
> As far as I understand, this patch addresses that problem by preventing the discarded clusters from being allocated while said background write is not settled yet.  This is done by keeping track of all host clusters that are currently the target of some write operation, and preventing them from being allocated.  OK, makes sense.
> (This latter part does roughly correspond to the second paragraph of this commit message, but I couldn’t make sense of it without understanding why we’d do it.  What’s notably missing from my explanation is the part that you hinted at with “when discarding”, but my problem is that that I just don’t understand what that means at all.  Perhaps it has to do with how I don’t understand the change to update_refcount(), and consequently the whole “slow path” in update_inflight_write_cnt().)


Let me explain:

What is the problem? User uses dynamic object, and the object disappear during the usage. Use-after-free follows. This means that we probably want reference counting for the object, so that user will keep the reference during usage, and object is freed when refcount becomes 0. But for qcow2 host clusters we already have reference counting! Still, we don't want to simply reuse qcow2 refcounts as normal reference counters for two reasons:

1. qcow2 refcounts is part of qcow2 metadata and strictly defined. Qcow2 refcounts shows only usage of the host cluster by other objects in qcow2 metadata. But what we want is usage by the process of writing data.
2. we don't want to do additional read/write of refcounts from/to physical drive

So, what comes to my mind is and additional "dynamic" in-RAM reference counter for host cluster. So that actual full reference count of host cluster is qcow2-refcount + inflight-write-cnt. So, what we need for this:

1. calculate these new inflight-write-cnt, increase it before data write and decrease after.

2. update the core thing of reference counting: freeing the object. Currently the object is "freed" when qcow2-refcoutn becomes 0, that's the code in update_recounts(). But we want to postpone freeing until full reference count becomes zero, i.e. both qcow2-refcount is zero and inflight-write-cnt is zero. So, if qcow2-refcount becomes zero but inflight-write-cnt is nonzero, we postpone discarding the cluster, storing needed information in Qcow2InFlightRefcount. And when inflight-write-cnt becomes zero (and we see this information) we call update_refcount(addend=0) to trigger actual discarding of the cluster.

3. we should consider full reference count when search for free host clusters.

> 
> 
> Side note: Intuitively, this hash table feels like an unnecessarily big hammer to me, but I can’t think of a better solution either, so.  (I mainly don’t like how every write request will result in one allocation and hash table entry per cluster it touches, even when no data cluster is ever discarded.)

We can have a list of inflight-data-writes instead. I afraid, we can't reuse tracked requests of underlying .file node, because we should add inflight-write to the list before releasing the qcow2 mutex.

> 
>> Enable qcow2-discard-during-rewrite as it is fixed.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block/qcow2.h                                 |   9 ++
>>   block/qcow2-refcount.c                        | 149 +++++++++++++++++-
>>   block/qcow2.c                                 |  26 ++-
>>   .../tests/qcow2-discard-during-rewrite        |   2 +-
>>   4 files changed, 181 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 0678073b74..e18d8dfbae 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
>>        * is to convert the image with the desired compression type set.
>>        */
>>       Qcow2CompressionType compression_type;
>> +
>> +    GHashTable *inflight_writes_counters;
>>   } BDRVQcow2State;
>>   typedef struct Qcow2COWRegion {
>> @@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
>>   int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>>   int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length);
>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length);
>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
>> +                                     int64_t length);
>> +
>>   /* qcow2-cluster.c functions */
>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>                           bool exact_size);
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 8e649b008e..464d133368 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -799,6 +799,140 @@ found:
>>       }
>>   }
>> +/*
>> + * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
>> + * hasm map. And it's keys are cluster indices.
> 
> *hash, *its
> 
> I’d rather document this for s->inflight_writes_counters, though (like “/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)
> 
>> + */
>> +typedef struct Qcow2InFlightRefcount {
>> +    /*
>> +     * Number of in-flight writes to the cluster, always > 0, as when becomes
> 
> s/when becomes/when it becomes/
> 
>> +     * 0 the entry is removed from s->inflight_writes_counters.
>> +     */
>> +    uint64_t inflight_writes_cnt;
>> +
>> +    /* Cluster refcount is known to be zero */
>> +    bool refcount_zero;
>> +
>> +    /* Cluster refcount was made zero with this discard-type */
>> +    enum qcow2_discard_type type;
>> +} Qcow2InFlightRefcount;
>> +
>> +static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
>> +                                           int64_t cluster_index)
>> +{
>> +    Qcow2InFlightRefcount *infl;
>> +
>> +    if (!s->inflight_writes_counters) {
>> +        return NULL;
>> +    }
>> +
>> +    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
>> +
>> +    if (infl) {
>> +        assert(infl->inflight_writes_cnt > 0);
>> +    }
>> +
>> +    return infl;
>> +}
>> +
>> +/*
>> + * Returns true if there are any in-flight writes to the cluster blocking
>> + * its reallocation.
>> + */
>> +static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
>> +{
>> +    return !!find_infl_wr(s, cluster_index);
> 
> I wonder if g_hash_table_contains() would be quicker. *shrug*
> 
>> +}
>> +
>> +static int update_inflight_write_cnt(BlockDriverState *bs,
>> +                                     int64_t offset, int64_t length,
>> +                                     bool decrease, bool locked)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    int64_t start, last, cluster_offset;
>> +
>> +    if (locked) {
>> +        qemu_co_mutex_assert_locked(&s->lock);
>> +    }
>> +
>> +    start = start_of_cluster(s, offset);
>> +    last = start_of_cluster(s, offset + length - 1);
>> +    for (cluster_offset = start; cluster_offset <= last;
>> +         cluster_offset += s->cluster_size)
>> +    {
>> +        int64_t cluster_index = cluster_offset >> s->cluster_bits;
>> +        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>> +
>> +        if (!infl) {
>> +            infl = g_new0(Qcow2InFlightRefcount, 1);
>> +            g_hash_table_insert(s->inflight_writes_counters,
>> +                                g_memdup(&cluster_index, sizeof(cluster_index)),
>> +                                infl);
> 
> I suppose we could save one allocation by putting the cluster index into Qcow2InFlightRefcount and then giving the key as &infl->cluster_index. Not sure if that’s too nasty, though.

Allocations are a lot faster than IO.. So, I don't think we'll see the difference in normal cases.

> 
>> +        }
>> +
>> +        if (decrease) {
>> +            assert(infl->inflight_writes_cnt >= 1);
>> +            infl->inflight_writes_cnt--;
>> +        } else {
>> +            infl->inflight_writes_cnt++;
>> +        }
>> +
>> +        if (infl->inflight_writes_cnt == 0) {
>> +            bool refcount_zero = infl->refcount_zero;
>> +            enum qcow2_discard_type type = infl->type;
>> +
>> +            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
>> +
>> +            if (refcount_zero) {
>> +                /*
>> +                 * Slow path. We must reset normal refcount to actually release
> 
> I don’t understand what “slow path” means in this context.  (Nor do I really understand the rest, but more on that below.)


In most cases inc/dev of inflight-writes-cnt is fast: it only update in-memory structure. But when full reference count of the cluster becomes zero we call real update_refcount to trigger discarding of the cluster.. This may be slower. Probably the needed code should be moved from update_refcount to separate function like host_cluster_free(), to not cheat with addend=0.


> 
>> +                 * the cluster.
>> +                 */
>> +                int ret;
>> +
>> +                if (!locked) {
>> +                    qemu_co_mutex_lock(&s->lock);
>> +                }
>> +                ret = qcow2_update_cluster_refcount(bs, cluster_index, 0,
>> +                                                    true, type);
> 
> I don’t understand this, but then again I’m writing this very early in my review, and at this point I don’t understand anything, really.  (The comment doesn’t explain to me what’s happening here.)
> 
> Revisiting later, refcount_zero is set by update_refcount() when the refcount drops to zero, so invoking this function here will finalize the change.  I don’t quite understand what’s going on in update_refcount(), though.
> 
> (And even after finding out why, I don’t understand why the comment doesn’t explain why we must invoke qcow2_update_cluster_refcount() with addend=0.)

Sorry for the mess. Here we want to trigger the code in update_refcount() that is freeing host cluster.. I.e. the code that runs when refcount becomes zero.

> 
>> +                if (!locked) {
>> +                    qemu_co_mutex_unlock(&s->lock);
>> +                }
>> +
>> +                if (ret < 0) {
>> +                    return ret;
>> +                }
>> +            }
>> +        }
>> +
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length)
>> +{
>> +    return update_inflight_write_cnt(bs, offset, length, false, false);
>> +}
>> +
>> +/*
>> + * Called with s->lock not locked by caller. Will take s->lock only if need to
>> + * release the cluster (refcount is 0 and inflight-write-cnt becomes zero).
>> + */
> 
> Why doesn’t qcow2_inflight_writes_inc() get the same comment?
> 
> ...Oh, because @locked doesn’t have an influence there.  Why isn’t it worth a comment that *_inc() works both with a locked and an unlocked mutex?

OK, it worth

> 
>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>> +                              int64_t length)
>> +{
>> +    return update_inflight_write_cnt(bs, offset, length, true, false);
>> +}
>> +
>> +/* Called with s->lock locked. */
>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
>> +                                     int64_t length)
>> +{
>> +    return update_inflight_write_cnt(bs, offset, length, true, true);
>> +}
>> +
>>   /* XXX: cache several refcount block clusters ? */
>>   /* @addend is the absolute value of the addend; if @decrease is set, @addend
>>    * will be subtracted from the current refcount, otherwise it will be added */
>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>           if (refcount == 0) {
>>               void *table;
>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>> +
>> +            if (infl) {
>> +                infl->refcount_zero = true;
>> +                infl->type = type;
>> +                continue;
>> +            }
> 
> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?

Don't follow.

We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().

> 
> (I also wondered why the continue wasn’t put before the s->set_refcount() call, but I suppose the same effect is had by the has_infl_wr() check in alloc_clusters_noref() and qcow2_alloc_clusters_at().)

The only change of update_refcount() logic is postponing of freeing the cluster. So, handling of qcow2-refacount is kept as is. qcow2-refcount becomes zero, and s->set_refcount() is called. inflight-write-cnt is a separate thing, not the cache for qcow2-refcount.

> 
>>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>>                                                   offset);
>> @@ -983,7 +1124,7 @@ retry:
>>           if (ret < 0) {
>>               return ret;
>> -        } else if (refcount != 0) {
>> +        } else if (refcount != 0 || has_infl_wr(s, next_cluster_index)) {
>>               goto retry;
>>           }
>>       }
>> @@ -1046,7 +1187,7 @@ int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
>>               ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
>>               if (ret < 0) {
>>                   return ret;
>> -            } else if (refcount != 0) {
>> +            } else if (refcount != 0 || has_infl_wr(s, cluster_index)) {
>>                   break;
>>               }
>>           }
>> @@ -2294,7 +2435,9 @@ static int64_t alloc_clusters_imrt(BlockDriverState *bs,
>>            contiguous_free_clusters < cluster_count;
>>            cluster++)
>>       {
>> -        if (!s->get_refcount(*refcount_table, cluster)) {
>> +        if (!s->get_refcount(*refcount_table, cluster) &&
>> +            !has_infl_wr(s, cluster))
> 
> Is this really necessary?

Yes. Everywhere when we want free cluster, we should check that full reference count is zero, which is qcow2-refcount + inflight-write-cnt.. Which is equal to check that both qcow2-refcount and inflight-write-cnt are zero. And for for zero inflight-write-cnt it must just be absent in the hash.

> 
>> +        {
>>               contiguous_free_clusters++;
>>               if (first_gap) {
>>                   /* If this is the first free cluster found, update
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index d9f49a52e7..6ee94421e0 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
> 
> [...]
> 
>> @@ -4536,13 +4553,20 @@ qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>       }
>>       ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
>> -    qemu_co_mutex_unlock(&s->lock);
>>       if (ret < 0) {
>> +        qemu_co_mutex_unlock(&s->lock);
>>           goto fail;
>>       }
>> +    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
> 
> It was my impression that this function could be called either with or without a lock, that it wouldn’t really matter.  Well, at least that’s what I figured out for lack of a comment regarding the contract how it is to be invoked.
> 

I'm sorry:(
Max Reitz March 12, 2021, 11:17 a.m. UTC | #3
On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
> 11.03.2021 22:58, Max Reitz wrote:
>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>> becomes 0) and reused during data write. In this case data write may
>>> pollute another cluster (recently allocated) or even metadata.
>>
>> I was about to ask whether we couldn’t somehow let discard requests 
>> check overlapping I/O with the tracked_request infrastructure from 
>> block/io.c (or perhaps even just let them be serializing).
>>
>> But I suppose there may be other reasons for clusters being discarded 
>> other than explicit discard requests, so we have to have something in 
>> qcow2...
> 
> For example, host cluster which contain exactly one compressed cluster 
> may be discarded when corresponding guest cluster is rewritten by 
> non-compressed write.
> 
> Also, (not very good argument for fixing existing bug but still) 
> serializing will not help with compressed-cache, because with compressed 
> cache we'll have data writes (cache flushes) not related to current 
> in-flight request. And that leads us to something like serializing of 
> operations with .file child of qcow2 node. But we can't do it, as 
> "discarding" a host cluster is operation with metadata, which may do no 
> operation with underlying .file child.

Yes, makes sense.  I think it is a good argument, because the point is 
that qcow2 host cluster discards are not necessarily related to guest 
discards.  The compressed cache is just one new example for that.

>>> To fix the issue let's track inflight writes to host cluster in the
>>> hash table and consider new counter when discarding and reusing host
>>> clusters.
>>
>> This didn’t really explain anything that would help me understand 
>> what’s going on in this patch.  The patch itself doesn’t really help 
>> me with comments either.  It’s possible that I’m just daft, but 
>> honestly I have a hard time really understanding what’s supposed to be 
>> going on.
> 
> Sorry for this. I believe into your experience of understanding my 
> patches, so that shows that something is wrong with the patch itself)
> 
>>
>> Coming back from jumping all over this patch, I also wonder why this 
>> isn’t split in multiple patches, where the first introduces the 
>> infrastructure and explains exactly what’s going on, and the next 
>> patches make use of it so I could understand what each check etc. is 
>> supposed to represent and do.
> 
> OK
> 
>>
>>
>> To serve as an example, after reading through the patch, I still have 
>> no idea how it prevents that discard problem you’re trying to fix.  
>> Maybe I’m lazy, but I would have liked exactly that to be explained by 
>> this commit message.  Actually, I don’t even understand the problem 
>> just from this commit message alone, but I could resort to HEAD^ and 
>> some thinking.  Not sure if that’s ideal, though.
> 
> The problem:
> 
> 1. Start write to qcow2. Assume guest cluster G and corresponding host 
> cluster is H.
> 
> 2. The write requests come to the point of data writing to .file. The 
> write to .file is started and qcow2 mutex is unlocked.
> 
> 3. At this time refcount of H becomes 0. For example, it may be due to 
> discard operation on qcow2 node, or rewriting compressed data by normal 
> write, or some operation with snapshots..
> 
> 4. Next, some operations occurs and leads to allocation of H for some 
> other needs: it may be another write-to-qcow2-node operation, or 
> allocation of L2 table or some other data or metadata cluster allocation.
> 
> 5. So, at this point H is used for something other. Assume, L2 table is 
> written into H.
> 
> 6. And now, our write from [2] finishes. And pollutes L2 table in H. 
> That's a bug.

Understood.

> The problem becomes more possible with compressed write cache, because 
> any flush operation may trigger deferred data writes, so problem can be 
> reproduced like:
> 
> 1. Write to guest cluster G2, which triggers flushing of data to host 
> cluster H, which is unrelated to G2.
> 
> 2. Discard cluster G in parallel
> 
> [...]
> 
> So, problem is simply triggered by parallel write and discard to 
> different guest clusters. That's why I should do something with this old 
> (and most probably never triggered) problem in qcow2 driver prior to 
> implementing compressed cache.

OK!  That makes sense.

> Hmm, probably we can avoid all these difficulties by implementing 
> compressed cache as a filter driver between qcow2 and its .file nodes..

As I see it, you’ve now opened the can of worms, and I’m not sure we can 
just close it and say we haven’t seen any problem. :)

So, regardless of the compressed cache, I think this should be fixed.

> I tried to do it, encountered some problems, but I don't remember what 
> exactly. Probably I should return to this try. Some obvious problems:
> 
> 1. Metadata write will go through cache, and we should handle it somehow 
> to not break the sequence of unaligned chunks. Simplest way is to add 
> BDRV_REQ_COLORED flag and do compressed data writes to .file with this 
> flag. And cache filter would track this flag for sequential caching.
> 
> 2. We can't enable caching by default this way, so user will have to 
> explicitly add a filter.. And actually, it's a real problem of qcow2 
> driver that in O_DIRECT mode it writes compressed data by small 
> unaligned chunks. So, good to fix it inside qcow2..

Agreed.

> With filter we avoid the problem of parallel reuse of host cluster, as 
> all writes go through the cache and flushes will be serialized. So, the 
> problem in qcow2 driver is not enlarged by the cache and we can leave it 
> unfixed..

Not sure I agree. ;)

>> So the commit message says that “host cluster can be discarded and 
>> reused during data write”, which to me just seems like the exact 
>> behavior we want.  The second sentence describes a problem, but 
>> doesn’t say how reclaiming discarded clusters leads there.
>>
>> I suppose what leads to the problem is that there first needs to be a 
>> background write even before the discard that is settled only after 
>> the re-claiming write (which isn’t mentioned).
>>
>> As far as I understand, this patch addresses that problem by 
>> preventing the discarded clusters from being allocated while said 
>> background write is not settled yet.  This is done by keeping track of 
>> all host clusters that are currently the target of some write 
>> operation, and preventing them from being allocated.  OK, makes sense.
>> (This latter part does roughly correspond to the second paragraph of 
>> this commit message, but I couldn’t make sense of it without 
>> understanding why we’d do it.  What’s notably missing from my 
>> explanation is the part that you hinted at with “when discarding”, but 
>> my problem is that that I just don’t understand what that means at 
>> all.  Perhaps it has to do with how I don’t understand the change to 
>> update_refcount(), and consequently the whole “slow path” in 
>> update_inflight_write_cnt().)
> 
> 
> Let me explain:
> 
> What is the problem? User uses dynamic object, and the object disappear 
> during the usage. Use-after-free follows. This means that we probably 
> want reference counting for the object, so that user will keep the 
> reference during usage, and object is freed when refcount becomes 0. But 
> for qcow2 host clusters we already have reference counting!

Makes sense.

> Still, we 
> don't want to simply reuse qcow2 refcounts as normal reference counters 
> for two reasons:
> 
> 1. qcow2 refcounts is part of qcow2 metadata and strictly defined. Qcow2 
> refcounts shows only usage of the host cluster by other objects in qcow2 
> metadata. But what we want is usage by the process of writing data.

Indeed.

> 2. we don't want to do additional read/write of refcounts from/to 
> physical drive

Right.

> So, what comes to my mind is and additional "dynamic" in-RAM reference 
> counter for host cluster. So that actual full reference count of host 
> cluster is qcow2-refcount + inflight-write-cnt.

OK, yes, that makes a lot of sense.

> So, what we need for this:
> 
> 1. calculate these new inflight-write-cnt, increase it before data write 
> and decrease after.
> 
> 2. update the core thing of reference counting: freeing the object.

Makes sense, but I don’t see how your change to update_refcount() does 
that, though.  But OTOH I do see how you prevent use-after-frees (by 
preventing allocations  of clusters with refcount == 0 but with active 
writes on them), which works just as well.

And if freeing the object (clusters) were postponed until all writes are 
settled, I don’t think we’d need the checks in the cluster allocation 
functions.

> Currently the object is "freed" when qcow2-refcoutn becomes 0, that's 
> the code in update_recounts(). But we want to postpone freeing until 
> full reference count becomes zero, i.e. both qcow2-refcount is zero and 
> inflight-write-cnt is zero. So, if qcow2-refcount becomes zero but 
> inflight-write-cnt is nonzero, we postpone discarding the cluster, 
> storing needed information in Qcow2InFlightRefcount. And when 
> inflight-write-cnt becomes zero (and we see this information) we call 
> update_refcount(addend=0) to trigger actual discarding of the cluster.
> 
> 3. we should consider full reference count when search for free host 
> clusters.
> 
>>
>>
>> Side note: Intuitively, this hash table feels like an unnecessarily 
>> big hammer to me, but I can’t think of a better solution either, so.  
>> (I mainly don’t like how every write request will result in one 
>> allocation and hash table entry per cluster it touches, even when no 
>> data cluster is ever discarded.)
> 
> We can have a list of inflight-data-writes instead.

Yes.  I’m not sure which would actually be better in the end.

I feel such a list would be better for the normal case (just writes, no 
allocations or discards), but I don’t know how much worse they’d be when 
something needs to be looked up (i.e. on allocations or discards).

> I afraid, we can't 
> reuse tracked requests of underlying .file node, because we should add 
> inflight-write to the list before releasing the qcow2 mutex.

Yes.

>>> Enable qcow2-discard-during-rewrite as it is fixed.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>   block/qcow2.h                                 |   9 ++
>>>   block/qcow2-refcount.c                        | 149 +++++++++++++++++-
>>>   block/qcow2.c                                 |  26 ++-
>>>   .../tests/qcow2-discard-during-rewrite        |   2 +-
>>>   4 files changed, 181 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>> index 0678073b74..e18d8dfbae 100644
>>> --- a/block/qcow2.h
>>> +++ b/block/qcow2.h
>>> @@ -420,6 +420,8 @@ typedef struct BDRVQcow2State {
>>>        * is to convert the image with the desired compression type set.
>>>        */
>>>       Qcow2CompressionType compression_type;
>>> +
>>> +    GHashTable *inflight_writes_counters;
>>>   } BDRVQcow2State;
>>>   typedef struct Qcow2COWRegion {
>>> @@ -896,6 +898,13 @@ int qcow2_shrink_reftable(BlockDriverState *bs);
>>>   int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
>>>   int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
>>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length);
>>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length);
>>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t 
>>> offset,
>>> +                                     int64_t length);
>>> +
>>>   /* qcow2-cluster.c functions */
>>>   int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
>>>                           bool exact_size);
>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>> index 8e649b008e..464d133368 100644
>>> --- a/block/qcow2-refcount.c
>>> +++ b/block/qcow2-refcount.c
>>> @@ -799,6 +799,140 @@ found:
>>>       }
>>>   }
>>> +/*
>>> + * Qcow2InFlightRefcount is a type for values of 
>>> s->inflight_writes_counters
>>> + * hasm map. And it's keys are cluster indices.
>>
>> *hash, *its
>>
>> I’d rather document this for s->inflight_writes_counters, though (like 
>> “/* cluster index (int64_t) -> Qcow2InFlightRefcount */”)
>>
>>> + */
>>> +typedef struct Qcow2InFlightRefcount {
>>> +    /*
>>> +     * Number of in-flight writes to the cluster, always > 0, as 
>>> when becomes
>>
>> s/when becomes/when it becomes/
>>
>>> +     * 0 the entry is removed from s->inflight_writes_counters.
>>> +     */
>>> +    uint64_t inflight_writes_cnt;
>>> +
>>> +    /* Cluster refcount is known to be zero */
>>> +    bool refcount_zero;
>>> +
>>> +    /* Cluster refcount was made zero with this discard-type */
>>> +    enum qcow2_discard_type type;
>>> +} Qcow2InFlightRefcount;
>>> +
>>> +static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
>>> +                                           int64_t cluster_index)
>>> +{
>>> +    Qcow2InFlightRefcount *infl;
>>> +
>>> +    if (!s->inflight_writes_counters) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    infl = g_hash_table_lookup(s->inflight_writes_counters, 
>>> &cluster_index);
>>> +
>>> +    if (infl) {
>>> +        assert(infl->inflight_writes_cnt > 0);
>>> +    }
>>> +
>>> +    return infl;
>>> +}
>>> +
>>> +/*
>>> + * Returns true if there are any in-flight writes to the cluster 
>>> blocking
>>> + * its reallocation.
>>> + */
>>> +static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
>>> +{
>>> +    return !!find_infl_wr(s, cluster_index);
>>
>> I wonder if g_hash_table_contains() would be quicker. *shrug*
>>
>>> +}
>>> +
>>> +static int update_inflight_write_cnt(BlockDriverState *bs,
>>> +                                     int64_t offset, int64_t length,
>>> +                                     bool decrease, bool locked)
>>> +{
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    int64_t start, last, cluster_offset;
>>> +
>>> +    if (locked) {
>>> +        qemu_co_mutex_assert_locked(&s->lock);
>>> +    }
>>> +
>>> +    start = start_of_cluster(s, offset);
>>> +    last = start_of_cluster(s, offset + length - 1);
>>> +    for (cluster_offset = start; cluster_offset <= last;
>>> +         cluster_offset += s->cluster_size)
>>> +    {
>>> +        int64_t cluster_index = cluster_offset >> s->cluster_bits;
>>> +        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>> +
>>> +        if (!infl) {
>>> +            infl = g_new0(Qcow2InFlightRefcount, 1);
>>> +            g_hash_table_insert(s->inflight_writes_counters,
>>> +                                g_memdup(&cluster_index, 
>>> sizeof(cluster_index)),
>>> +                                infl);
>>
>> I suppose we could save one allocation by putting the cluster index 
>> into Qcow2InFlightRefcount and then giving the key as 
>> &infl->cluster_index. Not sure if that’s too nasty, though.
> 
> Allocations are a lot faster than IO.. So, I don't think we'll see the 
> difference in normal cases.

Sure.

>>> +        }
>>> +
>>> +        if (decrease) {
>>> +            assert(infl->inflight_writes_cnt >= 1);
>>> +            infl->inflight_writes_cnt--;
>>> +        } else {
>>> +            infl->inflight_writes_cnt++;
>>> +        }
>>> +
>>> +        if (infl->inflight_writes_cnt == 0) {
>>> +            bool refcount_zero = infl->refcount_zero;
>>> +            enum qcow2_discard_type type = infl->type;
>>> +
>>> +            g_hash_table_remove(s->inflight_writes_counters, 
>>> &cluster_index);
>>> +
>>> +            if (refcount_zero) {
>>> +                /*
>>> +                 * Slow path. We must reset normal refcount to 
>>> actually release
>>
>> I don’t understand what “slow path” means in this context.  (Nor do I 
>> really understand the rest, but more on that below.)
> 
> 
> In most cases inc/dev of inflight-writes-cnt is fast: it only update 
> in-memory structure. But when full reference count of the cluster 
> becomes zero we call real update_refcount to trigger discarding of the 
> cluster.. This may be slower. Probably the needed code should be moved 
> from update_refcount to separate function like host_cluster_free(), to 
> not cheat with addend=0.

OK.

>>> +                 * the cluster.
>>> +                 */
>>> +                int ret;
>>> +
>>> +                if (!locked) {
>>> +                    qemu_co_mutex_lock(&s->lock);
>>> +                }
>>> +                ret = qcow2_update_cluster_refcount(bs, 
>>> cluster_index, 0,
>>> +                                                    true, type);
>>
>> I don’t understand this, but then again I’m writing this very early in 
>> my review, and at this point I don’t understand anything, really.  
>> (The comment doesn’t explain to me what’s happening here.)
>>
>> Revisiting later, refcount_zero is set by update_refcount() when the 
>> refcount drops to zero, so invoking this function here will finalize 
>> the change.  I don’t quite understand what’s going on in 
>> update_refcount(), though.
>>
>> (And even after finding out why, I don’t understand why the comment 
>> doesn’t explain why we must invoke qcow2_update_cluster_refcount() 
>> with addend=0.)
> 
> Sorry for the mess. Here we want to trigger the code in 
> update_refcount() that is freeing host cluster.. I.e. the code that runs 
> when refcount becomes zero.
> 
>>
>>> +                if (!locked) {
>>> +                    qemu_co_mutex_unlock(&s->lock);
>>> +                }
>>> +
>>> +                if (ret < 0) {
>>> +                    return ret;
>>> +                }
>>> +            }
>>> +        }
>>> +
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length)
>>> +{
>>> +    return update_inflight_write_cnt(bs, offset, length, false, false);
>>> +}
>>> +
>>> +/*
>>> + * Called with s->lock not locked by caller. Will take s->lock only 
>>> if need to
>>> + * release the cluster (refcount is 0 and inflight-write-cnt becomes 
>>> zero).
>>> + */
>>
>> Why doesn’t qcow2_inflight_writes_inc() get the same comment?
>>
>> ...Oh, because @locked doesn’t have an influence there.  Why isn’t it 
>> worth a comment that *_inc() works both with a locked and an unlocked 
>> mutex?
> 
> OK, it worth
> 
>>
>>> +int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
>>> +                              int64_t length)
>>> +{
>>> +    return update_inflight_write_cnt(bs, offset, length, true, false);
>>> +}
>>> +
>>> +/* Called with s->lock locked. */
>>> +int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t 
>>> offset,
>>> +                                     int64_t length)
>>> +{
>>> +    return update_inflight_write_cnt(bs, offset, length, true, true);
>>> +}
>>> +
>>>   /* XXX: cache several refcount block clusters ? */
>>>   /* @addend is the absolute value of the addend; if @decrease is 
>>> set, @addend
>>>    * will be subtracted from the current refcount, otherwise it will 
>>> be added */
>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>> update_refcount(BlockDriverState *bs,
>>>           if (refcount == 0) {
>>>               void *table;
>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>> cluster_index);
>>> +
>>> +            if (infl) {
>>> +                infl->refcount_zero = true;
>>> +                infl->type = type;
>>> +                continue;
>>> +            }
>>
>> I don’t understand what this is supposed to do exactly.  It seems like 
>> it wants to keep metadata structures in the cache that are still in 
>> use (because dropping them from the caches is what happens next), but 
>> users of metadata structures won’t set in-flight counters for those 
>> metadata structures, will they?
> 
> Don't follow.
> 
> We want the code in "if (refcount == 0)" to be triggered only when full 
> reference count of the host cluster becomes 0, including 
> inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we 
> postpone freeing the host cluster, it will be done later from "slow 
> path" in update_inflight_write_cnt().

But the code under “if (refcount == 0)” doesn’t free anything, does it? 
  All I can see is code to remove metadata structures from the metadata 
caches (if the discarded cluster was an L2 table or a refblock), and 
finally the discard on the underlying file.  I don’t see how that 
protocol-level discard has anything to do with our problem, though.

As far as I understand, the freeing happens immediately above the “if 
(refcount == 0)” block by s->set_refcount() setting the refcount to 0. 
(including updating s->free_cluster_index if the refcount is 0).

>> (I also wondered why the continue wasn’t put before the 
>> s->set_refcount() call, but I suppose the same effect is had by the 
>> has_infl_wr() check in alloc_clusters_noref() and 
>> qcow2_alloc_clusters_at().)
> 
> The only change of update_refcount() logic is postponing of freeing the 
> cluster. So, handling of qcow2-refacount is kept as is. qcow2-refcount 
> becomes zero, and s->set_refcount() is called. inflight-write-cnt is a 
> separate thing, not the cache for qcow2-refcount.
> 
>>
>>>               table = 
>>> qcow2_cache_is_table_offset(s->refcount_block_cache,
>>>                                                   offset);
>>> @@ -983,7 +1124,7 @@ retry:
>>>           if (ret < 0) {
>>>               return ret;
>>> -        } else if (refcount != 0) {
>>> +        } else if (refcount != 0 || has_infl_wr(s, 
>>> next_cluster_index)) {
>>>               goto retry;
>>>           }
>>>       }
>>> @@ -1046,7 +1187,7 @@ int64_t 
>>> qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
>>>               ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
>>>               if (ret < 0) {
>>>                   return ret;
>>> -            } else if (refcount != 0) {
>>> +            } else if (refcount != 0 || has_infl_wr(s, 
>>> cluster_index)) {
>>>                   break;
>>>               }
>>>           }
>>> @@ -2294,7 +2435,9 @@ static int64_t 
>>> alloc_clusters_imrt(BlockDriverState *bs,
>>>            contiguous_free_clusters < cluster_count;
>>>            cluster++)
>>>       {
>>> -        if (!s->get_refcount(*refcount_table, cluster)) {
>>> +        if (!s->get_refcount(*refcount_table, cluster) &&
>>> +            !has_infl_wr(s, cluster))
>>
>> Is this really necessary?
> 
> Yes. Everywhere when we want free cluster, we should check that full 
> reference count is zero, which is qcow2-refcount + inflight-write-cnt.. 
> Which is equal to check that both qcow2-refcount and inflight-write-cnt 
> are zero. And for for zero inflight-write-cnt it must just be absent in 
> the hash.

I understand the sentiment, but I don’t think that reasoning applies 
here.  The in-memory refcount table (IMRT) used for qemu-img check will 
not be updated on discards anyway.  If there were any I/O to the image 
concurrently to check, it would completely ignore the IMRT, so the check 
feels hypocritical.

>>> +        {
>>>               contiguous_free_clusters++;
>>>               if (first_gap) {
>>>                   /* If this is the first free cluster found, update
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index d9f49a52e7..6ee94421e0 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>
>> [...]
>>
>>> @@ -4536,13 +4553,20 @@ 
>>> qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
>>>       }
>>>       ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, 
>>> out_len, true);
>>> -    qemu_co_mutex_unlock(&s->lock);
>>>       if (ret < 0) {
>>> +        qemu_co_mutex_unlock(&s->lock);
>>>           goto fail;
>>>       }
>>> +    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
>>
>> It was my impression that this function could be called either with or 
>> without a lock, that it wouldn’t really matter.  Well, at least that’s 
>> what I figured out for lack of a comment regarding the contract how it 
>> is to be invoked.
>>
> 
> I'm sorry:(

Oh, geez, now I feel bad, too.  Thanks a lot for your explanations, it 
all makes much more sense to me now! :)

Max
Vladimir Sementsov-Ogievskiy March 12, 2021, 12:32 p.m. UTC | #4
12.03.2021 14:17, Max Reitz wrote:
> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>> 11.03.2021 22:58, Max Reitz wrote:
>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>> becomes 0) and reused during data write. In this case data write may

[..]

>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>           if (refcount == 0) {
>>>>               void *table;
>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>> +
>>>> +            if (infl) {
>>>> +                infl->refcount_zero = true;
>>>> +                infl->type = type;
>>>> +                continue;
>>>> +            }
>>>
>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>
>> Don't follow.
>>
>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
> 
> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.

Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.

On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt

> 
> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).

Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.

And about s->set_refcount(): it only update a refcount itself, and don't free anything.
Vladimir Sementsov-Ogievskiy March 12, 2021, 12:42 p.m. UTC | #5
12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 14:17, Max Reitz wrote:
>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.03.2021 22:58, Max Reitz wrote:
>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>> becomes 0) and reused during data write. In this case data write may
> 
> [..]
> 
>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>           if (refcount == 0) {
>>>>>               void *table;
>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>> +
>>>>> +            if (infl) {
>>>>> +                infl->refcount_zero = true;
>>>>> +                infl->type = type;
>>>>> +                continue;
>>>>> +            }
>>>>
>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>
>>> Don't follow.
>>>
>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>
>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
> 
> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
> 
> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
> 
>>
>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
> 
> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
> 
> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
> 
> 


Looking now at this place:


         if (refcount == 0 && cluster_index < s->free_cluster_index) {
             s->free_cluster_index = cluster_index;
         }
         s->set_refcount(refcount_block, block_index, refcount);
                                                                                  
         if (refcount == 0) {
             void *table;
             Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
                                                                                  
             if (infl) {
                 infl->refcount_zero = true;
                 infl->type = type;
                 continue;
             }
                                                                                  
             table = qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                 offset);
             if (table != NULL) {
                 qcow2_cache_put(s->refcount_block_cache, &refcount_block);
                 old_table_index = -1;
                 qcow2_cache_discard(s->refcount_block_cache, table);
             }
                                                                                  
             table = qcow2_cache_is_table_offset(s->l2_table_cache, offset);
             if (table != NULL) {
                 qcow2_cache_discard(s->l2_table_cache, table);
             }
                                                                                  
             if (s->compressed_cache) {
                 seqcache_discard_cluster(s->compressed_cache, cluster_offset);
             }
                                                                                  
             if (s->discard_passthrough[type]) {
                 update_refcount_discard(bs, cluster_offset, s->cluster_size);
             }
         }


Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? offset is the start of the whole loop, and is not updated on iterations. Isn't it more correct to use cluster_offset here? Or we are sure that refcount and l2 metadata is always discarded by exactly one cluster? Than we are OK, but still worth an assertion that offset == cluster_offset.
Vladimir Sementsov-Ogievskiy March 12, 2021, 12:46 p.m. UTC | #6
12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 14:17, Max Reitz wrote:
>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.03.2021 22:58, Max Reitz wrote:
>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>> becomes 0) and reused during data write. In this case data write may
> 
> [..]
> 
>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>           if (refcount == 0) {
>>>>>               void *table;
>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>> +
>>>>> +            if (infl) {
>>>>> +                infl->refcount_zero = true;
>>>>> +                infl->type = type;
>>>>> +                continue;
>>>>> +            }
>>>>
>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>
>>> Don't follow.
>>>
>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>
>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
> 
> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
> 
> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
> 
>>
>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
> 
> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
> 
> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
> 
> 

So, it is more correct like this:

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 464d133368..1da282446d 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
          } else {
              refcount += addend;
          }
-        if (refcount == 0 && cluster_index < s->free_cluster_index) {
-            s->free_cluster_index = cluster_index;
-        }
          s->set_refcount(refcount_block, block_index, refcount);
  
          if (refcount == 0) {
              void *table;
              Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
  
-            if (infl) {
-                infl->refcount_zero = true;
-                infl->type = type;
-                continue;
-            }
-
              table = qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                  offset);
              if (table != NULL) {
@@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
                  qcow2_cache_discard(s->l2_table_cache, table);
              }
  
+            if (infl) {
+                infl->refcount_zero = true;
+                infl->type = type;
+                continue;
+            }
+
+            if (cluster_index < s->free_cluster_index) {
+                s->free_cluster_index = cluster_index;
+            }
+
              if (s->discard_passthrough[type]) {
                  update_refcount_discard(bs, cluster_offset, s->cluster_size);
              }
Max Reitz March 12, 2021, 2:58 p.m. UTC | #7
On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 14:17, Max Reitz wrote:
>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>> 11.03.2021 22:58, Max Reitz wrote:
>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>> becomes 0) and reused during data write. In this case data write may
> 
> [..]
> 
>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>> update_refcount(BlockDriverState *bs,
>>>>>           if (refcount == 0) {
>>>>>               void *table;
>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>> cluster_index);
>>>>> +
>>>>> +            if (infl) {
>>>>> +                infl->refcount_zero = true;
>>>>> +                infl->type = type;
>>>>> +                continue;
>>>>> +            }
>>>>
>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>> like it wants to keep metadata structures in the cache that are 
>>>> still in use (because dropping them from the caches is what happens 
>>>> next), but users of metadata structures won’t set in-flight counters 
>>>> for those metadata structures, will they?
>>>
>>> Don't follow.
>>>
>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>> full reference count of the host cluster becomes 0, including 
>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, 
>>> we postpone freeing the host cluster, it will be done later from 
>>> "slow path" in update_inflight_write_cnt().
>>
>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>> it?  All I can see is code to remove metadata structures from the 
>> metadata caches (if the discarded cluster was an L2 table or a 
>> refblock), and finally the discard on the underlying file.  I don’t 
>> see how that protocol-level discard has anything to do with our 
>> problem, though.
> 
> Hmm. Still, if we do this discard, and then our in-flight write, we'll 
> have data instead of a hole. Not a big deal, but seems better to 
> postpone discard.
> 
> On the other hand, clearing caches is OK, as its related only to 
> qcow2-refcount, not to inflight-write-cnt
> 
>>
>> As far as I understand, the freeing happens immediately above the “if 
>> (refcount == 0)” block by s->set_refcount() setting the refcount to 0. 
>> (including updating s->free_cluster_index if the refcount is 0).
> 
> Hmm.. And that (setting s->free_cluster_index) what I should actually 
> prevent until total reference count becomes zero.
> 
> And about s->set_refcount(): it only update a refcount itself, and don't 
> free anything.

That is what freeing is, though.  I consider something to be free when 
allocation functions will allocate it.  The allocation functions look at 
the refcount, so once a cluster’s refcount is 0, it is free.

If that isn’t what freeing is, nothing in update_refcount() frees 
anything (when looking at how data clusters are handled).  Passing the 
discard through to the protocol layer isn’t “freeing”, because it’s 
independent of qcow2.

Now, your patch adds an additional check to the allocation functions 
(whether there are ongoing writes on the cluster), so it’s indeed 
possible that a cluster can have a refcount of 0 but still won’t be used 
by allocation functions.

But that means you’ve just changed the definition of what a free cluster 
is.  In fact, that means that nothing in update_refcount() can free a 
cluster that has active writes to it, because now a cluster is only free 
if there are no such writes.  It follows that you needn’t change 
update_refcount() to prevent clusters with such writes from being freed, 
because with this new definition of what a free cluster is, it’s 
impossible for update_refcount() to free them.

(Yes, you’re right that it would be nice to postpone the protocol-level 
discard still, but not doing so wouldn’t be a catastrophe – which shows 
that it has little to do with actually freeing something, as far as 
qcow2 is concerned.

If it’s just about postponing the discard, we can do exactly that: Let 
update_refcount() skip discarding for clusters that are still in use, 
and then let update_inflight_write_cnt() only do that discard instead of 
invoking all of qcow2_update_cluster_refcount().)

Alternatively, we could also not change the definition of what a free 
cluster is, which means we wouldn’t need to change the allocation 
functions, but instead postpone the refcount update that 
update_refcount() does.  That would mean we’d actually need to really 
drop the refcount in update_inflight_write_cnt() instead of doing a -0.

Max
Max Reitz March 12, 2021, 3:01 p.m. UTC | #8
On 12.03.21 13:42, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 14:17, Max Reitz wrote:
>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>> becomes 0) and reused during data write. In this case data write may
>>
>> [..]
>>
>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>>> update_refcount(BlockDriverState *bs,
>>>>>>           if (refcount == 0) {
>>>>>>               void *table;
>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>>> cluster_index);
>>>>>> +
>>>>>> +            if (infl) {
>>>>>> +                infl->refcount_zero = true;
>>>>>> +                infl->type = type;
>>>>>> +                continue;
>>>>>> +            }
>>>>>
>>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>>> like it wants to keep metadata structures in the cache that are 
>>>>> still in use (because dropping them from the caches is what happens 
>>>>> next), but users of metadata structures won’t set in-flight 
>>>>> counters for those metadata structures, will they?
>>>>
>>>> Don't follow.
>>>>
>>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>>> full reference count of the host cluster becomes 0, including 
>>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 
>>>> 0, we postpone freeing the host cluster, it will be done later from 
>>>> "slow path" in update_inflight_write_cnt().
>>>
>>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>>> it?  All I can see is code to remove metadata structures from the 
>>> metadata caches (if the discarded cluster was an L2 table or a 
>>> refblock), and finally the discard on the underlying file.  I don’t 
>>> see how that protocol-level discard has anything to do with our 
>>> problem, though.
>>
>> Hmm. Still, if we do this discard, and then our in-flight write, we'll 
>> have data instead of a hole. Not a big deal, but seems better to 
>> postpone discard.
>>
>> On the other hand, clearing caches is OK, as its related only to 
>> qcow2-refcount, not to inflight-write-cnt
>>
>>>
>>> As far as I understand, the freeing happens immediately above the “if 
>>> (refcount == 0)” block by s->set_refcount() setting the refcount to 
>>> 0. (including updating s->free_cluster_index if the refcount is 0).
>>
>> Hmm.. And that (setting s->free_cluster_index) what I should actually 
>> prevent until total reference count becomes zero.
>>
>> And about s->set_refcount(): it only update a refcount itself, and 
>> don't free anything.
>>
>>
> 
> 
> Looking now at this place:
> 
> 
>          if (refcount == 0 && cluster_index < s->free_cluster_index) {
>              s->free_cluster_index = cluster_index;
>          }
>          s->set_refcount(refcount_block, block_index, refcount);
>          if (refcount == 0) {
>              void *table;
>              Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>              if (infl) {
>                  infl->refcount_zero = true;
>                  infl->type = type;
>                  continue;
>              }
>              table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>                                                  offset);
>              if (table != NULL) {
>                  qcow2_cache_put(s->refcount_block_cache, &refcount_block);
>                  old_table_index = -1;
>                  qcow2_cache_discard(s->refcount_block_cache, table);
>              }
>              table = qcow2_cache_is_table_offset(s->l2_table_cache, 
> offset);
>              if (table != NULL) {
>                  qcow2_cache_discard(s->l2_table_cache, table);
>              }
>              if (s->compressed_cache) {
>                  seqcache_discard_cluster(s->compressed_cache, 
> cluster_offset);
>              }
>              if (s->discard_passthrough[type]) {
>                  update_refcount_discard(bs, cluster_offset, 
> s->cluster_size);
>              }
>          }
> 
> 
> Hmm. Is it OK that we use "offset" to discard qcow2 metadata caches? 
> offset is the start of the whole loop, and is not updated on iterations. 
> Isn't it more correct to use cluster_offset here? Or we are sure that 
> refcount and l2 metadata is always discarded by exactly one cluster? 
> Than we are OK, but still worth an assertion that offset == cluster_offset.

Spontaneously, I think it’s a bug that hasn’t made any problems yet, 
because I suppose L2 tables and refblocks are indeed always discarded 
one by one (i.e., cluster by cluster).  It definitely looks like this 
should be cluster_offset, yes.

Max
Max Reitz March 12, 2021, 3:10 p.m. UTC | #9
On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 14:17, Max Reitz wrote:
>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>> becomes 0) and reused during data write. In this case data write may
>>
>> [..]
>>
>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>>> update_refcount(BlockDriverState *bs,
>>>>>>           if (refcount == 0) {
>>>>>>               void *table;
>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>>> cluster_index);
>>>>>> +
>>>>>> +            if (infl) {
>>>>>> +                infl->refcount_zero = true;
>>>>>> +                infl->type = type;
>>>>>> +                continue;
>>>>>> +            }
>>>>>
>>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>>> like it wants to keep metadata structures in the cache that are 
>>>>> still in use (because dropping them from the caches is what happens 
>>>>> next), but users of metadata structures won’t set in-flight 
>>>>> counters for those metadata structures, will they?
>>>>
>>>> Don't follow.
>>>>
>>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>>> full reference count of the host cluster becomes 0, including 
>>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 
>>>> 0, we postpone freeing the host cluster, it will be done later from 
>>>> "slow path" in update_inflight_write_cnt().
>>>
>>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>>> it?  All I can see is code to remove metadata structures from the 
>>> metadata caches (if the discarded cluster was an L2 table or a 
>>> refblock), and finally the discard on the underlying file.  I don’t 
>>> see how that protocol-level discard has anything to do with our 
>>> problem, though.
>>
>> Hmm. Still, if we do this discard, and then our in-flight write, we'll 
>> have data instead of a hole. Not a big deal, but seems better to 
>> postpone discard.
>>
>> On the other hand, clearing caches is OK, as its related only to 
>> qcow2-refcount, not to inflight-write-cnt
>>
>>>
>>> As far as I understand, the freeing happens immediately above the “if 
>>> (refcount == 0)” block by s->set_refcount() setting the refcount to 
>>> 0. (including updating s->free_cluster_index if the refcount is 0).
>>
>> Hmm.. And that (setting s->free_cluster_index) what I should actually 
>> prevent until total reference count becomes zero.
>>
>> And about s->set_refcount(): it only update a refcount itself, and 
>> don't free anything.
>>
>>
> 
> So, it is more correct like this:
> 
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 464d133368..1da282446d 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
> update_refcount(BlockDriverState *bs,
>           } else {
>               refcount += addend;
>           }
> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
> -            s->free_cluster_index = cluster_index;
> -        }
>           s->set_refcount(refcount_block, block_index, refcount);
> 
>           if (refcount == 0) {
>               void *table;
>               Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
> 
> -            if (infl) {
> -                infl->refcount_zero = true;
> -                infl->type = type;
> -                continue;
> -            }
> -
>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>                                                   offset);
>               if (table != NULL) {
> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
> update_refcount(BlockDriverState *bs,
>                   qcow2_cache_discard(s->l2_table_cache, table);
>               }
> 
> +            if (infl) {
> +                infl->refcount_zero = true;
> +                infl->type = type;
> +                continue;
> +            }
> +
> +            if (cluster_index < s->free_cluster_index) {
> +                s->free_cluster_index = cluster_index;
> +            }
> +
>               if (s->discard_passthrough[type]) {
>                   update_refcount_discard(bs, cluster_offset, 
> s->cluster_size);
>               }

I don’t think I like using s->free_cluster_index as a protection against 
allocating something before it.

First, it comes back the problem I just described in my mail from 15:58 
GMT+1, which is that you’re changing the definition of what a free 
cluster is.  With this proposal, you’re proposing yet a new definition: 
A free cluster is anything with refcount == 0 after free_cluster_index.

Now looking only at the allocation functions, it may look like that kind 
of is the definition already.  But I don’t think that was the intention 
when free_cluster_index was introduced, so we’d have to check every 
place that sets free_cluster_index, to see whether it adheres to this 
definition.

And I think it’s clear that there is a place that won’t adhere to this 
definition, and that is this very place here, in update_refcount().  Say 
free_cluster_index is 42.  Then you free cluster 39, but there is a 
write to it, so free_cluster_index isn’t update.  Then you free cluster 
38, and there are writes to that cluster, so free_cluster_index is 
updated to 38.  Suddenly, 39 is free to be allocated, too.

(The precise problem is that with this new definition decreasing 
free_cluster_index suddenly has the power to free any cluster between 
its new and all value.  With the old definition, changing 
free_cluster_index would never free any cluster.  So when you decrease 
free_cluster_index, you suddenly have to be sure that all clusters 
between the new and old value that have refcount 0 are indeed to be 
considered free.)

Max
Vladimir Sementsov-Ogievskiy March 12, 2021, 3:24 p.m. UTC | #10
12.03.2021 18:10, Max Reitz wrote:
> On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>> 12.03.2021 14:17, Max Reitz wrote:
>>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>>> becomes 0) and reused during data write. In this case data write may
>>>
>>> [..]
>>>
>>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>>>           if (refcount == 0) {
>>>>>>>               void *table;
>>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>>>> +
>>>>>>> +            if (infl) {
>>>>>>> +                infl->refcount_zero = true;
>>>>>>> +                infl->type = type;
>>>>>>> +                continue;
>>>>>>> +            }
>>>>>>
>>>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>>>
>>>>> Don't follow.
>>>>>
>>>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>>>
>>>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
>>>
>>> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
>>>
>>> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
>>>
>>>>
>>>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
>>>
>>> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
>>>
>>> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
>>>
>>>
>>
>> So, it is more correct like this:
>>
>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>> index 464d133368..1da282446d 100644
>> --- a/block/qcow2-refcount.c
>> +++ b/block/qcow2-refcount.c
>> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>           } else {
>>               refcount += addend;
>>           }
>> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
>> -            s->free_cluster_index = cluster_index;
>> -        }
>>           s->set_refcount(refcount_block, block_index, refcount);
>>
>>           if (refcount == 0) {
>>               void *table;
>>               Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>
>> -            if (infl) {
>> -                infl->refcount_zero = true;
>> -                infl->type = type;
>> -                continue;
>> -            }
>> -
>>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>>                                                   offset);
>>               if (table != NULL) {
>> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>                   qcow2_cache_discard(s->l2_table_cache, table);
>>               }
>>
>> +            if (infl) {
>> +                infl->refcount_zero = true;
>> +                infl->type = type;
>> +                continue;
>> +            }
>> +
>> +            if (cluster_index < s->free_cluster_index) {
>> +                s->free_cluster_index = cluster_index;
>> +            }
>> +
>>               if (s->discard_passthrough[type]) {
>>                   update_refcount_discard(bs, cluster_offset, s->cluster_size);
>>               }
> 
> I don’t think I like using s->free_cluster_index as a protection against allocating something before it.

Hmm, I just propose not to update it, if refcount reached 0 but we still have inflight writes.


> 
> First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is.  With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index.

I think that free cluster is anything with refcount = 0 and inflight-write-cnt = 0. And free_cluster_index is a hint where start to search for such cluster.

> 
> Now looking only at the allocation functions, it may look like that kind of is the definition already.  But I don’t think that was the intention when free_cluster_index was introduced, so we’d have to check every place that sets free_cluster_index, to see whether it adheres to this definition.
> 
> And I think it’s clear that there is a place that won’t adhere to this definition, and that is this very place here, in update_refcount().  Say free_cluster_index is 42.  Then you free cluster 39, but there is a write to it, so free_cluster_index isn’t update.  Then you free cluster 38, and there are writes to that cluster, so free_cluster_index is updated to 38.  Suddenly, 39 is free to be allocated, too.

Why? 39 is protected by inflight-cnt, and we do has_infl_wr() check together with refcount==0 check when allocate clusters.

> 
> (The precise problem is that with this new definition decreasing free_cluster_index suddenly has the power to free any cluster between its new and all value.  With the old definition, changing free_cluster_index would never free any cluster.  So when you decrease free_cluster_index, you suddenly have to be sure that all clusters between the new and old value that have refcount 0 are indeed to be considered free.)
> 
> Max
>
Vladimir Sementsov-Ogievskiy March 12, 2021, 3:39 p.m. UTC | #11
12.03.2021 17:58, Max Reitz wrote:
> On 12.03.21 13:32, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 14:17, Max Reitz wrote:
>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>> becomes 0) and reused during data write. In this case data write may
>>
>> [..]
>>
>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>>           if (refcount == 0) {
>>>>>>               void *table;
>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>>> +
>>>>>> +            if (infl) {
>>>>>> +                infl->refcount_zero = true;
>>>>>> +                infl->type = type;
>>>>>> +                continue;
>>>>>> +            }
>>>>>
>>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>>
>>>> Don't follow.
>>>>
>>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>>
>>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
>>
>> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
>>
>> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
>>
>>>
>>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
>>
>> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
>>
>> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
> 
> That is what freeing is, though.  I consider something to be free when allocation functions will allocate it.  The allocation functions look at the refcount, so once a cluster’s refcount is 0, it is free.

And with this patch I try to update allocation function to look also at inflight-write-counters. If I missed something its a bug in the patch.

> 
> If that isn’t what freeing is, nothing in update_refcount() frees anything (when looking at how data clusters are handled).  Passing the discard through to the protocol layer isn’t “freeing”, because it’s independent of qcow2.
> 
> Now, your patch adds an additional check to the allocation functions (whether there are ongoing writes on the cluster), so it’s indeed possible that a cluster can have a refcount of 0 but still won’t be used by allocation functions.
> 
> But that means you’ve just changed the definition of what a free cluster is.  In fact, that means that nothing in update_refcount() can free a cluster that has active writes to it, because now a cluster is only free if there are no such writes.  It follows that you needn’t change update_refcount() to prevent clusters with such writes from being freed, because with this new definition of what a free cluster is, it’s impossible for update_refcount() to free them.

But as I noted somewhere else, update_refcount should not discard the host cluster in parallel with inflight write. It's not completely wrong, but it's inefficient.

> 
> (Yes, you’re right that it would be nice to postpone the protocol-level discard still, but not doing so wouldn’t be a catastrophe – which shows that it has little to do with actually freeing something, as far as qcow2 is concerned.
> 
> If it’s just about postponing the discard, we can do exactly that: Let update_refcount() skip discarding for clusters that are still in use, and then let update_inflight_write_cnt() only do that discard instead of invoking all of qcow2_update_cluster_refcount().)

Agree, yes.

> 
> Alternatively, we could also not change the definition of what a free cluster is, which means we wouldn’t need to change the allocation functions, but instead postpone the refcount update that update_refcount() does.  That would mean we’d actually need to really drop the refcount in update_inflight_write_cnt() instead of doing a -0.
> 

Hmm, that should work too. Do you think it is better? With first approach meaning of zero refcount is changed (it's not a free cluster now, keep in mind inflight-write-cnt too). So we should update functions interested in zero refcount. With second approach refcount=1 changes the meaning (it my be actually referenced by inflight-write-cnt object, not by some qcow2 metadata object).. Shouldn't we update some functions that are interested in refcount=1? Intuitively it seems safe enough. Nothing dangerous is in refcount=1 for cluster which is actually unused at all.
Max Reitz March 12, 2021, 3:52 p.m. UTC | #12
On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote:
> 12.03.2021 18:10, Max Reitz wrote:
>> On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
>>> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>>> 12.03.2021 14:17, Max Reitz wrote:
>>>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>>>> becomes 0) and reused during data write. In this case data write 
>>>>>>>> may
>>>>
>>>> [..]
>>>>
>>>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT 
>>>>>>>> update_refcount(BlockDriverState *bs,
>>>>>>>>           if (refcount == 0) {
>>>>>>>>               void *table;
>>>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>>>>>>> cluster_index);
>>>>>>>> +
>>>>>>>> +            if (infl) {
>>>>>>>> +                infl->refcount_zero = true;
>>>>>>>> +                infl->type = type;
>>>>>>>> +                continue;
>>>>>>>> +            }
>>>>>>>
>>>>>>> I don’t understand what this is supposed to do exactly.  It seems 
>>>>>>> like it wants to keep metadata structures in the cache that are 
>>>>>>> still in use (because dropping them from the caches is what 
>>>>>>> happens next), but users of metadata structures won’t set 
>>>>>>> in-flight counters for those metadata structures, will they?
>>>>>>
>>>>>> Don't follow.
>>>>>>
>>>>>> We want the code in "if (refcount == 0)" to be triggered only when 
>>>>>> full reference count of the host cluster becomes 0, including 
>>>>>> inflight-write-cnt. So, if at this point inflight-write-cnt is not 
>>>>>> 0, we postpone freeing the host cluster, it will be done later 
>>>>>> from "slow path" in update_inflight_write_cnt().
>>>>>
>>>>> But the code under “if (refcount == 0)” doesn’t free anything, does 
>>>>> it?  All I can see is code to remove metadata structures from the 
>>>>> metadata caches (if the discarded cluster was an L2 table or a 
>>>>> refblock), and finally the discard on the underlying file.  I don’t 
>>>>> see how that protocol-level discard has anything to do with our 
>>>>> problem, though.
>>>>
>>>> Hmm. Still, if we do this discard, and then our in-flight write, 
>>>> we'll have data instead of a hole. Not a big deal, but seems better 
>>>> to postpone discard.
>>>>
>>>> On the other hand, clearing caches is OK, as its related only to 
>>>> qcow2-refcount, not to inflight-write-cnt
>>>>
>>>>>
>>>>> As far as I understand, the freeing happens immediately above the 
>>>>> “if (refcount == 0)” block by s->set_refcount() setting the 
>>>>> refcount to 0. (including updating s->free_cluster_index if the 
>>>>> refcount is 0).
>>>>
>>>> Hmm.. And that (setting s->free_cluster_index) what I should 
>>>> actually prevent until total reference count becomes zero.
>>>>
>>>> And about s->set_refcount(): it only update a refcount itself, and 
>>>> don't free anything.
>>>>
>>>>
>>>
>>> So, it is more correct like this:
>>>
>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>> index 464d133368..1da282446d 100644
>>> --- a/block/qcow2-refcount.c
>>> +++ b/block/qcow2-refcount.c
>>> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT 
>>> update_refcount(BlockDriverState *bs,
>>>           } else {
>>>               refcount += addend;
>>>           }
>>> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
>>> -            s->free_cluster_index = cluster_index;
>>> -        }
>>>           s->set_refcount(refcount_block, block_index, refcount);
>>>
>>>           if (refcount == 0) {
>>>               void *table;
>>>               Qcow2InFlightRefcount *infl = find_infl_wr(s, 
>>> cluster_index);
>>>
>>> -            if (infl) {
>>> -                infl->refcount_zero = true;
>>> -                infl->type = type;
>>> -                continue;
>>> -            }
>>> -
>>>               table = 
>>> qcow2_cache_is_table_offset(s->refcount_block_cache,
>>>                                                   offset);
>>>               if (table != NULL) {
>>> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT 
>>> update_refcount(BlockDriverState *bs,
>>>                   qcow2_cache_discard(s->l2_table_cache, table);
>>>               }
>>>
>>> +            if (infl) {
>>> +                infl->refcount_zero = true;
>>> +                infl->type = type;
>>> +                continue;
>>> +            }
>>> +
>>> +            if (cluster_index < s->free_cluster_index) {
>>> +                s->free_cluster_index = cluster_index;
>>> +            }
>>> +
>>>               if (s->discard_passthrough[type]) {
>>>                   update_refcount_discard(bs, cluster_offset, 
>>> s->cluster_size);
>>>               }
>>
>> I don’t think I like using s->free_cluster_index as a protection 
>> against allocating something before it.
> 
> Hmm, I just propose not to update it, if refcount reached 0 but we still 
> have inflight writes.
> 
> 
>>
>> First, it comes back the problem I just described in my mail from 
>> 15:58 GMT+1, which is that you’re changing the definition of what a 
>> free cluster is.  With this proposal, you’re proposing yet a new 
>> definition: A free cluster is anything with refcount == 0 after 
>> free_cluster_index.
> 
> I think that free cluster is anything with refcount = 0 and 
> inflight-write-cnt = 0.

Then, as I said in my other mail, update_refcount() just cannot free any 
cluster.  So changes to that function can’t be justified by preventing 
it from freeing clusters.

You need to clearly define what it is that update_refcount() should or 
shouldn’t do, and then we have to think about whether when all writes 
have settled, we really have to invoke qcow2_update_cluster_refcount() 
or whether we should do the small outstanding changes just directly in 
update_inflight_write_cnt().

I think this needs to be more formalized, or it doesn’t make sense.

For example, say we do define a free cluster to be refcount (RC) = 0 and 
inflight-write-cnt (IFWC) = 0.  Then everything that is done to a 
cluster because it is considered being freed right now because its RC 
drops to 0 must probably be changed to only be done if also its IFWC is 
0.  For example, we should only discard host clusters on the protocol 
layer if a cluster becomes free.  update_refcount() will no longer be 
able to free clusters with IFWC > 0, so it must never issue a 
protocol-level discard for them.  And, yes, it also shouldn’t adjust 
first_free_cluster_index, as you propose here.  (But you didn’t explain 
why, and it seems like it was just intuition to you instead of looking 
at it more formally.)

Instead, for clusters with RC = 0 and IFWC > 0, 
update_inflight_write_cnt() will take on the role of freeing them.  So 
now that function must adjust first_free_cluster_index and issue the 
protocol-level discard for such clusters.

I suppose in practice we could invoke qcow2_update_cluster_refcount() 
with -0, as you do, because now the cluster has RC = 0 and IFWC = 0, so 
now that function will be capable of freeing it.  But to me, that just 
looks like a bit of abuse.


I suppose we could create a new function qcow2_cluster_freed() where we 
collect everything that needs to be done once a cluster is considered 
freed (which so far was whenever its RC dropped to 0, which only happens 
in update_refcount(); and then will be whenever its RC and its IFWC drop 
to 0, which can happen in either update_refcount() or 
update_inflight_write_cnt()).  What would belong in there is discarding 
the cluster on the protocol level, and adjusting 
first_free_cluster_index.  (Perhaps more, I don’t know.)  With such a 
function, it would seem clear to me that there is no need to invoke 
qcow2_update_cluster_refcount() just to get precisely that effect.


(The alternative would be to keep RC == 0 the definition of a freed 
cluster.  Then we’d have to postpone the s->set_refcount() in 
update_refcount(), and update the refcount again in 
update_inflight_write_cnt(), but invoking 
qcow2_update_cluster_refcount().  We wouldn’t need to change the 
allocation functions.

I’m not saying that alternative is better – I don’t think it is, I think 
you’re right that the definition of a freed cluster should be changed. 
I’m just presenting it in contrast, to show when it would make sense to 
call qcow2_update_cluster_refcount().)

> And free_cluster_index is a hint where start to 
> search for such cluster.
> 
>>
>> Now looking only at the allocation functions, it may look like that 
>> kind of is the definition already.  But I don’t think that was the 
>> intention when free_cluster_index was introduced, so we’d have to 
>> check every place that sets free_cluster_index, to see whether it 
>> adheres to this definition.
>>
>> And I think it’s clear that there is a place that won’t adhere to this 
>> definition, and that is this very place here, in update_refcount().  
>> Say free_cluster_index is 42.  Then you free cluster 39, but there is 
>> a write to it, so free_cluster_index isn’t update.  Then you free 
>> cluster 38, and there are writes to that cluster, so 
>> free_cluster_index is updated to 38.  Suddenly, 39 is free to be 
>> allocated, too.
> 
> Why? 39 is protected by inflight-cnt, and we do has_infl_wr() check 
> together with refcount==0 check when allocate clusters.

I was (wrongly) assuming that with this change you’d drop the check in 
the allocation functions.

Max

>> (The precise problem is that with this new definition decreasing 
>> free_cluster_index suddenly has the power to free any cluster between 
>> its new and all value.  With the old definition, changing 
>> free_cluster_index would never free any cluster.  So when you decrease 
>> free_cluster_index, you suddenly have to be sure that all clusters 
>> between the new and old value that have refcount 0 are indeed to be 
>> considered free.)
>>
>> Max
>>
> 
>
Vladimir Sementsov-Ogievskiy March 12, 2021, 4:03 p.m. UTC | #13
12.03.2021 18:52, Max Reitz wrote:
> On 12.03.21 16:24, Vladimir Sementsov-Ogievskiy wrote:
>> 12.03.2021 18:10, Max Reitz wrote:
>>> On 12.03.21 13:46, Vladimir Sementsov-Ogievskiy wrote:
>>>> 12.03.2021 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 12.03.2021 14:17, Max Reitz wrote:
>>>>>> On 12.03.21 10:09, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> 11.03.2021 22:58, Max Reitz wrote:
>>>>>>>> On 05.03.21 18:35, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>>> There is a bug in qcow2: host cluster can be discarded (refcount
>>>>>>>>> becomes 0) and reused during data write. In this case data write may
>>>>>
>>>>> [..]
>>>>>
>>>>>>>>> @@ -885,6 +1019,13 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>>>>>>           if (refcount == 0) {
>>>>>>>>>               void *table;
>>>>>>>>> +            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>>>>>> +
>>>>>>>>> +            if (infl) {
>>>>>>>>> +                infl->refcount_zero = true;
>>>>>>>>> +                infl->type = type;
>>>>>>>>> +                continue;
>>>>>>>>> +            }
>>>>>>>>
>>>>>>>> I don’t understand what this is supposed to do exactly.  It seems like it wants to keep metadata structures in the cache that are still in use (because dropping them from the caches is what happens next), but users of metadata structures won’t set in-flight counters for those metadata structures, will they?
>>>>>>>
>>>>>>> Don't follow.
>>>>>>>
>>>>>>> We want the code in "if (refcount == 0)" to be triggered only when full reference count of the host cluster becomes 0, including inflight-write-cnt. So, if at this point inflight-write-cnt is not 0, we postpone freeing the host cluster, it will be done later from "slow path" in update_inflight_write_cnt().
>>>>>>
>>>>>> But the code under “if (refcount == 0)” doesn’t free anything, does it?  All I can see is code to remove metadata structures from the metadata caches (if the discarded cluster was an L2 table or a refblock), and finally the discard on the underlying file.  I don’t see how that protocol-level discard has anything to do with our problem, though.
>>>>>
>>>>> Hmm. Still, if we do this discard, and then our in-flight write, we'll have data instead of a hole. Not a big deal, but seems better to postpone discard.
>>>>>
>>>>> On the other hand, clearing caches is OK, as its related only to qcow2-refcount, not to inflight-write-cnt
>>>>>
>>>>>>
>>>>>> As far as I understand, the freeing happens immediately above the “if (refcount == 0)” block by s->set_refcount() setting the refcount to 0. (including updating s->free_cluster_index if the refcount is 0).
>>>>>
>>>>> Hmm.. And that (setting s->free_cluster_index) what I should actually prevent until total reference count becomes zero.
>>>>>
>>>>> And about s->set_refcount(): it only update a refcount itself, and don't free anything.
>>>>>
>>>>>
>>>>
>>>> So, it is more correct like this:
>>>>
>>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>>> index 464d133368..1da282446d 100644
>>>> --- a/block/qcow2-refcount.c
>>>> +++ b/block/qcow2-refcount.c
>>>> @@ -1012,21 +1012,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>           } else {
>>>>               refcount += addend;
>>>>           }
>>>> -        if (refcount == 0 && cluster_index < s->free_cluster_index) {
>>>> -            s->free_cluster_index = cluster_index;
>>>> -        }
>>>>           s->set_refcount(refcount_block, block_index, refcount);
>>>>
>>>>           if (refcount == 0) {
>>>>               void *table;
>>>>               Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
>>>>
>>>> -            if (infl) {
>>>> -                infl->refcount_zero = true;
>>>> -                infl->type = type;
>>>> -                continue;
>>>> -            }
>>>> -
>>>>               table = qcow2_cache_is_table_offset(s->refcount_block_cache,
>>>>                                                   offset);
>>>>               if (table != NULL) {
>>>> @@ -1040,6 +1031,16 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
>>>>                   qcow2_cache_discard(s->l2_table_cache, table);
>>>>               }
>>>>
>>>> +            if (infl) {
>>>> +                infl->refcount_zero = true;
>>>> +                infl->type = type;
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            if (cluster_index < s->free_cluster_index) {
>>>> +                s->free_cluster_index = cluster_index;
>>>> +            }
>>>> +
>>>>               if (s->discard_passthrough[type]) {
>>>>                   update_refcount_discard(bs, cluster_offset, s->cluster_size);
>>>>               }
>>>
>>> I don’t think I like using s->free_cluster_index as a protection against allocating something before it.
>>
>> Hmm, I just propose not to update it, if refcount reached 0 but we still have inflight writes.
>>
>>
>>>
>>> First, it comes back the problem I just described in my mail from 15:58 GMT+1, which is that you’re changing the definition of what a free cluster is.  With this proposal, you’re proposing yet a new definition: A free cluster is anything with refcount == 0 after free_cluster_index.
>>
>> I think that free cluster is anything with refcount = 0 and inflight-write-cnt = 0.
> 
> Then, as I said in my other mail, update_refcount() just cannot free any cluster.  So changes to that function can’t be justified by preventing it from freeing clusters.
> 
> You need to clearly define what it is that update_refcount() should or shouldn’t do, and then we have to think about whether when all writes have settled, we really have to invoke qcow2_update_cluster_refcount() or whether we should do the small outstanding changes just directly in update_inflight_write_cnt().
> 
> I think this needs to be more formalized, or it doesn’t make sense.
> 
> For example, say we do define a free cluster to be refcount (RC) = 0 and inflight-write-cnt (IFWC) = 0.  Then everything that is done to a cluster because it is considered being freed right now because its RC drops to 0 must probably be changed to only be done if also its IFWC is 0.  For example, we should only discard host clusters on the protocol layer if a cluster becomes free.  update_refcount() will no longer be able to free clusters with IFWC > 0, so it must never issue a protocol-level discard for them.  And, yes, it also shouldn’t adjust first_free_cluster_index, as you propose here.  (But you didn’t explain why, and it seems like it was just intuition to you instead of looking at it more formally.)
> 
> Instead, for clusters with RC = 0 and IFWC > 0, update_inflight_write_cnt() will take on the role of freeing them.  So now that function must adjust first_free_cluster_index and issue the protocol-level discard for such clusters.

Yes, agree.

> 
> I suppose in practice we could invoke qcow2_update_cluster_refcount() with -0, as you do, because now the cluster has RC = 0 and IFWC = 0, so now that function will be capable of freeing it.  But to me, that just looks like a bit of abuse.

agree

> 
> 
> I suppose we could create a new function qcow2_cluster_freed() where we collect everything that needs to be done once a cluster is considered freed (which so far was whenever its RC dropped to 0, which only happens in update_refcount(); and then will be whenever its RC and its IFWC drop to 0, which can happen in either update_refcount() or update_inflight_write_cnt()).  What would belong in there is discarding the cluster on the protocol level, and adjusting first_free_cluster_index.  (Perhaps more, I don’t know.)  With such a function, it would seem clear to me that there is no need to invoke qcow2_update_cluster_refcount() just to get precisely that effect.

yes

> 
> 
> (The alternative would be to keep RC == 0 the definition of a freed cluster.  Then we’d have to postpone the s->set_refcount() in update_refcount(), and update the refcount again in update_inflight_write_cnt(), but invoking qcow2_update_cluster_refcount().  We wouldn’t need to change the allocation functions.
> 
> I’m not saying that alternative is better – I don’t think it is, I think you’re right that the definition of a freed cluster should be changed. I’m just presenting it in contrast, to show when it would make sense to call qcow2_update_cluster_refcount().)

OK

In the meanwhile Kevin dispelled my "big problems" in "[PATCH v2(RFC) 0/3] qcow2: fix parallel rewrite and discard", so probably next step would be to retry CoRwLock-based approach.

> 
>> And free_cluster_index is a hint where start to search for such cluster.
>>
>>>
>>> Now looking only at the allocation functions, it may look like that kind of is the definition already.  But I don’t think that was the intention when free_cluster_index was introduced, so we’d have to check every place that sets free_cluster_index, to see whether it adheres to this definition.
>>>
>>> And I think it’s clear that there is a place that won’t adhere to this definition, and that is this very place here, in update_refcount(). Say free_cluster_index is 42.  Then you free cluster 39, but there is a write to it, so free_cluster_index isn’t update.  Then you free cluster 38, and there are writes to that cluster, so free_cluster_index is updated to 38.  Suddenly, 39 is free to be allocated, too.
>>
>> Why? 39 is protected by inflight-cnt, and we do has_infl_wr() check together with refcount==0 check when allocate clusters.
> 
> I was (wrongly) assuming that with this change you’d drop the check in the allocation functions.
> 
> Max
> 
>>> (The precise problem is that with this new definition decreasing free_cluster_index suddenly has the power to free any cluster between its new and all value.  With the old definition, changing free_cluster_index would never free any cluster.  So when you decrease free_cluster_index, you suddenly have to be sure that all clusters between the new and old value that have refcount 0 are indeed to be considered free.)
>>>
>>> Max
>>>
>>
>>
>
diff mbox series

Patch

diff --git a/block/qcow2.h b/block/qcow2.h
index 0678073b74..e18d8dfbae 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -420,6 +420,8 @@  typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    GHashTable *inflight_writes_counters;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
@@ -896,6 +898,13 @@  int qcow2_shrink_reftable(BlockDriverState *bs);
 int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);
 int qcow2_detect_metadata_preallocation(BlockDriverState *bs);
 
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length);
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length);
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
+                                     int64_t length);
+
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                         bool exact_size);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 8e649b008e..464d133368 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -799,6 +799,140 @@  found:
     }
 }
 
+/*
+ * Qcow2InFlightRefcount is a type for values of s->inflight_writes_counters
+ * hasm map. And it's keys are cluster indices.
+ */
+typedef struct Qcow2InFlightRefcount {
+    /*
+     * Number of in-flight writes to the cluster, always > 0, as when becomes
+     * 0 the entry is removed from s->inflight_writes_counters.
+     */
+    uint64_t inflight_writes_cnt;
+
+    /* Cluster refcount is known to be zero */
+    bool refcount_zero;
+
+    /* Cluster refcount was made zero with this discard-type */
+    enum qcow2_discard_type type;
+} Qcow2InFlightRefcount;
+
+static Qcow2InFlightRefcount *find_infl_wr(BDRVQcow2State *s,
+                                           int64_t cluster_index)
+{
+    Qcow2InFlightRefcount *infl;
+
+    if (!s->inflight_writes_counters) {
+        return NULL;
+    }
+
+    infl = g_hash_table_lookup(s->inflight_writes_counters, &cluster_index);
+
+    if (infl) {
+        assert(infl->inflight_writes_cnt > 0);
+    }
+
+    return infl;
+}
+
+/*
+ * Returns true if there are any in-flight writes to the cluster blocking
+ * its reallocation.
+ */
+static bool has_infl_wr(BDRVQcow2State *s, int64_t cluster_index)
+{
+    return !!find_infl_wr(s, cluster_index);
+}
+
+static int update_inflight_write_cnt(BlockDriverState *bs,
+                                     int64_t offset, int64_t length,
+                                     bool decrease, bool locked)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t start, last, cluster_offset;
+
+    if (locked) {
+        qemu_co_mutex_assert_locked(&s->lock);
+    }
+
+    start = start_of_cluster(s, offset);
+    last = start_of_cluster(s, offset + length - 1);
+    for (cluster_offset = start; cluster_offset <= last;
+         cluster_offset += s->cluster_size)
+    {
+        int64_t cluster_index = cluster_offset >> s->cluster_bits;
+        Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+        if (!infl) {
+            infl = g_new0(Qcow2InFlightRefcount, 1);
+            g_hash_table_insert(s->inflight_writes_counters,
+                                g_memdup(&cluster_index, sizeof(cluster_index)),
+                                infl);
+        }
+
+        if (decrease) {
+            assert(infl->inflight_writes_cnt >= 1);
+            infl->inflight_writes_cnt--;
+        } else {
+            infl->inflight_writes_cnt++;
+        }
+
+        if (infl->inflight_writes_cnt == 0) {
+            bool refcount_zero = infl->refcount_zero;
+            enum qcow2_discard_type type = infl->type;
+
+            g_hash_table_remove(s->inflight_writes_counters, &cluster_index);
+
+            if (refcount_zero) {
+                /*
+                 * Slow path. We must reset normal refcount to actually release
+                 * the cluster.
+                 */
+                int ret;
+
+                if (!locked) {
+                    qemu_co_mutex_lock(&s->lock);
+                }
+                ret = qcow2_update_cluster_refcount(bs, cluster_index, 0,
+                                                    true, type);
+                if (!locked) {
+                    qemu_co_mutex_unlock(&s->lock);
+                }
+
+                if (ret < 0) {
+                    return ret;
+                }
+            }
+        }
+
+    }
+
+    return 0;
+}
+
+int qcow2_inflight_writes_inc(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, false, false);
+}
+
+/*
+ * Called with s->lock not locked by caller. Will take s->lock only if need to
+ * release the cluster (refcount is 0 and inflight-write-cnt becomes zero).
+ */
+int qcow2_inflight_writes_dec(BlockDriverState *bs, int64_t offset,
+                              int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, false);
+}
+
+/* Called with s->lock locked. */
+int qcow2_inflight_writes_dec_locked(BlockDriverState *bs, int64_t offset,
+                                     int64_t length)
+{
+    return update_inflight_write_cnt(bs, offset, length, true, true);
+}
+
 /* XXX: cache several refcount block clusters ? */
 /* @addend is the absolute value of the addend; if @decrease is set, @addend
  * will be subtracted from the current refcount, otherwise it will be added */
@@ -885,6 +1019,13 @@  static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
 
         if (refcount == 0) {
             void *table;
+            Qcow2InFlightRefcount *infl = find_infl_wr(s, cluster_index);
+
+            if (infl) {
+                infl->refcount_zero = true;
+                infl->type = type;
+                continue;
+            }
 
             table = qcow2_cache_is_table_offset(s->refcount_block_cache,
                                                 offset);
@@ -983,7 +1124,7 @@  retry:
 
         if (ret < 0) {
             return ret;
-        } else if (refcount != 0) {
+        } else if (refcount != 0 || has_infl_wr(s, next_cluster_index)) {
             goto retry;
         }
     }
@@ -1046,7 +1187,7 @@  int64_t qcow2_alloc_clusters_at(BlockDriverState *bs, uint64_t offset,
             ret = qcow2_get_refcount(bs, cluster_index++, &refcount);
             if (ret < 0) {
                 return ret;
-            } else if (refcount != 0) {
+            } else if (refcount != 0 || has_infl_wr(s, cluster_index)) {
                 break;
             }
         }
@@ -2294,7 +2435,9 @@  static int64_t alloc_clusters_imrt(BlockDriverState *bs,
          contiguous_free_clusters < cluster_count;
          cluster++)
     {
-        if (!s->get_refcount(*refcount_table, cluster)) {
+        if (!s->get_refcount(*refcount_table, cluster) &&
+            !has_infl_wr(s, cluster))
+        {
             contiguous_free_clusters++;
             if (first_gap) {
                 /* If this is the first free cluster found, update
diff --git a/block/qcow2.c b/block/qcow2.c
index d9f49a52e7..6ee94421e0 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1831,6 +1831,8 @@  static int coroutine_fn qcow2_do_open(BlockDriverState *bs, QDict *options,
 #endif
 
     qemu_co_queue_init(&s->thread_task_queue);
+    s->inflight_writes_counters =
+        g_hash_table_new_full(g_int64_hash, g_int64_equal, g_free, g_free);
 
     return ret;
 
@@ -2546,6 +2548,9 @@  out_unlocked:
 
 out_locked:
     qcow2_handle_l2meta(bs, &l2meta, false);
+
+    qcow2_inflight_writes_dec_locked(bs, host_offset, bytes);
+
     qemu_co_mutex_unlock(&s->lock);
 
     qemu_vfree(crypt_buf);
@@ -2605,6 +2610,8 @@  static coroutine_fn int qcow2_co_pwritev_part(
             goto out_locked;
         }
 
+        qcow2_inflight_writes_inc(bs, host_offset, cur_bytes);
+
         qemu_co_mutex_unlock(&s->lock);
 
         if (!aio && cur_bytes != bytes) {
@@ -2707,6 +2714,9 @@  static void qcow2_close(BlockDriverState *bs)
     g_free(s->image_backing_file);
     g_free(s->image_backing_format);
 
+    assert(g_hash_table_size(s->inflight_writes_counters) == 0);
+    g_hash_table_unref(s->inflight_writes_counters);
+
     if (has_data_file(bs)) {
         bdrv_unref_child(bs, s->data_file);
         s->data_file = NULL;
@@ -4097,10 +4107,17 @@  qcow2_co_copy_range_to(BlockDriverState *bs,
             goto fail;
         }
 
+        qcow2_inflight_writes_inc(bs, host_offset, cur_bytes);
+
         qemu_co_mutex_unlock(&s->lock);
+
         ret = bdrv_co_copy_range_to(src, src_offset, s->data_file, host_offset,
                                     cur_bytes, read_flags, write_flags);
+
         qemu_co_mutex_lock(&s->lock);
+
+        qcow2_inflight_writes_dec_locked(bs, host_offset, cur_bytes);
+
         if (ret < 0) {
             goto fail;
         }
@@ -4536,13 +4553,20 @@  qcow2_co_pwritev_compressed_task(BlockDriverState *bs,
     }
 
     ret = qcow2_pre_write_overlap_check(bs, 0, cluster_offset, out_len, true);
-    qemu_co_mutex_unlock(&s->lock);
     if (ret < 0) {
+        qemu_co_mutex_unlock(&s->lock);
         goto fail;
     }
 
+    qcow2_inflight_writes_inc(bs, cluster_offset, out_len);
+
+    qemu_co_mutex_unlock(&s->lock);
+
     BLKDBG_EVENT(s->data_file, BLKDBG_WRITE_COMPRESSED);
     ret = bdrv_co_pwrite(s->data_file, cluster_offset, out_len, out_buf, 0);
+
+    qcow2_inflight_writes_dec(bs, cluster_offset, out_len);
+
     if (ret < 0) {
         goto fail;
     }
diff --git a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
index 7f0d8a107a..2e2e0d2cb0 100755
--- a/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
+++ b/tests/qemu-iotests/tests/qcow2-discard-during-rewrite
@@ -1,5 +1,5 @@ 
 #!/usr/bin/env bash
-# group: quick disabled
+# group: quick
 #
 # Test discarding (and reusing) host cluster during writing data to it.
 #