Patchwork [RFC,07/17] block: make high level discard operation always zero

login
register
mail settings
Submitter Paolo Bonzini
Date March 8, 2012, 5:15 p.m.
Message ID <1331226917-6658-8-git-send-email-pbonzini@redhat.com>
Download mbox | patch
Permalink /patch/145583/
State New
Headers show

Comments

Paolo Bonzini - March 8, 2012, 5:15 p.m.
While the formats/protocols are free to implement a discard operation
that does *not* zero the data, providing a common view to the guests is
the only way to avoid configuration and migration nightmares.  (We don't
want the admin to manually set up discard_zeroes_data/discard_granularity!)

This has a couple of drawbacks:

1) QEMU effectively will not try to use discard unless discard_zeroes_data
is true.  To do this we could add a flag such as BDRV_ALWAYS_DISCARD,
which the device models could use if their discard_zeroes_data property is
set to false.  However, I haven't done this, estimating that no strictly
positive integer could represent the number of people using it.

2) it may turn a thin-provisioning operation into a full preallocation,
which is not nice.

3) it may cost memory for a bounce buffer that only needs to be filled
with zeroes.

While (3) can be worked around, the only way around the other two,
unfortunately, is support in the formats and protocols.  We will still
provide device options to opt out of this, but with raw and qed covered
(+ qcow2 without backing file, and qcow3 in the future) it should not
be too bad.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c |   14 +++++++++++---
 1 files changed, 11 insertions(+), 3 deletions(-)
Avi Kivity - March 8, 2012, 5:55 p.m.
On 03/08/2012 07:15 PM, Paolo Bonzini wrote:
> While the formats/protocols are free to implement a discard operation
> that does *not* zero the data, providing a common view to the guests is
> the only way to avoid configuration and migration nightmares.  (We don't
> want the admin to manually set up discard_zeroes_data/discard_granularity!)
>
> This has a couple of drawbacks:
>
> 1) QEMU effectively will not try to use discard unless discard_zeroes_data
> is true.  To do this we could add a flag such as BDRV_ALWAYS_DISCARD,
> which the device models could use if their discard_zeroes_data property is
> set to false.  However, I haven't done this, estimating that no strictly
> positive integer could represent the number of people using it.
>
> 2) it may turn a thin-provisioning operation into a full preallocation,
> which is not nice.
>
> 3) it may cost memory for a bounce buffer that only needs to be filled
> with zeroes.
>
> While (3) can be worked around, the only way around the other two,
> unfortunately, is support in the formats and protocols.  We will still
> provide device options to opt out of this, but with raw and qed covered
> (+ qcow2 without backing file, and qcow3 in the future) it should not
> be too bad.

Can't qcow2 with a backing file also be supported?  Zero out the first
cluster, and remember it.  The following discards can reuse this zero
cluster, as long as it hasn't been overwritten.
Kevin Wolf - March 9, 2012, 4:42 p.m.
Am 08.03.2012 18:55, schrieb Avi Kivity:
> On 03/08/2012 07:15 PM, Paolo Bonzini wrote:
>> While the formats/protocols are free to implement a discard operation
>> that does *not* zero the data, providing a common view to the guests is
>> the only way to avoid configuration and migration nightmares.  (We don't
>> want the admin to manually set up discard_zeroes_data/discard_granularity!)
>>
>> This has a couple of drawbacks:
>>
>> 1) QEMU effectively will not try to use discard unless discard_zeroes_data
>> is true.  To do this we could add a flag such as BDRV_ALWAYS_DISCARD,
>> which the device models could use if their discard_zeroes_data property is
>> set to false.  However, I haven't done this, estimating that no strictly
>> positive integer could represent the number of people using it.
>>
>> 2) it may turn a thin-provisioning operation into a full preallocation,
>> which is not nice.
>>
>> 3) it may cost memory for a bounce buffer that only needs to be filled
>> with zeroes.
>>
>> While (3) can be worked around, the only way around the other two,
>> unfortunately, is support in the formats and protocols.  We will still
>> provide device options to opt out of this, but with raw and qed covered
>> (+ qcow2 without backing file, and qcow3 in the future) it should not
>> be too bad.
> 
> Can't qcow2 with a backing file also be supported?  Zero out the first
> cluster, and remember it.  The following discards can reuse this zero
> cluster, as long as it hasn't been overwritten.

qcow2 can't handle clusters that are referenced twice from the same L1
table. This would require a reverse lookup to adjust the QCOW_O_COPIED
flags in the L2 tables containing the other references.

Kevin
Avi Kivity - March 12, 2012, 10:42 a.m.
On 03/09/2012 06:42 PM, Kevin Wolf wrote:
> >>
> >> While (3) can be worked around, the only way around the other two,
> >> unfortunately, is support in the formats and protocols.  We will still
> >> provide device options to opt out of this, but with raw and qed covered
> >> (+ qcow2 without backing file, and qcow3 in the future) it should not
> >> be too bad.
> > 
> > Can't qcow2 with a backing file also be supported?  Zero out the first
> > cluster, and remember it.  The following discards can reuse this zero
> > cluster, as long as it hasn't been overwritten.
>
> qcow2 can't handle clusters that are referenced twice from the same L1
> table. This would require a reverse lookup to adjust the QCOW_O_COPIED
> flags in the L2 tables containing the other references.

Don't follow, sorry.  What adjustment are you talking about?

If it's a 1->0 transition, is it mandatory to adjust the flag?  That is,
it it legal to have a refcount of exactly one, but have the flag clear?
Kevin Wolf - March 12, 2012, 11:04 a.m.
Am 12.03.2012 11:42, schrieb Avi Kivity:
> On 03/09/2012 06:42 PM, Kevin Wolf wrote:
>>>>
>>>> While (3) can be worked around, the only way around the other two,
>>>> unfortunately, is support in the formats and protocols.  We will still
>>>> provide device options to opt out of this, but with raw and qed covered
>>>> (+ qcow2 without backing file, and qcow3 in the future) it should not
>>>> be too bad.
>>>
>>> Can't qcow2 with a backing file also be supported?  Zero out the first
>>> cluster, and remember it.  The following discards can reuse this zero
>>> cluster, as long as it hasn't been overwritten.
>>
>> qcow2 can't handle clusters that are referenced twice from the same L1
>> table. This would require a reverse lookup to adjust the QCOW_O_COPIED
>> flags in the L2 tables containing the other references.
> 
> Don't follow, sorry.  What adjustment are you talking about?
> 
> If it's a 1->0 transition, is it mandatory to adjust the flag?  That is,
> it it legal to have a refcount of exactly one, but have the flag clear?

According to the spec it's illegal and qemu-img check will complain,
too. In practice, I'm not entirely sure if it will cause real corruption
or just unnecessary COWs. I believe it may be the latter.

But it's not worth the trouble anyway when we can have a real zero flag
in qcow3.

Kevin
Avi Kivity - March 12, 2012, 12:03 p.m.
On 03/12/2012 01:04 PM, Kevin Wolf wrote:
> >>
> >> qcow2 can't handle clusters that are referenced twice from the same L1
> >> table. This would require a reverse lookup to adjust the QCOW_O_COPIED
> >> flags in the L2 tables containing the other references.
> > 
> > Don't follow, sorry.  What adjustment are you talking about?
> > 
> > If it's a 1->0 transition, is it mandatory to adjust the flag?  That is,
> > it it legal to have a refcount of exactly one, but have the flag clear?
>
> According to the spec it's illegal and qemu-img check will complain,
> too. In practice, I'm not entirely sure if it will cause real corruption
> or just unnecessary COWs. I believe it may be the latter.

We could retro-doc it but I'm not a huge fan of this practice.

> But it's not worth the trouble anyway when we can have a real zero flag
> in qcow3.

ok.  Thanks for the explanations.

Patch

diff --git a/block.c b/block.c
index 34db914..30b137f 100644
--- a/block.c
+++ b/block.c
@@ -74,6 +74,8 @@  static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
     BdrvRequestFlags flags);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
+static int coroutine_fn bdrv_co_do_discard(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors);
 static BlockDriverAIOCB *bdrv_co_aio_rw_vector(BlockDriverState *bs,
                                                int64_t sector_num,
                                                QEMUIOVector *qiov,
@@ -1797,7 +1799,7 @@  static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
         bdi.discard_granularity &&
         (sector_num & (bdi.discard_granularity - 1)) == 0 &&
         (nb_sectors & (bdi.discard_granularity - 1)) == 0) {
-        return bdrv_co_discard(bs, sector_num, nb_sectors);
+        return bdrv_co_do_discard(bs, sector_num, nb_sectors);
     }
 
     if (qiov) {
@@ -3621,8 +3623,8 @@  static void coroutine_fn bdrv_discard_co_entry(void *opaque)
     rwco->ret = bdrv_co_discard(rwco->bs, rwco->sector_num, rwco->nb_sectors);
 }
 
-int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
-                                 int nb_sectors)
+static int coroutine_fn bdrv_co_do_discard(BlockDriverState *bs,
+                                           int64_t sector_num, int nb_sectors)
 {
     if (!bs->drv) {
         return -ENOMEDIUM;
@@ -3651,6 +3653,12 @@  int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
     }
 }
 
+int coroutine_fn bdrv_co_discard(BlockDriverState *bs,
+    int64_t sector_num, int nb_sectors)
+{
+    return bdrv_co_do_write_zeroes(bs, sector_num, nb_sectors, NULL);
+}
+
 int bdrv_discard(BlockDriverState *bs, int64_t sector_num, int nb_sectors)
 {
     Coroutine *co;