diff mbox

[RFC] block: optimize zero writes with bdrv_write_zeroes

Message ID 1393074022-32388-1-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven Feb. 22, 2014, 1 p.m. UTC
this patch tries to optimize zero write requests
by automatically using bdrv_write_zeroes if it is
supported by the format.

i know that there is a lot of potential for discussion, but i would
like to know what the others think.

this should significantly speed up file system initialization and
should speed zero write test used to test backend storage performance.

the difference can simply be tested by e.g.

dd if=/dev/zero of=/dev/vdX bs=1M

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 block.c               |    8 ++++++++
 include/qemu-common.h |    1 +
 util/iov.c            |   20 ++++++++++++++++++++
 3 files changed, 29 insertions(+)

Comments

Fam Zheng Feb. 22, 2014, 4:45 p.m. UTC | #1
On Sat, 02/22 14:00, Peter Lieven wrote:
> this patch tries to optimize zero write requests
> by automatically using bdrv_write_zeroes if it is
> supported by the format.
> 
> i know that there is a lot of potential for discussion, but i would
> like to know what the others think.
> 
> this should significantly speed up file system initialization and
> should speed zero write test used to test backend storage performance.
> 
> the difference can simply be tested by e.g.
> 
> dd if=/dev/zero of=/dev/vdX bs=1M
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---

With this patch, is is still possible to actually do zero fill? Prefill is
usually writing zeroes too, but according to the semantic, bdrv_write_zeroes
may just set L2 entry flag without allocating clusters, which won't satisfy
that.

Thanks,
Fam
Peter Lieven Feb. 23, 2014, 7:10 p.m. UTC | #2
Am 22.02.2014 17:45, schrieb Fam Zheng:
> On Sat, 02/22 14:00, Peter Lieven wrote:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> i know that there is a lot of potential for discussion, but i would
>> like to know what the others think.
>>
>> this should significantly speed up file system initialization and
>> should speed zero write test used to test backend storage performance.
>>
>> the difference can simply be tested by e.g.
>>
>> dd if=/dev/zero of=/dev/vdX bs=1M
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
> With this patch, is is still possible to actually do zero fill? Prefill is
> usually writing zeroes too, but according to the semantic, bdrv_write_zeroes
> may just set L2 entry flag without allocating clusters, which won't satisfy
> that.
Can you specify which operation you exactly mean? I don't think that
there is a problem, but maybe it would be better to add a check for
bs->file != NULL so the optimization takes only place for the format
not for the protocol.

I meanwhile ran a short test which shows that there is potential for significant
improvement.

I created a 60GB qcow2 container and formatted it with ext4. To immediately show the difference
I disabled lazy inode table and lazy journal init (writing zero takes place immediately then).

Timing without the patch:
time mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vda
real 1m5.649s
user 0m0.416s
sys  0m3.148s

Timing with the patch:
time mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
real 0m1.228s
user 0m0.116s
sys  0m0.732s

Container Size after Format without the patch: 1150615552 Byte (1097.3MB)
Container Size after Format with the patch: 24645632 Byte (23.5MB)

Peter


>
> Thanks,
> Fam
Fam Zheng Feb. 24, 2014, 1:01 a.m. UTC | #3
On Sun, 02/23 20:10, Peter Lieven wrote:
> Am 22.02.2014 17:45, schrieb Fam Zheng:
> > On Sat, 02/22 14:00, Peter Lieven wrote:
> >> this patch tries to optimize zero write requests
> >> by automatically using bdrv_write_zeroes if it is
> >> supported by the format.
> >>
> >> i know that there is a lot of potential for discussion, but i would
> >> like to know what the others think.
> >>
> >> this should significantly speed up file system initialization and
> >> should speed zero write test used to test backend storage performance.
> >>
> >> the difference can simply be tested by e.g.
> >>
> >> dd if=/dev/zero of=/dev/vdX bs=1M
> >>
> >> Signed-off-by: Peter Lieven <pl@kamp.de>
> >> ---
> > With this patch, is is still possible to actually do zero fill? Prefill is
> > usually writing zeroes too, but according to the semantic, bdrv_write_zeroes
> > may just set L2 entry flag without allocating clusters, which won't satisfy
> > that.
> Can you specify which operation you exactly mean? I don't think that
> there is a problem, but maybe it would be better to add a check for
> bs->file != NULL so the optimization takes only place for the format
> not for the protocol.
> 

Previously, users can do

dd if=/dev/zero of=/dev/vdX

to force backend allocation and mapping. This is meaningful for later IO
performance, but how long the dd takes doesn't matter as much, since it is a
one time shot.

The same in your test case: yes, mkfs time may be improved, but it comes with
tradeoff with later IO's slowness: when the real user data comes, the allocation
still needs to be done.

I would do this in qcow2:

 1. In qcow2_co_writev, allocate cluster regardless of if data is zero or not.
 2. If data is zero, set QCOW2_OFLAG_ZERO in L2.

Fam
Kevin Wolf Feb. 24, 2014, 10:11 a.m. UTC | #4
Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben:
> this patch tries to optimize zero write requests
> by automatically using bdrv_write_zeroes if it is
> supported by the format.
> 
> i know that there is a lot of potential for discussion, but i would
> like to know what the others think.
> 
> this should significantly speed up file system initialization and
> should speed zero write test used to test backend storage performance.
> 
> the difference can simply be tested by e.g.
> 
> dd if=/dev/zero of=/dev/vdX bs=1M
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

As you probably have expected, there's no way I can let the patch in in
this form. The least you need to introduce is a boolean option to enable
or disable the zero check. (The default would probably be disabled, but
we can discuss this.)

>  block.c               |    8 ++++++++
>  include/qemu-common.h |    1 +
>  util/iov.c            |   20 ++++++++++++++++++++
>  3 files changed, 29 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 6f4baca..505888e 100644
> --- a/block.c
> +++ b/block.c
> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>  
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>  
> +    if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
> +        flags |= BDRV_REQ_ZERO_WRITE;
> +        /* if the device was not opened with discard=on the below flag
> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
> +        flags |= BDRV_REQ_MAY_UNMAP;

I'm not sure about this one. I think it is reasonable to expect that
after an explicit write of a buffer filled with zeros the block is
allocated.

In a simple qcow2-on-file case, we basically have three options for
handling all-zero writes:

- Allocate the cluster on a qcow2 and file level and write literal zeros
  to it. No metadata updates involved in the next write to the cluster.

- Set the qcow2 zero flag, but leave the allocation in place. The next
  write in theory just needs to remove the zero flag (I think in
  practice we're doing an unnecessary COW) from the L2 table and that's
  it.

- Set the qcow2 zero flag and unmap the cluster on both the qcow2 and
  the filesystem layer. The next write causes new allocations in both
  layers, which means multiple metadata updates and possibly added
  fragmentation. The upside is that we use less disk space if there is
  no next write to this cluster.

I think it's pretty clear that the right behaviour depends on your use
case and we can't find an one-size-fits-all solution.

> +    }
> +
>      if (ret < 0) {
>          /* Do nothing, write notifier decided to fail this request */
>      } else if (flags & BDRV_REQ_ZERO_WRITE) {

Kevin
Peter Lieven Feb. 24, 2014, 10:26 a.m. UTC | #5
On 24.02.2014 11:11, Kevin Wolf wrote:
> Am 22.02.2014 um 14:00 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> i know that there is a lot of potential for discussion, but i would
>> like to know what the others think.
>>
>> this should significantly speed up file system initialization and
>> should speed zero write test used to test backend storage performance.
>>
>> the difference can simply be tested by e.g.
>>
>> dd if=/dev/zero of=/dev/vdX bs=1M
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> As you probably have expected, there's no way I can let the patch in in
> this form. The least you need to introduce is a boolean option to enable
> or disable the zero check. (The default would probably be disabled, but
> we can discuss this.)
I would have been really suprised *g*. As you and Fam already pointed
out, the desired behaviour is heavily dependant on the use case.

I personally do not need this for QCOW2 but for iSCSI. Here the optimization
is basically saved bandwidth since a zero write becomes a WRITESAME.
Unless the user specifies unmap=on there is no change in what is written to disk.

I would be fine with a default off boolean variable. For my case it would also be
sufficient to have boolean flag in the BlockDriver that indicates if this optimization
is a good idea. For iSCSI I think it is. I think also for GlusterFS. In those both
cases I basically saves bandwidth and let the backend storage more efficiently
write zeroes if it is capable. A third use case would be a raw device on an SSD.
In all cases if unmap=on it would additionally save disk space.

>
>>   block.c               |    8 ++++++++
>>   include/qemu-common.h |    1 +
>>   util/iov.c            |   20 ++++++++++++++++++++
>>   3 files changed, 29 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 6f4baca..505888e 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3145,6 +3145,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>   
>>       ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>   
>> +    if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
>> +        flags |= BDRV_REQ_ZERO_WRITE;
>> +        /* if the device was not opened with discard=on the below flag
>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
>> +        flags |= BDRV_REQ_MAY_UNMAP;
> I'm not sure about this one. I think it is reasonable to expect that
> after an explicit write of a buffer filled with zeros the block is
> allocated.
>
> In a simple qcow2-on-file case, we basically have three options for
> handling all-zero writes:
>
> - Allocate the cluster on a qcow2 and file level and write literal zeros
>    to it. No metadata updates involved in the next write to the cluster.
>
> - Set the qcow2 zero flag, but leave the allocation in place. The next
>    write in theory just needs to remove the zero flag (I think in
>    practice we're doing an unnecessary COW) from the L2 table and that's
>    it.
>
> - Set the qcow2 zero flag and unmap the cluster on both the qcow2 and
>    the filesystem layer. The next write causes new allocations in both
>    layers, which means multiple metadata updates and possibly added
>    fragmentation. The upside is that we use less disk space if there is
>    no next write to this cluster.
>
> I think it's pretty clear that the right behaviour depends on your use
> case and we can't find an one-size-fits-all solution.
I wouldn't mind have this optimization only work on raw format for
the moment.

Peter
>
>> +    }
>> +
>>       if (ret < 0) {
>>           /* Do nothing, write notifier decided to fail this request */
>>       } else if (flags & BDRV_REQ_ZERO_WRITE) {
> Kevin
Paolo Bonzini Feb. 24, 2014, 10:38 a.m. UTC | #6
Il 24/02/2014 11:26, Peter Lieven ha scritto:
>
> I personally do not need this for QCOW2 but for iSCSI. Here the optimization
> is basically saved bandwidth since a zero write becomes a WRITESAME.

It saves bandwidth, but at the potential cost of extra host CPU 
utilization.  I would be fine with having this automatically, but 
drv->bdrv_co_write_zeroes is not the right check because it is true for 
qcow2 and raw formats.  Something using bdrv_get_info is probably 
better, because it would have fewer or no false positives.

> In all cases if unmap=on it would additionally save disk space.

It would also cause worse performance though.  I think the automatic 
addition BDRV_REQ_MAY_UNMAP is what should be a separate option. 
Perhaps you can have a three-state option, detect-zeros=no/yes/unmap.

Paolo
Paolo Bonzini Feb. 24, 2014, 10:39 a.m. UTC | #7
Il 24/02/2014 02:01, Fam Zheng ha scritto:
>
> I would do this in qcow2:
>
>  1. In qcow2_co_writev, allocate cluster regardless of if data is zero or not.
>  2. If data is zero, set QCOW2_OFLAG_ZERO in L2.

This is (or should be) bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP.

Paolo
Fam Zheng Feb. 24, 2014, 11:33 a.m. UTC | #8
On Mon, 02/24 11:39, Paolo Bonzini wrote:
> Il 24/02/2014 02:01, Fam Zheng ha scritto:
> >
> >I would do this in qcow2:
> >
> > 1. In qcow2_co_writev, allocate cluster regardless of if data is zero or not.
> > 2. If data is zero, set QCOW2_OFLAG_ZERO in L2.
> 
> This is (or should be) bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP.

But IIUC bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP doesn't require
cluster allocation if it's allocated yet, which is a bit different.

Fam
Peter Lieven Feb. 24, 2014, 11:50 a.m. UTC | #9
On 24.02.2014 11:38, Paolo Bonzini wrote:
> Il 24/02/2014 11:26, Peter Lieven ha scritto:
>>
>> I personally do not need this for QCOW2 but for iSCSI. Here the optimization
>> is basically saved bandwidth since a zero write becomes a WRITESAME.
>
> It saves bandwidth, but at the potential cost of extra host CPU utilization.  I would be fine with having this automatically, but drv->bdrv_co_write_zeroes is not the right check because it is true for qcow2 and raw formats.  Something using 
> bdrv_get_info is probably better, because it would have fewer or no false positives.
>
>> In all cases if unmap=on it would additionally save disk space.
>
> It would also cause worse performance though.  I think the automatic addition BDRV_REQ_MAY_UNMAP is what should be a separate option. Perhaps you can have a three-state option, detect-zeros=no/yes/unmap.
I think this is the best option to start with and default it to no for the beginning.

Peter
Paolo Bonzini Feb. 24, 2014, 11:51 a.m. UTC | #10
Il 24/02/2014 12:33, Fam Zheng ha scritto:
> > This is (or should be) bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP.
>
> But IIUC bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP doesn't require
> cluster allocation if it's allocated yet, which is a bit different.

Yeah, that's why I wrote "or should be".  Those are the intended 
semantics of bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP: always 
allocate a cluster that will read as zeroes (allocating even if it does 
not necessarily write the zeroes).

For legacy reasons it may not be exactly what is implemented.  I asked 
Kevin a couple of weeks ago and he sent a patch, but even he wasn't sure 
of what qcow2 was doing util he looked at the code. :)

Paolo
Fam Zheng Feb. 24, 2014, 12:04 p.m. UTC | #11
On Mon, 02/24 12:51, Paolo Bonzini wrote:
> Il 24/02/2014 12:33, Fam Zheng ha scritto:
> >> This is (or should be) bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP.
> >
> >But IIUC bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP doesn't require
> >cluster allocation if it's allocated yet, which is a bit different.
> 
> Yeah, that's why I wrote "or should be".  Those are the intended semantics
> of bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP: always allocate a
> cluster that will read as zeroes (allocating even if it does not necessarily
> write the zeroes).
> 
> For legacy reasons it may not be exactly what is implemented.  I asked Kevin
> a couple of weeks ago and he sent a patch, but even he wasn't sure of what
> qcow2 was doing util he looked at the code. :)
> 

I see. I could only tell in VMDK cluster doesn't have this "mapped and zeroed"
state, so maybe we need some flexibility here and reduce assumption.

Fam
Kevin Wolf Feb. 24, 2014, 12:07 p.m. UTC | #12
Am 24.02.2014 um 12:51 hat Paolo Bonzini geschrieben:
> Il 24/02/2014 12:33, Fam Zheng ha scritto:
> >> This is (or should be) bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP.
> >
> >But IIUC bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP doesn't require
> >cluster allocation if it's allocated yet, which is a bit different.
> 
> Yeah, that's why I wrote "or should be".  Those are the intended
> semantics of bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP: always
> allocate a cluster that will read as zeroes (allocating even if it
> does not necessarily write the zeroes).

Which would mean that there is no way to say "give me zeroes, and do it
in the cheapest way possible". Because that would be to leave the
allocation status as it is and just toggle the zero bit.

> For legacy reasons it may not be exactly what is implemented.  I
> asked Kevin a couple of weeks ago and he sent a patch, but even he
> wasn't sure of what qcow2 was doing util he looked at the code. :)

Even with my patch, it doesn't actually allocate the cluster, it just
sets the flag.

Kevin
Paolo Bonzini Feb. 24, 2014, 12:10 p.m. UTC | #13
Il 24/02/2014 13:07, Kevin Wolf ha scritto:
>> > Yeah, that's why I wrote "or should be".  Those are the intended
>> > semantics of bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP: always
>> > allocate a cluster that will read as zeroes (allocating even if it
>> > does not necessarily write the zeroes).
> Which would mean that there is no way to say "give me zeroes, and do it
> in the cheapest way possible". Because that would be to leave the
> allocation status as it is and just toggle the zero bit.

If bdrv_co_write_zeroes is SCSI's "WRITE SAME without UNMAP", then it 
must allocate.  I think "give me zeroes and do it in the cheapest way 
possible" is BDRV_REQ_MAY_UNMAP (which *may* unmap but it doesn't have to).

That said, the really expensive part of unmapping is probably 
re-allocating the clusters on subsequent writes.  The unmap itself isn't 
that expensive (even if you have to flush the refcount blocks before the 
L2 tables), is it?

Paolo
Kevin Wolf Feb. 24, 2014, 12:22 p.m. UTC | #14
Am 24.02.2014 um 13:10 hat Paolo Bonzini geschrieben:
> Il 24/02/2014 13:07, Kevin Wolf ha scritto:
> >>> Yeah, that's why I wrote "or should be".  Those are the intended
> >>> semantics of bdrv_co_write_zeroes without BDRV_REQ_MAY_UNMAP: always
> >>> allocate a cluster that will read as zeroes (allocating even if it
> >>> does not necessarily write the zeroes).
> >Which would mean that there is no way to say "give me zeroes, and do it
> >in the cheapest way possible". Because that would be to leave the
> >allocation status as it is and just toggle the zero bit.
> 
> If bdrv_co_write_zeroes is SCSI's "WRITE SAME without UNMAP", then
> it must allocate.  I think "give me zeroes and do it in the cheapest
> way possible" is BDRV_REQ_MAY_UNMAP (which *may* unmap but it
> doesn't have to).

Hm, okay. So the intended behaviour for qcow2 is:
- without MAY_UNMAP: Preallocate the cluster and set the zero flag so
  that its content isn't valid
- with MAY_UNMAP: If the cluster is allocated, leave it allocated and
  set the zero flag; if it isn't, leave it unallocated and set the zero
  flag.

> That said, the really expensive part of unmapping is probably
> re-allocating the clusters on subsequent writes.  The unmap itself
> isn't that expensive (even if you have to flush the refcount blocks
> before the L2 tables), is it?

Flushes are pretty expensive. If you only have many discards in a row
and then many allocations in a row, it's probably okay. Mixing them so
that every other write is a discard might well kill performance.

Kevin
Peter Lieven Feb. 24, 2014, 1:01 p.m. UTC | #15
On 24.02.2014 11:38, Paolo Bonzini wrote:
> Il 24/02/2014 11:26, Peter Lieven ha scritto:
>>
>> I personally do not need this for QCOW2 but for iSCSI. Here the optimization
>> is basically saved bandwidth since a zero write becomes a WRITESAME.
>
> It saves bandwidth, but at the potential cost of extra host CPU utilization.  I would be fine with having this automatically, but drv->bdrv_co_write_zeroes is not the right check because it is true for qcow2 and raw formats.  Something using 
> bdrv_get_info is probably better, because it would have fewer or no false positives.
>
>> In all cases if unmap=on it would additionally save disk space.
>
> It would also cause worse performance though.  I think the automatic addition BDRV_REQ_MAY_UNMAP is what should be a separate option. Perhaps you can have a three-state option, detect-zeros=no/yes/unmap.

What would be the desired way to store this flag in the BlockDriverState?

Peter
Kevin Wolf Feb. 25, 2014, 1:41 p.m. UTC | #16
Am 24.02.2014 um 14:01 hat Peter Lieven geschrieben:
> On 24.02.2014 11:38, Paolo Bonzini wrote:
> >Il 24/02/2014 11:26, Peter Lieven ha scritto:
> >>
> >>I personally do not need this for QCOW2 but for iSCSI. Here the optimization
> >>is basically saved bandwidth since a zero write becomes a WRITESAME.
> >
> >It saves bandwidth, but at the potential cost of extra host CPU
> >utilization.  I would be fine with having this automatically, but
> >drv->bdrv_co_write_zeroes is not the right check because it is
> >true for qcow2 and raw formats.  Something using bdrv_get_info is
> >probably better, because it would have fewer or no false
> >positives.
> >
> >>In all cases if unmap=on it would additionally save disk space.
> >
> >It would also cause worse performance though.  I think the automatic addition BDRV_REQ_MAY_UNMAP is what should be a separate option. Perhaps you can have a three-state option, detect-zeros=no/yes/unmap.
> 
> What would be the desired way to store this flag in the BlockDriverState?

Some new enum field?

Kevin
Peter Lieven Feb. 25, 2014, 5:03 p.m. UTC | #17
Am 25.02.2014 14:41, schrieb Kevin Wolf:
> Am 24.02.2014 um 14:01 hat Peter Lieven geschrieben:
>> On 24.02.2014 11:38, Paolo Bonzini wrote:
>>> Il 24/02/2014 11:26, Peter Lieven ha scritto:
>>>> I personally do not need this for QCOW2 but for iSCSI. Here the optimization
>>>> is basically saved bandwidth since a zero write becomes a WRITESAME.
>>> It saves bandwidth, but at the potential cost of extra host CPU
>>> utilization.  I would be fine with having this automatically, but
>>> drv->bdrv_co_write_zeroes is not the right check because it is
>>> true for qcow2 and raw formats.  Something using bdrv_get_info is
>>> probably better, because it would have fewer or no false
>>> positives.
>>>
>>>> In all cases if unmap=on it would additionally save disk space.
>>> It would also cause worse performance though.  I think the automatic addition BDRV_REQ_MAY_UNMAP is what should be a separate option. Perhaps you can have a three-state option, detect-zeros=no/yes/unmap.
>> What would be the desired way to store this flag in the BlockDriverState?
> Some new enum field?
Is there an already implemented example where I can copy from? Its quite diffucult to search through all the involved
functions when doing it for the first time.

Peter
diff mbox

Patch

diff --git a/block.c b/block.c
index 6f4baca..505888e 100644
--- a/block.c
+++ b/block.c
@@ -3145,6 +3145,14 @@  static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
+    if (!ret && !(flags & BDRV_REQ_ZERO_WRITE) &&
+        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
+        flags |= BDRV_REQ_ZERO_WRITE;
+        /* if the device was not opened with discard=on the below flag
+         * is immediately cleared again in bdrv_co_do_write_zeroes */
+        flags |= BDRV_REQ_MAY_UNMAP;
+    }
+
     if (ret < 0) {
         /* Do nothing, write notifier decided to fail this request */
     } else if (flags & BDRV_REQ_ZERO_WRITE) {
diff --git a/include/qemu-common.h b/include/qemu-common.h
index b0e34b2..f0ad0f9 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -330,6 +330,7 @@  void qemu_iovec_concat(QEMUIOVector *dst,
 void qemu_iovec_concat_iov(QEMUIOVector *dst,
                            struct iovec *src_iov, unsigned int src_cnt,
                            size_t soffset, size_t sbytes);
+bool qemu_iovec_is_zero(QEMUIOVector *qiov);
 void qemu_iovec_destroy(QEMUIOVector *qiov);
 void qemu_iovec_reset(QEMUIOVector *qiov);
 size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
diff --git a/util/iov.c b/util/iov.c
index bb46c04..abbb374 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -342,6 +342,26 @@  void qemu_iovec_concat(QEMUIOVector *dst,
     qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
 }
 
+/*
+ *  check if the contents of the iovecs is all zero 
+ */
+bool qemu_iovec_is_zero(QEMUIOVector *qiov) {
+    int i;
+    for (i = 0; i < qiov->niov; i++) {
+        size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1);
+        uint8_t *ptr = qiov->iov[i].iov_base;
+        if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) {
+            return false;
+        }
+        for (; offs < qiov->iov[i].iov_len; offs++) {
+             if (ptr[offs]) {
+                 return false;
+             }
+        }
+    }
+    return true;
+}
+
 void qemu_iovec_destroy(QEMUIOVector *qiov)
 {
     assert(qiov->nalloc != -1);