Patchwork [RFC,06/17] block: use bdrv_{co, aio}_discard for write_zeroes operations

login
register
mail settings
Submitter Paolo Bonzini
Date March 8, 2012, 5:15 p.m.
Message ID <1331226917-6658-7-git-send-email-pbonzini@redhat.com>
Download mbox | patch
Permalink /patch/145594/
State New
Headers show

Comments

Paolo Bonzini - March 8, 2012, 5:15 p.m.
Remove the bdrv_co_write_zeroes callback.  Instead use the discard
information from bdrv_get_info to choose between bdrv_co_discard
and a normal write.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block.c     |   12 +++++++++---
 block/qed.c |    8 --------
 block_int.h |    8 --------
 3 files changed, 9 insertions(+), 19 deletions(-)
Kevin Wolf - March 9, 2012, 4:37 p.m.
Am 08.03.2012 18:15, schrieb Paolo Bonzini:
> Remove the bdrv_co_write_zeroes callback.  Instead use the discard
> information from bdrv_get_info to choose between bdrv_co_discard
> and a normal write.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

I'm not sure if this a good idea.

The goal of discard is to remove data from the image (or not add it if
it isn't there yet) and ideally deallocate the used clusters. The goal
of write_zeroes is to mark space as zero and explicitly allocate it for
this purpose.

From a guest point of view these are pretty similar, but from a host
perspective I'd say there's a difference.

Kevin
Paolo Bonzini - March 9, 2012, 6:06 p.m.
Il 09/03/2012 17:37, Kevin Wolf ha scritto:
>> > Remove the bdrv_co_write_zeroes callback.  Instead use the discard
>> > information from bdrv_get_info to choose between bdrv_co_discard
>> > and a normal write.
>> > 
>> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> I'm not sure if this a good idea.
> 
> The goal of discard is to remove data from the image (or not add it if
> it isn't there yet) and ideally deallocate the used clusters. The goal
> of write_zeroes is to mark space as zero and explicitly allocate it for
> this purpose.
> 
> From a guest point of view these are pretty similar, but from a host
> perspective I'd say there's a difference.

True.  However, we need to present a uniform view to the guests,
including the granularity, or discard can never be enabled.  The
granularity must be 512 on IDE (though it can be higher on SCSI), so
there are problems mapping block layer discard straight down to the guest.

There are basically three ways to do this:

1) we could cheat and present a discard_granularity that is smaller than
what the underlying format/protocol supports.  This is fine but forces
discard_zeroes_data to be false.  That's a pity because Linux 3.4 will
start using efficient zero write operations (WRITE SAME on SCSI, but
could be extended to UNMAP/TRIM if discard_zeroes_data is true).

2) we can make an emulated discard that always supports 512 bytes
granularity and always zeroes data.  This patch series takes this route.

3) we can let the user choose between (1) and (2).  I didn't choose this
because of laziness mostly---co_write_zeroes support is not really
complete, for example there's no aio version to use in device
models---and also because I doubt anyone would really use the option.

Paolo
Richard Laager - March 10, 2012, 6:02 p.m.
I'm believe your patch set provides these behaviors now:
      * QEMU block drivers report discard_granularity. 
              * discard_granularity = 0 means no discard 
                      * The guest is told there's no discard support.
              * discard_granularity < 0 is undefined.
                discard_granularity > 0 is reported to the guest as
                discard support.
      * QEMU block drivers report discard_zeros_data.
                This is passed to the guest when discard_granularity >
                0.

I propose adding the following behaviors in any event:
      * If a QEMU block device reports a discard_granularity > 0, it
        must be equal to 2^n (n >= 0), or QEMU's block core will change
        it to 0. (Non-power-of-two granularities are not likely to exist
        in the real world, and this assumption greatly simplifies
        ensuring correctness.)
      * For SCSI, report an unmap_granularity to the guest as follows:
      max(logical_block_size, discard_granularity) / logical_block_size


Regarding emulating discard_zeros_data...

I agree that when discard_zeros_data is set, we will need to write
zeroes in some cases. As you noted, IDE has a fixed granularity of one
sector. And the SCSI granularity is a hint only; guests are not
guaranteed to align to that value either. [0]

As a design concept, instead of guaranteeing that 512B zero'ing discards
are supported, I think the QEMU block layer should instead guarantee
aligned discards to QEMU block devices, emulating any misaligned
discards (or portions thereof) by writing zeroes if (and only if)
discard_zeros_data is set. When the QEMU block layer gets a discard:
      * Of the specified discard range, see if it includes an aligned
        multiple of discard granularity. If so, save that as the
        starting point of a subrange. Then find the last aligned
        multiple, if any, and pass that subrange (if start != end) down
        to the block driver's discard function.
      * If the discard really fails (i.e. returns failure and sets errno
        to something other than "not supported" or equivalent), return
        failure to the guest. For "not supported", fall through to the
        code below with the full range.
      * At this point, we have zero, one, or two subranges to handle.
      * If and only if discard_zeros_data is set, write zeros to the
        remaining subranges, if any. (This would use a lower-level
        write_zeroes call which does not attempt to use discard.) If
        this fails, return failure to the guest.
      * Return success.

This leaves one remaining issue: In raw-posix.c, for files (i.e. not
devices), I assume you're going to advertise discard_granularity=1 and
discard_zeros_data=1 when compiled with support for
fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually
guarantees that it zeros the data when punching holes. I haven't
verified this.

If the guest does a big discard (think mkfs) and fallocate() returns
EOPNOTSUPP, you'll have to zero essentially the whole virtual disk,
which, as you noted, will also allocate it (unless you explicitly check
for holes). This is bad. It can be avoided by not advertising
discard_zeros_data, but as you noted, that's unfortunate.

If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
going to work. This would side step these problems. You said it wasn't
possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
by extending the file by one byte and then punching that:
        char buf = 0;
        fstat(s->fd, &st);
        pwrite(s->fd, &buf, 1, st.st_size + 1);
        has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                                 st.st_size + 1, 1);
        ftruncate(s->fd, st.st_size);


[0] See the last paragraph starting on page 8:
    http://mkp.net/pubs/linux-advanced-storage.pdf
Paolo Bonzini - March 12, 2012, 12:27 p.m.
Il 10/03/2012 19:02, Richard Laager ha scritto:
> I propose adding the following behaviors in any event:
>       * If a QEMU block device reports a discard_granularity > 0, it
>         must be equal to 2^n (n >= 0), or QEMU's block core will change
>         it to 0. (Non-power-of-two granularities are not likely to exist
>         in the real world, and this assumption greatly simplifies
>         ensuring correctness.)

Yeah, I was considering this to be simply a bug in the block device.

>       * For SCSI, report an unmap_granularity to the guest as follows:
>       max(logical_block_size, discard_granularity) / logical_block_size

This is more or less already in place later in the series.

> As a design concept, instead of guaranteeing that 512B zero'ing discards
> are supported, I think the QEMU block layer should instead guarantee
> aligned discards to QEMU block devices, emulating any misaligned
> discards (or portions thereof) by writing zeroes if (and only if)
> discard_zeros_data is set.

Yes, this can be done of course.  This series does not include it yet.

> This leaves one remaining issue: In raw-posix.c, for files (i.e. not
> devices), I assume you're going to advertise discard_granularity=1 and
> discard_zeros_data=1 when compiled with support for
> fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually
> guarantees that it zeros the data when punching holes.

It does, that's pretty much the definition of a hole.

> If the guest does a big discard (think mkfs) and fallocate() returns
> EOPNOTSUPP, you'll have to zero essentially the whole virtual disk,
> which, as you noted, will also allocate it (unless you explicitly check
> for holes). This is bad. It can be avoided by not advertising
> discard_zeros_data, but as you noted, that's unfortunate.

If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
be done by skipping the zero write on known holes.

This could even be done at the block layer level using bdrv_is_allocated.

> If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> going to work. This would side step these problems. 

... and introduce others when migrating if your datacenter doesn't have
homogeneous kernel versions and/or filesystems. :(

> You said it wasn't
> possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
> by extending the file by one byte and then punching that:
>         char buf = 0;
>         fstat(s->fd, &st);
>         pwrite(s->fd, &buf, 1, st.st_size + 1);
>         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                                  st.st_size + 1, 1);
>         ftruncate(s->fd, st.st_size);

Nice trick. :)   Yes, that could work.

Do you know if non-Linux operating systems have something similar to
BLKDISCARDZEROES?

Paolo
Kevin Wolf - March 12, 2012, 1:04 p.m.
Am 12.03.2012 13:27, schrieb Paolo Bonzini:
> Il 10/03/2012 19:02, Richard Laager ha scritto:
>> I propose adding the following behaviors in any event:
>>       * If a QEMU block device reports a discard_granularity > 0, it
>>         must be equal to 2^n (n >= 0), or QEMU's block core will change
>>         it to 0. (Non-power-of-two granularities are not likely to exist
>>         in the real world, and this assumption greatly simplifies
>>         ensuring correctness.)
> 
> Yeah, I was considering this to be simply a bug in the block device.

Block driver you mean? Yes, I think we can just assert() this.

Kevin
Richard Laager - March 13, 2012, 7:13 p.m.
On Mon, 2012-03-12 at 10:34 +0100, Paolo Bonzini wrote:
> To be completely correct, I suggest the following behavior:
> >      1. Add a discard boolean option to the disk layer.
> >      2. If discard is not specified:
> >               * For files, detect a true/false value by comparing
> >                 stat.st_blocks != stat.st_size>>9.
> >               * For devices, assume a fixed value (true?).
> >      3. If discard is true, issue discards.
> >      4. If discard is false, do not issue discards.
> 
> The problem is, who will use this interface?

I'm a libvirt and virt-manager user; virt-manager already differentiates
between thin and thick provisioning. So I'm envisioning passing that
information to libvirt, which would save it in a config file and use
that to set discard=true vs. discard=false when using QEMU.

On Mon, 2012-03-12 at 13:27 +0100, Paolo Bonzini wrote:
> Il 10/03/2012 19:02, Richard Laager ha scritto:
> > I propose adding the following behaviors in any event:
> >       * If a QEMU block device reports a discard_granularity > 0, it
> >         must be equal to 2^n (n >= 0), or QEMU's block core will change
> >         it to 0. (Non-power-of-two granularities are not likely to exist
> >         in the real world, and this assumption greatly simplifies
> >         ensuring correctness.)
> 
> Yeah, I was considering this to be simply a bug in the block device.
> 
> >       * For SCSI, report an unmap_granularity to the guest as follows:
> >       max(logical_block_size, discard_granularity) / logical_block_size
> 
> This is more or less already in place later in the series.

I didn't see it. Which patch number?

> > Note, I'm assuming fallocate() actually
> > guarantees that it zeros the data when punching holes.
> 
> It does, that's pretty much the definition of a hole.

Agreed. I verified this fact after sending that email. At the time, I
just wanted to be very clear on what I knew for sure vs. what I had not
yet verified.

> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
> be done by skipping the zero write on known holes.
> 
> This could even be done at the block layer level using bdrv_is_allocated.

Would we want to make all write_zeros operations check for and skip
holes, or is write_zeros different from a discard in that it SHOULD/MUST
allocate space?

> > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> > going to work. This would side step these problems. 
> 
> ... and introduce others when migrating if your datacenter doesn't have
> homogeneous kernel versions and/or filesystems. :(

I hadn't thought of the migration issues. Thanks for bringing that up.

Worst case, you end up doing a bunch of zero writing if and only if you
migrate from a discard_zeros_data host to one that doesn't (or doesn't
do discard at all). But this only lasts until the guest reboots
(assuming we also add a behavior of re-probing on guest reboot--or until
it shuts down if we don't or can't). As far as I can see, this is
unavoidable, though. And this is no worse than writing zeros ALL of the
time that fallocate() fails, which is the behavior of your patch series,
right?

This might be another use case for a discard option on the disk. If some
but not all of one's hosts support discard, a system administrator might
want to set discard=0 to avoid this.

> Do you know if non-Linux operating systems have something similar to
> BLKDISCARDZEROES?

As far as I know, no. The SunOS one is only on Illumos (the open source
kernel forked from the now dead OpenSolaris) and only implemented for
ZFS zvols. So currently, it's roughly equivalent to fallocate() on Linux
in that it's happening at the filesystem level. (It doesn't actually
reach the platters yet. But even if it did, that's unrelated to the
guarantees provided by ZFS.) Thus, it always zeros, so we could set
discard_zeros_data = 1 unconditionally there. I should probably run that
by the Illumos developers, though, to ensure they're comfortable with
that ioctl() guaranteeing zeroing.

I haven't looked into the FreeBSD one as much yet. Worst case, we
unconditionally set discard_zeros_data = 0.
Paolo Bonzini - March 14, 2012, 7:41 a.m.
Il 13/03/2012 20:13, Richard Laager ha scritto:
>> > To be completely correct, I suggest the following behavior:
>>> > >      1. Add a discard boolean option to the disk layer.
>>> > >      2. If discard is not specified:
>>> > >               * For files, detect a true/false value by comparing
>>> > >                 stat.st_blocks != stat.st_size>>9.
>>> > >               * For devices, assume a fixed value (true?).
>>> > >      3. If discard is true, issue discards.
>>> > >      4. If discard is false, do not issue discards.
>> > 
>> > The problem is, who will use this interface?
> I'm a libvirt and virt-manager user; virt-manager already differentiates
> between thin and thick provisioning. So I'm envisioning passing that
> information to libvirt, which would save it in a config file and use
> that to set discard=true vs. discard=false when using QEMU.

Yeah, it could be set also at the pool level for libvirt.

>>> > >       * For SCSI, report an unmap_granularity to the guest as follows:
>>> > >       max(logical_block_size, discard_granularity) / logical_block_size
>> > 
>> > This is more or less already in place later in the series.
> I didn't see it. Which patch number?

Patch 11:

+    discard_granularity = s->qdev.conf.discard_granularity;
+    if (discard_granularity == -1) {
+        s->qdev.conf.discard_granularity = s->qdev.conf.logical_block_size;
+    } else if (discard_granularity < s->qdev.conf.logical_block_size) {
+        error_report("scsi-block: invalid discard_granularity");
+        return -1;
+    } else if (discard_granularity & (discard_granularity - 1)) {
+        error_report("scsi-block: discard_granularity not a power of two");
+        return -1;
+    }

> > If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
> > be done by skipping the zero write on known holes.
> > 
> > This could even be done at the block layer level using bdrv_is_allocated.
> 
> Would we want to make all write_zeros operations check for and skip
> holes, or is write_zeros different from a discard in that it SHOULD/MUST
> allocate space?

I think that's pretty much the question to answer for this patch to graduate
from the RFC state (the rest is just technicalities, so to speak).  So far,
write_zeros was intended to be an efficient operation (it avoids allocating
a cluster in qed and will do the same in qcow3, which is why I decided to
merge it with discard).

>>> > > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
>>> > > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
>>> > > going to work. This would side step these problems. 
>> > 
>> > ... and introduce others when migrating if your datacenter doesn't have
>> > homogeneous kernel versions and/or filesystems. :(
> I hadn't thought of the migration issues. Thanks for bringing that up.
> 
> Worst case, you end up doing a bunch of zero writing if and only if you
> migrate from a discard_zeros_data host to one that doesn't (or doesn't
> do discard at all). But this only lasts until the guest reboots
> (assuming we also add a behavior of re-probing on guest reboot--or until
> it shuts down if we don't or can't). As far as I can see, this is
> unavoidable, though. And this is no worse than writing zeros ALL of the
> time that fallocate() fails, which is the behavior of your patch series,
> right?

It is worse in that we do not want the hardware parameters exposed to the
guest to change behind the scenes, except if you change the machine type
or if you use the default unversioned type.

> This might be another use case for a discard option on the disk. If some
> but not all of one's hosts support discard, a system administrator might
> want to set discard=0 to avoid this.

A discard option is already there (discard_granularity=0).  Libvirt could
choose to expose it even now, but it would be of little use in the absence
of support for unaligned discards.

Paolo
Kevin Wolf - March 14, 2012, 12:01 p.m.
Am 14.03.2012 08:41, schrieb Paolo Bonzini:
> Il 13/03/2012 20:13, Richard Laager ha scritto:
>>> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
>>> be done by skipping the zero write on known holes.
>>>
>>> This could even be done at the block layer level using bdrv_is_allocated.
>>
>> Would we want to make all write_zeros operations check for and skip
>> holes, or is write_zeros different from a discard in that it SHOULD/MUST
>> allocate space?
> 
> I think that's pretty much the question to answer for this patch to graduate
> from the RFC state (the rest is just technicalities, so to speak).  So far,
> write_zeros was intended to be an efficient operation (it avoids allocating
> a cluster in qed and will do the same in qcow3, which is why I decided to
> merge it with discard).

Yes, for qcow3 and to some degree also for QED, setting the zero flag is
the natural implementation for both discard and write_zeros. The big
question is what happens with other formats.

Paolo mentioned a use case as a fast way for guests to write zeros, but
is it really faster than a normal write when we have to emulate it by a
bdrv_write with a temporary buffer of zeros? On the other hand we have
the cases where discard really means "I don't care about the data any
more" and emulating it by writing zeros is just a waste of resources there.

So I think we only want to advertise that discard zeroes data if we can
do it efficiently. This means that the format does support it, and that
the device is able to communicate the discard granularity (= cluster
size) to the guest OS.

Kevin
Paolo Bonzini - March 14, 2012, 12:14 p.m.
Il 14/03/2012 13:01, Kevin Wolf ha scritto:
> Am 14.03.2012 08:41, schrieb Paolo Bonzini:
>> Il 13/03/2012 20:13, Richard Laager ha scritto:
>>>> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also
>>>> be done by skipping the zero write on known holes.
>>>>
>>>> This could even be done at the block layer level using bdrv_is_allocated.
>>>
>>> Would we want to make all write_zeros operations check for and skip
>>> holes, or is write_zeros different from a discard in that it SHOULD/MUST
>>> allocate space?
>>
>> I think that's pretty much the question to answer for this patch to graduate
>> from the RFC state (the rest is just technicalities, so to speak).  So far,
>> write_zeros was intended to be an efficient operation (it avoids allocating
>> a cluster in qed and will do the same in qcow3, which is why I decided to
>> merge it with discard).
> 
> Yes, for qcow3 and to some degree also for QED, setting the zero flag is
> the natural implementation for both discard and write_zeros. The big
> question is what happens with other formats.

Also raw if used with a sparse file.

> Paolo mentioned a use case as a fast way for guests to write zeros, but
> is it really faster than a normal write when we have to emulate it by a
> bdrv_write with a temporary buffer of zeros? 

No, of course not.

> On the other hand we have
> the cases where discard really means "I don't care about the data any
> more" and emulating it by writing zeros is just a waste of resources there.
> 
> So I think we only want to advertise that discard zeroes data if we can
> do it efficiently. This means that the format does support it, and that
> the device is able to communicate the discard granularity (= cluster
> size) to the guest OS.

Note that the discard granularity is only a hint, so it's really more a
maximum suggested value than a granularity.  Outside of a cluster
boundary the format would still have to write zeros manually.

Also, Linux for example will only round the number of sectors down to
the granularity, not the start sector.  Rereading the code, for SCSI we
want to advertise a zero granularity (aka do whatever you want),
otherwise we may get only misaligned discard requests and end up writing
zeroes inefficiently all the time.

The problem is that advertising discard_zeroes_data based on the backend
calls for trouble as soon as you migrate between storage formats,
filesystems or disks.

(BTW, if the backing file allows discard and zeroes data, efficient
write-zeroes could be done in qcow2 by allocating a cluster and
discarding its contents.  It's similar to how you do preallocated metadata).

Paolo
Kevin Wolf - March 14, 2012, 12:37 p.m.
Am 14.03.2012 13:14, schrieb Paolo Bonzini:
>> Paolo mentioned a use case as a fast way for guests to write zeros, but
>> is it really faster than a normal write when we have to emulate it by a
>> bdrv_write with a temporary buffer of zeros? 
> 
> No, of course not.
> 
>> On the other hand we have
>> the cases where discard really means "I don't care about the data any
>> more" and emulating it by writing zeros is just a waste of resources there.
>>
>> So I think we only want to advertise that discard zeroes data if we can
>> do it efficiently. This means that the format does support it, and that
>> the device is able to communicate the discard granularity (= cluster
>> size) to the guest OS.
> 
> Note that the discard granularity is only a hint, so it's really more a
> maximum suggested value than a granularity.  Outside of a cluster
> boundary the format would still have to write zeros manually.

You're talking about SCSI here, I guess? Would be one case where being
able to define sane semantics for virtio-blk would have been an
advantage... I had hoped that SCSI was already sane, but if doesn't
distinguish between "I don't care about this any more" and "I want to
have zeros here", then I'm afraid I can't call it sane any more.

We can make the conditions even stricter, i.e. allow it only if protocol
can pass through discards for unaligned requests. This wouldn't free
clusters on an image format level, but at least on a file system level.

> Also, Linux for example will only round the number of sectors down to
> the granularity, not the start sector.  Rereading the code, for SCSI we
> want to advertise a zero granularity (aka do whatever you want),
> otherwise we may get only misaligned discard requests and end up writing
> zeroes inefficiently all the time.

Does this make sense with real hardware or is it a Linux bug?

> The problem is that advertising discard_zeroes_data based on the backend
> calls for trouble as soon as you migrate between storage formats,
> filesystems or disks.

True. You would have to emulate if you migrate from a source that can
discard to zeros efficiently to a destination that can't.

In the end, I guess we'll just have to accept that we can't fix bad
semantics of ATA and SCSI, and just need to decide whether "I don't
care" or "I want to have zeros" is more common. My feeling is that "I
don't care" is the more useful operation because it can't be expressed
otherwise, but I haven't checked what guests really do.

> (BTW, if the backing file allows discard and zeroes data, efficient
> write-zeroes could be done in qcow2 by allocating a cluster and
> discarding its contents.  It's similar to how you do preallocated metadata).

Yes.

Kevin
Paolo Bonzini - March 14, 2012, 12:49 p.m.
Il 14/03/2012 13:37, Kevin Wolf ha scritto:
> Am 14.03.2012 13:14, schrieb Paolo Bonzini:
>>> Paolo mentioned a use case as a fast way for guests to write zeros, but
>>> is it really faster than a normal write when we have to emulate it by a
>>> bdrv_write with a temporary buffer of zeros? 
>>
>> No, of course not.
>>
>>> On the other hand we have
>>> the cases where discard really means "I don't care about the data any
>>> more" and emulating it by writing zeros is just a waste of resources there.
>>>
>>> So I think we only want to advertise that discard zeroes data if we can
>>> do it efficiently. This means that the format does support it, and that
>>> the device is able to communicate the discard granularity (= cluster
>>> size) to the guest OS.
>>
>> Note that the discard granularity is only a hint, so it's really more a
>> maximum suggested value than a granularity.  Outside of a cluster
>> boundary the format would still have to write zeros manually.
> 
> You're talking about SCSI here, I guess? Would be one case where being
> able to define sane semantics for virtio-blk would have been an
> advantage... I had hoped that SCSI was already sane, but if doesn't
> distinguish between "I don't care about this any more" and "I want to
> have zeros here", then I'm afraid I can't call it sane any more.

It does make the distinction.  "I don't care" is UNMAP (or WRITE
SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE
SAME(10) or WRITE SAME(16) with an all-zero payload.

> We can make the conditions even stricter, i.e. allow it only if protocol
> can pass through discards for unaligned requests. This wouldn't free
> clusters on an image format level, but at least on a file system level.
> 
>> Also, Linux for example will only round the number of sectors down to
>> the granularity, not the start sector.  Rereading the code, for SCSI we
>> want to advertise a zero granularity (aka do whatever you want),
>> otherwise we may get only misaligned discard requests and end up writing
>> zeroes inefficiently all the time.
> 
> Does this make sense with real hardware or is it a Linux bug?

It's a bug, SCSI defines the "optimal unmap request starting LBA" to be
"(n × optimal unmap granularity) + unmap granularity alignment".

>> The problem is that advertising discard_zeroes_data based on the backend
>> calls for trouble as soon as you migrate between storage formats,
>> filesystems or disks.
> 
> True. You would have to emulate if you migrate from a source that can
> discard to zeros efficiently to a destination that can't.
> 
> In the end, I guess we'll just have to accept that we can't fix bad
> semantics of ATA and SCSI, and just need to decide whether "I don't
> care" or "I want to have zeros" is more common. My feeling is that "I
> don't care" is the more useful operation because it can't be expressed
> otherwise, but I haven't checked what guests really do.

Yeah, guests right now only use it for unused filesystem pieces, so the
"do not care" semantics are fine.  I also hoped to use discard to avoid
blowing up thin-provisioned images when streaming.  Perhaps we can use
bdrv_has_zero_init instead, and/or pass down the copy-on-read flag to
the block driver.

Anyhow, there are some patches from this series that are relatively
independent and ready for inclusion, I'll extract them and post them
separately.

Paolo
Kevin Wolf - March 14, 2012, 1:04 p.m.
Am 14.03.2012 13:49, schrieb Paolo Bonzini:
> Il 14/03/2012 13:37, Kevin Wolf ha scritto:
>> Am 14.03.2012 13:14, schrieb Paolo Bonzini:
>>>> Paolo mentioned a use case as a fast way for guests to write zeros, but
>>>> is it really faster than a normal write when we have to emulate it by a
>>>> bdrv_write with a temporary buffer of zeros? 
>>>
>>> No, of course not.
>>>
>>>> On the other hand we have
>>>> the cases where discard really means "I don't care about the data any
>>>> more" and emulating it by writing zeros is just a waste of resources there.
>>>>
>>>> So I think we only want to advertise that discard zeroes data if we can
>>>> do it efficiently. This means that the format does support it, and that
>>>> the device is able to communicate the discard granularity (= cluster
>>>> size) to the guest OS.
>>>
>>> Note that the discard granularity is only a hint, so it's really more a
>>> maximum suggested value than a granularity.  Outside of a cluster
>>> boundary the format would still have to write zeros manually.
>>
>> You're talking about SCSI here, I guess? Would be one case where being
>> able to define sane semantics for virtio-blk would have been an
>> advantage... I had hoped that SCSI was already sane, but if doesn't
>> distinguish between "I don't care about this any more" and "I want to
>> have zeros here", then I'm afraid I can't call it sane any more.
> 
> It does make the distinction.  "I don't care" is UNMAP (or WRITE
> SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE
> SAME(10) or WRITE SAME(16) with an all-zero payload.

Good. So why are we trying to make both the same in qemu? Keeping two
different callbacks in the block drivers for two different operations
seems to make much more sense to me.

>> We can make the conditions even stricter, i.e. allow it only if protocol
>> can pass through discards for unaligned requests. This wouldn't free
>> clusters on an image format level, but at least on a file system level.
>>
>>> Also, Linux for example will only round the number of sectors down to
>>> the granularity, not the start sector.  Rereading the code, for SCSI we
>>> want to advertise a zero granularity (aka do whatever you want),
>>> otherwise we may get only misaligned discard requests and end up writing
>>> zeroes inefficiently all the time.
>>
>> Does this make sense with real hardware or is it a Linux bug?
> 
> It's a bug, SCSI defines the "optimal unmap request starting LBA" to be
> "(n × optimal unmap granularity) + unmap granularity alignment".

Okay. Needs to get fixed in Linux then, not in qemu. Means that qemu
might ignore some discard requests by broken Linux versions, but
ignoring is a valid action for "don't care".

>>> The problem is that advertising discard_zeroes_data based on the backend
>>> calls for trouble as soon as you migrate between storage formats,
>>> filesystems or disks.
>>
>> True. You would have to emulate if you migrate from a source that can
>> discard to zeros efficiently to a destination that can't.
>>
>> In the end, I guess we'll just have to accept that we can't fix bad
>> semantics of ATA and SCSI, and just need to decide whether "I don't
>> care" or "I want to have zeros" is more common. My feeling is that "I
>> don't care" is the more useful operation because it can't be expressed
>> otherwise, but I haven't checked what guests really do.
> 
> Yeah, guests right now only use it for unused filesystem pieces, so the
> "do not care" semantics are fine.  I also hoped to use discard to avoid
> blowing up thin-provisioned images when streaming.  Perhaps we can use
> bdrv_has_zero_init instead, and/or pass down the copy-on-read flag to
> the block driver.

You mean for image formats that don't support explicit zero clusters? It
only works without backing files, and I'm not sure how common this case is.

> Anyhow, there are some patches from this series that are relatively
> independent and ready for inclusion, I'll extract them and post them
> separately.

Okay.

Kevin
Christoph Hellwig - March 24, 2012, 3:27 p.m.
On Sat, Mar 10, 2012 at 12:02:40PM -0600, Richard Laager wrote:
> If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid
> advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not
> going to work. This would side step these problems. You said it wasn't
> possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing
> by extending the file by one byte and then punching that:
>         char buf = 0;
>         fstat(s->fd, &st);
>         pwrite(s->fd, &buf, 1, st.st_size + 1);
>         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,

There is no point in using FALLOC_FL_KEEP_SIZE together with
FALLOC_FL_PUNCH_HOLE.
Christoph Hellwig - March 24, 2012, 3:29 p.m.
On Wed, Mar 14, 2012 at 01:01:35PM +0100, Kevin Wolf wrote:
> Paolo mentioned a use case as a fast way for guests to write zeros, but
> is it really faster than a normal write when we have to emulate it by a
> bdrv_write with a temporary buffer of zeros? On the other hand we have
> the cases where discard really means "I don't care about the data any
> more" and emulating it by writing zeros is just a waste of resources there.

On raw a real discard also will destroy the preallocation, which might
absolutely kill your performance on filesystems.

XFS provides a XFS_IOC_ZERO_RANGE ioctl (same calloing convention as
XFS_IOC_RESVSP), which would provide a good backend for WRITE SAME emulation
in this context.
Christoph Hellwig - March 24, 2012, 3:30 p.m.
On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote:
> 
> Note that the discard granularity is only a hint, so it's really more a
> maximum suggested value than a granularity.  Outside of a cluster
> boundary the format would still have to write zeros manually.
> 
> Also, Linux for example will only round the number of sectors down to
> the granularity, not the start sector.

Which really is more of a bug than a feature.  The current discard
implementation in Linux is still very poor.
Christoph Hellwig - March 24, 2012, 3:33 p.m.
On Wed, Mar 14, 2012 at 01:49:48PM +0100, Paolo Bonzini wrote:
> It does make the distinction.  "I don't care" is UNMAP (or WRITE
> SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE
> SAME(10) or WRITE SAME(16) with an all-zero payload.

But once the taget sets the unmap zeroes data bit even UNMAP / WRITE SAME with
unmap have to zero all blocks touched.

Also strictly speaking even the WRITE SAME with unmap requires a payload
of zeroes, even if most targets don't bother to check for this.
Daniel P. Berrange - March 26, 2012, 9:44 a.m.
On Sat, Mar 24, 2012 at 04:29:26PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 14, 2012 at 01:01:35PM +0100, Kevin Wolf wrote:
> > Paolo mentioned a use case as a fast way for guests to write zeros, but
> > is it really faster than a normal write when we have to emulate it by a
> > bdrv_write with a temporary buffer of zeros? On the other hand we have
> > the cases where discard really means "I don't care about the data any
> > more" and emulating it by writing zeros is just a waste of resources there.
> 
> On raw a real discard also will destroy the preallocation, which might
> absolutely kill your performance on filesystems.

This suggests that there be a new command line param to '-drive' to turn
discard support on/off, since QEMU can't reliably know if the raw file
it is given is intended to be fully pre-allocated by the mgmt app.

Regards,
Daniel
Christoph Hellwig - March 26, 2012, 9:56 a.m.
On Mon, Mar 26, 2012 at 10:44:07AM +0100, Daniel P. Berrange wrote:
> This suggests that there be a new command line param to '-drive' to turn
> discard support on/off, since QEMU can't reliably know if the raw file
> it is given is intended to be fully pre-allocated by the mgmt app.

Yes.
Richard Laager - March 26, 2012, 7:40 p.m.
On Sat, 2012-03-24 at 16:27 +0100, Christoph Hellwig wrote:
> >         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> 
> There is no point in using FALLOC_FL_KEEP_SIZE together with
> FALLOC_FL_PUNCH_HOLE.

It's *required*. From the man page [0], "The FALLOC_FL_PUNCH_HOLE flag
must be ORed with FALLOC_FL_KEEP_SIZE in mode;"

[0] http://man7.org/linux/man-pages/man2/fallocate.2.html
Richard Laager - March 26, 2012, 7:40 p.m.
On Sat, 2012-03-24 at 16:30 +0100, Christoph Hellwig wrote:
> On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote:
> > 
> > Note that the discard granularity is only a hint, so it's really more a
> > maximum suggested value than a granularity.  Outside of a cluster
> > boundary the format would still have to write zeros manually.
> > 
> > Also, Linux for example will only round the number of sectors down to
> > the granularity, not the start sector.
> 
> Which really is more of a bug than a feature.  The current discard
> implementation in Linux is still very poor.

The disk protocols do not require the granularity to be respected. It is
*only a hint*. Therefore, QEMU has to handle non-aligned discards. It
doesn't really matter what we wish was the case; this is the reality.
Christoph Hellwig - March 27, 2012, 9:08 a.m.
On Mon, Mar 26, 2012 at 02:40:47PM -0500, Richard Laager wrote:
> On Sat, 2012-03-24 at 16:27 +0100, Christoph Hellwig wrote:
> > >         has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > 
> > There is no point in using FALLOC_FL_KEEP_SIZE together with
> > FALLOC_FL_PUNCH_HOLE.
> 
> It's *required*. From the man page [0], "The FALLOC_FL_PUNCH_HOLE flag
> must be ORed with FALLOC_FL_KEEP_SIZE in mode;"
> 
> [0] http://man7.org/linux/man-pages/man2/fallocate.2.html

You're right - we check that it is set in common code before entering the
filesystem.

It's utterly pointless, but too late to change now - I should have reviewed
the patch to add it better.
Kevin Wolf - March 27, 2012, 10:20 a.m.
Am 26.03.2012 21:40, schrieb Richard Laager:
> On Sat, 2012-03-24 at 16:30 +0100, Christoph Hellwig wrote:
>> On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote:
>>>
>>> Note that the discard granularity is only a hint, so it's really more a
>>> maximum suggested value than a granularity.  Outside of a cluster
>>> boundary the format would still have to write zeros manually.
>>>
>>> Also, Linux for example will only round the number of sectors down to
>>> the granularity, not the start sector.
>>
>> Which really is more of a bug than a feature.  The current discard
>> implementation in Linux is still very poor.
> 
> The disk protocols do not require the granularity to be respected. It is
> *only a hint*. Therefore, QEMU has to handle non-aligned discards. It
> doesn't really matter what we wish was the case; this is the reality.

But we can do suboptimal things (like just ignoring the request if we
didn't promise to zero data, or turn it into a non-discarding zero
buffer write otherwise) if the OS knowingly uses misaligned discards.
Not our bug then.

Kevin

Patch

diff --git a/block.c b/block.c
index e4f7782..34db914 100644
--- a/block.c
+++ b/block.c
@@ -1656,6 +1656,7 @@  static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs,
     int cluster_nb_sectors;
     size_t skip_bytes;
     int ret;
+    BlockDriverInfo bdi;
 
     /* Cover entire cluster so no additional backing file I/O is required when
      * allocating cluster in the image file.
@@ -1676,7 +1677,8 @@  static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs,
         goto err;
     }
 
-    if (drv->bdrv_co_write_zeroes &&
+    /* If it is worthless, do not check if the buffer is zero.  */
+    if (bdrv_get_info(bs, &bdi) >= 0 && bdi.discard_zeroes_data &&
         buffer_is_zero(bounce_buffer, iov.iov_len)) {
         ret = bdrv_co_do_write_zeroes(bs, cluster_sector_num,
                                       cluster_nb_sectors, &bounce_qiov);
@@ -1785,13 +1787,17 @@  static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov)
 {
     BlockDriver *drv = bs->drv;
+    BlockDriverInfo bdi;
     QEMUIOVector my_qiov;
     struct iovec iov;
     int ret;
 
     /* First try the efficient write zeroes operation */
-    if (drv->bdrv_co_write_zeroes) {
-        return drv->bdrv_co_write_zeroes(bs, sector_num, nb_sectors);
+    if (bdrv_get_info(bs, &bdi) >= 0 && bdi.discard_zeroes_data &&
+        bdi.discard_granularity &&
+        (sector_num & (bdi.discard_granularity - 1)) == 0 &&
+        (nb_sectors & (bdi.discard_granularity - 1)) == 0) {
+        return bdrv_co_discard(bs, sector_num, nb_sectors);
     }
 
     if (qiov) {
diff --git a/block/qed.c b/block/qed.c
index 9944be5..33fe03f 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1395,13 +1395,6 @@  static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
     return bdrv_aio_flush(bs->file, cb, opaque);
 }
 
-static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
-                                                 int64_t sector_num,
-                                                 int nb_sectors)
-{
-    return bdrv_co_discard(bs, sector_num, nb_sectors);
-}
-
 static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset)
 {
     BDRVQEDState *s = bs->opaque;
@@ -1563,7 +1556,6 @@  static BlockDriver bdrv_qed = {
     .bdrv_aio_writev          = bdrv_qed_aio_writev,
     .bdrv_aio_flush           = bdrv_qed_aio_flush,
     .bdrv_aio_discard         = bdrv_qed_aio_discard,
-    .bdrv_co_write_zeroes     = bdrv_qed_co_write_zeroes,
     .bdrv_truncate            = bdrv_qed_truncate,
     .bdrv_getlength           = bdrv_qed_getlength,
     .bdrv_get_info            = bdrv_qed_get_info,
diff --git a/block_int.h b/block_int.h
index b460c36..2e8cdfe 100644
--- a/block_int.h
+++ b/block_int.h
@@ -131,14 +131,6 @@  struct BlockDriver {
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
     int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov);
-    /*
-     * Efficiently zero a region of the disk image.  Typically an image format
-     * would use a compact metadata representation to implement this.  This
-     * function pointer may be NULL and .bdrv_co_writev() will be called
-     * instead.
-     */
-    int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs,
-        int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs,
         int64_t sector_num, int nb_sectors);
     int coroutine_fn (*bdrv_co_is_allocated)(BlockDriverState *bs,