Message ID | 1331226917-6658-7-git-send-email-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
Am 08.03.2012 18:15, schrieb Paolo Bonzini: > Remove the bdrv_co_write_zeroes callback. Instead use the discard > information from bdrv_get_info to choose between bdrv_co_discard > and a normal write. > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> I'm not sure if this a good idea. The goal of discard is to remove data from the image (or not add it if it isn't there yet) and ideally deallocate the used clusters. The goal of write_zeroes is to mark space as zero and explicitly allocate it for this purpose. From a guest point of view these are pretty similar, but from a host perspective I'd say there's a difference. Kevin
Il 09/03/2012 17:37, Kevin Wolf ha scritto: >> > Remove the bdrv_co_write_zeroes callback. Instead use the discard >> > information from bdrv_get_info to choose between bdrv_co_discard >> > and a normal write. >> > >> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > I'm not sure if this a good idea. > > The goal of discard is to remove data from the image (or not add it if > it isn't there yet) and ideally deallocate the used clusters. The goal > of write_zeroes is to mark space as zero and explicitly allocate it for > this purpose. > > From a guest point of view these are pretty similar, but from a host > perspective I'd say there's a difference. True. However, we need to present a uniform view to the guests, including the granularity, or discard can never be enabled. The granularity must be 512 on IDE (though it can be higher on SCSI), so there are problems mapping block layer discard straight down to the guest. There are basically three ways to do this: 1) we could cheat and present a discard_granularity that is smaller than what the underlying format/protocol supports. This is fine but forces discard_zeroes_data to be false. That's a pity because Linux 3.4 will start using efficient zero write operations (WRITE SAME on SCSI, but could be extended to UNMAP/TRIM if discard_zeroes_data is true). 2) we can make an emulated discard that always supports 512 bytes granularity and always zeroes data. This patch series takes this route. 3) we can let the user choose between (1) and (2). I didn't choose this because of laziness mostly---co_write_zeroes support is not really complete, for example there's no aio version to use in device models---and also because I doubt anyone would really use the option. Paolo
I'm believe your patch set provides these behaviors now: * QEMU block drivers report discard_granularity. * discard_granularity = 0 means no discard * The guest is told there's no discard support. * discard_granularity < 0 is undefined. discard_granularity > 0 is reported to the guest as discard support. * QEMU block drivers report discard_zeros_data. This is passed to the guest when discard_granularity > 0. I propose adding the following behaviors in any event: * If a QEMU block device reports a discard_granularity > 0, it must be equal to 2^n (n >= 0), or QEMU's block core will change it to 0. (Non-power-of-two granularities are not likely to exist in the real world, and this assumption greatly simplifies ensuring correctness.) * For SCSI, report an unmap_granularity to the guest as follows: max(logical_block_size, discard_granularity) / logical_block_size Regarding emulating discard_zeros_data... I agree that when discard_zeros_data is set, we will need to write zeroes in some cases. As you noted, IDE has a fixed granularity of one sector. And the SCSI granularity is a hint only; guests are not guaranteed to align to that value either. [0] As a design concept, instead of guaranteeing that 512B zero'ing discards are supported, I think the QEMU block layer should instead guarantee aligned discards to QEMU block devices, emulating any misaligned discards (or portions thereof) by writing zeroes if (and only if) discard_zeros_data is set. When the QEMU block layer gets a discard: * Of the specified discard range, see if it includes an aligned multiple of discard granularity. If so, save that as the starting point of a subrange. Then find the last aligned multiple, if any, and pass that subrange (if start != end) down to the block driver's discard function. * If the discard really fails (i.e. returns failure and sets errno to something other than "not supported" or equivalent), return failure to the guest. For "not supported", fall through to the code below with the full range. * At this point, we have zero, one, or two subranges to handle. * If and only if discard_zeros_data is set, write zeros to the remaining subranges, if any. (This would use a lower-level write_zeroes call which does not attempt to use discard.) If this fails, return failure to the guest. * Return success. This leaves one remaining issue: In raw-posix.c, for files (i.e. not devices), I assume you're going to advertise discard_granularity=1 and discard_zeros_data=1 when compiled with support for fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually guarantees that it zeros the data when punching holes. I haven't verified this. If the guest does a big discard (think mkfs) and fallocate() returns EOPNOTSUPP, you'll have to zero essentially the whole virtual disk, which, as you noted, will also allocate it (unless you explicitly check for holes). This is bad. It can be avoided by not advertising discard_zeros_data, but as you noted, that's unfortunate. If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not going to work. This would side step these problems. You said it wasn't possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing by extending the file by one byte and then punching that: char buf = 0; fstat(s->fd, &st); pwrite(s->fd, &buf, 1, st.st_size + 1); has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, st.st_size + 1, 1); ftruncate(s->fd, st.st_size); [0] See the last paragraph starting on page 8: http://mkp.net/pubs/linux-advanced-storage.pdf
Il 10/03/2012 19:02, Richard Laager ha scritto: > I propose adding the following behaviors in any event: > * If a QEMU block device reports a discard_granularity > 0, it > must be equal to 2^n (n >= 0), or QEMU's block core will change > it to 0. (Non-power-of-two granularities are not likely to exist > in the real world, and this assumption greatly simplifies > ensuring correctness.) Yeah, I was considering this to be simply a bug in the block device. > * For SCSI, report an unmap_granularity to the guest as follows: > max(logical_block_size, discard_granularity) / logical_block_size This is more or less already in place later in the series. > As a design concept, instead of guaranteeing that 512B zero'ing discards > are supported, I think the QEMU block layer should instead guarantee > aligned discards to QEMU block devices, emulating any misaligned > discards (or portions thereof) by writing zeroes if (and only if) > discard_zeros_data is set. Yes, this can be done of course. This series does not include it yet. > This leaves one remaining issue: In raw-posix.c, for files (i.e. not > devices), I assume you're going to advertise discard_granularity=1 and > discard_zeros_data=1 when compiled with support for > fallocate(FALLOC_FL_PUNCH_HOLE). Note, I'm assuming fallocate() actually > guarantees that it zeros the data when punching holes. It does, that's pretty much the definition of a hole. > If the guest does a big discard (think mkfs) and fallocate() returns > EOPNOTSUPP, you'll have to zero essentially the whole virtual disk, > which, as you noted, will also allocate it (unless you explicitly check > for holes). This is bad. It can be avoided by not advertising > discard_zeros_data, but as you noted, that's unfortunate. If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also be done by skipping the zero write on known holes. This could even be done at the block layer level using bdrv_is_allocated. > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not > going to work. This would side step these problems. ... and introduce others when migrating if your datacenter doesn't have homogeneous kernel versions and/or filesystems. :( > You said it wasn't > possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing > by extending the file by one byte and then punching that: > char buf = 0; > fstat(s->fd, &st); > pwrite(s->fd, &buf, 1, st.st_size + 1); > has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > st.st_size + 1, 1); > ftruncate(s->fd, st.st_size); Nice trick. :) Yes, that could work. Do you know if non-Linux operating systems have something similar to BLKDISCARDZEROES? Paolo
Am 12.03.2012 13:27, schrieb Paolo Bonzini: > Il 10/03/2012 19:02, Richard Laager ha scritto: >> I propose adding the following behaviors in any event: >> * If a QEMU block device reports a discard_granularity > 0, it >> must be equal to 2^n (n >= 0), or QEMU's block core will change >> it to 0. (Non-power-of-two granularities are not likely to exist >> in the real world, and this assumption greatly simplifies >> ensuring correctness.) > > Yeah, I was considering this to be simply a bug in the block device. Block driver you mean? Yes, I think we can just assert() this. Kevin
On Mon, 2012-03-12 at 10:34 +0100, Paolo Bonzini wrote: > To be completely correct, I suggest the following behavior: > > 1. Add a discard boolean option to the disk layer. > > 2. If discard is not specified: > > * For files, detect a true/false value by comparing > > stat.st_blocks != stat.st_size>>9. > > * For devices, assume a fixed value (true?). > > 3. If discard is true, issue discards. > > 4. If discard is false, do not issue discards. > > The problem is, who will use this interface? I'm a libvirt and virt-manager user; virt-manager already differentiates between thin and thick provisioning. So I'm envisioning passing that information to libvirt, which would save it in a config file and use that to set discard=true vs. discard=false when using QEMU. On Mon, 2012-03-12 at 13:27 +0100, Paolo Bonzini wrote: > Il 10/03/2012 19:02, Richard Laager ha scritto: > > I propose adding the following behaviors in any event: > > * If a QEMU block device reports a discard_granularity > 0, it > > must be equal to 2^n (n >= 0), or QEMU's block core will change > > it to 0. (Non-power-of-two granularities are not likely to exist > > in the real world, and this assumption greatly simplifies > > ensuring correctness.) > > Yeah, I was considering this to be simply a bug in the block device. > > > * For SCSI, report an unmap_granularity to the guest as follows: > > max(logical_block_size, discard_granularity) / logical_block_size > > This is more or less already in place later in the series. I didn't see it. Which patch number? > > Note, I'm assuming fallocate() actually > > guarantees that it zeros the data when punching holes. > > It does, that's pretty much the definition of a hole. Agreed. I verified this fact after sending that email. At the time, I just wanted to be very clear on what I knew for sure vs. what I had not yet verified. > If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also > be done by skipping the zero write on known holes. > > This could even be done at the block layer level using bdrv_is_allocated. Would we want to make all write_zeros operations check for and skip holes, or is write_zeros different from a discard in that it SHOULD/MUST allocate space? > > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid > > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not > > going to work. This would side step these problems. > > ... and introduce others when migrating if your datacenter doesn't have > homogeneous kernel versions and/or filesystems. :( I hadn't thought of the migration issues. Thanks for bringing that up. Worst case, you end up doing a bunch of zero writing if and only if you migrate from a discard_zeros_data host to one that doesn't (or doesn't do discard at all). But this only lasts until the guest reboots (assuming we also add a behavior of re-probing on guest reboot--or until it shuts down if we don't or can't). As far as I can see, this is unavoidable, though. And this is no worse than writing zeros ALL of the time that fallocate() fails, which is the behavior of your patch series, right? This might be another use case for a discard option on the disk. If some but not all of one's hosts support discard, a system administrator might want to set discard=0 to avoid this. > Do you know if non-Linux operating systems have something similar to > BLKDISCARDZEROES? As far as I know, no. The SunOS one is only on Illumos (the open source kernel forked from the now dead OpenSolaris) and only implemented for ZFS zvols. So currently, it's roughly equivalent to fallocate() on Linux in that it's happening at the filesystem level. (It doesn't actually reach the platters yet. But even if it did, that's unrelated to the guarantees provided by ZFS.) Thus, it always zeros, so we could set discard_zeros_data = 1 unconditionally there. I should probably run that by the Illumos developers, though, to ensure they're comfortable with that ioctl() guaranteeing zeroing. I haven't looked into the FreeBSD one as much yet. Worst case, we unconditionally set discard_zeros_data = 0.
Il 13/03/2012 20:13, Richard Laager ha scritto: >> > To be completely correct, I suggest the following behavior: >>> > > 1. Add a discard boolean option to the disk layer. >>> > > 2. If discard is not specified: >>> > > * For files, detect a true/false value by comparing >>> > > stat.st_blocks != stat.st_size>>9. >>> > > * For devices, assume a fixed value (true?). >>> > > 3. If discard is true, issue discards. >>> > > 4. If discard is false, do not issue discards. >> > >> > The problem is, who will use this interface? > I'm a libvirt and virt-manager user; virt-manager already differentiates > between thin and thick provisioning. So I'm envisioning passing that > information to libvirt, which would save it in a config file and use > that to set discard=true vs. discard=false when using QEMU. Yeah, it could be set also at the pool level for libvirt. >>> > > * For SCSI, report an unmap_granularity to the guest as follows: >>> > > max(logical_block_size, discard_granularity) / logical_block_size >> > >> > This is more or less already in place later in the series. > I didn't see it. Which patch number? Patch 11: + discard_granularity = s->qdev.conf.discard_granularity; + if (discard_granularity == -1) { + s->qdev.conf.discard_granularity = s->qdev.conf.logical_block_size; + } else if (discard_granularity < s->qdev.conf.logical_block_size) { + error_report("scsi-block: invalid discard_granularity"); + return -1; + } else if (discard_granularity & (discard_granularity - 1)) { + error_report("scsi-block: discard_granularity not a power of two"); + return -1; + } > > If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also > > be done by skipping the zero write on known holes. > > > > This could even be done at the block layer level using bdrv_is_allocated. > > Would we want to make all write_zeros operations check for and skip > holes, or is write_zeros different from a discard in that it SHOULD/MUST > allocate space? I think that's pretty much the question to answer for this patch to graduate from the RFC state (the rest is just technicalities, so to speak). So far, write_zeros was intended to be an efficient operation (it avoids allocating a cluster in qed and will do the same in qcow3, which is why I decided to merge it with discard). >>> > > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid >>> > > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not >>> > > going to work. This would side step these problems. >> > >> > ... and introduce others when migrating if your datacenter doesn't have >> > homogeneous kernel versions and/or filesystems. :( > I hadn't thought of the migration issues. Thanks for bringing that up. > > Worst case, you end up doing a bunch of zero writing if and only if you > migrate from a discard_zeros_data host to one that doesn't (or doesn't > do discard at all). But this only lasts until the guest reboots > (assuming we also add a behavior of re-probing on guest reboot--or until > it shuts down if we don't or can't). As far as I can see, this is > unavoidable, though. And this is no worse than writing zeros ALL of the > time that fallocate() fails, which is the behavior of your patch series, > right? It is worse in that we do not want the hardware parameters exposed to the guest to change behind the scenes, except if you change the machine type or if you use the default unversioned type. > This might be another use case for a discard option on the disk. If some > but not all of one's hosts support discard, a system administrator might > want to set discard=0 to avoid this. A discard option is already there (discard_granularity=0). Libvirt could choose to expose it even now, but it would be of little use in the absence of support for unaligned discards. Paolo
Am 14.03.2012 08:41, schrieb Paolo Bonzini: > Il 13/03/2012 20:13, Richard Laager ha scritto: >>> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also >>> be done by skipping the zero write on known holes. >>> >>> This could even be done at the block layer level using bdrv_is_allocated. >> >> Would we want to make all write_zeros operations check for and skip >> holes, or is write_zeros different from a discard in that it SHOULD/MUST >> allocate space? > > I think that's pretty much the question to answer for this patch to graduate > from the RFC state (the rest is just technicalities, so to speak). So far, > write_zeros was intended to be an efficient operation (it avoids allocating > a cluster in qed and will do the same in qcow3, which is why I decided to > merge it with discard). Yes, for qcow3 and to some degree also for QED, setting the zero flag is the natural implementation for both discard and write_zeros. The big question is what happens with other formats. Paolo mentioned a use case as a fast way for guests to write zeros, but is it really faster than a normal write when we have to emulate it by a bdrv_write with a temporary buffer of zeros? On the other hand we have the cases where discard really means "I don't care about the data any more" and emulating it by writing zeros is just a waste of resources there. So I think we only want to advertise that discard zeroes data if we can do it efficiently. This means that the format does support it, and that the device is able to communicate the discard granularity (= cluster size) to the guest OS. Kevin
Il 14/03/2012 13:01, Kevin Wolf ha scritto: > Am 14.03.2012 08:41, schrieb Paolo Bonzini: >> Il 13/03/2012 20:13, Richard Laager ha scritto: >>>> If you have a new kernel that supports SEEK_HOLE/SEEK_DATA, it can also >>>> be done by skipping the zero write on known holes. >>>> >>>> This could even be done at the block layer level using bdrv_is_allocated. >>> >>> Would we want to make all write_zeros operations check for and skip >>> holes, or is write_zeros different from a discard in that it SHOULD/MUST >>> allocate space? >> >> I think that's pretty much the question to answer for this patch to graduate >> from the RFC state (the rest is just technicalities, so to speak). So far, >> write_zeros was intended to be an efficient operation (it avoids allocating >> a cluster in qed and will do the same in qcow3, which is why I decided to >> merge it with discard). > > Yes, for qcow3 and to some degree also for QED, setting the zero flag is > the natural implementation for both discard and write_zeros. The big > question is what happens with other formats. Also raw if used with a sparse file. > Paolo mentioned a use case as a fast way for guests to write zeros, but > is it really faster than a normal write when we have to emulate it by a > bdrv_write with a temporary buffer of zeros? No, of course not. > On the other hand we have > the cases where discard really means "I don't care about the data any > more" and emulating it by writing zeros is just a waste of resources there. > > So I think we only want to advertise that discard zeroes data if we can > do it efficiently. This means that the format does support it, and that > the device is able to communicate the discard granularity (= cluster > size) to the guest OS. Note that the discard granularity is only a hint, so it's really more a maximum suggested value than a granularity. Outside of a cluster boundary the format would still have to write zeros manually. Also, Linux for example will only round the number of sectors down to the granularity, not the start sector. Rereading the code, for SCSI we want to advertise a zero granularity (aka do whatever you want), otherwise we may get only misaligned discard requests and end up writing zeroes inefficiently all the time. The problem is that advertising discard_zeroes_data based on the backend calls for trouble as soon as you migrate between storage formats, filesystems or disks. (BTW, if the backing file allows discard and zeroes data, efficient write-zeroes could be done in qcow2 by allocating a cluster and discarding its contents. It's similar to how you do preallocated metadata). Paolo
Am 14.03.2012 13:14, schrieb Paolo Bonzini: >> Paolo mentioned a use case as a fast way for guests to write zeros, but >> is it really faster than a normal write when we have to emulate it by a >> bdrv_write with a temporary buffer of zeros? > > No, of course not. > >> On the other hand we have >> the cases where discard really means "I don't care about the data any >> more" and emulating it by writing zeros is just a waste of resources there. >> >> So I think we only want to advertise that discard zeroes data if we can >> do it efficiently. This means that the format does support it, and that >> the device is able to communicate the discard granularity (= cluster >> size) to the guest OS. > > Note that the discard granularity is only a hint, so it's really more a > maximum suggested value than a granularity. Outside of a cluster > boundary the format would still have to write zeros manually. You're talking about SCSI here, I guess? Would be one case where being able to define sane semantics for virtio-blk would have been an advantage... I had hoped that SCSI was already sane, but if doesn't distinguish between "I don't care about this any more" and "I want to have zeros here", then I'm afraid I can't call it sane any more. We can make the conditions even stricter, i.e. allow it only if protocol can pass through discards for unaligned requests. This wouldn't free clusters on an image format level, but at least on a file system level. > Also, Linux for example will only round the number of sectors down to > the granularity, not the start sector. Rereading the code, for SCSI we > want to advertise a zero granularity (aka do whatever you want), > otherwise we may get only misaligned discard requests and end up writing > zeroes inefficiently all the time. Does this make sense with real hardware or is it a Linux bug? > The problem is that advertising discard_zeroes_data based on the backend > calls for trouble as soon as you migrate between storage formats, > filesystems or disks. True. You would have to emulate if you migrate from a source that can discard to zeros efficiently to a destination that can't. In the end, I guess we'll just have to accept that we can't fix bad semantics of ATA and SCSI, and just need to decide whether "I don't care" or "I want to have zeros" is more common. My feeling is that "I don't care" is the more useful operation because it can't be expressed otherwise, but I haven't checked what guests really do. > (BTW, if the backing file allows discard and zeroes data, efficient > write-zeroes could be done in qcow2 by allocating a cluster and > discarding its contents. It's similar to how you do preallocated metadata). Yes. Kevin
Il 14/03/2012 13:37, Kevin Wolf ha scritto: > Am 14.03.2012 13:14, schrieb Paolo Bonzini: >>> Paolo mentioned a use case as a fast way for guests to write zeros, but >>> is it really faster than a normal write when we have to emulate it by a >>> bdrv_write with a temporary buffer of zeros? >> >> No, of course not. >> >>> On the other hand we have >>> the cases where discard really means "I don't care about the data any >>> more" and emulating it by writing zeros is just a waste of resources there. >>> >>> So I think we only want to advertise that discard zeroes data if we can >>> do it efficiently. This means that the format does support it, and that >>> the device is able to communicate the discard granularity (= cluster >>> size) to the guest OS. >> >> Note that the discard granularity is only a hint, so it's really more a >> maximum suggested value than a granularity. Outside of a cluster >> boundary the format would still have to write zeros manually. > > You're talking about SCSI here, I guess? Would be one case where being > able to define sane semantics for virtio-blk would have been an > advantage... I had hoped that SCSI was already sane, but if doesn't > distinguish between "I don't care about this any more" and "I want to > have zeros here", then I'm afraid I can't call it sane any more. It does make the distinction. "I don't care" is UNMAP (or WRITE SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE SAME(10) or WRITE SAME(16) with an all-zero payload. > We can make the conditions even stricter, i.e. allow it only if protocol > can pass through discards for unaligned requests. This wouldn't free > clusters on an image format level, but at least on a file system level. > >> Also, Linux for example will only round the number of sectors down to >> the granularity, not the start sector. Rereading the code, for SCSI we >> want to advertise a zero granularity (aka do whatever you want), >> otherwise we may get only misaligned discard requests and end up writing >> zeroes inefficiently all the time. > > Does this make sense with real hardware or is it a Linux bug? It's a bug, SCSI defines the "optimal unmap request starting LBA" to be "(n × optimal unmap granularity) + unmap granularity alignment". >> The problem is that advertising discard_zeroes_data based on the backend >> calls for trouble as soon as you migrate between storage formats, >> filesystems or disks. > > True. You would have to emulate if you migrate from a source that can > discard to zeros efficiently to a destination that can't. > > In the end, I guess we'll just have to accept that we can't fix bad > semantics of ATA and SCSI, and just need to decide whether "I don't > care" or "I want to have zeros" is more common. My feeling is that "I > don't care" is the more useful operation because it can't be expressed > otherwise, but I haven't checked what guests really do. Yeah, guests right now only use it for unused filesystem pieces, so the "do not care" semantics are fine. I also hoped to use discard to avoid blowing up thin-provisioned images when streaming. Perhaps we can use bdrv_has_zero_init instead, and/or pass down the copy-on-read flag to the block driver. Anyhow, there are some patches from this series that are relatively independent and ready for inclusion, I'll extract them and post them separately. Paolo
Am 14.03.2012 13:49, schrieb Paolo Bonzini: > Il 14/03/2012 13:37, Kevin Wolf ha scritto: >> Am 14.03.2012 13:14, schrieb Paolo Bonzini: >>>> Paolo mentioned a use case as a fast way for guests to write zeros, but >>>> is it really faster than a normal write when we have to emulate it by a >>>> bdrv_write with a temporary buffer of zeros? >>> >>> No, of course not. >>> >>>> On the other hand we have >>>> the cases where discard really means "I don't care about the data any >>>> more" and emulating it by writing zeros is just a waste of resources there. >>>> >>>> So I think we only want to advertise that discard zeroes data if we can >>>> do it efficiently. This means that the format does support it, and that >>>> the device is able to communicate the discard granularity (= cluster >>>> size) to the guest OS. >>> >>> Note that the discard granularity is only a hint, so it's really more a >>> maximum suggested value than a granularity. Outside of a cluster >>> boundary the format would still have to write zeros manually. >> >> You're talking about SCSI here, I guess? Would be one case where being >> able to define sane semantics for virtio-blk would have been an >> advantage... I had hoped that SCSI was already sane, but if doesn't >> distinguish between "I don't care about this any more" and "I want to >> have zeros here", then I'm afraid I can't call it sane any more. > > It does make the distinction. "I don't care" is UNMAP (or WRITE > SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE > SAME(10) or WRITE SAME(16) with an all-zero payload. Good. So why are we trying to make both the same in qemu? Keeping two different callbacks in the block drivers for two different operations seems to make much more sense to me. >> We can make the conditions even stricter, i.e. allow it only if protocol >> can pass through discards for unaligned requests. This wouldn't free >> clusters on an image format level, but at least on a file system level. >> >>> Also, Linux for example will only round the number of sectors down to >>> the granularity, not the start sector. Rereading the code, for SCSI we >>> want to advertise a zero granularity (aka do whatever you want), >>> otherwise we may get only misaligned discard requests and end up writing >>> zeroes inefficiently all the time. >> >> Does this make sense with real hardware or is it a Linux bug? > > It's a bug, SCSI defines the "optimal unmap request starting LBA" to be > "(n × optimal unmap granularity) + unmap granularity alignment". Okay. Needs to get fixed in Linux then, not in qemu. Means that qemu might ignore some discard requests by broken Linux versions, but ignoring is a valid action for "don't care". >>> The problem is that advertising discard_zeroes_data based on the backend >>> calls for trouble as soon as you migrate between storage formats, >>> filesystems or disks. >> >> True. You would have to emulate if you migrate from a source that can >> discard to zeros efficiently to a destination that can't. >> >> In the end, I guess we'll just have to accept that we can't fix bad >> semantics of ATA and SCSI, and just need to decide whether "I don't >> care" or "I want to have zeros" is more common. My feeling is that "I >> don't care" is the more useful operation because it can't be expressed >> otherwise, but I haven't checked what guests really do. > > Yeah, guests right now only use it for unused filesystem pieces, so the > "do not care" semantics are fine. I also hoped to use discard to avoid > blowing up thin-provisioned images when streaming. Perhaps we can use > bdrv_has_zero_init instead, and/or pass down the copy-on-read flag to > the block driver. You mean for image formats that don't support explicit zero clusters? It only works without backing files, and I'm not sure how common this case is. > Anyhow, there are some patches from this series that are relatively > independent and ready for inclusion, I'll extract them and post them > separately. Okay. Kevin
On Sat, Mar 10, 2012 at 12:02:40PM -0600, Richard Laager wrote: > If we could probe for FALLOC_FL_PUNCH_HOLE support, then we could avoid > advertising discard support based on FALLOC_FL_PUNCH_HOLE when it is not > going to work. This would side step these problems. You said it wasn't > possible to probe for FALLOC_FL_PUNCH_HOLE. Have you considered probing > by extending the file by one byte and then punching that: > char buf = 0; > fstat(s->fd, &st); > pwrite(s->fd, &buf, 1, st.st_size + 1); > has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, There is no point in using FALLOC_FL_KEEP_SIZE together with FALLOC_FL_PUNCH_HOLE.
On Wed, Mar 14, 2012 at 01:01:35PM +0100, Kevin Wolf wrote: > Paolo mentioned a use case as a fast way for guests to write zeros, but > is it really faster than a normal write when we have to emulate it by a > bdrv_write with a temporary buffer of zeros? On the other hand we have > the cases where discard really means "I don't care about the data any > more" and emulating it by writing zeros is just a waste of resources there. On raw a real discard also will destroy the preallocation, which might absolutely kill your performance on filesystems. XFS provides a XFS_IOC_ZERO_RANGE ioctl (same calloing convention as XFS_IOC_RESVSP), which would provide a good backend for WRITE SAME emulation in this context.
On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote: > > Note that the discard granularity is only a hint, so it's really more a > maximum suggested value than a granularity. Outside of a cluster > boundary the format would still have to write zeros manually. > > Also, Linux for example will only round the number of sectors down to > the granularity, not the start sector. Which really is more of a bug than a feature. The current discard implementation in Linux is still very poor.
On Wed, Mar 14, 2012 at 01:49:48PM +0100, Paolo Bonzini wrote: > It does make the distinction. "I don't care" is UNMAP (or WRITE > SAME(16) with the UNMAP bit set); "I want to have zeroes" is WRITE > SAME(10) or WRITE SAME(16) with an all-zero payload. But once the taget sets the unmap zeroes data bit even UNMAP / WRITE SAME with unmap have to zero all blocks touched. Also strictly speaking even the WRITE SAME with unmap requires a payload of zeroes, even if most targets don't bother to check for this.
On Sat, Mar 24, 2012 at 04:29:26PM +0100, Christoph Hellwig wrote: > On Wed, Mar 14, 2012 at 01:01:35PM +0100, Kevin Wolf wrote: > > Paolo mentioned a use case as a fast way for guests to write zeros, but > > is it really faster than a normal write when we have to emulate it by a > > bdrv_write with a temporary buffer of zeros? On the other hand we have > > the cases where discard really means "I don't care about the data any > > more" and emulating it by writing zeros is just a waste of resources there. > > On raw a real discard also will destroy the preallocation, which might > absolutely kill your performance on filesystems. This suggests that there be a new command line param to '-drive' to turn discard support on/off, since QEMU can't reliably know if the raw file it is given is intended to be fully pre-allocated by the mgmt app. Regards, Daniel
On Mon, Mar 26, 2012 at 10:44:07AM +0100, Daniel P. Berrange wrote: > This suggests that there be a new command line param to '-drive' to turn > discard support on/off, since QEMU can't reliably know if the raw file > it is given is intended to be fully pre-allocated by the mgmt app. Yes.
On Sat, 2012-03-24 at 16:27 +0100, Christoph Hellwig wrote: > > has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > There is no point in using FALLOC_FL_KEEP_SIZE together with > FALLOC_FL_PUNCH_HOLE. It's *required*. From the man page [0], "The FALLOC_FL_PUNCH_HOLE flag must be ORed with FALLOC_FL_KEEP_SIZE in mode;" [0] http://man7.org/linux/man-pages/man2/fallocate.2.html
On Sat, 2012-03-24 at 16:30 +0100, Christoph Hellwig wrote: > On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote: > > > > Note that the discard granularity is only a hint, so it's really more a > > maximum suggested value than a granularity. Outside of a cluster > > boundary the format would still have to write zeros manually. > > > > Also, Linux for example will only round the number of sectors down to > > the granularity, not the start sector. > > Which really is more of a bug than a feature. The current discard > implementation in Linux is still very poor. The disk protocols do not require the granularity to be respected. It is *only a hint*. Therefore, QEMU has to handle non-aligned discards. It doesn't really matter what we wish was the case; this is the reality.
On Mon, Mar 26, 2012 at 02:40:47PM -0500, Richard Laager wrote: > On Sat, 2012-03-24 at 16:27 +0100, Christoph Hellwig wrote: > > > has_discard = !fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > > > There is no point in using FALLOC_FL_KEEP_SIZE together with > > FALLOC_FL_PUNCH_HOLE. > > It's *required*. From the man page [0], "The FALLOC_FL_PUNCH_HOLE flag > must be ORed with FALLOC_FL_KEEP_SIZE in mode;" > > [0] http://man7.org/linux/man-pages/man2/fallocate.2.html You're right - we check that it is set in common code before entering the filesystem. It's utterly pointless, but too late to change now - I should have reviewed the patch to add it better.
Am 26.03.2012 21:40, schrieb Richard Laager: > On Sat, 2012-03-24 at 16:30 +0100, Christoph Hellwig wrote: >> On Wed, Mar 14, 2012 at 01:14:18PM +0100, Paolo Bonzini wrote: >>> >>> Note that the discard granularity is only a hint, so it's really more a >>> maximum suggested value than a granularity. Outside of a cluster >>> boundary the format would still have to write zeros manually. >>> >>> Also, Linux for example will only round the number of sectors down to >>> the granularity, not the start sector. >> >> Which really is more of a bug than a feature. The current discard >> implementation in Linux is still very poor. > > The disk protocols do not require the granularity to be respected. It is > *only a hint*. Therefore, QEMU has to handle non-aligned discards. It > doesn't really matter what we wish was the case; this is the reality. But we can do suboptimal things (like just ignoring the request if we didn't promise to zero data, or turn it into a non-discarding zero buffer write otherwise) if the OS knowingly uses misaligned discards. Not our bug then. Kevin
diff --git a/block.c b/block.c index e4f7782..34db914 100644 --- a/block.c +++ b/block.c @@ -1656,6 +1656,7 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs, int cluster_nb_sectors; size_t skip_bytes; int ret; + BlockDriverInfo bdi; /* Cover entire cluster so no additional backing file I/O is required when * allocating cluster in the image file. @@ -1676,7 +1677,8 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BlockDriverState *bs, goto err; } - if (drv->bdrv_co_write_zeroes && + /* If it is worthless, do not check if the buffer is zero. */ + if (bdrv_get_info(bs, &bdi) >= 0 && bdi.discard_zeroes_data && buffer_is_zero(bounce_buffer, iov.iov_len)) { ret = bdrv_co_do_write_zeroes(bs, cluster_sector_num, cluster_nb_sectors, &bounce_qiov); @@ -1785,13 +1787,17 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors, QEMUIOVector *qiov) { BlockDriver *drv = bs->drv; + BlockDriverInfo bdi; QEMUIOVector my_qiov; struct iovec iov; int ret; /* First try the efficient write zeroes operation */ - if (drv->bdrv_co_write_zeroes) { - return drv->bdrv_co_write_zeroes(bs, sector_num, nb_sectors); + if (bdrv_get_info(bs, &bdi) >= 0 && bdi.discard_zeroes_data && + bdi.discard_granularity && + (sector_num & (bdi.discard_granularity - 1)) == 0 && + (nb_sectors & (bdi.discard_granularity - 1)) == 0) { + return bdrv_co_discard(bs, sector_num, nb_sectors); } if (qiov) { diff --git a/block/qed.c b/block/qed.c index 9944be5..33fe03f 100644 --- a/block/qed.c +++ b/block/qed.c @@ -1395,13 +1395,6 @@ static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs, return bdrv_aio_flush(bs->file, cb, opaque); } -static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs, - int64_t sector_num, - int nb_sectors) -{ - return bdrv_co_discard(bs, sector_num, nb_sectors); -} - static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset) { BDRVQEDState *s = bs->opaque; @@ -1563,7 +1556,6 @@ static BlockDriver bdrv_qed = { .bdrv_aio_writev = bdrv_qed_aio_writev, .bdrv_aio_flush = bdrv_qed_aio_flush, .bdrv_aio_discard = bdrv_qed_aio_discard, - .bdrv_co_write_zeroes = bdrv_qed_co_write_zeroes, .bdrv_truncate = bdrv_qed_truncate, .bdrv_getlength = bdrv_qed_getlength, .bdrv_get_info = bdrv_qed_get_info, diff --git a/block_int.h b/block_int.h index b460c36..2e8cdfe 100644 --- a/block_int.h +++ b/block_int.h @@ -131,14 +131,6 @@ struct BlockDriver { int64_t sector_num, int nb_sectors, QEMUIOVector *qiov); int coroutine_fn (*bdrv_co_writev)(BlockDriverState *bs, int64_t sector_num, int nb_sectors, QEMUIOVector *qiov); - /* - * Efficiently zero a region of the disk image. Typically an image format - * would use a compact metadata representation to implement this. This - * function pointer may be NULL and .bdrv_co_writev() will be called - * instead. - */ - int coroutine_fn (*bdrv_co_write_zeroes)(BlockDriverState *bs, - int64_t sector_num, int nb_sectors); int coroutine_fn (*bdrv_co_discard)(BlockDriverState *bs, int64_t sector_num, int nb_sectors); int coroutine_fn (*bdrv_co_is_allocated)(BlockDriverState *bs,
Remove the bdrv_co_write_zeroes callback. Instead use the discard information from bdrv_get_info to choose between bdrv_co_discard and a normal write. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- block.c | 12 +++++++++--- block/qed.c | 8 -------- block_int.h | 8 -------- 3 files changed, 9 insertions(+), 19 deletions(-)