diff mbox series

[for-3.0] file-posix: Fix write_zeroes with unmap on block devices

Message ID 20180726113337.23144-1-kwolf@redhat.com
State New
Headers show
Series [for-3.0] file-posix: Fix write_zeroes with unmap on block devices | expand

Commit Message

Kevin Wolf July 26, 2018, 11:33 a.m. UTC
The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
all-zero afterwards, so don't try to abuse it for zero writing. We try
to only use this if BLKDISCARDZEROES tells us that it is safe, but this
is unreliable on older kernels and a constant 0 in newer kernels. In
other words, this code path is never actually used with newer kernels,
so we don't even try to unmap while writing zeros.

This patch removes the abuse of discard for writing zeroes from
file-posix and instead adds a new function that uses interfaces that are
actually meant to deallocate and zero out at the same time. Only if
those fail, it falls back to zeroing out without unmap. We never fall
back to a discard operation any more that may or may not result in
zeros.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/file-posix.c | 62 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 15 deletions(-)

Comments

Eric Blake July 26, 2018, 2:28 p.m. UTC | #1
On 07/26/2018 06:33 AM, Kevin Wolf wrote:
> The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> all-zero afterwards, so don't try to abuse it for zero writing. We try
> to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> is unreliable on older kernels and a constant 0 in newer kernels. In

For my own curiosity, which kernel commit switched to constant 0, so I 
can read the rationale in that commit message?

> other words, this code path is never actually used with newer kernels,
> so we don't even try to unmap while writing zeros.
> 
> This patch removes the abuse of discard for writing zeroes from
> file-posix and instead adds a new function that uses interfaces that are
> actually meant to deallocate and zero out at the same time. Only if
> those fail, it falls back to zeroing out without unmap. We never fall
> back to a discard operation any more that may or may not result in
> zeros.

Makes sense.

> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/file-posix.c | 62 +++++++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 47 insertions(+), 15 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 60af4b3d51..9c66cd87d1 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>       }
>   #endif
>   
> -    bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
> +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>       ret = 0;
>   fail:
>       if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> @@ -1487,6 +1487,38 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>       return -ENOTSUP;
>   }
>   
> +static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
> +{
> +    BDRVRawState *s = aiocb->bs->opaque;
> +    int ret;
> +
> +    /* First try to write zeros and unmap at the same time */
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                       aiocb->aio_offset, aiocb->aio_nbytes);

Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE 
deallocs, but is not required to write zeroes.

> +    if (ret != -ENOTSUP) {
> +        return ret;
> +    }
> +#endif
> +
> +#ifdef CONFIG_XFS
> +    if (s->is_xfs) {
> +        /* xfs_discard() guarantees that the discarded area reads as all-zero
> +         * afterwards, so we can use it here. */
> +        return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
> +    }
> +#endif
> +
> +    /* Make the compiler happy if neither of the above is compiled in */
> +    (void) s;

Could also be done in fewer lines by use of:

    BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;

since that attribute means "might be unused, don't warn if it is 
actually unused" (and not the stricter "must be unused, warn if it got 
used anyway")

> +
> +    /* If we couldn't manage to unmap while guaranteed that the area reads as
> +     * all-zero afterwards, just write zeroes without unmapping */
> +    ret = handle_aiocb_write_zeroes(aiocb);
> +    return ret;
> +}
> +
>   #ifndef HAVE_COPY_FILE_RANGE
>   static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
>                                off_t *out_off, size_t len, unsigned int flags)
Nir Soffer July 26, 2018, 3:06 p.m. UTC | #2
On Thu, Jul 26, 2018 at 5:28 PM Eric Blake <eblake@redhat.com> wrote:

> On 07/26/2018 06:33 AM, Kevin Wolf wrote:
> > The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> > all-zero afterwards, so don't try to abuse it for zero writing. We try
> > to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> > is unreliable on older kernels and a constant 0 in newer kernels. In
>
> For my own curiosity, which kernel commit switched to constant 0, so I
> can read the rationale in that commit message?
>

This should help
https://patchwork.kernel.org/patch/9903757/

Nir
Kevin Wolf July 26, 2018, 3:06 p.m. UTC | #3
Am 26.07.2018 um 16:28 hat Eric Blake geschrieben:
> On 07/26/2018 06:33 AM, Kevin Wolf wrote:
> > The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> > all-zero afterwards, so don't try to abuse it for zero writing. We try
> > to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> > is unreliable on older kernels and a constant 0 in newer kernels. In
> 
> For my own curiosity, which kernel commit switched to constant 0, so I can
> read the rationale in that commit message?

I think 48920ff2a5a9 is the commit that actually changes the value.

Essentially the kernel abused discard for zero writes previously (so it
needed something like BLKDISCARDZEROES) and then switched to a separate
write_zeroes operation like we do in QEMU. Once this was done, second
guessing the behaviour of discard wasn't necessary any more because you
just tell what you really want to do (discard or zero out).

> > other words, this code path is never actually used with newer kernels,
> > so we don't even try to unmap while writing zeros.
> > 
> > This patch removes the abuse of discard for writing zeroes from
> > file-posix and instead adds a new function that uses interfaces that are
> > actually meant to deallocate and zero out at the same time. Only if
> > those fail, it falls back to zeroing out without unmap. We never fall
> > back to a discard operation any more that may or may not result in
> > zeros.
> 
> Makes sense.
> 
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block/file-posix.c | 62 +++++++++++++++++++++++++++++++++++++++++-------------
> >   1 file changed, 47 insertions(+), 15 deletions(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 60af4b3d51..9c66cd87d1 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
> >       }
> >   #endif
> > -    bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
> > +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> >       ret = 0;
> >   fail:
> >       if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> > @@ -1487,6 +1487,38 @@ static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> >       return -ENOTSUP;
> >   }
> > +static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
> > +{
> > +    BDRVRawState *s = aiocb->bs->opaque;
> > +    int ret;
> > +
> > +    /* First try to write zeros and unmap at the same time */
> > +
> > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > +                       aiocb->aio_offset, aiocb->aio_nbytes);
> 
> Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE
> deallocs, but is not required to write zeroes.

Yes, it is. See the man page:

    Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
    2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
    range starting at offset and continuing for len bytes. Within the
    specified range, partial filesystem blocks are zeroed, and whole
    filesystem blocks are removed from the file. After a successful
    call, subsequent reads from this range will return zeroes.

FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.

> > +    if (ret != -ENOTSUP) {
> > +        return ret;
> > +    }
> > +#endif
> > +
> > +#ifdef CONFIG_XFS
> > +    if (s->is_xfs) {
> > +        /* xfs_discard() guarantees that the discarded area reads as all-zero
> > +         * afterwards, so we can use it here. */
> > +        return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
> > +    }
> > +#endif
> > +
> > +    /* Make the compiler happy if neither of the above is compiled in */
> > +    (void) s;
> 
> Could also be done in fewer lines by use of:
> 
>    BDRVRawState *s G_GNUC_UNUSED = aiocb->bs->opaque;
> 
> since that attribute means "might be unused, don't warn if it is actually
> unused" (and not the stricter "must be unused, warn if it got used anyway")

Thanks, I wasn't aware of G_GNUC_UNUSED (and didn't want to add a
__attribute__ here without a macro that wraps it).

Kevin
Eric Blake July 26, 2018, 3:23 p.m. UTC | #4
On 07/26/2018 10:06 AM, Kevin Wolf wrote:

>>> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>> +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>>> +                       aiocb->aio_offset, aiocb->aio_nbytes);
>>
>> Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE
>> deallocs, but is not required to write zeroes.
> 
> Yes, it is. See the man page:
> 
>      Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
>      2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
>      range starting at offset and continuing for len bytes. Within the
>      specified range, partial filesystem blocks are zeroed, and whole
>      filesystem blocks are removed from the file. After a successful
>      call, subsequent reads from this range will return zeroes.

That's true for file-system fds, but not for block device fds.

As pointed out by Nir,

 > https://patchwork.kernel.org/patch/9903757/
Which says, among other things:

 >> Do we also know that the blocks were discarded as we do with
 >> BLKDISCARD ?
 >
 > There never was a way to know for sure.
 >
 > ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to
 > bend their semantics towards getting predictable behavior but ultimately
 > failed. Too many corner cases.
 >
 >> As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4
 >> to make sure that inode tables are zeroed after discard.
 >
 > The point is that you shouldn't have an if (discard_zeroes_data)
 > conditional in the first place.
 >
 >  - If you need to dellocate a block range and you don't care about its
 >    contents in the future, use BLKDISCARD / FL_PUNCH_HOLE.
 >
 >  - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE.

PUNCH_HOLE deallocates; but can only guarantee a read back of zero on 
file systems.

Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new 
flag not present/documented on Fedora 28. I wonder if it helps, too.

> 
> FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.

I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read 
back 0, using whatever is most efficient under the hood (in the case of 
block devices, unmapping that reliably reads back as zero is favored).
Kevin Wolf July 26, 2018, 3:33 p.m. UTC | #5
Am 26.07.2018 um 17:23 hat Eric Blake geschrieben:
> On 07/26/2018 10:06 AM, Kevin Wolf wrote:
> 
> > > > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > > > +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > > +                       aiocb->aio_offset, aiocb->aio_nbytes);
> > > 
> > > Umm, doesn't this have to use FALLOC_FL_ZERO_RANGE? FALLOC_FL_PUNCH_HOLE
> > > deallocs, but is not required to write zeroes.
> > 
> > Yes, it is. See the man page:
> > 
> >      Specifying the FALLOC_FL_PUNCH_HOLE flag (available since Linux
> >      2.6.38) in mode deallocates space (i.e., creates a hole) in the byte
> >      range starting at offset and continuing for len bytes. Within the
> >      specified range, partial filesystem blocks are zeroed, and whole
> >      filesystem blocks are removed from the file. After a successful
> >      call, subsequent reads from this range will return zeroes.
> 
> That's true for file-system fds, but not for block device fds.

It is true for block device fds, too. Look at fs/block_dev.c,
specifically blkdev_fallocate():

        switch (mode) {
        case FALLOC_FL_ZERO_RANGE:
        case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE: 
                error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
                                            GFP_KERNEL, BLKDEV_ZERO_NOUNMAP);
                break;
        case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE: 
                error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
                                             GFP_KERNEL, BLKDEV_ZERO_NOFALLBACK);
                break;
        case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE:
                error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
                                             GFP_KERNEL, 0);
                break;
        default:
                return -EOPNOTSUPP;
        }

> As pointed out by Nir,
> 
> > https://patchwork.kernel.org/patch/9903757/
> Which says, among other things:
> 
> >> Do we also know that the blocks were discarded as we do with
> >> BLKDISCARD ?
> >
> > There never was a way to know for sure.
> >
> > ATA DSM TRIM and SCSI UNMAP are hints by definition. We attempted to
> > bend their semantics towards getting predictable behavior but ultimately
> > failed. Too many corner cases.
> >
> >> As I mentioned before. We relied on discard_zeroes_data in mkfs.ext4
> >> to make sure that inode tables are zeroed after discard.
> >
> > The point is that you shouldn't have an if (discard_zeroes_data)
> > conditional in the first place.
> >
> >  - If you need to dellocate a block range and you don't care about its
> >    contents in the future, use BLKDISCARD / FL_PUNCH_HOLE.
> >
> >  - If you need to zero a block range, use BLKZEROOUT / FL_ZERO_RANGE.
> 
> PUNCH_HOLE deallocates; but can only guarantee a read back of zero on file
> systems.

As far as I know, the comment you quoted is accurate for BLKDISCARD and
BLKZEROOUT, but not for the fallocate() flags.

> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new flag
> not present/documented on Fedora 28. I wonder if it helps, too.
> 
> > 
> > FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
> 
> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read back
> 0, using whatever is most efficient under the hood (in the case of block
> devices, unmapping that reliably reads back as zero is favored).

See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.

Kevin
Eric Blake July 26, 2018, 4:10 p.m. UTC | #6
On 07/26/2018 10:33 AM, Kevin Wolf wrote:

> 
> As far as I know, the comment you quoted is accurate for BLKDISCARD and
> BLKZEROOUT, but not for the fallocate() flags.
> 

I sure wish the man pages were more explicit on what guarantees each 
flag offers.

>> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new flag
>> not present/documented on Fedora 28. I wonder if it helps, too.
>>
>>>
>>> FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
>>
>> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read back
>> 0, using whatever is most efficient under the hood (in the case of block
>> devices, unmapping that reliably reads back as zero is favored).
> 
> See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
> blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.

Having to resort to reading the kernel code to learn what the guarantees 
are is annoying (it's nice that GPL guarantees that we CAN do that, but 
it means the man pages could use some TLC so that we don't have to). 
Especially since the kernel code has changed over time.

But your paste of current kernel code is rather hard to argue against.
Nir Soffer July 26, 2018, 4:22 p.m. UTC | #7
On Thu, Jul 26, 2018 at 2:33 PM Kevin Wolf <kwolf@redhat.com> wrote:

> The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> all-zero afterwards, so don't try to abuse it for zero writing. We try
> to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> is unreliable on older kernels and a constant 0 in newer kernels. In
> other words, this code path is never actually used with newer kernels,
> so we don't even try to unmap while writing zeros.
>
> This patch removes the abuse of discard for writing zeroes from
> file-posix and instead adds a new function that uses interfaces that are
> actually meant to deallocate and zero out at the same time. Only if
> those fail, it falls back to zeroing out without unmap. We never fall
> back to a discard operation any more that may or may not result in
> zeros.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/file-posix.c | 62
> +++++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 47 insertions(+), 15 deletions(-)
>
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 60af4b3d51..9c66cd87d1 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict
> *options,
>      }
>  #endif
>
> -    bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
> +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>      ret = 0;
>  fail:
>      if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> @@ -1487,6 +1487,38 @@ static ssize_t
> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>      return -ENOTSUP;
>  }
>
> +static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
> +{
> +    BDRVRawState *s = aiocb->bs->opaque;
> +    int ret;
> +
> +    /* First try to write zeros and unmap at the same time */
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                       aiocb->aio_offset, aiocb->aio_nbytes);
> +    if (ret != -ENOTSUP) {
>

This fails with ENODEV on RHEL 7.5 when fd is a block device.

The manual says:

       ENODEV fd does not refer to a regular file or a directory.  (If fd
is a pipe or FIFO, a different error results.)

Same issue exists in this patch:
http://lists.gnu.org/archive/html/qemu-devel/2018-03/msg07194.html

+        return ret;
> +    }
> +#endif
> +
> +#ifdef CONFIG_XFS
> +    if (s->is_xfs) {
> +        /* xfs_discard() guarantees that the discarded area reads as
> all-zero
> +         * afterwards, so we can use it here. */
> +        return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
> +    }
> +#endif
> +
> +    /* Make the compiler happy if neither of the above is compiled in */
> +    (void) s;
> +
> +    /* If we couldn't manage to unmap while guaranteed that the area
> reads as
> +     * all-zero afterwards, just write zeroes without unmapping */
> +    ret = handle_aiocb_write_zeroes(aiocb);
> +    return ret;
> +}
> +
>  #ifndef HAVE_COPY_FILE_RANGE
>  static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
>                               off_t *out_off, size_t len, unsigned int
> flags)
> @@ -1729,6 +1761,9 @@ static int aio_worker(void *arg)
>      case QEMU_AIO_WRITE_ZEROES:
>          ret = handle_aiocb_write_zeroes(aiocb);
>          break;
> +    case QEMU_AIO_WRITE_ZEROES | QEMU_AIO_DISCARD:
> +        ret = handle_aiocb_write_zeroes_unmap(aiocb);
> +        break;
>      case QEMU_AIO_COPY_RANGE:
>          ret = handle_aiocb_copy_range(aiocb);
>          break;
> @@ -2553,15 +2588,13 @@ static int coroutine_fn raw_co_pwrite_zeroes(
>      int bytes, BdrvRequestFlags flags)
>  {
>      BDRVRawState *s = bs->opaque;
> +    int operation = QEMU_AIO_WRITE_ZEROES;
>
> -    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> -        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -                              QEMU_AIO_WRITE_ZEROES);
> -    } else if (s->discard_zeroes) {
> -        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -                              QEMU_AIO_DISCARD);
> +    if (flags & BDRV_REQ_MAY_UNMAP) {
> +        operation |= QEMU_AIO_DISCARD;
>      }
> -    return -ENOTSUP;
> +
> +    return paio_submit_co(bs, s->fd, offset, NULL, bytes, operation);
>  }
>
>  static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> @@ -3054,20 +3087,19 @@ static coroutine_fn int
> hdev_co_pwrite_zeroes(BlockDriverState *bs,
>      int64_t offset, int bytes, BdrvRequestFlags flags)
>  {
>      BDRVRawState *s = bs->opaque;
> +    int operation = QEMU_AIO_WRITE_ZEROES | QEMU_AIO_BLKDEV;
>      int rc;
>
>      rc = fd_open(bs);
>      if (rc < 0) {
>          return rc;
>      }
> -    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> -        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -                              QEMU_AIO_WRITE_ZEROES|QEMU_AIO_BLKDEV);
> -    } else if (s->discard_zeroes) {
> -        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
> -                              QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
> +
> +    if (flags & BDRV_REQ_MAY_UNMAP) {
> +        operation |= QEMU_AIO_DISCARD;
>      }
> -    return -ENOTSUP;
> +
> +    return paio_submit_co(bs, s->fd, offset, NULL, bytes, operation);
>  }
>
>  static int coroutine_fn hdev_co_create_opts(const char *filename,
> QemuOpts *opts,
> --
> 2.13.6
>
>
>
Kevin Wolf July 26, 2018, 4:41 p.m. UTC | #8
Am 26.07.2018 um 18:22 hat Nir Soffer geschrieben:
> On Thu, Jul 26, 2018 at 2:33 PM Kevin Wolf <kwolf@redhat.com> wrote:
> 
> > The BLKDISCARD ioctl doesn't guarantee that the discarded blocks read as
> > all-zero afterwards, so don't try to abuse it for zero writing. We try
> > to only use this if BLKDISCARDZEROES tells us that it is safe, but this
> > is unreliable on older kernels and a constant 0 in newer kernels. In
> > other words, this code path is never actually used with newer kernels,
> > so we don't even try to unmap while writing zeros.
> >
> > This patch removes the abuse of discard for writing zeroes from
> > file-posix and instead adds a new function that uses interfaces that are
> > actually meant to deallocate and zero out at the same time. Only if
> > those fail, it falls back to zeroing out without unmap. We never fall
> > back to a discard operation any more that may or may not result in
> > zeros.
> >
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block/file-posix.c | 62
> > +++++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 47 insertions(+), 15 deletions(-)
> >
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index 60af4b3d51..9c66cd87d1 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -648,7 +648,7 @@ static int raw_open_common(BlockDriverState *bs, QDict
> > *options,
> >      }
> >  #endif
> >
> > -    bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
> > +    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> >      ret = 0;
> >  fail:
> >      if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
> > @@ -1487,6 +1487,38 @@ static ssize_t
> > handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
> >      return -ENOTSUP;
> >  }
> >
> > +static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
> > +{
> > +    BDRVRawState *s = aiocb->bs->opaque;
> > +    int ret;
> > +
> > +    /* First try to write zeros and unmap at the same time */
> > +
> > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > +                       aiocb->aio_offset, aiocb->aio_nbytes);
> > +    if (ret != -ENOTSUP) {
> >
> 
> This fails with ENODEV on RHEL 7.5 when fd is a block device.
> 
> The manual says:
> 
>        ENODEV fd does not refer to a regular file or a directory.  (If fd
> is a pipe or FIFO, a different error results.)

do_fallocate() is a wrapper around fallocate() that handles EINTR and
translates a few errno values (including ENODEV) into a -ENOTSUP return
code.

Kevin
Nir Soffer July 26, 2018, 4:46 p.m. UTC | #9
On Thu, Jul 26, 2018 at 7:10 PM Eric Blake <eblake@redhat.com> wrote:

> On 07/26/2018 10:33 AM, Kevin Wolf wrote:
>
> >
> > As far as I know, the comment you quoted is accurate for BLKDISCARD and
> > BLKZEROOUT, but not for the fallocate() flags.
> >
>
> I sure wish the man pages were more explicit on what guarantees each
> flag offers.
>
> >> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new
> flag
> >> not present/documented on Fedora 28. I wonder if it helps, too.
> >>
> >>>
> >>> FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
> >>
> >> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read
> back
> >> 0, using whatever is most efficient under the hood (in the case of block
> >> devices, unmapping that reliably reads back as zero is favored).
> >
> > See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
> > blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.
>
> Having to resort to reading the kernel code to learn what the guarantees
> are is annoying (it's nice that GPL guarantees that we CAN do that, but
> it means the man pages could use some TLC so that we don't have to).
> Especially since the kernel code has changed over time.
>
> But your paste of current kernel code is rather hard to argue against.
>

I don't think we should depend on undocumented kernel code.

Nir
Nir Soffer July 26, 2018, 4:54 p.m. UTC | #10
On Thu, Jul 26, 2018 at 7:41 PM Kevin Wolf <kwolf@redhat.com> wrote:
...

> > > +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> > > +    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE |
> FALLOC_FL_KEEP_SIZE,
> > > +                       aiocb->aio_offset, aiocb->aio_nbytes);
> > > +    if (ret != -ENOTSUP) {
> > >
> >
> > This fails with ENODEV on RHEL 7.5 when fd is a block device.
> >
> > The manual says:
> >
> >        ENODEV fd does not refer to a regular file or a directory.  (If fd
> > is a pipe or FIFO, a different error results.)
>
> do_fallocate() is a wrapper around fallocate() that handles EINTR and
> translates a few errno values (including ENODEV) into a -ENOTSUP return
> code.
>

I missed that, so this looks correct.

Nir
Kevin Wolf July 26, 2018, 4:58 p.m. UTC | #11
Am 26.07.2018 um 18:46 hat Nir Soffer geschrieben:
> On Thu, Jul 26, 2018 at 7:10 PM Eric Blake <eblake@redhat.com> wrote:
> 
> > On 07/26/2018 10:33 AM, Kevin Wolf wrote:
> >
> > >
> > > As far as I know, the comment you quoted is accurate for BLKDISCARD and
> > > BLKZEROOUT, but not for the fallocate() flags.
> > >
> >
> > I sure wish the man pages were more explicit on what guarantees each
> > flag offers.
> >
> > >> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new
> > flag
> > >> not present/documented on Fedora 28. I wonder if it helps, too.
> > >>
> > >>>
> > >>> FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without unmap.
> > >>
> > >> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you read
> > back
> > >> 0, using whatever is most efficient under the hood (in the case of block
> > >> devices, unmapping that reliably reads back as zero is favored).
> > >
> > > See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
> > > blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.
> >
> > Having to resort to reading the kernel code to learn what the guarantees
> > are is annoying (it's nice that GPL guarantees that we CAN do that, but
> > it means the man pages could use some TLC so that we don't have to).
> > Especially since the kernel code has changed over time.
> >
> > But your paste of current kernel code is rather hard to argue against.
> 
> I don't think we should depend on undocumented kernel code.

Yes, the man page says "filesystem blocks" instead of just "blocks". If
you're in a nit-picking mood, that excludes block devices. But I think
it's pretty obvious what the intended meaning is and how it consistently
extends to block devices, and the code confirms it.

That's by far not "undocumented kernel code" in my book.

Kevin
Nir Soffer July 26, 2018, 5:34 p.m. UTC | #12
On Thu, Jul 26, 2018 at 7:58 PM Kevin Wolf <kwolf@redhat.com> wrote:

> Am 26.07.2018 um 18:46 hat Nir Soffer geschrieben:
> > On Thu, Jul 26, 2018 at 7:10 PM Eric Blake <eblake@redhat.com> wrote:
> >
> > > On 07/26/2018 10:33 AM, Kevin Wolf wrote:
> > >
> > > >
> > > > As far as I know, the comment you quoted is accurate for BLKDISCARD
> and
> > > > BLKZEROOUT, but not for the fallocate() flags.
> > > >
> > >
> > > I sure wish the man pages were more explicit on what guarantees each
> > > flag offers.
> > >
> > > >> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a
> new
> > > flag
> > > >> not present/documented on Fedora 28. I wonder if it helps, too.
> > > >>
> > > >>>
> > > >>> FALLOC_FL_ZERO_RANGE in contrast implements write_zeroes without
> unmap.
> > > >>
> > > >> I thought the opposite: FALLOC_FL_ZERO_RANGE guarantees that you
> read
> > > back
> > > >> 0, using whatever is most efficient under the hood (in the case of
> block
> > > >> devices, unmapping that reliably reads back as zero is favored).
> > > >
> > > > See the code I quoted above, FALLOC_FL_ZERO_RANGE calls
> > > > blkdev_issue_zeroout() with BLKDEV_ZERO_NOUNMAP internally.
> > >
> > > Having to resort to reading the kernel code to learn what the
> guarantees
> > > are is annoying (it's nice that GPL guarantees that we CAN do that, but
> > > it means the man pages could use some TLC so that we don't have to).
> > > Especially since the kernel code has changed over time.
> > >
> > > But your paste of current kernel code is rather hard to argue against.
> >
> > I don't think we should depend on undocumented kernel code.
>
> Yes, the man page says "filesystem blocks" instead of just "blocks". If
> you're in a nit-picking mood, that excludes block devices. But I think
> it's pretty obvious what the intended meaning is and how it consistently
> extends to block devices, and the code confirms it.
>
> That's by far not "undocumented kernel code" in my book.
>

Maybe it is more correct to say this is poorly documented.

The manual is very specific about supported filesystems, but there nothing
about
block devices:

   Deallocating file space


       Specifying  the  FALLOC_FL_PUNCH_HOLE flag (available since Linux
       2.6.38) in mode deallocates space (i.e., creates a hole) in the
       byte range starting at offset and continuing for len bytes.  Within
the
       specified range, partial filesystem blocks are zeroed, and whole
       filesystem blocks are removed from the file.  After a successful
call,
       subsequent  reads from this range will return zeroes.

       The  FALLOC_FL_PUNCH_HOLE  flag  must be ORed with
FALLOC_FL_KEEP_SIZE
       in mode; in other words, even when punching off the end of the file,
the
       file size (as reported by stat(2)) does not change.

       Not all filesystems support FALLOC_FL_PUNCH_HOLE; if a filesystem
       doesn't support the operation, an error is returned.  The operation
is
       supported  on  at  least  the  following filesystems:

       *  XFS (since Linux 2.6.38)
       *  ext4 (since Linux 3.0)
       *  Btrfs (since Linux 3.7)
       *  tmpfs(5) (since Linux 3.5)

I expect to see something like:

    * block devices (since Linux 4.9)

Anyway this looks the same level of documentation as FALLOC_FL_ZERO_RANGE,
so if we are ok with using it, FALLOC_FL_PUNCH_HOLE is fine.

Does this change mean that oVirt can start to enable discard for VM when
disks are using wipe-after-delete?

Currently we disable discard if a disk is marked for wiping, since discard
does not
guarantee that discarded ranges are zeroed.

Nir
Eric Blake July 26, 2018, 6 p.m. UTC | #13
On 07/26/2018 12:34 PM, Nir Soffer wrote:

> 
> Anyway this looks the same level of documentation as FALLOC_FL_ZERO_RANGE,
> so if we are ok with using it, FALLOC_FL_PUNCH_HOLE is fine.
> 
> Does this change mean that oVirt can start to enable discard for VM when
> disks are using wipe-after-delete?
> 
> Currently we disable discard if a disk is marked for wiping, since discard
> does not
> guarantee that discarded ranges are zeroed.

Discard does not ever guarantee anything (it is ALWAYS advisory, and 
thus may be ignored at any point in the stack).  If you NEED to 
guarantee that something reads as zeros, you have to either explicitly 
write zeroes, or do a combination of discard first and write zeroes 
second (and hopefully, the combination of discard+write zeroes is 
sufficiently optimized in the case where you were lucky that discard 
does result in a reads-as-zeroes to not do any extra work on the write 
zeroes call).

Oddly enough, write zeroes first and discard second is NOT guaranteed to 
read zeroes; the most obvious case would be if discard across an 
encrypted volume has some bulk-erase mechanism that writes all 0s to the 
hardware but decrypting all 0s now sees garbage, which is different from 
writing 0s which gets encrypted to something not all 0s on hardware 
(that said, such a scenario might still be considered reasonably clean 
for a wipe-after-delete disk - although with discard being advisory, 
it's hard to make guarantees).

The current kernel implementation that Kevin posted makes it look like 
for block devices, you now get:

FL_PUNCH_HOLE: guarantee that you read zeroes, using the fastest means 
possible (possibly by discarding - but only when discard reads as zeroes)

FL_PUNCH_HOLE | FL_NO_HIDE_STALE: deallocate, what you read is no longer 
guaranteed (a true discard, regardless of whether discard reads as zeroes)

FL_ZERO_RANGE: guarantee that you read zeroes, without using discard
Kevin Wolf July 26, 2018, 6:09 p.m. UTC | #14
Am 26.07.2018 um 19:34 hat Nir Soffer geschrieben:
> On Thu, Jul 26, 2018 at 7:58 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 26.07.2018 um 18:46 hat Nir Soffer geschrieben:
> > > I don't think we should depend on undocumented kernel code.
> >
> > Yes, the man page says "filesystem blocks" instead of just "blocks". If
> > you're in a nit-picking mood, that excludes block devices. But I think
> > it's pretty obvious what the intended meaning is and how it consistently
> > extends to block devices, and the code confirms it.
> >
> > That's by far not "undocumented kernel code" in my book.
> 
> Maybe it is more correct to say this is poorly documented.
> 
> The manual is very specific about supported filesystems, but there
> nothing about block devices:

Well, as specific as "_at least_ the following filesystems" can be.

>        Not all filesystems support FALLOC_FL_PUNCH_HOLE; if a filesystem
>        doesn't support the operation, an error is returned.  The operation
> is
>        supported  on  at  least  the  following filesystems:
> 
>        *  XFS (since Linux 2.6.38)
>        *  ext4 (since Linux 3.0)
>        *  Btrfs (since Linux 3.7)
>        *  tmpfs(5) (since Linux 3.5)
> 
> I expect to see something like:
> 
>     * block devices (since Linux 4.9)

Yup, the man page does need an update. But it says "at least", so
technically it's not wrong.

> Anyway this looks the same level of documentation as FALLOC_FL_ZERO_RANGE,
> so if we are ok with using it, FALLOC_FL_PUNCH_HOLE is fine.
> 
> Does this change mean that oVirt can start to enable discard for VM when
> disks are using wipe-after-delete?
> 
> Currently we disable discard if a disk is marked for wiping, since
> discard does not guarantee that discarded ranges are zeroed.

How do you wipe the disk? Does this really involve a write_zeroes
operation in qemu? But even if so, we always implemented that correctly
(modulo (mostly theoretical?) inaccuracy of BLKDISCARDZEROS in older
kernels). This patch fixes that we fall back to inefficient explicit
zero writing instead of offloading with recent kernels, but we didn't
just discard.

The explanation I've heard before from your team was that you're wiping
the disk with regular writes, but they feared that this doesn't take
effect for blocks that have previously been discarded. This is bullshit
without any basis in reality and has never been correct. When you zero
out data regions, they read as zeros, period. It doens't make any
difference whether the block had been discarded previously.

So either way, I feel like I already explained it to your team a
thousand times, but once again: Get rid of that behaviour yesterday. It
doesn't make sense and it never made sense.

Kevin
Eric Blake July 26, 2018, 6:24 p.m. UTC | #15
On 07/26/2018 10:23 AM, Eric Blake wrote:

> 
> Hmm - that thread also mentions FALLOC_FL_NO_HIDE_STALE, which is a new 
> flag not present/documented on Fedora 28. I wonder if it helps, too.

Even more odd - that flag has been proposed to the kernel since 2012 but 
still not mentioned in the F28 man page:

https://lwn.net/Articles/492920/

and that its use on block devices is much more recent, just last month:

https://webcache.googleusercontent.com/search?q=cache:1LV2Y06XnjkJ:https://patchwork.kernel.org/patch/10466183/+&cd=5&hl=en&ct=clnk&gl=us&client=firefox-b-1

but now that I've read that, it makes total sense that the combination 
PUNCH_HOLE|NO_HIDE_STALE is explicitly asking for semantics of "discard 
as fast as possible regardless of whatever later reads will see", while 
plain PUNCH_HOLE means "guarantee that I can read zeroes, using the 
fastest means possible [discard, if that works]"
diff mbox series

Patch

diff --git a/block/file-posix.c b/block/file-posix.c
index 60af4b3d51..9c66cd87d1 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -648,7 +648,7 @@  static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif
 
-    bs->supported_zero_flags = s->discard_zeroes ? BDRV_REQ_MAY_UNMAP : 0;
+    bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
     ret = 0;
 fail:
     if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1487,6 +1487,38 @@  static ssize_t handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
     return -ENOTSUP;
 }
 
+static ssize_t handle_aiocb_write_zeroes_unmap(RawPosixAIOData *aiocb)
+{
+    BDRVRawState *s = aiocb->bs->opaque;
+    int ret;
+
+    /* First try to write zeros and unmap at the same time */
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    ret = do_fallocate(s->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                       aiocb->aio_offset, aiocb->aio_nbytes);
+    if (ret != -ENOTSUP) {
+        return ret;
+    }
+#endif
+
+#ifdef CONFIG_XFS
+    if (s->is_xfs) {
+        /* xfs_discard() guarantees that the discarded area reads as all-zero
+         * afterwards, so we can use it here. */
+        return xfs_discard(s, aiocb->aio_offset, aiocb->aio_nbytes);
+    }
+#endif
+
+    /* Make the compiler happy if neither of the above is compiled in */
+    (void) s;
+
+    /* If we couldn't manage to unmap while guaranteed that the area reads as
+     * all-zero afterwards, just write zeroes without unmapping */
+    ret = handle_aiocb_write_zeroes(aiocb);
+    return ret;
+}
+
 #ifndef HAVE_COPY_FILE_RANGE
 static off_t copy_file_range(int in_fd, off_t *in_off, int out_fd,
                              off_t *out_off, size_t len, unsigned int flags)
@@ -1729,6 +1761,9 @@  static int aio_worker(void *arg)
     case QEMU_AIO_WRITE_ZEROES:
         ret = handle_aiocb_write_zeroes(aiocb);
         break;
+    case QEMU_AIO_WRITE_ZEROES | QEMU_AIO_DISCARD:
+        ret = handle_aiocb_write_zeroes_unmap(aiocb);
+        break;
     case QEMU_AIO_COPY_RANGE:
         ret = handle_aiocb_copy_range(aiocb);
         break;
@@ -2553,15 +2588,13 @@  static int coroutine_fn raw_co_pwrite_zeroes(
     int bytes, BdrvRequestFlags flags)
 {
     BDRVRawState *s = bs->opaque;
+    int operation = QEMU_AIO_WRITE_ZEROES;
 
-    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-                              QEMU_AIO_WRITE_ZEROES);
-    } else if (s->discard_zeroes) {
-        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-                              QEMU_AIO_DISCARD);
+    if (flags & BDRV_REQ_MAY_UNMAP) {
+        operation |= QEMU_AIO_DISCARD;
     }
-    return -ENOTSUP;
+
+    return paio_submit_co(bs, s->fd, offset, NULL, bytes, operation);
 }
 
 static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
@@ -3054,20 +3087,19 @@  static coroutine_fn int hdev_co_pwrite_zeroes(BlockDriverState *bs,
     int64_t offset, int bytes, BdrvRequestFlags flags)
 {
     BDRVRawState *s = bs->opaque;
+    int operation = QEMU_AIO_WRITE_ZEROES | QEMU_AIO_BLKDEV;
     int rc;
 
     rc = fd_open(bs);
     if (rc < 0) {
         return rc;
     }
-    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
-        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-                              QEMU_AIO_WRITE_ZEROES|QEMU_AIO_BLKDEV);
-    } else if (s->discard_zeroes) {
-        return paio_submit_co(bs, s->fd, offset, NULL, bytes,
-                              QEMU_AIO_DISCARD|QEMU_AIO_BLKDEV);
+
+    if (flags & BDRV_REQ_MAY_UNMAP) {
+        operation |= QEMU_AIO_DISCARD;
     }
-    return -ENOTSUP;
+
+    return paio_submit_co(bs, s->fd, offset, NULL, bytes, operation);
 }
 
 static int coroutine_fn hdev_co_create_opts(const char *filename, QemuOpts *opts,