diff mbox

[PATCHv4] block: optimize zero writes with bdrv_write_zeroes

Message ID 1399566139-24140-1-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven May 8, 2014, 4:22 p.m. UTC
this patch tries to optimize zero write requests
by automatically using bdrv_write_zeroes if it is
supported by the format.

This significantly speeds up file system initialization and
should speed zero write test used to test backend storage
performance.

I ran the following 2 tests on my internal SSD with a
50G QCOW2 container and on an attached iSCSI storage.

a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX

QCOW2         [off]     [on]     [unmap]
-----
runtime:       14secs    1.1secs  1.1secs
filesize:      937M      18M      18M

iSCSI         [off]     [on]     [unmap]
----
runtime:       9.3s      0.9s     0.9s

b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct

QCOW2         [off]     [on]     [unmap]
-----
runtime:       246secs   18secs   18secs
filesize:      51G       192K     192K
throughput:    203M/s    2.3G/s   2.3G/s

iSCSI*        [off]     [on]     [unmap]
----
runtime:       8mins     45secs   33secs
throughput:    106M/s    1.2G/s   1.6G/s
allocated:     100%      100%     0%

* The storage was connected via an 1Gbit interface.
  It seems to internally handle writing zeroes
  via WRITESAME16 very fast.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
v3->v4: - use QAPI generated enum and lookup table [Kevin]
        - added more details about the options in the comments
          of the qapi-schema [Eric]
        - changed the type of detect_zeroes from str to
          BlockdevDetectZeroesOptions. I left the name
          as is because it is consistent with e.g.
          BlockdevDiscardOptions or BlockdevAioOptions [Eric]
        - changed the parse function in blockdev_init to
          be generic usable for other enum parameters
        
v2->v3: - moved parameter parsing to blockdev_init
        - added per device detect_zeroes status to 
          hmp (info block -v) and qmp (query-block) [Eric]
        - added support to enable detect-zeroes also
          for hot added devices [Eric]
        - added missing entry to qemu_common_drive_opts
        - fixed description of qemu_iovec_is_zero [Fam]

v1->v2: - added tests to commit message (Markus)
RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
           - fixed typo (choosen->chosen) (Eric)
           - added second example to commit msg

RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline parameter
              - call zero detection only for format (bs->file != NULL)

 block.c                   |   11 ++++++++++
 block/qapi.c              |    1 +
 blockdev.c                |   34 +++++++++++++++++++++++++++++
 hmp.c                     |    5 +++++
 include/block/block_int.h |    1 +
 include/qemu-common.h     |    1 +
 qapi-schema.json          |   52 ++++++++++++++++++++++++++++++++-------------
 qemu-options.hx           |    6 ++++++
 qmp-commands.hx           |    3 +++
 util/iov.c                |   21 ++++++++++++++++++
 10 files changed, 120 insertions(+), 15 deletions(-)

Comments

Eric Blake May 12, 2014, 8:28 p.m. UTC | #1
On 05/08/2014 10:22 AM, Peter Lieven wrote:
> this patch tries to optimize zero write requests
> by automatically using bdrv_write_zeroes if it is
> supported by the format.
> 
> This significantly speeds up file system initialization and
> should speed zero write test used to test backend storage
> performance.
> 

> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>         - added more details about the options in the comments
>           of the qapi-schema [Eric]
>         - changed the type of detect_zeroes from str to
>           BlockdevDetectZeroesOptions. I left the name
>           as is because it is consistent with e.g.
>           BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>         - changed the parse function in blockdev_init to
>           be generic usable for other enum parameters

If you wouldn't mind, I think the generic function is useful enough that
people might want to backport it independently from this optimization.
It would be better to split this into a two-patch series, one for the
new parse_enum_option, the other for bdrv_write_zeroes utilizing it.


> +        },{
> +            .name = "detect-zeroes",
> +            .type = QEMU_OPT_STRING,
> +            .help = "try to optimize zero writes",

Might be worth listing (off, on, unmap) in the text.

Everything else looked okay, but I'll wait for R-b until I see a
response about the idea of splitting the patch (even if that response is
justification for keeping it as one)
Peter Lieven May 13, 2014, 12:06 p.m. UTC | #2
On 12.05.2014 22:28, Eric Blake wrote:
> On 05/08/2014 10:22 AM, Peter Lieven wrote:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> This significantly speeds up file system initialization and
>> should speed zero write test used to test backend storage
>> performance.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>>          - added more details about the options in the comments
>>            of the qapi-schema [Eric]
>>          - changed the type of detect_zeroes from str to
>>            BlockdevDetectZeroesOptions. I left the name
>>            as is because it is consistent with e.g.
>>            BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>>          - changed the parse function in blockdev_init to
>>            be generic usable for other enum parameters
> If you wouldn't mind, I think the generic function is useful enough that
> people might want to backport it independently from this optimization.
> It would be better to split this into a two-patch series, one for the
> new parse_enum_option, the other for bdrv_write_zeroes utilizing it.
>
>
>> +        },{
>> +            .name = "detect-zeroes",
>> +            .type = QEMU_OPT_STRING,
>> +            .help = "try to optimize zero writes",
> Might be worth listing (off, on, unmap) in the text.
>
> Everything else looked okay, but I'll wait for R-b until I see a
> response about the idea of splitting the patch (even if that response is
> justification for keeping it as one)
>

I did not split because currently there is no other possible
user in the function. The on_error settings and discard settings
would be possible users, but for on_error there is a hardcoded
difference between read and write which is not reflected in the
qapi and for discard settings we have ignore and unmap, but
we have also off and on which are not in qapi as well.

Peter
Kevin Wolf May 14, 2014, 11:41 a.m. UTC | #3
Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben:
> this patch tries to optimize zero write requests
> by automatically using bdrv_write_zeroes if it is
> supported by the format.
> 
> This significantly speeds up file system initialization and
> should speed zero write test used to test backend storage
> performance.
> 
> I ran the following 2 tests on my internal SSD with a
> 50G QCOW2 container and on an attached iSCSI storage.
> 
> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
> 
> QCOW2         [off]     [on]     [unmap]
> -----
> runtime:       14secs    1.1secs  1.1secs
> filesize:      937M      18M      18M
> 
> iSCSI         [off]     [on]     [unmap]
> ----
> runtime:       9.3s      0.9s     0.9s
> 
> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
> 
> QCOW2         [off]     [on]     [unmap]
> -----
> runtime:       246secs   18secs   18secs
> filesize:      51G       192K     192K
> throughput:    203M/s    2.3G/s   2.3G/s
> 
> iSCSI*        [off]     [on]     [unmap]
> ----
> runtime:       8mins     45secs   33secs
> throughput:    106M/s    1.2G/s   1.6G/s
> allocated:     100%      100%     0%
> 
> * The storage was connected via an 1Gbit interface.
>   It seems to internally handle writing zeroes
>   via WRITESAME16 very fast.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>         - added more details about the options in the comments
>           of the qapi-schema [Eric]
>         - changed the type of detect_zeroes from str to
>           BlockdevDetectZeroesOptions. I left the name
>           as is because it is consistent with e.g.
>           BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>         - changed the parse function in blockdev_init to
>           be generic usable for other enum parameters
>         
> v2->v3: - moved parameter parsing to blockdev_init
>         - added per device detect_zeroes status to 
>           hmp (info block -v) and qmp (query-block) [Eric]
>         - added support to enable detect-zeroes also
>           for hot added devices [Eric]
>         - added missing entry to qemu_common_drive_opts
>         - fixed description of qemu_iovec_is_zero [Fam]
> 
> v1->v2: - added tests to commit message (Markus)
> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
>            - fixed typo (choosen->chosen) (Eric)
>            - added second example to commit msg
> 
> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline parameter
>               - call zero detection only for format (bs->file != NULL)
> 
>  block.c                   |   11 ++++++++++
>  block/qapi.c              |    1 +
>  blockdev.c                |   34 +++++++++++++++++++++++++++++
>  hmp.c                     |    5 +++++
>  include/block/block_int.h |    1 +
>  include/qemu-common.h     |    1 +
>  qapi-schema.json          |   52 ++++++++++++++++++++++++++++++++-------------
>  qemu-options.hx           |    6 ++++++
>  qmp-commands.hx           |    3 +++
>  util/iov.c                |   21 ++++++++++++++++++
>  10 files changed, 120 insertions(+), 15 deletions(-)
> 
> diff --git a/block.c b/block.c
> index b749d31..aea4c77 100644
> --- a/block.c
> +++ b/block.c
> @@ -3244,6 +3244,17 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>  
>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>  
> +    if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
> +        !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {

Pretty long condition. :-)

Looks like most is obviously needed, but I wonder what the bs->file part
is good for? That looks rather arbitrary.

> +        flags |= BDRV_REQ_ZERO_WRITE;
> +        /* if the device was not opened with discard=on the below flag
> +         * is immediately cleared again in bdrv_co_do_write_zeroes */

Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's
not a function that seems to be called from here.

> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
> +            flags |= BDRV_REQ_MAY_UNMAP;
> +        }
> +    }
> +
>      if (ret < 0) {
>          /* Do nothing, write notifier decided to fail this request */
>      } else if (flags & BDRV_REQ_ZERO_WRITE) {
> diff --git a/block/qapi.c b/block/qapi.c
> index af11445..75f44f1 100644
> --- a/block/qapi.c
> +++ b/block/qapi.c
> @@ -50,6 +50,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
>      }
>  
>      info->backing_file_depth = bdrv_get_backing_file_depth(bs);
> +    info->detect_zeroes = bs->detect_zeroes;
>  
>      if (bs->io_limits_enabled) {
>          ThrottleConfig cfg;
> diff --git a/blockdev.c b/blockdev.c
> index 7810e9f..955bd49 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -288,6 +288,23 @@ static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
>      }
>  }
>  
> +static int parse_enum_option(const char *lookup[], const char *buf, int max,
> +                             int def, Error **errp)
> +{
> +    int i;
> +    if (!buf) {
> +        return def;
> +    }
> +    for (i = 0; i < max; i++) {
> +        if (!strcmp(buf, lookup[i])) {
> +            return i;
> +        }
> +    }
> +    error_setg(errp, "invalid parameter value: %s",
> +               buf);
> +    return def;
> +}
> +
>  static bool check_throttle_config(ThrottleConfig *cfg, Error **errp)
>  {
>      if (throttle_conflicting(cfg)) {

This hunk could have been a patch of its own (not a reason for a respin,
but if you need to respin to address one of the other comments, please
split the patch).

> @@ -324,6 +341,7 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>      QemuOpts *opts;
>      const char *id;
>      bool has_driver_specific_opts;
> +    BlockdevDetectZeroesOptions detect_zeroes;
>      BlockDriver *drv = NULL;
>  
>      /* Check common options by copying from bs_opts to opts, all other options
> @@ -452,6 +470,17 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>          }
>      }
>  
> +    detect_zeroes =
> +        parse_enum_option(BlockdevDetectZeroesOptions_lookup,
> +                          qemu_opt_get(opts, "detect-zeroes"),
> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_MAX,
> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF,
> +                          &error);
> +    if (error) {
> +        error_propagate(errp, error);
> +        goto early_err;
> +    }
> +
>      /* init */
>      dinfo = g_malloc0(sizeof(*dinfo));
>      dinfo->id = g_strdup(qemu_opts_id(opts));
> @@ -462,6 +491,7 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>      }
>      dinfo->bdrv->open_flags = snapshot ? BDRV_O_SNAPSHOT : 0;
>      dinfo->bdrv->read_only = ro;
> +    dinfo->bdrv->detect_zeroes = detect_zeroes;
>      dinfo->refcount = 1;
>      if (serial != NULL) {
>          dinfo->serial = g_strdup(serial);
> @@ -2455,6 +2485,10 @@ QemuOptsList qemu_common_drive_opts = {
>              .name = "copy-on-read",
>              .type = QEMU_OPT_BOOL,
>              .help = "copy read data from backing file into image file",
> +        },{
> +            .name = "detect-zeroes",
> +            .type = QEMU_OPT_STRING,
> +            .help = "try to optimize zero writes",
>          },
>          { /* end of list */ }
>      },
> diff --git a/hmp.c b/hmp.c
> index ca869ba..a541e7d 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -336,6 +336,11 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
>                             info->value->inserted->backing_file_depth);
>          }
>  
> +        if (verbose) {
> +            monitor_printf(mon, "    Detect zeroes:    %s\n",
> +                           BlockdevDetectZeroesOptions_lookup[info->value->inserted->detect_zeroes]);
> +        }

Perhaps we should always display the information if zero detection is
enabled. This would be similar to things like I/O throttling limits,
which are only printed if throttling is enabled.

>          if (info->value->inserted->bps
>              || info->value->inserted->bps_rd
>              || info->value->inserted->bps_wr
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 9ffcb69..b8cc926 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -364,6 +364,7 @@ struct BlockDriverState {
>      BlockJob *job;
>  
>      QDict *options;
> +    BlockdevDetectZeroesOptions detect_zeroes;
>  };
>  
>  int get_tmp_filename(char *filename, int size);
> diff --git a/include/qemu-common.h b/include/qemu-common.h
> index a998e8d..8e3d6eb 100644
> --- a/include/qemu-common.h
> +++ b/include/qemu-common.h
> @@ -330,6 +330,7 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>  void qemu_iovec_concat_iov(QEMUIOVector *dst,
>                             struct iovec *src_iov, unsigned int src_cnt,
>                             size_t soffset, size_t sbytes);
> +bool qemu_iovec_is_zero(QEMUIOVector *qiov);
>  void qemu_iovec_destroy(QEMUIOVector *qiov);
>  void qemu_iovec_reset(QEMUIOVector *qiov);
>  size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 0b00427..d1620b1 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -937,6 +937,8 @@
>  # @encryption_key_missing: true if the backing device is encrypted but an
>  #                          valid encryption key is missing
>  #
> +# @detect_zeroes: detect and optimize zero writes (Since 2.1)
> +#
>  # @bps: total throughput limit in bytes per second is specified
>  #
>  # @bps_rd: read throughput limit in bytes per second is specified
> @@ -972,6 +974,7 @@
>    'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str',
>              '*backing_file': 'str', 'backing_file_depth': 'int',
>              'encrypted': 'bool', 'encryption_key_missing': 'bool',
> +            'detect_zeroes': 'BlockdevDetectZeroesOptions',
>              'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int',
>              'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
>              'image': 'ImageInfo',
> @@ -4250,6 +4253,22 @@
>    'data': [ 'ignore', 'unmap' ] }
>  
>  ##
> +# @BlockdevDetectZeroesOptions
> +#
> +# Describes the operation mode for the automatic conversion of plain
> +# zero writes by the OS to driver specific optimized zero write commands.
> +#
> +# @off:      Disabled (default)
> +# @on:       Enabled
> +# @unmap:    Enabled and even try to unmap blocks if possible. This requires
> +#            also that @BlockdevDiscardOptions is set to unmap for this device.
> +#
> +# Since: 2.1
> +##
> +{ 'enum': 'BlockdevDetectZeroesOptions',
> +  'data': [ 'off', 'on', 'unmap' ] }
> +
> +##
>  # @BlockdevAioOptions
>  #
>  # Selects the AIO backend to handle I/O requests
> @@ -4301,20 +4320,22 @@
>  # Options that are available for all block devices, independent of the block
>  # driver.
>  #
> -# @driver:      block driver name
> -# @id:          #optional id by which the new block device can be referred to.
> -#               This is a required option on the top level of blockdev-add, and
> -#               currently not allowed on any other level.
> -# @node-name:   #optional the name of a block driver state node (Since 2.0)
> -# @discard:     #optional discard-related options (default: ignore)
> -# @cache:       #optional cache-related options
> -# @aio:         #optional AIO backend (default: threads)
> -# @rerror:      #optional how to handle read errors on the device
> -#               (default: report)
> -# @werror:      #optional how to handle write errors on the device
> -#               (default: enospc)
> -# @read-only:   #optional whether the block device should be read-only
> -#               (default: false)
> +# @driver:        block driver name
> +# @id:            #optional id by which the new block device can be referred to.
> +#                 This is a required option on the top level of blockdev-add, and
> +#                 currently not allowed on any other level.
> +# @node-name:     #optional the name of a block driver state node (Since 2.0)
> +# @discard:       #optional discard-related options (default: ignore)
> +# @cache:         #optional cache-related options
> +# @aio:           #optional AIO backend (default: threads)
> +# @rerror:        #optional how to handle read errors on the device
> +#                 (default: report)
> +# @werror:        #optional how to handle write errors on the device
> +#                 (default: enospc)
> +# @read-only:     #optional whether the block device should be read-only
> +#                 (default: false)
> +# @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
> +#                 (default: off)
>  #
>  # Since: 1.7
>  ##
> @@ -4327,7 +4348,8 @@
>              '*aio': 'BlockdevAioOptions',
>              '*rerror': 'BlockdevOnError',
>              '*werror': 'BlockdevOnError',
> -            '*read-only': 'bool' } }
> +            '*read-only': 'bool',
> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions' } }
>  
>  ##
>  # @BlockdevOptionsFile
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 781af14..5ee94ea 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -414,6 +414,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
>      "       [,serial=s][,addr=A][,rerror=ignore|stop|report]\n"
>      "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
>      "       [,readonly=on|off][,copy-on-read=on|off]\n"
> +    "       [,detect-zeroes=on|off|unmap]\n"
>      "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
>      "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
>      "       [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
> @@ -475,6 +476,11 @@ Open drive @option{file} as read-only. Guest write attempts will fail.
>  @item copy-on-read=@var{copy-on-read}
>  @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
>  file sectors into the image file.
> +@item detect-zeroes=@var{detect-zeroes}
> +@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
> +conversion of plain zero writes by the OS to driver specific optimized
> +zero write commands. If "unmap" is chosen and @var{discard} is "on"
> +a zero write may even be converted to an UNMAP operation.
>  @end table
>  
>  By default, the @option{cache=writeback} mode is used. It will report data
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index ed3ab92..a535955 100644
> --- a/qmp-commands.hx
> +++ b/qmp-commands.hx
> @@ -2032,6 +2032,8 @@ Each json-object contain the following:
>           - "iops_rd_max":  read I/O operations max (json-int)
>           - "iops_wr_max":  write I/O operations max (json-int)
>           - "iops_size": I/O size when limiting by iops (json-int)
> +         - "detect_zeroes": detect and optimize zero writing (json-string)
> +             - Possible values: "off", "on", "unmap"
>           - "image": the detail of the image, it is a json-object containing
>              the following:
>               - "filename": image file name (json-string)
> @@ -2108,6 +2110,7 @@ Example:
>                 "iops_rd_max": 0,
>                 "iops_wr_max": 0,
>                 "iops_size": 0,
> +               "detect_zeroes": "on",
>                 "image":{
>                    "filename":"disks/test.qcow2",
>                    "format":"qcow2",
> diff --git a/util/iov.c b/util/iov.c
> index 6569b5a..f8c49a1 100644
> --- a/util/iov.c
> +++ b/util/iov.c
> @@ -335,6 +335,27 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>      qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
>  }
>  
> +/*
> + *  check if the contents of the iovecs are all zero
> + */
> +bool qemu_iovec_is_zero(QEMUIOVector *qiov)
> +{
> +    int i;
> +    for (i = 0; i < qiov->niov; i++) {
> +        size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1);

I think it gets a bit more readable like this:

size_t offs = QEMU_ALIGN_DOWN(qiov->iov[i].iov_len, 4 * sizeof(long));

> +        uint8_t *ptr = qiov->iov[i].iov_base;
> +        if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) {
> +            return false;
> +        }
> +        for (; offs < qiov->iov[i].iov_len; offs++) {
> +            if (ptr[offs]) {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +

This function could be a separate patch as well.

Kevin
Eric Blake May 14, 2014, 1:15 p.m. UTC | #4
On 05/13/2014 06:06 AM, Peter Lieven wrote:

>>>          - changed the parse function in blockdev_init to
>>>            be generic usable for other enum parameters
>> If you wouldn't mind, I think the generic function is useful enough that
>> people might want to backport it independently from this optimization.
>> It would be better to split this into a two-patch series, one for the
>> new parse_enum_option, the other for bdrv_write_zeroes utilizing it.
>>
>>
>>> +        },{
>>> +            .name = "detect-zeroes",
>>> +            .type = QEMU_OPT_STRING,
>>> +            .help = "try to optimize zero writes",
>> Might be worth listing (off, on, unmap) in the text.
>>
>> Everything else looked okay, but I'll wait for R-b until I see a
>> response about the idea of splitting the patch (even if that response is
>> justification for keeping it as one)
>>
> 
> I did not split because currently there is no other possible
> user in the function. The on_error settings and discard settings
> would be possible users, but for on_error there is a hardcoded
> difference between read and write which is not reflected in the
> qapi and for discard settings we have ignore and unmap, but
> we have also off and on which are not in qapi as well.

Even if there is currently only one caller of the new function, it's
STILL better to split patches into manageable pieces - psychologically,
it's easier to review a new function in isolation.  Besides, just
because there's currently only one user (the rest of your patch) does
not rule out that someone else may start using the function, and then a
backport would target that later commit plus your function.
Peter Lieven May 15, 2014, 5:16 a.m. UTC | #5
Am 14.05.2014 13:41, schrieb Kevin Wolf:
> Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben:
>> this patch tries to optimize zero write requests
>> by automatically using bdrv_write_zeroes if it is
>> supported by the format.
>>
>> This significantly speeds up file system initialization and
>> should speed zero write test used to test backend storage
>> performance.
>>
>> I ran the following 2 tests on my internal SSD with a
>> 50G QCOW2 container and on an attached iSCSI storage.
>>
>> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
>>
>> QCOW2         [off]     [on]     [unmap]
>> -----
>> runtime:       14secs    1.1secs  1.1secs
>> filesize:      937M      18M      18M
>>
>> iSCSI         [off]     [on]     [unmap]
>> ----
>> runtime:       9.3s      0.9s     0.9s
>>
>> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
>>
>> QCOW2         [off]     [on]     [unmap]
>> -----
>> runtime:       246secs   18secs   18secs
>> filesize:      51G       192K     192K
>> throughput:    203M/s    2.3G/s   2.3G/s
>>
>> iSCSI*        [off]     [on]     [unmap]
>> ----
>> runtime:       8mins     45secs   33secs
>> throughput:    106M/s    1.2G/s   1.6G/s
>> allocated:     100%      100%     0%
>>
>> * The storage was connected via an 1Gbit interface.
>>   It seems to internally handle writing zeroes
>>   via WRITESAME16 very fast.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>>         - added more details about the options in the comments
>>           of the qapi-schema [Eric]
>>         - changed the type of detect_zeroes from str to
>>           BlockdevDetectZeroesOptions. I left the name
>>           as is because it is consistent with e.g.
>>           BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>>         - changed the parse function in blockdev_init to
>>           be generic usable for other enum parameters
>>         
>> v2->v3: - moved parameter parsing to blockdev_init
>>         - added per device detect_zeroes status to 
>>           hmp (info block -v) and qmp (query-block) [Eric]
>>         - added support to enable detect-zeroes also
>>           for hot added devices [Eric]
>>         - added missing entry to qemu_common_drive_opts
>>         - fixed description of qemu_iovec_is_zero [Fam]
>>
>> v1->v2: - added tests to commit message (Markus)
>> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
>>            - fixed typo (choosen->chosen) (Eric)
>>            - added second example to commit msg
>>
>> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline parameter
>>               - call zero detection only for format (bs->file != NULL)
>>
>>  block.c                   |   11 ++++++++++
>>  block/qapi.c              |    1 +
>>  blockdev.c                |   34 +++++++++++++++++++++++++++++
>>  hmp.c                     |    5 +++++
>>  include/block/block_int.h |    1 +
>>  include/qemu-common.h     |    1 +
>>  qapi-schema.json          |   52 ++++++++++++++++++++++++++++++++-------------
>>  qemu-options.hx           |    6 ++++++
>>  qmp-commands.hx           |    3 +++
>>  util/iov.c                |   21 ++++++++++++++++++
>>  10 files changed, 120 insertions(+), 15 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index b749d31..aea4c77 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -3244,6 +3244,17 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>  
>>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>  
>> +    if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
>> +        !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
> Pretty long condition. :-)
>
> Looks like most is obviously needed, but I wonder what the bs->file part
> is good for? That looks rather arbitrary.

What I wanted to achieve is that this condition is only true if we handle
the format (e.g. raw, qcow2, vmdk etc.). If e.g. qcow2 then sends a
zero write this should always end on the disk and should not be optimizable.


>
>> +        flags |= BDRV_REQ_ZERO_WRITE;
>> +        /* if the device was not opened with discard=on the below flag
>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
> Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's
> not a function that seems to be called from here.

You are right. Question, do we want to support detect_zeroes = unmap
if discard = ignore? If yes, I have to update the docs. Otherwise
I have to check for BDRV_O_DISCARD before setting BDRV_REQ_MAY_UNMAP.

>
>> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>> +            flags |= BDRV_REQ_MAY_UNMAP;
>> +        }
>> +    }
>> +
>>      if (ret < 0) {
>>          /* Do nothing, write notifier decided to fail this request */
>>      } else if (flags & BDRV_REQ_ZERO_WRITE) {
>> diff --git a/block/qapi.c b/block/qapi.c
>> index af11445..75f44f1 100644
>> --- a/block/qapi.c
>> +++ b/block/qapi.c
>> @@ -50,6 +50,7 @@ BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
>>      }
>>  
>>      info->backing_file_depth = bdrv_get_backing_file_depth(bs);
>> +    info->detect_zeroes = bs->detect_zeroes;
>>  
>>      if (bs->io_limits_enabled) {
>>          ThrottleConfig cfg;
>> diff --git a/blockdev.c b/blockdev.c
>> index 7810e9f..955bd49 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -288,6 +288,23 @@ static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
>>      }
>>  }
>>  
>> +static int parse_enum_option(const char *lookup[], const char *buf, int max,
>> +                             int def, Error **errp)
>> +{
>> +    int i;
>> +    if (!buf) {
>> +        return def;
>> +    }
>> +    for (i = 0; i < max; i++) {
>> +        if (!strcmp(buf, lookup[i])) {
>> +            return i;
>> +        }
>> +    }
>> +    error_setg(errp, "invalid parameter value: %s",
>> +               buf);
>> +    return def;
>> +}
>> +
>>  static bool check_throttle_config(ThrottleConfig *cfg, Error **errp)
>>  {
>>      if (throttle_conflicting(cfg)) {
> This hunk could have been a patch of its own (not a reason for a respin,
> but if you need to respin to address one of the other comments, please
> split the patch).

Yes, will do. Erik also already suggested this.

>
>> @@ -324,6 +341,7 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>>      QemuOpts *opts;
>>      const char *id;
>>      bool has_driver_specific_opts;
>> +    BlockdevDetectZeroesOptions detect_zeroes;
>>      BlockDriver *drv = NULL;
>>  
>>      /* Check common options by copying from bs_opts to opts, all other options
>> @@ -452,6 +470,17 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>>          }
>>      }
>>  
>> +    detect_zeroes =
>> +        parse_enum_option(BlockdevDetectZeroesOptions_lookup,
>> +                          qemu_opt_get(opts, "detect-zeroes"),
>> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_MAX,
>> +                          BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF,
>> +                          &error);
>> +    if (error) {
>> +        error_propagate(errp, error);
>> +        goto early_err;
>> +    }
>> +
>>      /* init */
>>      dinfo = g_malloc0(sizeof(*dinfo));
>>      dinfo->id = g_strdup(qemu_opts_id(opts));
>> @@ -462,6 +491,7 @@ static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
>>      }
>>      dinfo->bdrv->open_flags = snapshot ? BDRV_O_SNAPSHOT : 0;
>>      dinfo->bdrv->read_only = ro;
>> +    dinfo->bdrv->detect_zeroes = detect_zeroes;
>>      dinfo->refcount = 1;
>>      if (serial != NULL) {
>>          dinfo->serial = g_strdup(serial);
>> @@ -2455,6 +2485,10 @@ QemuOptsList qemu_common_drive_opts = {
>>              .name = "copy-on-read",
>>              .type = QEMU_OPT_BOOL,
>>              .help = "copy read data from backing file into image file",
>> +        },{
>> +            .name = "detect-zeroes",
>> +            .type = QEMU_OPT_STRING,
>> +            .help = "try to optimize zero writes",
>>          },
>>          { /* end of list */ }
>>      },
>> diff --git a/hmp.c b/hmp.c
>> index ca869ba..a541e7d 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -336,6 +336,11 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
>>                             info->value->inserted->backing_file_depth);
>>          }
>>  
>> +        if (verbose) {
>> +            monitor_printf(mon, "    Detect zeroes:    %s\n",
>> +                           BlockdevDetectZeroesOptions_lookup[info->value->inserted->detect_zeroes]);
>> +        }
> Perhaps we should always display the information if zero detection is
> enabled. This would be similar to things like I/O throttling limits,
> which are only printed if throttling is enabled.

Ok

>
>>          if (info->value->inserted->bps
>>              || info->value->inserted->bps_rd
>>              || info->value->inserted->bps_wr
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 9ffcb69..b8cc926 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -364,6 +364,7 @@ struct BlockDriverState {
>>      BlockJob *job;
>>  
>>      QDict *options;
>> +    BlockdevDetectZeroesOptions detect_zeroes;
>>  };
>>  
>>  int get_tmp_filename(char *filename, int size);
>> diff --git a/include/qemu-common.h b/include/qemu-common.h
>> index a998e8d..8e3d6eb 100644
>> --- a/include/qemu-common.h
>> +++ b/include/qemu-common.h
>> @@ -330,6 +330,7 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>>  void qemu_iovec_concat_iov(QEMUIOVector *dst,
>>                             struct iovec *src_iov, unsigned int src_cnt,
>>                             size_t soffset, size_t sbytes);
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov);
>>  void qemu_iovec_destroy(QEMUIOVector *qiov);
>>  void qemu_iovec_reset(QEMUIOVector *qiov);
>>  size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 0b00427..d1620b1 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -937,6 +937,8 @@
>>  # @encryption_key_missing: true if the backing device is encrypted but an
>>  #                          valid encryption key is missing
>>  #
>> +# @detect_zeroes: detect and optimize zero writes (Since 2.1)
>> +#
>>  # @bps: total throughput limit in bytes per second is specified
>>  #
>>  # @bps_rd: read throughput limit in bytes per second is specified
>> @@ -972,6 +974,7 @@
>>    'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str',
>>              '*backing_file': 'str', 'backing_file_depth': 'int',
>>              'encrypted': 'bool', 'encryption_key_missing': 'bool',
>> +            'detect_zeroes': 'BlockdevDetectZeroesOptions',
>>              'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int',
>>              'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
>>              'image': 'ImageInfo',
>> @@ -4250,6 +4253,22 @@
>>    'data': [ 'ignore', 'unmap' ] }
>>  
>>  ##
>> +# @BlockdevDetectZeroesOptions
>> +#
>> +# Describes the operation mode for the automatic conversion of plain
>> +# zero writes by the OS to driver specific optimized zero write commands.
>> +#
>> +# @off:      Disabled (default)
>> +# @on:       Enabled
>> +# @unmap:    Enabled and even try to unmap blocks if possible. This requires
>> +#            also that @BlockdevDiscardOptions is set to unmap for this device.
>> +#
>> +# Since: 2.1
>> +##
>> +{ 'enum': 'BlockdevDetectZeroesOptions',
>> +  'data': [ 'off', 'on', 'unmap' ] }
>> +
>> +##
>>  # @BlockdevAioOptions
>>  #
>>  # Selects the AIO backend to handle I/O requests
>> @@ -4301,20 +4320,22 @@
>>  # Options that are available for all block devices, independent of the block
>>  # driver.
>>  #
>> -# @driver:      block driver name
>> -# @id:          #optional id by which the new block device can be referred to.
>> -#               This is a required option on the top level of blockdev-add, and
>> -#               currently not allowed on any other level.
>> -# @node-name:   #optional the name of a block driver state node (Since 2.0)
>> -# @discard:     #optional discard-related options (default: ignore)
>> -# @cache:       #optional cache-related options
>> -# @aio:         #optional AIO backend (default: threads)
>> -# @rerror:      #optional how to handle read errors on the device
>> -#               (default: report)
>> -# @werror:      #optional how to handle write errors on the device
>> -#               (default: enospc)
>> -# @read-only:   #optional whether the block device should be read-only
>> -#               (default: false)
>> +# @driver:        block driver name
>> +# @id:            #optional id by which the new block device can be referred to.
>> +#                 This is a required option on the top level of blockdev-add, and
>> +#                 currently not allowed on any other level.
>> +# @node-name:     #optional the name of a block driver state node (Since 2.0)
>> +# @discard:       #optional discard-related options (default: ignore)
>> +# @cache:         #optional cache-related options
>> +# @aio:           #optional AIO backend (default: threads)
>> +# @rerror:        #optional how to handle read errors on the device
>> +#                 (default: report)
>> +# @werror:        #optional how to handle write errors on the device
>> +#                 (default: enospc)
>> +# @read-only:     #optional whether the block device should be read-only
>> +#                 (default: false)
>> +# @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
>> +#                 (default: off)
>>  #
>>  # Since: 1.7
>>  ##
>> @@ -4327,7 +4348,8 @@
>>              '*aio': 'BlockdevAioOptions',
>>              '*rerror': 'BlockdevOnError',
>>              '*werror': 'BlockdevOnError',
>> -            '*read-only': 'bool' } }
>> +            '*read-only': 'bool',
>> +            '*detect-zeroes': 'BlockdevDetectZeroesOptions' } }
>>  
>>  ##
>>  # @BlockdevOptionsFile
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 781af14..5ee94ea 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -414,6 +414,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
>>      "       [,serial=s][,addr=A][,rerror=ignore|stop|report]\n"
>>      "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
>>      "       [,readonly=on|off][,copy-on-read=on|off]\n"
>> +    "       [,detect-zeroes=on|off|unmap]\n"
>>      "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
>>      "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
>>      "       [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
>> @@ -475,6 +476,11 @@ Open drive @option{file} as read-only. Guest write attempts will fail.
>>  @item copy-on-read=@var{copy-on-read}
>>  @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
>>  file sectors into the image file.
>> +@item detect-zeroes=@var{detect-zeroes}
>> +@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
>> +conversion of plain zero writes by the OS to driver specific optimized
>> +zero write commands. If "unmap" is chosen and @var{discard} is "on"
>> +a zero write may even be converted to an UNMAP operation.
>>  @end table
>>  
>>  By default, the @option{cache=writeback} mode is used. It will report data
>> diff --git a/qmp-commands.hx b/qmp-commands.hx
>> index ed3ab92..a535955 100644
>> --- a/qmp-commands.hx
>> +++ b/qmp-commands.hx
>> @@ -2032,6 +2032,8 @@ Each json-object contain the following:
>>           - "iops_rd_max":  read I/O operations max (json-int)
>>           - "iops_wr_max":  write I/O operations max (json-int)
>>           - "iops_size": I/O size when limiting by iops (json-int)
>> +         - "detect_zeroes": detect and optimize zero writing (json-string)
>> +             - Possible values: "off", "on", "unmap"
>>           - "image": the detail of the image, it is a json-object containing
>>              the following:
>>               - "filename": image file name (json-string)
>> @@ -2108,6 +2110,7 @@ Example:
>>                 "iops_rd_max": 0,
>>                 "iops_wr_max": 0,
>>                 "iops_size": 0,
>> +               "detect_zeroes": "on",
>>                 "image":{
>>                    "filename":"disks/test.qcow2",
>>                    "format":"qcow2",
>> diff --git a/util/iov.c b/util/iov.c
>> index 6569b5a..f8c49a1 100644
>> --- a/util/iov.c
>> +++ b/util/iov.c
>> @@ -335,6 +335,27 @@ void qemu_iovec_concat(QEMUIOVector *dst,
>>      qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
>>  }
>>  
>> +/*
>> + *  check if the contents of the iovecs are all zero
>> + */
>> +bool qemu_iovec_is_zero(QEMUIOVector *qiov)
>> +{
>> +    int i;
>> +    for (i = 0; i < qiov->niov; i++) {
>> +        size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1);
> I think it gets a bit more readable like this:
>
> size_t offs = QEMU_ALIGN_DOWN(qiov->iov[i].iov_len, 4 * sizeof(long));

Yes, indeed.

>
>> +        uint8_t *ptr = qiov->iov[i].iov_base;
>> +        if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) {
>> +            return false;
>> +        }
>> +        for (; offs < qiov->iov[i].iov_len; offs++) {
>> +            if (ptr[offs]) {
>> +                return false;
>> +            }
>> +        }
>> +    }
>> +    return true;
>> +}
>> +
> This function could be a separate patch as well.

Ok, will split as well.

Thank you for your and Erik for his comments,
Peter
Kevin Wolf May 15, 2014, 9:54 a.m. UTC | #6
Am 15.05.2014 um 07:16 hat Peter Lieven geschrieben:
> Am 14.05.2014 13:41, schrieb Kevin Wolf:
> > Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben:
> >> this patch tries to optimize zero write requests
> >> by automatically using bdrv_write_zeroes if it is
> >> supported by the format.
> >>
> >> This significantly speeds up file system initialization and
> >> should speed zero write test used to test backend storage
> >> performance.
> >>
> >> I ran the following 2 tests on my internal SSD with a
> >> 50G QCOW2 container and on an attached iSCSI storage.
> >>
> >> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
> >>
> >> QCOW2         [off]     [on]     [unmap]
> >> -----
> >> runtime:       14secs    1.1secs  1.1secs
> >> filesize:      937M      18M      18M
> >>
> >> iSCSI         [off]     [on]     [unmap]
> >> ----
> >> runtime:       9.3s      0.9s     0.9s
> >>
> >> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
> >>
> >> QCOW2         [off]     [on]     [unmap]
> >> -----
> >> runtime:       246secs   18secs   18secs
> >> filesize:      51G       192K     192K
> >> throughput:    203M/s    2.3G/s   2.3G/s
> >>
> >> iSCSI*        [off]     [on]     [unmap]
> >> ----
> >> runtime:       8mins     45secs   33secs
> >> throughput:    106M/s    1.2G/s   1.6G/s
> >> allocated:     100%      100%     0%
> >>
> >> * The storage was connected via an 1Gbit interface.
> >>   It seems to internally handle writing zeroes
> >>   via WRITESAME16 very fast.
> >>
> >> Signed-off-by: Peter Lieven <pl@kamp.de>
> >> ---
> >> v3->v4: - use QAPI generated enum and lookup table [Kevin]
> >>         - added more details about the options in the comments
> >>           of the qapi-schema [Eric]
> >>         - changed the type of detect_zeroes from str to
> >>           BlockdevDetectZeroesOptions. I left the name
> >>           as is because it is consistent with e.g.
> >>           BlockdevDiscardOptions or BlockdevAioOptions [Eric]
> >>         - changed the parse function in blockdev_init to
> >>           be generic usable for other enum parameters
> >>         
> >> v2->v3: - moved parameter parsing to blockdev_init
> >>         - added per device detect_zeroes status to 
> >>           hmp (info block -v) and qmp (query-block) [Eric]
> >>         - added support to enable detect-zeroes also
> >>           for hot added devices [Eric]
> >>         - added missing entry to qemu_common_drive_opts
> >>         - fixed description of qemu_iovec_is_zero [Fam]
> >>
> >> v1->v2: - added tests to commit message (Markus)
> >> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
> >>            - fixed typo (choosen->chosen) (Eric)
> >>            - added second example to commit msg
> >>
> >> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline parameter
> >>               - call zero detection only for format (bs->file != NULL)
> >>
> >>  block.c                   |   11 ++++++++++
> >>  block/qapi.c              |    1 +
> >>  blockdev.c                |   34 +++++++++++++++++++++++++++++
> >>  hmp.c                     |    5 +++++
> >>  include/block/block_int.h |    1 +
> >>  include/qemu-common.h     |    1 +
> >>  qapi-schema.json          |   52 ++++++++++++++++++++++++++++++++-------------
> >>  qemu-options.hx           |    6 ++++++
> >>  qmp-commands.hx           |    3 +++
> >>  util/iov.c                |   21 ++++++++++++++++++
> >>  10 files changed, 120 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/block.c b/block.c
> >> index b749d31..aea4c77 100644
> >> --- a/block.c
> >> +++ b/block.c
> >> @@ -3244,6 +3244,17 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
> >>  
> >>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
> >>  
> >> +    if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
> >> +        !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
> >> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
> > Pretty long condition. :-)
> >
> > Looks like most is obviously needed, but I wonder what the bs->file part
> > is good for? That looks rather arbitrary.
> 
> What I wanted to achieve is that this condition is only true if we handle
> the format (e.g. raw, qcow2, vmdk etc.). If e.g. qcow2 then sends a
> zero write this should always end on the disk and should not be optimizable.

But why? This means setting an arbitrary policy for no good reason. You
already have an option, and it already defaults to off, so unless
someone specifically enables it for bs->file, we don't do the
optimisation. But if someone wants to have it on bs->file, what reason
is there to ignore that request?

> >> +        flags |= BDRV_REQ_ZERO_WRITE;
> >> +        /* if the device was not opened with discard=on the below flag
> >> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
> > Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's
> > not a function that seems to be called from here.
> 
> You are right. Question, do we want to support detect_zeroes = unmap
> if discard = ignore? If yes, I have to update the docs. Otherwise
> I have to check for BDRV_O_DISCARD before setting BDRV_REQ_MAY_UNMAP.

I think it would be reasonable enough to just error out when you try to
open an image with detect_zeroes=unmap,discard=ignore.

Can these flags be changed during runtime? If so, we need to check there
as well.

> >> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
> >> +            flags |= BDRV_REQ_MAY_UNMAP;
> >> +        }
> >> +    }
> >> +
> >>      if (ret < 0) {
> >>          /* Do nothing, write notifier decided to fail this request */
> >>      } else if (flags & BDRV_REQ_ZERO_WRITE) {

Kevin
Peter Lieven May 15, 2014, 9:20 p.m. UTC | #7
Am 15.05.2014 11:54, schrieb Kevin Wolf:
> Am 15.05.2014 um 07:16 hat Peter Lieven geschrieben:
>> Am 14.05.2014 13:41, schrieb Kevin Wolf:
>>> Am 08.05.2014 um 18:22 hat Peter Lieven geschrieben:
>>>> this patch tries to optimize zero write requests
>>>> by automatically using bdrv_write_zeroes if it is
>>>> supported by the format.
>>>>
>>>> This significantly speeds up file system initialization and
>>>> should speed zero write test used to test backend storage
>>>> performance.
>>>>
>>>> I ran the following 2 tests on my internal SSD with a
>>>> 50G QCOW2 container and on an attached iSCSI storage.
>>>>
>>>> a) mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 /dev/vdX
>>>>
>>>> QCOW2         [off]     [on]     [unmap]
>>>> -----
>>>> runtime:       14secs    1.1secs  1.1secs
>>>> filesize:      937M      18M      18M
>>>>
>>>> iSCSI         [off]     [on]     [unmap]
>>>> ----
>>>> runtime:       9.3s      0.9s     0.9s
>>>>
>>>> b) dd if=/dev/zero of=/dev/vdX bs=1M oflag=direct
>>>>
>>>> QCOW2         [off]     [on]     [unmap]
>>>> -----
>>>> runtime:       246secs   18secs   18secs
>>>> filesize:      51G       192K     192K
>>>> throughput:    203M/s    2.3G/s   2.3G/s
>>>>
>>>> iSCSI*        [off]     [on]     [unmap]
>>>> ----
>>>> runtime:       8mins     45secs   33secs
>>>> throughput:    106M/s    1.2G/s   1.6G/s
>>>> allocated:     100%      100%     0%
>>>>
>>>> * The storage was connected via an 1Gbit interface.
>>>>   It seems to internally handle writing zeroes
>>>>   via WRITESAME16 very fast.
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> ---
>>>> v3->v4: - use QAPI generated enum and lookup table [Kevin]
>>>>         - added more details about the options in the comments
>>>>           of the qapi-schema [Eric]
>>>>         - changed the type of detect_zeroes from str to
>>>>           BlockdevDetectZeroesOptions. I left the name
>>>>           as is because it is consistent with e.g.
>>>>           BlockdevDiscardOptions or BlockdevAioOptions [Eric]
>>>>         - changed the parse function in blockdev_init to
>>>>           be generic usable for other enum parameters
>>>>         
>>>> v2->v3: - moved parameter parsing to blockdev_init
>>>>         - added per device detect_zeroes status to 
>>>>           hmp (info block -v) and qmp (query-block) [Eric]
>>>>         - added support to enable detect-zeroes also
>>>>           for hot added devices [Eric]
>>>>         - added missing entry to qemu_common_drive_opts
>>>>         - fixed description of qemu_iovec_is_zero [Fam]
>>>>
>>>> v1->v2: - added tests to commit message (Markus)
>>>> RFCv2->v1: - fixed paramter parsing strncmp -> strcmp (Eric)
>>>>            - fixed typo (choosen->chosen) (Eric)
>>>>            - added second example to commit msg
>>>>
>>>> RFCv1->RFCv2: - add detect-zeroes=off|on|unmap knob to drive cmdline parameter
>>>>               - call zero detection only for format (bs->file != NULL)
>>>>
>>>>  block.c                   |   11 ++++++++++
>>>>  block/qapi.c              |    1 +
>>>>  blockdev.c                |   34 +++++++++++++++++++++++++++++
>>>>  hmp.c                     |    5 +++++
>>>>  include/block/block_int.h |    1 +
>>>>  include/qemu-common.h     |    1 +
>>>>  qapi-schema.json          |   52 ++++++++++++++++++++++++++++++++-------------
>>>>  qemu-options.hx           |    6 ++++++
>>>>  qmp-commands.hx           |    3 +++
>>>>  util/iov.c                |   21 ++++++++++++++++++
>>>>  10 files changed, 120 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index b749d31..aea4c77 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -3244,6 +3244,17 @@ static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
>>>>  
>>>>      ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
>>>>  
>>>> +    if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
>>>> +        !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
>>>> +        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
>>> Pretty long condition. :-)
>>>
>>> Looks like most is obviously needed, but I wonder what the bs->file part
>>> is good for? That looks rather arbitrary.
>> What I wanted to achieve is that this condition is only true if we handle
>> the format (e.g. raw, qcow2, vmdk etc.). If e.g. qcow2 then sends a
>> zero write this should always end on the disk and should not be optimizable.
> But why? This means setting an arbitrary policy for no good reason. You
> already have an option, and it already defaults to off, so unless
> someone specifically enables it for bs->file, we don't do the
> optimisation. But if someone wants to have it on bs->file, what reason
> is there to ignore that request?

I was erroneously thinking that the setting would be inherited to bs->file.
I will remove the bs->file from the condition.
>
>>>> +        flags |= BDRV_REQ_ZERO_WRITE;
>>>> +        /* if the device was not opened with discard=on the below flag
>>>> +         * is immediately cleared again in bdrv_co_do_write_zeroes */
>>> Is it? I only see it being cleared in bdrv_co_write_zeroes(), but that's
>>> not a function that seems to be called from here.
>> You are right. Question, do we want to support detect_zeroes = unmap
>> if discard = ignore? If yes, I have to update the docs. Otherwise
>> I have to check for BDRV_O_DISCARD before setting BDRV_REQ_MAY_UNMAP.
> I think it would be reasonable enough to just error out when you try to
> open an image with detect_zeroes=unmap,discard=ignore.
>
> Can these flags be changed during runtime? If so, we need to check there
> as well.

No, but you can specify them during hot add. Same as with discard.

Peter

>
>>>> +        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>>>> +            flags |= BDRV_REQ_MAY_UNMAP;
>>>> +        }
>>>> +    }
>>>> +
>>>>      if (ret < 0) {
>>>>          /* Do nothing, write notifier decided to fail this request */
>>>>      } else if (flags & BDRV_REQ_ZERO_WRITE) {
> Kevin
diff mbox

Patch

diff --git a/block.c b/block.c
index b749d31..aea4c77 100644
--- a/block.c
+++ b/block.c
@@ -3244,6 +3244,17 @@  static int coroutine_fn bdrv_aligned_pwritev(BlockDriverState *bs,
 
     ret = notifier_with_return_list_notify(&bs->before_write_notifiers, req);
 
+    if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
+        !(flags & BDRV_REQ_ZERO_WRITE) && bs->file &&
+        drv->bdrv_co_write_zeroes && qemu_iovec_is_zero(qiov)) {
+        flags |= BDRV_REQ_ZERO_WRITE;
+        /* if the device was not opened with discard=on the below flag
+         * is immediately cleared again in bdrv_co_do_write_zeroes */
+        if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
+            flags |= BDRV_REQ_MAY_UNMAP;
+        }
+    }
+
     if (ret < 0) {
         /* Do nothing, write notifier decided to fail this request */
     } else if (flags & BDRV_REQ_ZERO_WRITE) {
diff --git a/block/qapi.c b/block/qapi.c
index af11445..75f44f1 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -50,6 +50,7 @@  BlockDeviceInfo *bdrv_block_device_info(BlockDriverState *bs)
     }
 
     info->backing_file_depth = bdrv_get_backing_file_depth(bs);
+    info->detect_zeroes = bs->detect_zeroes;
 
     if (bs->io_limits_enabled) {
         ThrottleConfig cfg;
diff --git a/blockdev.c b/blockdev.c
index 7810e9f..955bd49 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -288,6 +288,23 @@  static int parse_block_error_action(const char *buf, bool is_read, Error **errp)
     }
 }
 
+static int parse_enum_option(const char *lookup[], const char *buf, int max,
+                             int def, Error **errp)
+{
+    int i;
+    if (!buf) {
+        return def;
+    }
+    for (i = 0; i < max; i++) {
+        if (!strcmp(buf, lookup[i])) {
+            return i;
+        }
+    }
+    error_setg(errp, "invalid parameter value: %s",
+               buf);
+    return def;
+}
+
 static bool check_throttle_config(ThrottleConfig *cfg, Error **errp)
 {
     if (throttle_conflicting(cfg)) {
@@ -324,6 +341,7 @@  static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
     QemuOpts *opts;
     const char *id;
     bool has_driver_specific_opts;
+    BlockdevDetectZeroesOptions detect_zeroes;
     BlockDriver *drv = NULL;
 
     /* Check common options by copying from bs_opts to opts, all other options
@@ -452,6 +470,17 @@  static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
         }
     }
 
+    detect_zeroes =
+        parse_enum_option(BlockdevDetectZeroesOptions_lookup,
+                          qemu_opt_get(opts, "detect-zeroes"),
+                          BLOCKDEV_DETECT_ZEROES_OPTIONS_MAX,
+                          BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF,
+                          &error);
+    if (error) {
+        error_propagate(errp, error);
+        goto early_err;
+    }
+
     /* init */
     dinfo = g_malloc0(sizeof(*dinfo));
     dinfo->id = g_strdup(qemu_opts_id(opts));
@@ -462,6 +491,7 @@  static DriveInfo *blockdev_init(const char *file, QDict *bs_opts,
     }
     dinfo->bdrv->open_flags = snapshot ? BDRV_O_SNAPSHOT : 0;
     dinfo->bdrv->read_only = ro;
+    dinfo->bdrv->detect_zeroes = detect_zeroes;
     dinfo->refcount = 1;
     if (serial != NULL) {
         dinfo->serial = g_strdup(serial);
@@ -2455,6 +2485,10 @@  QemuOptsList qemu_common_drive_opts = {
             .name = "copy-on-read",
             .type = QEMU_OPT_BOOL,
             .help = "copy read data from backing file into image file",
+        },{
+            .name = "detect-zeroes",
+            .type = QEMU_OPT_STRING,
+            .help = "try to optimize zero writes",
         },
         { /* end of list */ }
     },
diff --git a/hmp.c b/hmp.c
index ca869ba..a541e7d 100644
--- a/hmp.c
+++ b/hmp.c
@@ -336,6 +336,11 @@  void hmp_info_block(Monitor *mon, const QDict *qdict)
                            info->value->inserted->backing_file_depth);
         }
 
+        if (verbose) {
+            monitor_printf(mon, "    Detect zeroes:    %s\n",
+                           BlockdevDetectZeroesOptions_lookup[info->value->inserted->detect_zeroes]);
+        }
+
         if (info->value->inserted->bps
             || info->value->inserted->bps_rd
             || info->value->inserted->bps_wr
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9ffcb69..b8cc926 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -364,6 +364,7 @@  struct BlockDriverState {
     BlockJob *job;
 
     QDict *options;
+    BlockdevDetectZeroesOptions detect_zeroes;
 };
 
 int get_tmp_filename(char *filename, int size);
diff --git a/include/qemu-common.h b/include/qemu-common.h
index a998e8d..8e3d6eb 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -330,6 +330,7 @@  void qemu_iovec_concat(QEMUIOVector *dst,
 void qemu_iovec_concat_iov(QEMUIOVector *dst,
                            struct iovec *src_iov, unsigned int src_cnt,
                            size_t soffset, size_t sbytes);
+bool qemu_iovec_is_zero(QEMUIOVector *qiov);
 void qemu_iovec_destroy(QEMUIOVector *qiov);
 void qemu_iovec_reset(QEMUIOVector *qiov);
 size_t qemu_iovec_to_buf(QEMUIOVector *qiov, size_t offset,
diff --git a/qapi-schema.json b/qapi-schema.json
index 0b00427..d1620b1 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -937,6 +937,8 @@ 
 # @encryption_key_missing: true if the backing device is encrypted but an
 #                          valid encryption key is missing
 #
+# @detect_zeroes: detect and optimize zero writes (Since 2.1)
+#
 # @bps: total throughput limit in bytes per second is specified
 #
 # @bps_rd: read throughput limit in bytes per second is specified
@@ -972,6 +974,7 @@ 
   'data': { 'file': 'str', '*node-name': 'str', 'ro': 'bool', 'drv': 'str',
             '*backing_file': 'str', 'backing_file_depth': 'int',
             'encrypted': 'bool', 'encryption_key_missing': 'bool',
+            'detect_zeroes': 'BlockdevDetectZeroesOptions',
             'bps': 'int', 'bps_rd': 'int', 'bps_wr': 'int',
             'iops': 'int', 'iops_rd': 'int', 'iops_wr': 'int',
             'image': 'ImageInfo',
@@ -4250,6 +4253,22 @@ 
   'data': [ 'ignore', 'unmap' ] }
 
 ##
+# @BlockdevDetectZeroesOptions
+#
+# Describes the operation mode for the automatic conversion of plain
+# zero writes by the OS to driver specific optimized zero write commands.
+#
+# @off:      Disabled (default)
+# @on:       Enabled
+# @unmap:    Enabled and even try to unmap blocks if possible. This requires
+#            also that @BlockdevDiscardOptions is set to unmap for this device.
+#
+# Since: 2.1
+##
+{ 'enum': 'BlockdevDetectZeroesOptions',
+  'data': [ 'off', 'on', 'unmap' ] }
+
+##
 # @BlockdevAioOptions
 #
 # Selects the AIO backend to handle I/O requests
@@ -4301,20 +4320,22 @@ 
 # Options that are available for all block devices, independent of the block
 # driver.
 #
-# @driver:      block driver name
-# @id:          #optional id by which the new block device can be referred to.
-#               This is a required option on the top level of blockdev-add, and
-#               currently not allowed on any other level.
-# @node-name:   #optional the name of a block driver state node (Since 2.0)
-# @discard:     #optional discard-related options (default: ignore)
-# @cache:       #optional cache-related options
-# @aio:         #optional AIO backend (default: threads)
-# @rerror:      #optional how to handle read errors on the device
-#               (default: report)
-# @werror:      #optional how to handle write errors on the device
-#               (default: enospc)
-# @read-only:   #optional whether the block device should be read-only
-#               (default: false)
+# @driver:        block driver name
+# @id:            #optional id by which the new block device can be referred to.
+#                 This is a required option on the top level of blockdev-add, and
+#                 currently not allowed on any other level.
+# @node-name:     #optional the name of a block driver state node (Since 2.0)
+# @discard:       #optional discard-related options (default: ignore)
+# @cache:         #optional cache-related options
+# @aio:           #optional AIO backend (default: threads)
+# @rerror:        #optional how to handle read errors on the device
+#                 (default: report)
+# @werror:        #optional how to handle write errors on the device
+#                 (default: enospc)
+# @read-only:     #optional whether the block device should be read-only
+#                 (default: false)
+# @detect-zeroes: #optional detect and optimize zero writes (Since 2.1)
+#                 (default: off)
 #
 # Since: 1.7
 ##
@@ -4327,7 +4348,8 @@ 
             '*aio': 'BlockdevAioOptions',
             '*rerror': 'BlockdevOnError',
             '*werror': 'BlockdevOnError',
-            '*read-only': 'bool' } }
+            '*read-only': 'bool',
+            '*detect-zeroes': 'BlockdevDetectZeroesOptions' } }
 
 ##
 # @BlockdevOptionsFile
diff --git a/qemu-options.hx b/qemu-options.hx
index 781af14..5ee94ea 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -414,6 +414,7 @@  DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,serial=s][,addr=A][,rerror=ignore|stop|report]\n"
     "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
     "       [,readonly=on|off][,copy-on-read=on|off]\n"
+    "       [,detect-zeroes=on|off|unmap]\n"
     "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
     "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"
     "       [[,bps_max=bm]|[[,bps_rd_max=rm][,bps_wr_max=wm]]]\n"
@@ -475,6 +476,11 @@  Open drive @option{file} as read-only. Guest write attempts will fail.
 @item copy-on-read=@var{copy-on-read}
 @var{copy-on-read} is "on" or "off" and enables whether to copy read backing
 file sectors into the image file.
+@item detect-zeroes=@var{detect-zeroes}
+@var{detect-zeroes} is "off", "on" or "unmap" and enables the automatic
+conversion of plain zero writes by the OS to driver specific optimized
+zero write commands. If "unmap" is chosen and @var{discard} is "on"
+a zero write may even be converted to an UNMAP operation.
 @end table
 
 By default, the @option{cache=writeback} mode is used. It will report data
diff --git a/qmp-commands.hx b/qmp-commands.hx
index ed3ab92..a535955 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2032,6 +2032,8 @@  Each json-object contain the following:
          - "iops_rd_max":  read I/O operations max (json-int)
          - "iops_wr_max":  write I/O operations max (json-int)
          - "iops_size": I/O size when limiting by iops (json-int)
+         - "detect_zeroes": detect and optimize zero writing (json-string)
+             - Possible values: "off", "on", "unmap"
          - "image": the detail of the image, it is a json-object containing
             the following:
              - "filename": image file name (json-string)
@@ -2108,6 +2110,7 @@  Example:
                "iops_rd_max": 0,
                "iops_wr_max": 0,
                "iops_size": 0,
+               "detect_zeroes": "on",
                "image":{
                   "filename":"disks/test.qcow2",
                   "format":"qcow2",
diff --git a/util/iov.c b/util/iov.c
index 6569b5a..f8c49a1 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -335,6 +335,27 @@  void qemu_iovec_concat(QEMUIOVector *dst,
     qemu_iovec_concat_iov(dst, src->iov, src->niov, soffset, sbytes);
 }
 
+/*
+ *  check if the contents of the iovecs are all zero
+ */
+bool qemu_iovec_is_zero(QEMUIOVector *qiov)
+{
+    int i;
+    for (i = 0; i < qiov->niov; i++) {
+        size_t offs = qiov->iov[i].iov_len & ~(4 * sizeof(long) - 1);
+        uint8_t *ptr = qiov->iov[i].iov_base;
+        if (offs && !buffer_is_zero(qiov->iov[i].iov_base, offs)) {
+            return false;
+        }
+        for (; offs < qiov->iov[i].iov_len; offs++) {
+            if (ptr[offs]) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 void qemu_iovec_destroy(QEMUIOVector *qiov)
 {
     assert(qiov->nalloc != -1);