[PATCHv2] block: add native support for NFS

Message ID	1387271725-17060-1-git-send-email-pl@kamp.de
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> hbedv: 8.2.12.164/7.11.120.48. spamassassin: 3.3.1. Clear:RC:1(82.141.1.145):SA:0(-1.2/4.0):. Processed in 2.729338 secs); 17 Dec 2013 09:15:06 -0000 From: Peter Lieven <pl@kamp.de> To: qemu-devel@nongnu.org Date: Tue, 17 Dec 2013 10:15:25 +0100 Message-Id: <1387271725-17060-1-git-send-email-pl@kamp.de> Error: Malformed IPv6 address (bad octet value). Cc: kwolf@redhat.com, pbonzini@redhat.com, famz@redhat.com, stefanha@redhat.com, Peter Lieven <pl@kamp.de> Subject: [Qemu-devel] [PATCHv2] block: add native support for NFS Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Peter Lieven Dec. 17, 2013, 9:15 a.m. UTC

This patch adds native support for accessing images on NFS shares without
the requirement to actually mount the entire NFS share on the host.

NFS Images can simply be specified by an url of the form:
nfs://<host>/<export>/<filename>

For example:
qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2

You need libnfs from Ronnie Sahlberg available at:
   git://github.com/sahlberg/libnfs.git
for this to work.

During configure it is automatically probed for libnfs and support
is enabled on-the-fly. You can forbid or enforce libnfs support
with --disable-libnfs or --enable-libnfs respectively.

Due to NFS restrictions you might need to execute your binaries
as root, allow them to open priviledged ports (<1024) or specify
insecure option on the NFS server.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
v1->v2:
 - fixed block/Makefile.objs [Ronnie]
 - do not always register a read handler [Ronnie]
 - add support for reading beyond EOF [Fam]
 - fixed struct and paramter naming [Fam]
 - fixed overlong lines and whitespace errors [Fam]
 - return return status from libnfs whereever possible [Fam]
 - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
 - avoid segfault when parsing filname [Fam]
 - remove unused close_bh from NFSClient [Fam]
 - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]

 MAINTAINERS         |    5 +
 block/Makefile.objs |    1 +
 block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
 configure           |   38 +++++
 4 files changed, 463 insertions(+)
 create mode 100644 block/nfs.c

Stefan Hajnoczi Dec. 17, 2013, 4:47 p.m. UTC | #1

On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
> This patch adds native support for accessing images on NFS shares without
> the requirement to actually mount the entire NFS share on the host.
> 
> NFS Images can simply be specified by an url of the form:
> nfs://<host>/<export>/<filename>
> 
> For example:
> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
> 
> You need libnfs from Ronnie Sahlberg available at:
>    git://github.com/sahlberg/libnfs.git
> for this to work.
> 
> During configure it is automatically probed for libnfs and support
> is enabled on-the-fly. You can forbid or enforce libnfs support
> with --disable-libnfs or --enable-libnfs respectively.
> 
> Due to NFS restrictions you might need to execute your binaries
> as root, allow them to open priviledged ports (<1024) or specify
> insecure option on the NFS server.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
> v1->v2:
>  - fixed block/Makefile.objs [Ronnie]
>  - do not always register a read handler [Ronnie]
>  - add support for reading beyond EOF [Fam]
>  - fixed struct and paramter naming [Fam]
>  - fixed overlong lines and whitespace errors [Fam]
>  - return return status from libnfs whereever possible [Fam]
>  - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
>  - avoid segfault when parsing filname [Fam]
>  - remove unused close_bh from NFSClient [Fam]
>  - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]
> 
>  MAINTAINERS         |    5 +
>  block/Makefile.objs |    1 +
>  block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  configure           |   38 +++++
>  4 files changed, 463 insertions(+)
>  create mode 100644 block/nfs.c

Which NFS protocol versions are supported by current libnfs?

> +#include <poll.h>

Why is this header included?

> +typedef struct nfsclient {

Please either drop the struct tag or use "NFSClient".

> +static void
> +nfs_co_generic_cb(int status, struct nfs_context *nfs, void *data,
> +                  void *private_data)
> +{
> +    NFSTask *Task = private_data;

lowercase "task" local variable name please.

> +static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
> +                                        int64_t sector_num, int nb_sectors,
> +                                        QEMUIOVector *iov)
> +{
> +    NFSClient *client = bs->opaque;
> +    NFSTask task;
> +    char *buf = NULL;
> +
> +    nfs_co_init_task(client, &task);
> +
> +    buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
> +    qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
> +
> +    if (nfs_pwrite_async(client->context, client->fh,
> +                         sector_num * BDRV_SECTOR_SIZE,
> +                         nb_sectors * BDRV_SECTOR_SIZE,
> +                         buf, nfs_co_generic_cb, &task) != 0) {
> +        g_free(buf);
> +        return -EIO;

Can we get a more detailed errno here?  (e.g. ENOSPC)

> +    }
> +
> +    while (!task.complete) {
> +        nfs_set_events(client);
> +        qemu_coroutine_yield();
> +    }
> +
> +    g_free(buf);
> +
> +    if (task.status != nb_sectors * BDRV_SECTOR_SIZE) {
> +        return task.status < 0 ? task.status : -EIO;
> +    }
> +
> +    bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);

Why is this necessary?  block.c will update bs->total_sectors if the
file is growable.

> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
> +     * in qemu-img open. So we can use the cached value for allocate
> +     * filesize obtained from fstat at open time */
> +    client->allocated_file_size = -ENOTSUP;

Can you implement this fully?  By stubbing it out like this we won't be
able to call get_allocated_file_size() at runtime in the future without
updating the nfs block driver code.  It's just an fstat call, shouldn't
be too hard to implement properly :).

> +    if (client->context == NULL) {
> +        error_setg(errp, "Failed to init NFS context");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    if (strlen(filename) <= 6) {
> +        error_setg(errp, "Invalid server in URL");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    server = g_strdup(filename + 6);
> +    if (server[0] == '/' || server[0] == '\0') {
> +        error_setg(errp, "Invalid server in URL");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    strp = strchr(server, '/');
> +    if (strp == NULL) {
> +        error_setg(errp, "Invalid URL specified.\n");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    path = g_strdup(strp);
> +    *strp = 0;
> +    strp = strrchr(path, '/');
> +    if (strp == NULL) {
> +        error_setg(errp, "Invalid URL specified.\n");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    file = g_strdup(strp);
> +    *strp = 0;

Can you use util/uri.c to avoid the string manipulation?  It can extract
the different parts for validation: scheme, server, port, and path.

> +static int nfs_file_create(const char *filename, QEMUOptionParameter *options,
> +                         Error **errp)
> +{
> +    int ret = 0;
> +    int64_t total_size = 0;
> +    BlockDriverState *bs;
> +    NFSClient *client = NULL;
> +    QDict *bs_options;
> +
> +    bs = bdrv_new("");

This approach seems a little risky to me.  bs is a fake
BlockDriverState with many fields not initialized.  bdrv_*() calls would
crash.

How about a static nfs_open() function that operates on NFSClient and is
shared by nfs_file_create() and nfs_file_open()?

> +##########################################
> +# Do we have libnfs
> +if test "$libnfs" != "no" ; then
> +  cat > $TMPC << EOF
> +#include <nfsc/libnfs-zdr.h>
> +#include <nfsc/libnfs.h>
> +#include <nfsc/libnfs-raw.h>
> +#include <nfsc/libnfs-raw-mount.h>
> +int main(void) {
> +    nfs_init_context();
> +    nfs_pread_async(0,0,0,0,0,0);
> +    nfs_pwrite_async(0,0,0,0,0,0,0);
> +    nfs_fstat(0,0,0);
> +    return 0;
> +    }
> +EOF
> +  if compile_prog "-Werror" "-lnfs" ; then
> +    libnfs="yes"
> +    LIBS="$LIBS -lnfs"

pkg-config is usually better than hardcoding names.  Is pkg-config
available for libnfs?

ronnie sahlberg Dec. 17, 2013, 4:53 p.m. UTC | #2

NFSTask

Task is a very scsi-ish term. Maybe RPC is better ?

NFSrpc ?



On Tue, Dec 17, 2013 at 1:15 AM, Peter Lieven <pl@kamp.de> wrote:
> This patch adds native support for accessing images on NFS shares without
> the requirement to actually mount the entire NFS share on the host.
>
> NFS Images can simply be specified by an url of the form:
> nfs://<host>/<export>/<filename>
>
> For example:
> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>
> You need libnfs from Ronnie Sahlberg available at:
>    git://github.com/sahlberg/libnfs.git
> for this to work.
>
> During configure it is automatically probed for libnfs and support
> is enabled on-the-fly. You can forbid or enforce libnfs support
> with --disable-libnfs or --enable-libnfs respectively.
>
> Due to NFS restrictions you might need to execute your binaries
> as root, allow them to open priviledged ports (<1024) or specify
> insecure option on the NFS server.
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
> v1->v2:
>  - fixed block/Makefile.objs [Ronnie]
>  - do not always register a read handler [Ronnie]
>  - add support for reading beyond EOF [Fam]
>  - fixed struct and paramter naming [Fam]
>  - fixed overlong lines and whitespace errors [Fam]
>  - return return status from libnfs whereever possible [Fam]
>  - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
>  - avoid segfault when parsing filname [Fam]
>  - remove unused close_bh from NFSClient [Fam]
>  - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]
>
>  MAINTAINERS         |    5 +
>  block/Makefile.objs |    1 +
>  block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  configure           |   38 +++++
>  4 files changed, 463 insertions(+)
>  create mode 100644 block/nfs.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c19133f..f53d184 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -899,6 +899,11 @@ M: Peter Lieven <pl@kamp.de>
>  S: Supported
>  F: block/iscsi.c
>
> +NFS
> +M: Peter Lieven <pl@kamp.de>
> +S: Maintained
> +F: block/nfs.c
> +
>  SSH
>  M: Richard W.M. Jones <rjones@redhat.com>
>  S: Supported
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index f43ecbc..aa8eaf9 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -12,6 +12,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>  ifeq ($(CONFIG_POSIX),y)
>  block-obj-y += nbd.o sheepdog.o
>  block-obj-$(CONFIG_LIBISCSI) += iscsi.o
> +block-obj-$(CONFIG_LIBNFS) += nfs.o
>  block-obj-$(CONFIG_CURL) += curl.o
>  block-obj-$(CONFIG_RBD) += rbd.o
>  block-obj-$(CONFIG_GLUSTERFS) += gluster.o
> diff --git a/block/nfs.c b/block/nfs.c
> new file mode 100644
> index 0000000..006b8cc
> --- /dev/null
> +++ b/block/nfs.c
> @@ -0,0 +1,419 @@
> +/*
> + * QEMU Block driver for native access to files on NFS shares
> + *
> + * Copyright (c) 2013 Peter Lieven <pl@kamp.de>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "config-host.h"
> +
> +#include <poll.h>
> +#include "qemu-common.h"
> +#include "qemu/config-file.h"
> +#include "qemu/error-report.h"
> +#include "block/block_int.h"
> +#include "trace.h"
> +#include "qemu/iov.h"
> +#include "sysemu/sysemu.h"
> +
> +#include <nfsc/libnfs-zdr.h>
> +#include <nfsc/libnfs.h>
> +#include <nfsc/libnfs-raw.h>
> +#include <nfsc/libnfs-raw-mount.h>
> +
> +typedef struct nfsclient {
> +    struct nfs_context *context;
> +    struct nfsfh *fh;
> +    int events;
> +    bool has_zero_init;
> +    int64_t allocated_file_size;
> +} NFSClient;
> +
> +typedef struct NFSTask {
> +    int status;
> +    int complete;
> +    QEMUIOVector *iov;
> +    Coroutine *co;
> +    QEMUBH *bh;
> +} NFSTask;
> +
> +static void nfs_process_read(void *arg);
> +static void nfs_process_write(void *arg);
> +
> +static void nfs_set_events(NFSClient *client)
> +{
> +    int ev = nfs_which_events(client->context);
> +    if (ev != client->events) {
> +        qemu_aio_set_fd_handler(nfs_get_fd(client->context),
> +                      (ev & POLLIN) ? nfs_process_read : NULL,
> +                      (ev & POLLOUT) ? nfs_process_write : NULL,
> +                      client);
> +
> +    }
> +    client->events = ev;
> +}
> +
> +static void nfs_process_read(void *arg)
> +{
> +    NFSClient *client = arg;
> +    nfs_service(client->context, POLLIN);
> +    nfs_set_events(client);
> +}
> +
> +static void nfs_process_write(void *arg)
> +{
> +    NFSClient *client = arg;
> +    nfs_service(client->context, POLLOUT);
> +    nfs_set_events(client);
> +}
> +
> +static void nfs_co_init_task(NFSClient *client, NFSTask *Task)
> +{
> +    *Task = (NFSTask) {
> +        .co         = qemu_coroutine_self(),
> +    };
> +}
> +
> +static void nfs_co_generic_bh_cb(void *opaque)
> +{
> +    NFSTask *Task = opaque;
> +    qemu_bh_delete(Task->bh);
> +    qemu_coroutine_enter(Task->co, NULL);
> +}
> +
> +static void
> +nfs_co_generic_cb(int status, struct nfs_context *nfs, void *data,
> +                  void *private_data)
> +{
> +    NFSTask *Task = private_data;
> +    Task->complete = 1;
> +    Task->status = status;
> +    if (Task->status > 0 && Task->iov) {
> +        if (Task->status <= Task->iov->size) {
> +            qemu_iovec_from_buf(Task->iov, 0, data, Task->status);
> +        } else {
> +            Task->status = -EIO;
> +        }
> +    }
> +    if (Task->co) {
> +        Task->bh = qemu_bh_new(nfs_co_generic_bh_cb, Task);
> +        qemu_bh_schedule(Task->bh);
> +    }
> +}
> +
> +static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
> +                                     int64_t sector_num, int nb_sectors,
> +                                     QEMUIOVector *iov)
> +{
> +    NFSClient *client = bs->opaque;
> +    NFSTask task;
> +
> +    nfs_co_init_task(client, &task);
> +    task.iov = iov;
> +
> +    if (nfs_pread_async(client->context, client->fh,
> +                        sector_num * BDRV_SECTOR_SIZE,
> +                        nb_sectors * BDRV_SECTOR_SIZE,
> +                        nfs_co_generic_cb, &task) != 0) {
> +        return -EIO;
> +    }
> +
> +    while (!task.complete) {
> +        nfs_set_events(client);
> +        qemu_coroutine_yield();
> +    }
> +
> +    if (task.status < 0) {
> +        return task.status;
> +    }
> +
> +    return 0;
> +}
> +
> +static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
> +                                        int64_t sector_num, int nb_sectors,
> +                                        QEMUIOVector *iov)
> +{
> +    NFSClient *client = bs->opaque;
> +    NFSTask task;
> +    char *buf = NULL;
> +
> +    nfs_co_init_task(client, &task);
> +
> +    buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
> +    qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
> +
> +    if (nfs_pwrite_async(client->context, client->fh,
> +                         sector_num * BDRV_SECTOR_SIZE,
> +                         nb_sectors * BDRV_SECTOR_SIZE,
> +                         buf, nfs_co_generic_cb, &task) != 0) {
> +        g_free(buf);
> +        return -EIO;
> +    }
> +
> +    while (!task.complete) {
> +        nfs_set_events(client);
> +        qemu_coroutine_yield();
> +    }
> +
> +    g_free(buf);
> +
> +    if (task.status != nb_sectors * BDRV_SECTOR_SIZE) {
> +        return task.status < 0 ? task.status : -EIO;
> +    }
> +
> +    bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
> +     * in qemu-img open. So we can use the cached value for allocate
> +     * filesize obtained from fstat at open time */
> +    client->allocated_file_size = -ENOTSUP;
> +    return 0;
> +}
> +
> +static int coroutine_fn nfs_co_flush(BlockDriverState *bs)
> +{
> +    NFSClient *client = bs->opaque;
> +    NFSTask task;
> +
> +    nfs_co_init_task(client, &task);
> +
> +    if (nfs_fsync_async(client->context, client->fh, nfs_co_generic_cb,
> +                        &task) != 0) {
> +        return -EIO;
> +    }
> +
> +    while (!task.complete) {
> +        nfs_set_events(client);
> +        qemu_coroutine_yield();
> +    }
> +
> +    return task.status;
> +}
> +
> +static QemuOptsList runtime_opts = {
> +    .name = "nfs",
> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
> +    .desc = {
> +        {
> +            .name = "filename",
> +            .type = QEMU_OPT_STRING,
> +            .help = "URL to the NFS file",
> +        },
> +        { /* end of list */ }
> +    },
> +};
> +
> +static void nfs_file_close(BlockDriverState *bs)
> +{
> +    NFSClient *client = bs->opaque;
> +    if (client->context) {
> +        if (client->fh) {
> +            nfs_close(client->context, client->fh);
> +        }
> +        qemu_aio_set_fd_handler(nfs_get_fd(client->context), NULL, NULL, NULL);
> +        nfs_destroy_context(client->context);
> +    }
> +    memset(client, 0, sizeof(NFSClient));
> +}
> +
> +
> +static int nfs_file_open_common(BlockDriverState *bs, QDict *options, int flags,
> +                         int open_flags, Error **errp)
> +{
> +    NFSClient *client = bs->opaque;
> +    const char *filename;
> +    int ret = 0;
> +    QemuOpts *opts;
> +    Error *local_err = NULL;
> +    char *server = NULL, *path = NULL, *file = NULL, *strp;
> +    struct stat st;
> +
> +    opts = qemu_opts_create_nofail(&runtime_opts);
> +    qemu_opts_absorb_qdict(opts, options, &local_err);
> +    if (error_is_set(&local_err)) {
> +        qerror_report_err(local_err);
> +        error_free(local_err);
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    filename = qemu_opt_get(opts, "filename");
> +
> +    client->context = nfs_init_context();
> +
> +    if (client->context == NULL) {
> +        error_setg(errp, "Failed to init NFS context");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    if (strlen(filename) <= 6) {
> +        error_setg(errp, "Invalid server in URL");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +
> +    server = g_strdup(filename + 6);
> +    if (server[0] == '/' || server[0] == '\0') {
> +        error_setg(errp, "Invalid server in URL");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    strp = strchr(server, '/');
> +    if (strp == NULL) {
> +        error_setg(errp, "Invalid URL specified.\n");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    path = g_strdup(strp);
> +    *strp = 0;
> +    strp = strrchr(path, '/');
> +    if (strp == NULL) {
> +        error_setg(errp, "Invalid URL specified.\n");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +    file = g_strdup(strp);
> +    *strp = 0;
> +
> +    ret = nfs_mount(client->context, server, path);
> +    if (ret < 0) {
> +        error_setg(errp, "Failed to mount nfs share: %s",
> +                    nfs_get_error(client->context));
> +        goto fail;
> +    }
> +
> +    if (open_flags & O_CREAT) {
> +        ret = nfs_creat(client->context, file, 0600, &client->fh);
> +        if (ret < 0) {
> +            error_setg(errp, "Failed to create file: %s",
> +                             nfs_get_error(client->context));
> +            goto fail;
> +        }
> +    } else {
> +        open_flags = (flags & BDRV_O_RDWR) ? O_RDWR : O_RDONLY;
> +        ret = nfs_open(client->context, file, open_flags, &client->fh);
> +        if (ret < 0) {
> +            error_setg(errp, "Failed to open file : %s",
> +                       nfs_get_error(client->context));
> +            goto fail;
> +        }
> +    }
> +
> +    ret = nfs_fstat(client->context, client->fh, &st);
> +    if (ret < 0) {
> +        error_setg(errp, "Failed to fstat file: %s",
> +                   nfs_get_error(client->context));
> +        goto fail;
> +    }
> +
> +    /* TODO allow reading beyond EOF */
> +    bs->total_sectors = DIV_ROUND_UP(st.st_size, BDRV_SECTOR_SIZE);
> +    client->has_zero_init = S_ISREG(st.st_mode);
> +    client->allocated_file_size = st.st_blocks * st.st_blksize;
> +    goto out;
> +fail:
> +    nfs_file_close(bs);
> +out:
> +    g_free(server);
> +    g_free(path);
> +    g_free(file);
> +    return ret;
> +}
> +
> +static int nfs_file_open(BlockDriverState *bs, QDict *options, int flags,
> +                         Error **errp) {
> +    return nfs_file_open_common(bs, options, flags, 0, errp);
> +}
> +
> +static int nfs_file_create(const char *filename, QEMUOptionParameter *options,
> +                         Error **errp)
> +{
> +    int ret = 0;
> +    int64_t total_size = 0;
> +    BlockDriverState *bs;
> +    NFSClient *client = NULL;
> +    QDict *bs_options;
> +
> +    bs = bdrv_new("");
> +
> +    /* Read out options */
> +    while (options && options->name) {
> +        if (!strcmp(options->name, "size")) {
> +            total_size = options->value.n;
> +        }
> +        options++;
> +    }
> +
> +    bs->opaque = g_malloc0(sizeof(NFSClient));
> +    client = bs->opaque;
> +
> +    bs_options = qdict_new();
> +    qdict_put(bs_options, "filename", qstring_from_str(filename));
> +    ret = nfs_file_open_common(bs, bs_options, 0, O_CREAT, NULL);
> +    QDECREF(bs_options);
> +    if (ret < 0) {
> +        goto out;
> +    }
> +    ret = nfs_ftruncate(client->context, client->fh, total_size);
> +out:
> +    nfs_file_close(bs);
> +    g_free(bs->opaque);
> +    bs->opaque = NULL;
> +    bdrv_unref(bs);
> +    return ret;
> +}
> +
> +static int nfs_has_zero_init(BlockDriverState *bs)
> +{
> +    NFSClient *client = bs->opaque;
> +    return client->has_zero_init;
> +}
> +
> +static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
> +{
> +    NFSClient *client = bs->opaque;
> +    return client->allocated_file_size;
> +}
> +
> +static BlockDriver bdrv_nfs = {
> +    .format_name     = "nfs",
> +    .protocol_name   = "nfs",
> +
> +    .instance_size   = sizeof(NFSClient),
> +    .bdrv_needs_filename = true,
> +    .bdrv_has_zero_init = nfs_has_zero_init,
> +    .bdrv_get_allocated_file_size = nfs_get_allocated_file_size,
> +
> +    .bdrv_file_open  = nfs_file_open,
> +    .bdrv_close      = nfs_file_close,
> +    .bdrv_create     = nfs_file_create,
> +
> +    .bdrv_co_readv         = nfs_co_readv,
> +    .bdrv_co_writev        = nfs_co_writev,
> +    .bdrv_co_flush_to_disk = nfs_co_flush,
> +};
> +
> +static void nfs_block_init(void)
> +{
> +    bdrv_register(&bdrv_nfs);
> +}
> +
> +block_init(nfs_block_init);
> diff --git a/configure b/configure
> index 8144d9f..6c9e47a 100755
> --- a/configure
> +++ b/configure
> @@ -250,6 +250,7 @@ vss_win32_sdk=""
>  win_sdk="no"
>  want_tools="yes"
>  libiscsi=""
> +libnfs=""
>  coroutine=""
>  coroutine_pool=""
>  seccomp=""
> @@ -833,6 +834,10 @@ for opt do
>    ;;
>    --enable-libiscsi) libiscsi="yes"
>    ;;
> +  --disable-libnfs) libnfs="no"
> +  ;;
> +  --enable-libnfs) libnfs="yes"
> +  ;;
>    --enable-profiler) profiler="yes"
>    ;;
>    --disable-cocoa) cocoa="no"
> @@ -1202,6 +1207,8 @@ echo "  --enable-spice           enable spice"
>  echo "  --enable-rbd             enable building the rados block device (rbd)"
>  echo "  --disable-libiscsi       disable iscsi support"
>  echo "  --enable-libiscsi        enable iscsi support"
> +echo "  --disable-libnfs         disable nfs support"
> +echo "  --enable-libnfs          enable nfs support"
>  echo "  --disable-smartcard-nss  disable smartcard nss support"
>  echo "  --enable-smartcard-nss   enable smartcard nss support"
>  echo "  --disable-libusb         disable libusb (for usb passthrough)"
> @@ -3050,6 +3057,32 @@ EOF
>    fi
>  fi
>
> +##########################################
> +# Do we have libnfs
> +if test "$libnfs" != "no" ; then
> +  cat > $TMPC << EOF
> +#include <nfsc/libnfs-zdr.h>
> +#include <nfsc/libnfs.h>
> +#include <nfsc/libnfs-raw.h>
> +#include <nfsc/libnfs-raw-mount.h>
> +int main(void) {
> +    nfs_init_context();
> +    nfs_pread_async(0,0,0,0,0,0);
> +    nfs_pwrite_async(0,0,0,0,0,0,0);
> +    nfs_fstat(0,0,0);
> +    return 0;
> +    }
> +EOF
> +  if compile_prog "-Werror" "-lnfs" ; then
> +    libnfs="yes"
> +    LIBS="$LIBS -lnfs"
> +  else
> +    if test "$libnfs" = "yes" ; then
> +      feature_not_found "libnfs"
> +    fi
> +    libnfs="no"
> +  fi
> +fi
>
>  ##########################################
>  # Do we need libm
> @@ -3777,6 +3810,7 @@ echo "libusb            $libusb"
>  echo "usb net redir     $usb_redir"
>  echo "GLX support       $glx"
>  echo "libiscsi support  $libiscsi"
> +echo "libnfs support    $libnfs"
>  echo "build guest agent $guest_agent"
>  echo "QGA VSS support   $guest_agent_with_vss"
>  echo "seccomp support   $seccomp"
> @@ -4107,6 +4141,10 @@ if test "$libiscsi" = "yes" ; then
>    echo "CONFIG_LIBISCSI=y" >> $config_host_mak
>  fi
>
> +if test "$libnfs" = "yes" ; then
> +  echo "CONFIG_LIBNFS=y" >> $config_host_mak
> +fi
> +
>  if test "$seccomp" = "yes"; then
>    echo "CONFIG_SECCOMP=y" >> $config_host_mak
>  fi
> --
> 1.7.9.5
>
>

Peter Lieven Dec. 17, 2013, 5:03 p.m. UTC | #3

On 17.12.2013 17:47, Stefan Hajnoczi wrote:
> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>> This patch adds native support for accessing images on NFS shares without
>> the requirement to actually mount the entire NFS share on the host.
>>
>> NFS Images can simply be specified by an url of the form:
>> nfs://<host>/<export>/<filename>
>>
>> For example:
>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>
>> You need libnfs from Ronnie Sahlberg available at:
>>     git://github.com/sahlberg/libnfs.git
>> for this to work.
>>
>> During configure it is automatically probed for libnfs and support
>> is enabled on-the-fly. You can forbid or enforce libnfs support
>> with --disable-libnfs or --enable-libnfs respectively.
>>
>> Due to NFS restrictions you might need to execute your binaries
>> as root, allow them to open priviledged ports (<1024) or specify
>> insecure option on the NFS server.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v1->v2:
>>   - fixed block/Makefile.objs [Ronnie]
>>   - do not always register a read handler [Ronnie]
>>   - add support for reading beyond EOF [Fam]
>>   - fixed struct and paramter naming [Fam]
>>   - fixed overlong lines and whitespace errors [Fam]
>>   - return return status from libnfs whereever possible [Fam]
>>   - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
>>   - avoid segfault when parsing filname [Fam]
>>   - remove unused close_bh from NFSClient [Fam]
>>   - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]
>>
>>   MAINTAINERS         |    5 +
>>   block/Makefile.objs |    1 +
>>   block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   configure           |   38 +++++
>>   4 files changed, 463 insertions(+)
>>   create mode 100644 block/nfs.c
> Which NFS protocol versions are supported by current libnfs?
Will check that out. Ronnie?
>
>> +#include <poll.h>
> Why is this header included?
leftover.
>
>> +typedef struct nfsclient {
> Please either drop the struct tag or use "NFSClient".
ok
>
>> +static void
>> +nfs_co_generic_cb(int status, struct nfs_context *nfs, void *data,
>> +                  void *private_data)
>> +{
>> +    NFSTask *Task = private_data;
> lowercase "task" local variable name please.
ok
>
>> +static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
>> +                                        int64_t sector_num, int nb_sectors,
>> +                                        QEMUIOVector *iov)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    NFSTask task;
>> +    char *buf = NULL;
>> +
>> +    nfs_co_init_task(client, &task);
>> +
>> +    buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
>> +    qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
>> +
>> +    if (nfs_pwrite_async(client->context, client->fh,
>> +                         sector_num * BDRV_SECTOR_SIZE,
>> +                         nb_sectors * BDRV_SECTOR_SIZE,
>> +                         buf, nfs_co_generic_cb, &task) != 0) {
>> +        g_free(buf);
>> +        return -EIO;
> Can we get a more detailed errno here?  (e.g. ENOSPC)
libnfs only returns 0 or -1 if the setup of the call
fails. the status code from the RPC is more detailed
and available in task.status.
>
>> +    }
>> +
>> +    while (!task.complete) {
>> +        nfs_set_events(client);
>> +        qemu_coroutine_yield();
>> +    }
>> +
>> +    g_free(buf);
>> +
>> +    if (task.status != nb_sectors * BDRV_SECTOR_SIZE) {
>> +        return task.status < 0 ? task.status : -EIO;
>> +    }
>> +
>> +    bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
> Why is this necessary?  block.c will update bs->total_sectors if the
> file is growable.
Ok, didn't know ;-)
>
>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>> +     * in qemu-img open. So we can use the cached value for allocate
>> +     * filesize obtained from fstat at open time */
>> +    client->allocated_file_size = -ENOTSUP;
> Can you implement this fully?  By stubbing it out like this we won't be
> able to call get_allocated_file_size() at runtime in the future without
> updating the nfs block driver code.  It's just an fstat call, shouldn't
> be too hard to implement properly :).
Will do. I will also add bdrv_truncate as its needed for VMDK Create.
>
>> +    if (client->context == NULL) {
>> +        error_setg(errp, "Failed to init NFS context");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    if (strlen(filename) <= 6) {
>> +        error_setg(errp, "Invalid server in URL");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    server = g_strdup(filename + 6);
>> +    if (server[0] == '/' || server[0] == '\0') {
>> +        error_setg(errp, "Invalid server in URL");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +    strp = strchr(server, '/');
>> +    if (strp == NULL) {
>> +        error_setg(errp, "Invalid URL specified.\n");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +    path = g_strdup(strp);
>> +    *strp = 0;
>> +    strp = strrchr(path, '/');
>> +    if (strp == NULL) {
>> +        error_setg(errp, "Invalid URL specified.\n");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +    file = g_strdup(strp);
>> +    *strp = 0;
> Can you use util/uri.c to avoid the string manipulation?  It can extract
> the different parts for validation: scheme, server, port, and path.
Thanks for the pointer. This is actually the parsing code from
libnfs examples.
>
>> +static int nfs_file_create(const char *filename, QEMUOptionParameter *options,
>> +                         Error **errp)
>> +{
>> +    int ret = 0;
>> +    int64_t total_size = 0;
>> +    BlockDriverState *bs;
>> +    NFSClient *client = NULL;
>> +    QDict *bs_options;
>> +
>> +    bs = bdrv_new("");
> This approach seems a little risky to me.  bs is a fake
> BlockDriverState with many fields not initialized.  bdrv_*() calls would
> crash.
>
> How about a static nfs_open() function that operates on NFSClient and is
> shared by nfs_file_create() and nfs_file_open()?
Good idea, will look into this.
>
>> +##########################################
>> +# Do we have libnfs
>> +if test "$libnfs" != "no" ; then
>> +  cat > $TMPC << EOF
>> +#include <nfsc/libnfs-zdr.h>
>> +#include <nfsc/libnfs.h>
>> +#include <nfsc/libnfs-raw.h>
>> +#include <nfsc/libnfs-raw-mount.h>
>> +int main(void) {
>> +    nfs_init_context();
>> +    nfs_pread_async(0,0,0,0,0,0);
>> +    nfs_pwrite_async(0,0,0,0,0,0,0);
>> +    nfs_fstat(0,0,0);
>> +    return 0;
>> +    }
>> +EOF
>> +  if compile_prog "-Werror" "-lnfs" ; then
>> +    libnfs="yes"
>> +    LIBS="$LIBS -lnfs"
> pkg-config is usually better than hardcoding names.  Is pkg-config
> available for libnfs?
it is, but until today it was buggy. we can use it from libnfs 1.9.0
onwards.

Thanks for reviewing.

Peter

ronnie sahlberg Dec. 17, 2013, 5:13 p.m. UTC | #4

On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven <pl@kamp.de> wrote:
> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
...
>> Which NFS protocol versions are supported by current libnfs?
>
> Will check that out. Ronnie?
>

It uses NFS v3 only.

ronnie sahlberg Dec. 17, 2013, 5:28 p.m. UTC | #5

On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven <pl@kamp.de> wrote:
> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>
>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:

...
>>> +    if (nfs_pwrite_async(client->context, client->fh,
>>> +                         sector_num * BDRV_SECTOR_SIZE,
>>> +                         nb_sectors * BDRV_SECTOR_SIZE,
>>> +                         buf, nfs_co_generic_cb, &task) != 0) {
>>> +        g_free(buf);
>>> +        return -EIO;
>>
>> Can we get a more detailed errno here?  (e.g. ENOSPC)
>
> libnfs only returns 0 or -1 if the setup of the call
> fails. the status code from the RPC is more detailed
> and available in task.status.
>

The *_async() functions only allocates memory and marshalls the
arguments to the buffer.
So barring marshalling bugs, it will only fail due to OOM.

So -ENOMEM is perhaps a better error for when *_async() returns an error.
That is really the only condition where these functions will fail.

If *_async() returns success you are guaranteed that
nfs_co_generic_cb() will be invoked
and there you can inspect the status argument for more detailed reason why.

Daniel P. Berrangé Dec. 17, 2013, 5:32 p.m. UTC | #6

On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
> This patch adds native support for accessing images on NFS shares without
> the requirement to actually mount the entire NFS share on the host.
> 
> NFS Images can simply be specified by an url of the form:
> nfs://<host>/<export>/<filename>
> 
> For example:
> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2

Does it support other config tunables, eg specifying which
NFS version to use 2/3/4 ? If so will they be available as
URI parameters in the obvious manner ?

Daniel

Peter Lieven Dec. 17, 2013, 10:36 p.m. UTC | #7

> Am 17.12.2013 um 18:13 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
> 
>> On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven <pl@kamp.de> wrote:
>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
> ...
>>> Which NFS protocol versions are supported by current libnfs?
>> 
>> Will check that out. Ronnie?
> 
> It uses NFS v3 only.

should we use nfs3:// for the urls then?

Eric Blake Dec. 17, 2013, 10:44 p.m. UTC | #8

On 12/17/2013 03:36 PM, Peter Lieven wrote:
> 
> 
>> Am 17.12.2013 um 18:13 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>>
>>> On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven <pl@kamp.de> wrote:
>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>> ...
>>>> Which NFS protocol versions are supported by current libnfs?
>>>
>>> Will check that out. Ronnie?
>>
>> It uses NFS v3 only.
> 
> should we use nfs3:// for the urls then?

Or maybe nfs://10.0.0.1/qemu-images/test.qcow2?protocol=3 (where
?protocol= is optional and defaults to 3, but could be expanded to
include 4 in the future, and that way the single nfs:// URI covers both
protocols).

ronnie sahlberg Dec. 17, 2013, 10:51 p.m. UTC | #9

On Tue, Dec 17, 2013 at 2:36 PM, Peter Lieven <pl@kamp.de> wrote:
>
>
>> Am 17.12.2013 um 18:13 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>>
>>> On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven <pl@kamp.de> wrote:
>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>> ...
>>>> Which NFS protocol versions are supported by current libnfs?
>>>
>>> Will check that out. Ronnie?
>>
>> It uses NFS v3 only.
>
> should we use nfs3:// for the urls then?

No, I think we should leave it as nfs://... so that we are compatilbe
with rfc2224

Once/if/when I add support for v2 and v4 we can force a protocol
version using ?version=2

Then
nfs://server/foo/bar   would be "use whatever versions the server offers"
but
nfs://server/foo/bar?version=2 would become "use version 2 only"

Peter Lieven Dec. 17, 2013, 10:56 p.m. UTC | #10

> Am 17.12.2013 um 23:51 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
> 
>> On Tue, Dec 17, 2013 at 2:36 PM, Peter Lieven <pl@kamp.de> wrote:
>> 
>> 
>>>> Am 17.12.2013 um 18:13 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>>>> 
>>>> On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven <pl@kamp.de> wrote:
>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>> ...
>>>>> Which NFS protocol versions are supported by current libnfs?
>>>> 
>>>> Will check that out. Ronnie?
>>> 
>>> It uses NFS v3 only.
>> 
>> should we use nfs3:// for the urls then?
> 
> No, I think we should leave it as nfs://... so that we are compatilbe
> with rfc2224
> 
> Once/if/when I add support for v2 and v4 we can force a protocol
> version using ?version=2
> 
> Then
> nfs://server/foo/bar   would be "use whatever versions the server offers"
> but
> nfs://server/foo/bar?version=2 would become "use version 2 only"

then i would leave it as is and add a comment to the commit message that only v3 is supported atm.

Peter Lieven Dec. 17, 2013, 10:57 p.m. UTC | #11

> Am 17.12.2013 um 17:53 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
> 
> NFSTask
> 
> Task is a very scsi-ish term. Maybe RPC is better ?
> 
> NFSrpc ?

will change it in v3

> 
> 
> 
>> On Tue, Dec 17, 2013 at 1:15 AM, Peter Lieven <pl@kamp.de> wrote:
>> This patch adds native support for accessing images on NFS shares without
>> the requirement to actually mount the entire NFS share on the host.
>> 
>> NFS Images can simply be specified by an url of the form:
>> nfs://<host>/<export>/<filename>
>> 
>> For example:
>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>> 
>> You need libnfs from Ronnie Sahlberg available at:
>>   git://github.com/sahlberg/libnfs.git
>> for this to work.
>> 
>> During configure it is automatically probed for libnfs and support
>> is enabled on-the-fly. You can forbid or enforce libnfs support
>> with --disable-libnfs or --enable-libnfs respectively.
>> 
>> Due to NFS restrictions you might need to execute your binaries
>> as root, allow them to open priviledged ports (<1024) or specify
>> insecure option on the NFS server.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v1->v2:
>> - fixed block/Makefile.objs [Ronnie]
>> - do not always register a read handler [Ronnie]
>> - add support for reading beyond EOF [Fam]
>> - fixed struct and paramter naming [Fam]
>> - fixed overlong lines and whitespace errors [Fam]
>> - return return status from libnfs whereever possible [Fam]
>> - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
>> - avoid segfault when parsing filname [Fam]
>> - remove unused close_bh from NFSClient [Fam]
>> - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]
>> 
>> MAINTAINERS         |    5 +
>> block/Makefile.objs |    1 +
>> block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> configure           |   38 +++++
>> 4 files changed, 463 insertions(+)
>> create mode 100644 block/nfs.c
>> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index c19133f..f53d184 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -899,6 +899,11 @@ M: Peter Lieven <pl@kamp.de>
>> S: Supported
>> F: block/iscsi.c
>> 
>> +NFS
>> +M: Peter Lieven <pl@kamp.de>
>> +S: Maintained
>> +F: block/nfs.c
>> +
>> SSH
>> M: Richard W.M. Jones <rjones@redhat.com>
>> S: Supported
>> diff --git a/block/Makefile.objs b/block/Makefile.objs
>> index f43ecbc..aa8eaf9 100644
>> --- a/block/Makefile.objs
>> +++ b/block/Makefile.objs
>> @@ -12,6 +12,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>> ifeq ($(CONFIG_POSIX),y)
>> block-obj-y += nbd.o sheepdog.o
>> block-obj-$(CONFIG_LIBISCSI) += iscsi.o
>> +block-obj-$(CONFIG_LIBNFS) += nfs.o
>> block-obj-$(CONFIG_CURL) += curl.o
>> block-obj-$(CONFIG_RBD) += rbd.o
>> block-obj-$(CONFIG_GLUSTERFS) += gluster.o
>> diff --git a/block/nfs.c b/block/nfs.c
>> new file mode 100644
>> index 0000000..006b8cc
>> --- /dev/null
>> +++ b/block/nfs.c
>> @@ -0,0 +1,419 @@
>> +/*
>> + * QEMU Block driver for native access to files on NFS shares
>> + *
>> + * Copyright (c) 2013 Peter Lieven <pl@kamp.de>
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +
>> +#include "config-host.h"
>> +
>> +#include <poll.h>
>> +#include "qemu-common.h"
>> +#include "qemu/config-file.h"
>> +#include "qemu/error-report.h"
>> +#include "block/block_int.h"
>> +#include "trace.h"
>> +#include "qemu/iov.h"
>> +#include "sysemu/sysemu.h"
>> +
>> +#include <nfsc/libnfs-zdr.h>
>> +#include <nfsc/libnfs.h>
>> +#include <nfsc/libnfs-raw.h>
>> +#include <nfsc/libnfs-raw-mount.h>
>> +
>> +typedef struct nfsclient {
>> +    struct nfs_context *context;
>> +    struct nfsfh *fh;
>> +    int events;
>> +    bool has_zero_init;
>> +    int64_t allocated_file_size;
>> +} NFSClient;
>> +
>> +typedef struct NFSTask {
>> +    int status;
>> +    int complete;
>> +    QEMUIOVector *iov;
>> +    Coroutine *co;
>> +    QEMUBH *bh;
>> +} NFSTask;
>> +
>> +static void nfs_process_read(void *arg);
>> +static void nfs_process_write(void *arg);
>> +
>> +static void nfs_set_events(NFSClient *client)
>> +{
>> +    int ev = nfs_which_events(client->context);
>> +    if (ev != client->events) {
>> +        qemu_aio_set_fd_handler(nfs_get_fd(client->context),
>> +                      (ev & POLLIN) ? nfs_process_read : NULL,
>> +                      (ev & POLLOUT) ? nfs_process_write : NULL,
>> +                      client);
>> +
>> +    }
>> +    client->events = ev;
>> +}
>> +
>> +static void nfs_process_read(void *arg)
>> +{
>> +    NFSClient *client = arg;
>> +    nfs_service(client->context, POLLIN);
>> +    nfs_set_events(client);
>> +}
>> +
>> +static void nfs_process_write(void *arg)
>> +{
>> +    NFSClient *client = arg;
>> +    nfs_service(client->context, POLLOUT);
>> +    nfs_set_events(client);
>> +}
>> +
>> +static void nfs_co_init_task(NFSClient *client, NFSTask *Task)
>> +{
>> +    *Task = (NFSTask) {
>> +        .co         = qemu_coroutine_self(),
>> +    };
>> +}
>> +
>> +static void nfs_co_generic_bh_cb(void *opaque)
>> +{
>> +    NFSTask *Task = opaque;
>> +    qemu_bh_delete(Task->bh);
>> +    qemu_coroutine_enter(Task->co, NULL);
>> +}
>> +
>> +static void
>> +nfs_co_generic_cb(int status, struct nfs_context *nfs, void *data,
>> +                  void *private_data)
>> +{
>> +    NFSTask *Task = private_data;
>> +    Task->complete = 1;
>> +    Task->status = status;
>> +    if (Task->status > 0 && Task->iov) {
>> +        if (Task->status <= Task->iov->size) {
>> +            qemu_iovec_from_buf(Task->iov, 0, data, Task->status);
>> +        } else {
>> +            Task->status = -EIO;
>> +        }
>> +    }
>> +    if (Task->co) {
>> +        Task->bh = qemu_bh_new(nfs_co_generic_bh_cb, Task);
>> +        qemu_bh_schedule(Task->bh);
>> +    }
>> +}
>> +
>> +static int coroutine_fn nfs_co_readv(BlockDriverState *bs,
>> +                                     int64_t sector_num, int nb_sectors,
>> +                                     QEMUIOVector *iov)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    NFSTask task;
>> +
>> +    nfs_co_init_task(client, &task);
>> +    task.iov = iov;
>> +
>> +    if (nfs_pread_async(client->context, client->fh,
>> +                        sector_num * BDRV_SECTOR_SIZE,
>> +                        nb_sectors * BDRV_SECTOR_SIZE,
>> +                        nfs_co_generic_cb, &task) != 0) {
>> +        return -EIO;
>> +    }
>> +
>> +    while (!task.complete) {
>> +        nfs_set_events(client);
>> +        qemu_coroutine_yield();
>> +    }
>> +
>> +    if (task.status < 0) {
>> +        return task.status;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
>> +                                        int64_t sector_num, int nb_sectors,
>> +                                        QEMUIOVector *iov)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    NFSTask task;
>> +    char *buf = NULL;
>> +
>> +    nfs_co_init_task(client, &task);
>> +
>> +    buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
>> +    qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
>> +
>> +    if (nfs_pwrite_async(client->context, client->fh,
>> +                         sector_num * BDRV_SECTOR_SIZE,
>> +                         nb_sectors * BDRV_SECTOR_SIZE,
>> +                         buf, nfs_co_generic_cb, &task) != 0) {
>> +        g_free(buf);
>> +        return -EIO;
>> +    }
>> +
>> +    while (!task.complete) {
>> +        nfs_set_events(client);
>> +        qemu_coroutine_yield();
>> +    }
>> +
>> +    g_free(buf);
>> +
>> +    if (task.status != nb_sectors * BDRV_SECTOR_SIZE) {
>> +        return task.status < 0 ? task.status : -EIO;
>> +    }
>> +
>> +    bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>> +     * in qemu-img open. So we can use the cached value for allocate
>> +     * filesize obtained from fstat at open time */
>> +    client->allocated_file_size = -ENOTSUP;
>> +    return 0;
>> +}
>> +
>> +static int coroutine_fn nfs_co_flush(BlockDriverState *bs)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    NFSTask task;
>> +
>> +    nfs_co_init_task(client, &task);
>> +
>> +    if (nfs_fsync_async(client->context, client->fh, nfs_co_generic_cb,
>> +                        &task) != 0) {
>> +        return -EIO;
>> +    }
>> +
>> +    while (!task.complete) {
>> +        nfs_set_events(client);
>> +        qemu_coroutine_yield();
>> +    }
>> +
>> +    return task.status;
>> +}
>> +
>> +static QemuOptsList runtime_opts = {
>> +    .name = "nfs",
>> +    .head = QTAILQ_HEAD_INITIALIZER(runtime_opts.head),
>> +    .desc = {
>> +        {
>> +            .name = "filename",
>> +            .type = QEMU_OPT_STRING,
>> +            .help = "URL to the NFS file",
>> +        },
>> +        { /* end of list */ }
>> +    },
>> +};
>> +
>> +static void nfs_file_close(BlockDriverState *bs)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    if (client->context) {
>> +        if (client->fh) {
>> +            nfs_close(client->context, client->fh);
>> +        }
>> +        qemu_aio_set_fd_handler(nfs_get_fd(client->context), NULL, NULL, NULL);
>> +        nfs_destroy_context(client->context);
>> +    }
>> +    memset(client, 0, sizeof(NFSClient));
>> +}
>> +
>> +
>> +static int nfs_file_open_common(BlockDriverState *bs, QDict *options, int flags,
>> +                         int open_flags, Error **errp)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    const char *filename;
>> +    int ret = 0;
>> +    QemuOpts *opts;
>> +    Error *local_err = NULL;
>> +    char *server = NULL, *path = NULL, *file = NULL, *strp;
>> +    struct stat st;
>> +
>> +    opts = qemu_opts_create_nofail(&runtime_opts);
>> +    qemu_opts_absorb_qdict(opts, options, &local_err);
>> +    if (error_is_set(&local_err)) {
>> +        qerror_report_err(local_err);
>> +        error_free(local_err);
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    filename = qemu_opt_get(opts, "filename");
>> +
>> +    client->context = nfs_init_context();
>> +
>> +    if (client->context == NULL) {
>> +        error_setg(errp, "Failed to init NFS context");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    if (strlen(filename) <= 6) {
>> +        error_setg(errp, "Invalid server in URL");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +
>> +    server = g_strdup(filename + 6);
>> +    if (server[0] == '/' || server[0] == '\0') {
>> +        error_setg(errp, "Invalid server in URL");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +    strp = strchr(server, '/');
>> +    if (strp == NULL) {
>> +        error_setg(errp, "Invalid URL specified.\n");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +    path = g_strdup(strp);
>> +    *strp = 0;
>> +    strp = strrchr(path, '/');
>> +    if (strp == NULL) {
>> +        error_setg(errp, "Invalid URL specified.\n");
>> +        ret = -EINVAL;
>> +        goto fail;
>> +    }
>> +    file = g_strdup(strp);
>> +    *strp = 0;
>> +
>> +    ret = nfs_mount(client->context, server, path);
>> +    if (ret < 0) {
>> +        error_setg(errp, "Failed to mount nfs share: %s",
>> +                    nfs_get_error(client->context));
>> +        goto fail;
>> +    }
>> +
>> +    if (open_flags & O_CREAT) {
>> +        ret = nfs_creat(client->context, file, 0600, &client->fh);
>> +        if (ret < 0) {
>> +            error_setg(errp, "Failed to create file: %s",
>> +                             nfs_get_error(client->context));
>> +            goto fail;
>> +        }
>> +    } else {
>> +        open_flags = (flags & BDRV_O_RDWR) ? O_RDWR : O_RDONLY;
>> +        ret = nfs_open(client->context, file, open_flags, &client->fh);
>> +        if (ret < 0) {
>> +            error_setg(errp, "Failed to open file : %s",
>> +                       nfs_get_error(client->context));
>> +            goto fail;
>> +        }
>> +    }
>> +
>> +    ret = nfs_fstat(client->context, client->fh, &st);
>> +    if (ret < 0) {
>> +        error_setg(errp, "Failed to fstat file: %s",
>> +                   nfs_get_error(client->context));
>> +        goto fail;
>> +    }
>> +
>> +    /* TODO allow reading beyond EOF */
>> +    bs->total_sectors = DIV_ROUND_UP(st.st_size, BDRV_SECTOR_SIZE);
>> +    client->has_zero_init = S_ISREG(st.st_mode);
>> +    client->allocated_file_size = st.st_blocks * st.st_blksize;
>> +    goto out;
>> +fail:
>> +    nfs_file_close(bs);
>> +out:
>> +    g_free(server);
>> +    g_free(path);
>> +    g_free(file);
>> +    return ret;
>> +}
>> +
>> +static int nfs_file_open(BlockDriverState *bs, QDict *options, int flags,
>> +                         Error **errp) {
>> +    return nfs_file_open_common(bs, options, flags, 0, errp);
>> +}
>> +
>> +static int nfs_file_create(const char *filename, QEMUOptionParameter *options,
>> +                         Error **errp)
>> +{
>> +    int ret = 0;
>> +    int64_t total_size = 0;
>> +    BlockDriverState *bs;
>> +    NFSClient *client = NULL;
>> +    QDict *bs_options;
>> +
>> +    bs = bdrv_new("");
>> +
>> +    /* Read out options */
>> +    while (options && options->name) {
>> +        if (!strcmp(options->name, "size")) {
>> +            total_size = options->value.n;
>> +        }
>> +        options++;
>> +    }
>> +
>> +    bs->opaque = g_malloc0(sizeof(NFSClient));
>> +    client = bs->opaque;
>> +
>> +    bs_options = qdict_new();
>> +    qdict_put(bs_options, "filename", qstring_from_str(filename));
>> +    ret = nfs_file_open_common(bs, bs_options, 0, O_CREAT, NULL);
>> +    QDECREF(bs_options);
>> +    if (ret < 0) {
>> +        goto out;
>> +    }
>> +    ret = nfs_ftruncate(client->context, client->fh, total_size);
>> +out:
>> +    nfs_file_close(bs);
>> +    g_free(bs->opaque);
>> +    bs->opaque = NULL;
>> +    bdrv_unref(bs);
>> +    return ret;
>> +}
>> +
>> +static int nfs_has_zero_init(BlockDriverState *bs)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    return client->has_zero_init;
>> +}
>> +
>> +static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    return client->allocated_file_size;
>> +}
>> +
>> +static BlockDriver bdrv_nfs = {
>> +    .format_name     = "nfs",
>> +    .protocol_name   = "nfs",
>> +
>> +    .instance_size   = sizeof(NFSClient),
>> +    .bdrv_needs_filename = true,
>> +    .bdrv_has_zero_init = nfs_has_zero_init,
>> +    .bdrv_get_allocated_file_size = nfs_get_allocated_file_size,
>> +
>> +    .bdrv_file_open  = nfs_file_open,
>> +    .bdrv_close      = nfs_file_close,
>> +    .bdrv_create     = nfs_file_create,
>> +
>> +    .bdrv_co_readv         = nfs_co_readv,
>> +    .bdrv_co_writev        = nfs_co_writev,
>> +    .bdrv_co_flush_to_disk = nfs_co_flush,
>> +};
>> +
>> +static void nfs_block_init(void)
>> +{
>> +    bdrv_register(&bdrv_nfs);
>> +}
>> +
>> +block_init(nfs_block_init);
>> diff --git a/configure b/configure
>> index 8144d9f..6c9e47a 100755
>> --- a/configure
>> +++ b/configure
>> @@ -250,6 +250,7 @@ vss_win32_sdk=""
>> win_sdk="no"
>> want_tools="yes"
>> libiscsi=""
>> +libnfs=""
>> coroutine=""
>> coroutine_pool=""
>> seccomp=""
>> @@ -833,6 +834,10 @@ for opt do
>>   ;;
>>   --enable-libiscsi) libiscsi="yes"
>>   ;;
>> +  --disable-libnfs) libnfs="no"
>> +  ;;
>> +  --enable-libnfs) libnfs="yes"
>> +  ;;
>>   --enable-profiler) profiler="yes"
>>   ;;
>>   --disable-cocoa) cocoa="no"
>> @@ -1202,6 +1207,8 @@ echo "  --enable-spice           enable spice"
>> echo "  --enable-rbd             enable building the rados block device (rbd)"
>> echo "  --disable-libiscsi       disable iscsi support"
>> echo "  --enable-libiscsi        enable iscsi support"
>> +echo "  --disable-libnfs         disable nfs support"
>> +echo "  --enable-libnfs          enable nfs support"
>> echo "  --disable-smartcard-nss  disable smartcard nss support"
>> echo "  --enable-smartcard-nss   enable smartcard nss support"
>> echo "  --disable-libusb         disable libusb (for usb passthrough)"
>> @@ -3050,6 +3057,32 @@ EOF
>>   fi
>> fi
>> 
>> +##########################################
>> +# Do we have libnfs
>> +if test "$libnfs" != "no" ; then
>> +  cat > $TMPC << EOF
>> +#include <nfsc/libnfs-zdr.h>
>> +#include <nfsc/libnfs.h>
>> +#include <nfsc/libnfs-raw.h>
>> +#include <nfsc/libnfs-raw-mount.h>
>> +int main(void) {
>> +    nfs_init_context();
>> +    nfs_pread_async(0,0,0,0,0,0);
>> +    nfs_pwrite_async(0,0,0,0,0,0,0);
>> +    nfs_fstat(0,0,0);
>> +    return 0;
>> +    }
>> +EOF
>> +  if compile_prog "-Werror" "-lnfs" ; then
>> +    libnfs="yes"
>> +    LIBS="$LIBS -lnfs"
>> +  else
>> +    if test "$libnfs" = "yes" ; then
>> +      feature_not_found "libnfs"
>> +    fi
>> +    libnfs="no"
>> +  fi
>> +fi
>> 
>> ##########################################
>> # Do we need libm
>> @@ -3777,6 +3810,7 @@ echo "libusb            $libusb"
>> echo "usb net redir     $usb_redir"
>> echo "GLX support       $glx"
>> echo "libiscsi support  $libiscsi"
>> +echo "libnfs support    $libnfs"
>> echo "build guest agent $guest_agent"
>> echo "QGA VSS support   $guest_agent_with_vss"
>> echo "seccomp support   $seccomp"
>> @@ -4107,6 +4141,10 @@ if test "$libiscsi" = "yes" ; then
>>   echo "CONFIG_LIBISCSI=y" >> $config_host_mak
>> fi
>> 
>> +if test "$libnfs" = "yes" ; then
>> +  echo "CONFIG_LIBNFS=y" >> $config_host_mak
>> +fi
>> +
>> if test "$seccomp" = "yes"; then
>>   echo "CONFIG_SECCOMP=y" >> $config_host_mak
>> fi
>> --
>> 1.7.9.5
>> 
>>

Peter Lieven Dec. 17, 2013, 11 p.m. UTC | #12

> Am 17.12.2013 um 18:28 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
> 
>> On Tue, Dec 17, 2013 at 9:03 AM, Peter Lieven <pl@kamp.de> wrote:
>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>> 
>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
> 
> ...
>>>> +    if (nfs_pwrite_async(client->context, client->fh,
>>>> +                         sector_num * BDRV_SECTOR_SIZE,
>>>> +                         nb_sectors * BDRV_SECTOR_SIZE,
>>>> +                         buf, nfs_co_generic_cb, &task) != 0) {
>>>> +        g_free(buf);
>>>> +        return -EIO;
>>> 
>>> Can we get a more detailed errno here?  (e.g. ENOSPC)
>> 
>> libnfs only returns 0 or -1 if the setup of the call
>> fails. the status code from the RPC is more detailed
>> and available in task.status.
> 
> The *_async() functions only allocates memory and marshalls the
> arguments to the buffer.
> So barring marshalling bugs, it will only fail due to OOM.
> 
> So -ENOMEM is perhaps a better error for when *_async() returns an error.
> That is really the only condition where these functions will fail.

i guess the same applies to libiscsi?!
i will change it in v3 and make a patch for the iscsi driver.

> 
> 
> If *_async() returns success you are guaranteed that
> nfs_co_generic_cb() will be invoked
> and there you can inspect the status argument for more detailed reason why.

Peter Lieven Dec. 17, 2013, 11:03 p.m. UTC | #13

> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange" <berrange@redhat.com>:
> 
>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>> This patch adds native support for accessing images on NFS shares without
>> the requirement to actually mount the entire NFS share on the host.
>> 
>> NFS Images can simply be specified by an url of the form:
>> nfs://<host>/<export>/<filename>
>> 
>> For example:
>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
> 
> Does it support other config tunables, eg specifying which
> NFS version to use 2/3/4 ? If so will they be available as
> URI parameters in the obvious manner ?

currently only v3 is supported by libnfs. what other tunables would you like to see?

> 
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

Daniel P. Berrangé Dec. 18, 2013, 9:30 a.m. UTC | #14

On Wed, Dec 18, 2013 at 12:03:24AM +0100, Peter Lieven wrote:
> 
> 
> > Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange" <berrange@redhat.com>:
> > 
> >> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
> >> This patch adds native support for accessing images on NFS shares without
> >> the requirement to actually mount the entire NFS share on the host.
> >> 
> >> NFS Images can simply be specified by an url of the form:
> >> nfs://<host>/<export>/<filename>
> >> 
> >> For example:
> >> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
> > 
> > Does it support other config tunables, eg specifying which
> > NFS version to use 2/3/4 ? If so will they be available as
> > URI parameters in the obvious manner ?
> 
> currently only v3 is supported by libnfs. what other tunables would you like to see?

I didn't have any particular list in mind beyond just protocol for now. 


Daniel

Orit Wasserman Dec. 18, 2013, 10 a.m. UTC | #15

On 12/18/2013 01:03 AM, Peter Lieven wrote:
>
>
>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange" <berrange@redhat.com>:
>>
>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>> This patch adds native support for accessing images on NFS shares without
>>> the requirement to actually mount the entire NFS share on the host.
>>>
>>> NFS Images can simply be specified by an url of the form:
>>> nfs://<host>/<export>/<filename>
>>>
>>> For example:
>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>
>> Does it support other config tunables, eg specifying which
>> NFS version to use 2/3/4 ? If so will they be available as
>> URI parameters in the obvious manner ?
>
> currently only v3 is supported by libnfs. what other tunables would you like to see?
>

For live migration we need the sync option (async ignores O_SYNC and O_DIRECT sadly),
will it be supported? or will it be the default?

Orit

>>
>> Daniel
>> --
>> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
>> |: http://libvirt.org              -o-             http://virt-manager.org :|
>> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
>> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
>

Daniel P. Berrangé Dec. 18, 2013, 10:18 a.m. UTC | #16

On Wed, Dec 18, 2013 at 12:00:03PM +0200, Orit Wasserman wrote:
> On 12/18/2013 01:03 AM, Peter Lieven wrote:
> >
> >
> >>Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange" <berrange@redhat.com>:
> >>
> >>>On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
> >>>This patch adds native support for accessing images on NFS shares without
> >>>the requirement to actually mount the entire NFS share on the host.
> >>>
> >>>NFS Images can simply be specified by an url of the form:
> >>>nfs://<host>/<export>/<filename>
> >>>
> >>>For example:
> >>>qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
> >>
> >>Does it support other config tunables, eg specifying which
> >>NFS version to use 2/3/4 ? If so will they be available as
> >>URI parameters in the obvious manner ?
> >
> >currently only v3 is supported by libnfs. what other tunables would you like to see?
> >
> 
> For live migration we need the sync option (async ignores O_SYNC and O_DIRECT sadly),
> will it be supported? or will it be the default?

Since this is bypassing the client kernel FS I/O layer question around
support of things like O_SYNC/O_DIRECT are not applicable.

Daniel

Orit Wasserman Dec. 18, 2013, 10:24 a.m. UTC | #17

On 12/18/2013 12:18 PM, Daniel P. Berrange wrote:
> On Wed, Dec 18, 2013 at 12:00:03PM +0200, Orit Wasserman wrote:
>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>
>>>
>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange" <berrange@redhat.com>:
>>>>
>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>> This patch adds native support for accessing images on NFS shares without
>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>
>>>>> NFS Images can simply be specified by an url of the form:
>>>>> nfs://<host>/<export>/<filename>
>>>>>
>>>>> For example:
>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>
>>>> Does it support other config tunables, eg specifying which
>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>> URI parameters in the obvious manner ?
>>>
>>> currently only v3 is supported by libnfs. what other tunables would you like to see?
>>>
>>
>> For live migration we need the sync option (async ignores O_SYNC and O_DIRECT sadly),
>> will it be supported? or will it be the default?
>
> Since this is bypassing the client kernel FS I/O layer question around
> support of things like O_SYNC/O_DIRECT are not applicable.
>

so no live migration support?
  
> Daniel
>

Paolo Bonzini Dec. 18, 2013, 10:38 a.m. UTC | #18

Il 18/12/2013 11:24, Orit Wasserman ha scritto:
>>>
>>> For live migration we need the sync option (async ignores O_SYNC and
>>> O_DIRECT sadly),
>>> will it be supported? or will it be the default?
>>
>> Since this is bypassing the client kernel FS I/O layer question around
>> support of things like O_SYNC/O_DIRECT are not applicable.
>>
> 
> so no live migration support?

No, live migration just works.

O_SYNC is not used anymore in QEMU, we just issue a flush after every write.

And all protocol drivers except files/block devices _always_ bypass the
kernel page cache (libiscsi, NBD, ceph, and now libnfs) so they are
always behaving as if they had O_DIRECT.  For these protocols,
cache=writeback and cache=none are entirely the same.

Paolo

Peter Lieven Dec. 18, 2013, 11:11 a.m. UTC | #19

> Am 18.12.2013 um 11:00 schrieb Orit Wasserman <owasserm@redhat.com>:
> 
>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>> 
>> 
>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange" <berrange@redhat.com>:
>>>> 
>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>> This patch adds native support for accessing images on NFS shares without
>>>> the requirement to actually mount the entire NFS share on the host.
>>>> 
>>>> NFS Images can simply be specified by an url of the form:
>>>> nfs://<host>/<export>/<filename>
>>>> 
>>>> For example:
>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>> 
>>> Does it support other config tunables, eg specifying which
>>> NFS version to use 2/3/4 ? If so will they be available as
>>> URI parameters in the obvious manner ?
>> 
>> currently only v3 is supported by libnfs. what other tunables would you like to see?
> 
> For live migration we need the sync option (async ignores O_SYNC and O_DIRECT sadly),
> will it be supported? or will it be the default?

the async you see in the libnfs calls refer to the async API. bdrv_flush will not return before the nfs server completes the request.

Peter

> 
> Orit
> 
>>> 
>>> Daniel
>>> --
>>> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
>>> |: http://libvirt.org              -o-             http://virt-manager.org :|
>>> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
>>> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
>

Orit Wasserman Dec. 18, 2013, 11:23 a.m. UTC | #20

On 12/18/2013 01:11 PM, Peter Lieven wrote:
>
>
>> Am 18.12.2013 um 11:00 schrieb Orit Wasserman <owasserm@redhat.com>:
>>
>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>
>>>
>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange" <berrange@redhat.com>:
>>>>>
>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>> This patch adds native support for accessing images on NFS shares without
>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>
>>>>> NFS Images can simply be specified by an url of the form:
>>>>> nfs://<host>/<export>/<filename>
>>>>>
>>>>> For example:
>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>
>>>> Does it support other config tunables, eg specifying which
>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>> URI parameters in the obvious manner ?
>>>
>>> currently only v3 is supported by libnfs. what other tunables would you like to see?
>>
>> For live migration we need the sync option (async ignores O_SYNC and O_DIRECT sadly),
>> will it be supported? or will it be the default?
>
> the async you see in the libnfs calls refer to the async API. bdrv_flush will not return before the nfs server completes the request.
>

That great!

Thanks,
Orit

> Peter
>
>>
>> Orit
>>
>>>>
>>>> Daniel
>>>> --
>>>> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
>>>> |: http://libvirt.org              -o-             http://virt-manager.org :|
>>>> |: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
>>>> |: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|
>>

ronnie sahlberg Dec. 18, 2013, 2:42 p.m. UTC | #21

On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <owasserm@redhat.com> wrote:
> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>
>>
>>
>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>> <berrange@redhat.com>:
>>>
>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>> This patch adds native support for accessing images on NFS shares
>>>> without
>>>> the requirement to actually mount the entire NFS share on the host.
>>>>
>>>> NFS Images can simply be specified by an url of the form:
>>>> nfs://<host>/<export>/<filename>
>>>>
>>>> For example:
>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>
>>>
>>> Does it support other config tunables, eg specifying which
>>> NFS version to use 2/3/4 ? If so will they be available as
>>> URI parameters in the obvious manner ?
>>
>>
>> currently only v3 is supported by libnfs. what other tunables would you
>> like to see?
>>
>
> For live migration we need the sync option (async ignores O_SYNC and
> O_DIRECT sadly),
> will it be supported? or will it be the default?
>

If you use the high-level API that provides posix like functions, such
as nfs_open() then libnfs does.
nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
the O_SYNC flag in modes.

By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
NFS/WRITE3+UNSTABLE that allows the server to just write to
cache/memory.

IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
then libnfs will flag this handle as sync and any calls to
nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC

Calls to nfs_fsync is translated to NFS/COMMIT3

Of course, as for normal file I/O this is useful but not great since
you can only control the sync vs async per open filehandle.
Libnfs does also allow you to send raw rpc commands to the server and
using this API you can control the sync behaviour for individual
writes.
This means you coould do something like
* emulate SCSI to the guest.
* if guest sends SCSI/WRITE* without any FUA bits set, then for that
I/O you send a NFS3/WRITE+UNSTABLE
* if guest sends SCSI/WRITE* with FUA bits set, then for that I/O you
send NFS3/WRITE+FILE_SYNC
and then the guest kernel can control for each individual write
whether it is sync or not.

But that is probably something that can wait until later and don't
need to be part of the initial patch?
If peter wants to do this in the future I can create a small writeup
on how to mixin the two different APIs using the same context.

Peter Lieven Dec. 18, 2013, 4:59 p.m. UTC | #22

Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:

> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <owasserm@redhat.com> wrote:
>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>> 
>>> 
>>> 
>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>> <berrange@redhat.com>:
>>>> 
>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>> This patch adds native support for accessing images on NFS shares
>>>>> without
>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>> 
>>>>> NFS Images can simply be specified by an url of the form:
>>>>> nfs://<host>/<export>/<filename>
>>>>> 
>>>>> For example:
>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>> 
>>>> 
>>>> Does it support other config tunables, eg specifying which
>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>> URI parameters in the obvious manner ?
>>> 
>>> 
>>> currently only v3 is supported by libnfs. what other tunables would you
>>> like to see?
>>> 
>> 
>> For live migration we need the sync option (async ignores O_SYNC and
>> O_DIRECT sadly),
>> will it be supported? or will it be the default?
>> 
> 
> If you use the high-level API that provides posix like functions, such
> as nfs_open() then libnfs does.
> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
> the O_SYNC flag in modes.
> 
> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
> NFS/WRITE3+UNSTABLE that allows the server to just write to
> cache/memory.
> 
> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
> then libnfs will flag this handle as sync and any calls to
> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
> 
> Calls to nfs_fsync is translated to NFS/COMMIT3

If this NFS/COMMIT3 would issue a sync on the server that would be all we
actually need. And in this case migration should be safe. Even if we open
a file with cache = none qemu would issue such a commit after every write.

This also allow for writeback caching where the filesystem flushes would
go through right to the server.


> 
> 
> 
> Of course, as for normal file I/O this is useful but not great since
> you can only control the sync vs async per open filehandle.
> Libnfs does also allow you to send raw rpc commands to the server and
> using this API you can control the sync behaviour for individual
> writes.
> This means you coould do something like
> * emulate SCSI to the guest.
> * if guest sends SCSI/WRITE* without any FUA bits set, then for that
> I/O you send a NFS3/WRITE+UNSTABLE
> * if guest sends SCSI/WRITE* with FUA bits set, then for that I/O you
> send NFS3/WRITE+FILE_SYNC
> and then the guest kernel can control for each individual write
> whether it is sync or not.
> 
> But that is probably something that can wait until later and don't
> need to be part of the initial patch?
> If peter wants to do this in the future I can create a small writeup
> on how to mixin the two different APIs using the same context.

We can do that, but I would like to focus on the basic functionality
first.

Peter

Peter Lieven Dec. 18, 2013, 5:21 p.m. UTC | #23

Am 18.12.2013 um 11:38 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 18/12/2013 11:24, Orit Wasserman ha scritto:
>>>> 
>>>> For live migration we need the sync option (async ignores O_SYNC and
>>>> O_DIRECT sadly),
>>>> will it be supported? or will it be the default?
>>> 
>>> Since this is bypassing the client kernel FS I/O layer question around
>>> support of things like O_SYNC/O_DIRECT are not applicable.
>>> 
>> 
>> so no live migration support?
> 
> No, live migration just works.
> 
> O_SYNC is not used anymore in QEMU, we just issue a flush after every write.
> 
> And all protocol drivers except files/block devices _always_ bypass the
> kernel page cache (libiscsi, NBD, ceph, and now libnfs) so they are
> always behaving as if they had O_DIRECT.  For these protocols,
> cache=writeback and cache=none are entirely the same.

Not completely I think, but please correct me if I am wrong.

If cache=writeback is set we issue just a write. In libnfs or libiscsi case
that guarantees that the request has been successfully executed
on the target / server. This is enough to be consistent in case of
migration because consecutive reads will be answered correctly.

But the target / server is still free to just keep this
data in memory so we should only set this if we know that the target / server
e.g. has a battery backup or we know that the filesystem uses e.g. barriers and
issues a flush at important points.

If cache=none is set we issue a flush after every single write. In the libiscsi
case we issue a synchronize cache which should also guarantee that
data is flushed to disk. I hope that the NFS_COMMIT does the same in the
libnfs case?!

BR,
Peter

ronnie sahlberg Dec. 18, 2013, 5:33 p.m. UTC | #24

On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven <pl@kamp.de> wrote:
>
> Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>
>> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <owasserm@redhat.com> wrote:
>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>>
>>>>
>>>>
>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>>> <berrange@redhat.com>:
>>>>>
>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>> This patch adds native support for accessing images on NFS shares
>>>>>> without
>>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>>
>>>>>> NFS Images can simply be specified by an url of the form:
>>>>>> nfs://<host>/<export>/<filename>
>>>>>>
>>>>>> For example:
>>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>>
>>>>>
>>>>> Does it support other config tunables, eg specifying which
>>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>>> URI parameters in the obvious manner ?
>>>>
>>>>
>>>> currently only v3 is supported by libnfs. what other tunables would you
>>>> like to see?
>>>>
>>>
>>> For live migration we need the sync option (async ignores O_SYNC and
>>> O_DIRECT sadly),
>>> will it be supported? or will it be the default?
>>>
>>
>> If you use the high-level API that provides posix like functions, such
>> as nfs_open() then libnfs does.
>> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
>> the O_SYNC flag in modes.
>>
>> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
>> NFS/WRITE3+UNSTABLE that allows the server to just write to
>> cache/memory.
>>
>> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
>> then libnfs will flag this handle as sync and any calls to
>> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
>>
>> Calls to nfs_fsync is translated to NFS/COMMIT3
>
> If this NFS/COMMIT3 would issue a sync on the server that would be all we
> actually need.

You have that guarantee in NFS/COMMIT3
NFS/COMMIT3 will not return until the server has flushed the specified
range to disk.

However, while the NFS protocol allows you to specify a range for the
COMMIT3 call so that you can do things like
WRITE3 Offset:foo Length:bar
COMMIT3 Offset:foo Length:bar
many/most nfs servers will ignore the offset/length arguments to the
COMMIT3 call and always unconditionally make an fsync() for the whole
file.

This can make the COMMIT3 call very expensive for large files.


NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
specify O_SYNC to nfs_open*()
In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
which means that the server will guarantee to write the data to stable
storage before responding back to the client.
In this mode there is no real need to do anything at all or even call
COMMIT3  since there is never any writeback data on the server that
needs to be destaged.


Since many servers treat COMMIT3 as "unconditionally walk all blocks
for the whole file and make sure they are destaged" it is not clear
whether how

WRITE3-normal Offset:foo Length:bar
COMMIT3 Offset:foo Length:bar

will compare to

WRITE3+O_SYNC Offset:foo Length:bar

I would not be surprised if the second mode would have higher
(potentially significantly) performance than the former.



> And in this case migration should be safe. Even if we open
> a file with cache = none qemu would issue such a commit after every write.
>
> This also allow for writeback caching where the filesystem flushes would
> go through right to the server.
>
>
>>
>>
>>
>> Of course, as for normal file I/O this is useful but not great since
>> you can only control the sync vs async per open filehandle.
>> Libnfs does also allow you to send raw rpc commands to the server and
>> using this API you can control the sync behaviour for individual
>> writes.
>> This means you coould do something like
>> * emulate SCSI to the guest.
>> * if guest sends SCSI/WRITE* without any FUA bits set, then for that
>> I/O you send a NFS3/WRITE+UNSTABLE
>> * if guest sends SCSI/WRITE* with FUA bits set, then for that I/O you
>> send NFS3/WRITE+FILE_SYNC
>> and then the guest kernel can control for each individual write
>> whether it is sync or not.
>>
>> But that is probably something that can wait until later and don't
>> need to be part of the initial patch?
>> If peter wants to do this in the future I can create a small writeup
>> on how to mixin the two different APIs using the same context.
>
> We can do that, but I would like to focus on the basic functionality
> first.

ACK

>
> Peter
>

Peter Lieven Dec. 18, 2013, 5:42 p.m. UTC | #25

Am 18.12.2013 um 18:33 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:

> On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven <pl@kamp.de> wrote:
>> 
>> Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>> 
>>> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <owasserm@redhat.com> wrote:
>>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>>>> <berrange@redhat.com>:
>>>>>> 
>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>> This patch adds native support for accessing images on NFS shares
>>>>>>> without
>>>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>>> 
>>>>>>> NFS Images can simply be specified by an url of the form:
>>>>>>> nfs://<host>/<export>/<filename>
>>>>>>> 
>>>>>>> For example:
>>>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>>> 
>>>>>> 
>>>>>> Does it support other config tunables, eg specifying which
>>>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>>>> URI parameters in the obvious manner ?
>>>>> 
>>>>> 
>>>>> currently only v3 is supported by libnfs. what other tunables would you
>>>>> like to see?
>>>>> 
>>>> 
>>>> For live migration we need the sync option (async ignores O_SYNC and
>>>> O_DIRECT sadly),
>>>> will it be supported? or will it be the default?
>>>> 
>>> 
>>> If you use the high-level API that provides posix like functions, such
>>> as nfs_open() then libnfs does.
>>> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
>>> the O_SYNC flag in modes.
>>> 
>>> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
>>> NFS/WRITE3+UNSTABLE that allows the server to just write to
>>> cache/memory.
>>> 
>>> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
>>> then libnfs will flag this handle as sync and any calls to
>>> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
>>> 
>>> Calls to nfs_fsync is translated to NFS/COMMIT3
>> 
>> If this NFS/COMMIT3 would issue a sync on the server that would be all we
>> actually need.
> 
> You have that guarantee in NFS/COMMIT3
> NFS/COMMIT3 will not return until the server has flushed the specified
> range to disk.
> 
> However, while the NFS protocol allows you to specify a range for the
> COMMIT3 call so that you can do things like
> WRITE3 Offset:foo Length:bar
> COMMIT3 Offset:foo Length:bar
> many/most nfs servers will ignore the offset/length arguments to the
> COMMIT3 call and always unconditionally make an fsync() for the whole
> file.
> 
> This can make the COMMIT3 call very expensive for large files.
> 
> 
> NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
> specify O_SYNC to nfs_open*()
> In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
> which means that the server will guarantee to write the data to stable
> storage before responding back to the client.
> In this mode there is no real need to do anything at all or even call
> COMMIT3  since there is never any writeback data on the server that
> needs to be destaged.
> 
> 
> Since many servers treat COMMIT3 as "unconditionally walk all blocks
> for the whole file and make sure they are destaged" it is not clear
> whether how
> 
> WRITE3-normal Offset:foo Length:bar
> COMMIT3 Offset:foo Length:bar
> 
> will compare to
> 
> WRITE3+O_SYNC Offset:foo Length:bar
> 
> I would not be surprised if the second mode would have higher
> (potentially significantly) performance than the former.

The qemu block layer currently is designed to send a bdrv_flush after every single
write if the write cache is not enabled. This means that the unwritten data is just
the data of the single write operation. However, changing this to issue a sync
write call would require to change the whole API. The major problem is that
the write cache setting can be changed while the device is open otherwise
we could just ignore all calls to bdrv flush if the device was opened without
enabled write cache.

In the very popular case of using Virtio as Driver it is the case that the device
is always opened with disabled write cache and the write cache is only
enabled after the host has negotiated with the guest that the guest is
able to send flushed.

We can keep in mind for a later version of the driver that we manually craft
a write call with O_SYNC if the write cache is disabled and ignore bdrv_flush.
And we use async write + commit via bdrv_flush in the case of an enabled
write cache.

Peter

ronnie sahlberg Dec. 18, 2013, 5:50 p.m. UTC | #26

On Wed, Dec 18, 2013 at 9:42 AM, Peter Lieven <pl@kamp.de> wrote:
>
> Am 18.12.2013 um 18:33 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>
>> On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>>>
>>>> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <owasserm@redhat.com> wrote:
>>>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>>>>> <berrange@redhat.com>:
>>>>>>>
>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>> This patch adds native support for accessing images on NFS shares
>>>>>>>> without
>>>>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>>>>
>>>>>>>> NFS Images can simply be specified by an url of the form:
>>>>>>>> nfs://<host>/<export>/<filename>
>>>>>>>>
>>>>>>>> For example:
>>>>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>>>>
>>>>>>>
>>>>>>> Does it support other config tunables, eg specifying which
>>>>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>>>>> URI parameters in the obvious manner ?
>>>>>>
>>>>>>
>>>>>> currently only v3 is supported by libnfs. what other tunables would you
>>>>>> like to see?
>>>>>>
>>>>>
>>>>> For live migration we need the sync option (async ignores O_SYNC and
>>>>> O_DIRECT sadly),
>>>>> will it be supported? or will it be the default?
>>>>>
>>>>
>>>> If you use the high-level API that provides posix like functions, such
>>>> as nfs_open() then libnfs does.
>>>> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
>>>> the O_SYNC flag in modes.
>>>>
>>>> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
>>>> NFS/WRITE3+UNSTABLE that allows the server to just write to
>>>> cache/memory.
>>>>
>>>> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
>>>> then libnfs will flag this handle as sync and any calls to
>>>> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
>>>>
>>>> Calls to nfs_fsync is translated to NFS/COMMIT3
>>>
>>> If this NFS/COMMIT3 would issue a sync on the server that would be all we
>>> actually need.
>>
>> You have that guarantee in NFS/COMMIT3
>> NFS/COMMIT3 will not return until the server has flushed the specified
>> range to disk.
>>
>> However, while the NFS protocol allows you to specify a range for the
>> COMMIT3 call so that you can do things like
>> WRITE3 Offset:foo Length:bar
>> COMMIT3 Offset:foo Length:bar
>> many/most nfs servers will ignore the offset/length arguments to the
>> COMMIT3 call and always unconditionally make an fsync() for the whole
>> file.
>>
>> This can make the COMMIT3 call very expensive for large files.
>>
>>
>> NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
>> specify O_SYNC to nfs_open*()
>> In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
>> which means that the server will guarantee to write the data to stable
>> storage before responding back to the client.
>> In this mode there is no real need to do anything at all or even call
>> COMMIT3  since there is never any writeback data on the server that
>> needs to be destaged.
>>
>>
>> Since many servers treat COMMIT3 as "unconditionally walk all blocks
>> for the whole file and make sure they are destaged" it is not clear
>> whether how
>>
>> WRITE3-normal Offset:foo Length:bar
>> COMMIT3 Offset:foo Length:bar
>>
>> will compare to
>>
>> WRITE3+O_SYNC Offset:foo Length:bar
>>
>> I would not be surprised if the second mode would have higher
>> (potentially significantly) performance than the former.
>
> The qemu block layer currently is designed to send a bdrv_flush after every single
> write if the write cache is not enabled. This means that the unwritten data is just
> the data of the single write operation.

I understand that, there is only a single WRITE3 worth of data to
actually destage each time.

But what I meant is that for a lot of servers, for large files,   the
server might need to spend non-trivial amount of time
crunching file metadata and check every single page for the file in
order to discover the "I only need to destage pages x,y,z"

On many nfs servers this "figure out which blocks to flush" can take a
lot of time and affect performance greatly.



> However, changing this to issue a sync
> write call would require to change the whole API. The major problem is that
> the write cache setting can be changed while the device is open otherwise
> we could just ignore all calls to bdrv flush if the device was opened without
> enabled write cache.
>
> In the very popular case of using Virtio as Driver it is the case that the device
> is always opened with disabled write cache and the write cache is only
> enabled after the host has negotiated with the guest that the guest is
> able to send flushed.
>
> We can keep in mind for a later version of the driver that we manually craft
> a write call with O_SYNC if the write cache is disabled and ignore bdrv_flush.
> And we use async write + commit via bdrv_flush in the case of an enabled
> write cache.
>
> Peter
>

Peter Lieven Dec. 18, 2013, 5:55 p.m. UTC | #27

Am 18.12.2013 um 18:50 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:

> On Wed, Dec 18, 2013 at 9:42 AM, Peter Lieven <pl@kamp.de> wrote:
>> 
>> Am 18.12.2013 um 18:33 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>> 
>>> On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven <pl@kamp.de> wrote:
>>>> 
>>>> Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>>>> 
>>>>> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <owasserm@redhat.com> wrote:
>>>>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>>>>>> <berrange@redhat.com>:
>>>>>>>> 
>>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>> This patch adds native support for accessing images on NFS shares
>>>>>>>>> without
>>>>>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>>>>> 
>>>>>>>>> NFS Images can simply be specified by an url of the form:
>>>>>>>>> nfs://<host>/<export>/<filename>
>>>>>>>>> 
>>>>>>>>> For example:
>>>>>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Does it support other config tunables, eg specifying which
>>>>>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>>>>>> URI parameters in the obvious manner ?
>>>>>>> 
>>>>>>> 
>>>>>>> currently only v3 is supported by libnfs. what other tunables would you
>>>>>>> like to see?
>>>>>>> 
>>>>>> 
>>>>>> For live migration we need the sync option (async ignores O_SYNC and
>>>>>> O_DIRECT sadly),
>>>>>> will it be supported? or will it be the default?
>>>>>> 
>>>>> 
>>>>> If you use the high-level API that provides posix like functions, such
>>>>> as nfs_open() then libnfs does.
>>>>> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
>>>>> the O_SYNC flag in modes.
>>>>> 
>>>>> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
>>>>> NFS/WRITE3+UNSTABLE that allows the server to just write to
>>>>> cache/memory.
>>>>> 
>>>>> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
>>>>> then libnfs will flag this handle as sync and any calls to
>>>>> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
>>>>> 
>>>>> Calls to nfs_fsync is translated to NFS/COMMIT3
>>>> 
>>>> If this NFS/COMMIT3 would issue a sync on the server that would be all we
>>>> actually need.
>>> 
>>> You have that guarantee in NFS/COMMIT3
>>> NFS/COMMIT3 will not return until the server has flushed the specified
>>> range to disk.
>>> 
>>> However, while the NFS protocol allows you to specify a range for the
>>> COMMIT3 call so that you can do things like
>>> WRITE3 Offset:foo Length:bar
>>> COMMIT3 Offset:foo Length:bar
>>> many/most nfs servers will ignore the offset/length arguments to the
>>> COMMIT3 call and always unconditionally make an fsync() for the whole
>>> file.
>>> 
>>> This can make the COMMIT3 call very expensive for large files.
>>> 
>>> 
>>> NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
>>> specify O_SYNC to nfs_open*()
>>> In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
>>> which means that the server will guarantee to write the data to stable
>>> storage before responding back to the client.
>>> In this mode there is no real need to do anything at all or even call
>>> COMMIT3  since there is never any writeback data on the server that
>>> needs to be destaged.
>>> 
>>> 
>>> Since many servers treat COMMIT3 as "unconditionally walk all blocks
>>> for the whole file and make sure they are destaged" it is not clear
>>> whether how
>>> 
>>> WRITE3-normal Offset:foo Length:bar
>>> COMMIT3 Offset:foo Length:bar
>>> 
>>> will compare to
>>> 
>>> WRITE3+O_SYNC Offset:foo Length:bar
>>> 
>>> I would not be surprised if the second mode would have higher
>>> (potentially significantly) performance than the former.
>> 
>> The qemu block layer currently is designed to send a bdrv_flush after every single
>> write if the write cache is not enabled. This means that the unwritten data is just
>> the data of the single write operation.
> 
> I understand that, there is only a single WRITE3 worth of data to
> actually destage each time.
> 
> But what I meant is that for a lot of servers, for large files,   the
> server might need to spend non-trivial amount of time
> crunching file metadata and check every single page for the file in
> order to discover the "I only need to destage pages x,y,z"
> 
> On many nfs servers this "figure out which blocks to flush" can take a
> lot of time and affect performance greatly.

But this is only the case for disabled write cache. I would not expect great write
performance at all if the write cache is disabled.

As an improvement for the future we could improve the write operation with disabled
write cache by sending a write + file_sync call for every write and ignore the bdrv_flush
if this is faster.

Peter

Paolo Bonzini Dec. 19, 2013, 2:31 p.m. UTC | #28

Il 18/12/2013 18:21, Peter Lieven ha scritto:
> Not completely I think, but please correct me if I am wrong.
> 
> If cache=writeback is set we issue just a write. In libnfs or libiscsi case
> that guarantees that the request has been successfully executed
> on the target / server. This is enough to be consistent in case of
> migration because consecutive reads will be answered correctly.
> 
> But the target / server is still free to just keep this
> data in memory so we should only set this if we know that the target / server
> e.g. has a battery backup or we know that the filesystem uses e.g. barriers and
> issues a flush at important points.

Yes.  However, the same holds for cache=none (both on
libnfs/libiscsi/NBD/others, where it is a nop, and on regular I/O, where
it uses O_DIRECT *but not O_SYNC or O_DSYNC*).

> If cache=none is set we issue a flush after every single write.

That's what you get with cache=writethrough or cache=directsync.  Here
the flush is added automatically by the block layer, so it works with
every driver.

Paolo

Peter Lieven Dec. 20, 2013, 9:48 a.m. UTC | #29

On 17.12.2013 17:47, Stefan Hajnoczi wrote:
> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>> This patch adds native support for accessing images on NFS shares without
>> the requirement to actually mount the entire NFS share on the host.
>>
>> NFS Images can simply be specified by an url of the form:
>> nfs://<host>/<export>/<filename>
>>
>> For example:
>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>
>> You need libnfs from Ronnie Sahlberg available at:
>>     git://github.com/sahlberg/libnfs.git
>> for this to work.
>>
>> During configure it is automatically probed for libnfs and support
>> is enabled on-the-fly. You can forbid or enforce libnfs support
>> with --disable-libnfs or --enable-libnfs respectively.
>>
>> Due to NFS restrictions you might need to execute your binaries
>> as root, allow them to open priviledged ports (<1024) or specify
>> insecure option on the NFS server.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> v1->v2:
>>   - fixed block/Makefile.objs [Ronnie]
>>   - do not always register a read handler [Ronnie]
>>   - add support for reading beyond EOF [Fam]
>>   - fixed struct and paramter naming [Fam]
>>   - fixed overlong lines and whitespace errors [Fam]
>>   - return return status from libnfs whereever possible [Fam]
>>   - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
>>   - avoid segfault when parsing filname [Fam]
>>   - remove unused close_bh from NFSClient [Fam]
>>   - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]
>>
>>   MAINTAINERS         |    5 +
>>   block/Makefile.objs |    1 +
>>   block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   configure           |   38 +++++
>>   4 files changed, 463 insertions(+)
>>   create mode 100644 block/nfs.c
> Which NFS protocol versions are supported by current libnfs?
>
>> +#include <poll.h>
> Why is this header included?
>
>> +typedef struct nfsclient {
> Please either drop the struct tag or use "NFSClient".
>
>> +static void
>> +nfs_co_generic_cb(int status, struct nfs_context *nfs, void *data,
>> +                  void *private_data)
>> +{
>> +    NFSTask *Task = private_data;
> lowercase "task" local variable name please.
>
>> +static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
>> +                                        int64_t sector_num, int nb_sectors,
>> +                                        QEMUIOVector *iov)
>> +{
>> +    NFSClient *client = bs->opaque;
>> +    NFSTask task;
>> +    char *buf = NULL;
>> +
>> +    nfs_co_init_task(client, &task);
>> +
>> +    buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
>> +    qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
>> +
>> +    if (nfs_pwrite_async(client->context, client->fh,
>> +                         sector_num * BDRV_SECTOR_SIZE,
>> +                         nb_sectors * BDRV_SECTOR_SIZE,
>> +                         buf, nfs_co_generic_cb, &task) != 0) {
>> +        g_free(buf);
>> +        return -EIO;
> Can we get a more detailed errno here?  (e.g. ENOSPC)
>
>> +    }
>> +
>> +    while (!task.complete) {
>> +        nfs_set_events(client);
>> +        qemu_coroutine_yield();
>> +    }
>> +
>> +    g_free(buf);
>> +
>> +    if (task.status != nb_sectors * BDRV_SECTOR_SIZE) {
>> +        return task.status < 0 ? task.status : -EIO;
>> +    }
>> +
>> +    bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
> Why is this necessary?  block.c will update bs->total_sectors if the
> file is growable.
>
>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>> +     * in qemu-img open. So we can use the cached value for allocate
>> +     * filesize obtained from fstat at open time */
>> +    client->allocated_file_size = -ENOTSUP;
> Can you implement this fully?  By stubbing it out like this we won't be
> able to call get_allocated_file_size() at runtime in the future without
> updating the nfs block driver code.  It's just an fstat call, shouldn't
> be too hard to implement properly :).

It seems I have to leave it as is currently. bdrv_get_allocated_file_size
is not in a coroutine context. I get coroutine yields to no one.

Peter

Stefan Hajnoczi Dec. 20, 2013, 12:19 p.m. UTC | #30

On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
> >On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
> >>This patch adds native support for accessing images on NFS shares without
> >>the requirement to actually mount the entire NFS share on the host.
> >>
> >>NFS Images can simply be specified by an url of the form:
> >>nfs://<host>/<export>/<filename>
> >>
> >>For example:
> >>qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
> >>
> >>You need libnfs from Ronnie Sahlberg available at:
> >>    git://github.com/sahlberg/libnfs.git
> >>for this to work.
> >>
> >>During configure it is automatically probed for libnfs and support
> >>is enabled on-the-fly. You can forbid or enforce libnfs support
> >>with --disable-libnfs or --enable-libnfs respectively.
> >>
> >>Due to NFS restrictions you might need to execute your binaries
> >>as root, allow them to open priviledged ports (<1024) or specify
> >>insecure option on the NFS server.
> >>
> >>Signed-off-by: Peter Lieven <pl@kamp.de>
> >>---
> >>v1->v2:
> >>  - fixed block/Makefile.objs [Ronnie]
> >>  - do not always register a read handler [Ronnie]
> >>  - add support for reading beyond EOF [Fam]
> >>  - fixed struct and paramter naming [Fam]
> >>  - fixed overlong lines and whitespace errors [Fam]
> >>  - return return status from libnfs whereever possible [Fam]
> >>  - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
> >>  - avoid segfault when parsing filname [Fam]
> >>  - remove unused close_bh from NFSClient [Fam]
> >>  - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]
> >>
> >>  MAINTAINERS         |    5 +
> >>  block/Makefile.objs |    1 +
> >>  block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  configure           |   38 +++++
> >>  4 files changed, 463 insertions(+)
> >>  create mode 100644 block/nfs.c
> >Which NFS protocol versions are supported by current libnfs?
> >
> >>+#include <poll.h>
> >Why is this header included?
> >
> >>+typedef struct nfsclient {
> >Please either drop the struct tag or use "NFSClient".
> >
> >>+static void
> >>+nfs_co_generic_cb(int status, struct nfs_context *nfs, void *data,
> >>+                  void *private_data)
> >>+{
> >>+    NFSTask *Task = private_data;
> >lowercase "task" local variable name please.
> >
> >>+static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
> >>+                                        int64_t sector_num, int nb_sectors,
> >>+                                        QEMUIOVector *iov)
> >>+{
> >>+    NFSClient *client = bs->opaque;
> >>+    NFSTask task;
> >>+    char *buf = NULL;
> >>+
> >>+    nfs_co_init_task(client, &task);
> >>+
> >>+    buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
> >>+    qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
> >>+
> >>+    if (nfs_pwrite_async(client->context, client->fh,
> >>+                         sector_num * BDRV_SECTOR_SIZE,
> >>+                         nb_sectors * BDRV_SECTOR_SIZE,
> >>+                         buf, nfs_co_generic_cb, &task) != 0) {
> >>+        g_free(buf);
> >>+        return -EIO;
> >Can we get a more detailed errno here?  (e.g. ENOSPC)
> >
> >>+    }
> >>+
> >>+    while (!task.complete) {
> >>+        nfs_set_events(client);
> >>+        qemu_coroutine_yield();
> >>+    }
> >>+
> >>+    g_free(buf);
> >>+
> >>+    if (task.status != nb_sectors * BDRV_SECTOR_SIZE) {
> >>+        return task.status < 0 ? task.status : -EIO;
> >>+    }
> >>+
> >>+    bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
> >Why is this necessary?  block.c will update bs->total_sectors if the
> >file is growable.
> >
> >>+    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
> >>+     * in qemu-img open. So we can use the cached value for allocate
> >>+     * filesize obtained from fstat at open time */
> >>+    client->allocated_file_size = -ENOTSUP;
> >Can you implement this fully?  By stubbing it out like this we won't be
> >able to call get_allocated_file_size() at runtime in the future without
> >updating the nfs block driver code.  It's just an fstat call, shouldn't
> >be too hard to implement properly :).
> 
> It seems I have to leave it as is currently. bdrv_get_allocated_file_size
> is not in a coroutine context. I get coroutine yields to no one.

Create a coroutine and pump the event loop until it has reached
completion:

co = qemu_coroutine_create(my_coroutine_fn, ...);
qemu_coroutine_enter(co, foo);
while (!complete) {
    qemu_aio_wait();
}

See block.c for similar examples.

Stefan

Peter Lieven Dec. 20, 2013, 12:53 p.m. UTC | #31

On 20.12.2013 13:19, Stefan Hajnoczi wrote:
> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>> This patch adds native support for accessing images on NFS shares without
>>>> the requirement to actually mount the entire NFS share on the host.
>>>>
>>>> NFS Images can simply be specified by an url of the form:
>>>> nfs://<host>/<export>/<filename>
>>>>
>>>> For example:
>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>
>>>> You need libnfs from Ronnie Sahlberg available at:
>>>>     git://github.com/sahlberg/libnfs.git
>>>> for this to work.
>>>>
>>>> During configure it is automatically probed for libnfs and support
>>>> is enabled on-the-fly. You can forbid or enforce libnfs support
>>>> with --disable-libnfs or --enable-libnfs respectively.
>>>>
>>>> Due to NFS restrictions you might need to execute your binaries
>>>> as root, allow them to open priviledged ports (<1024) or specify
>>>> insecure option on the NFS server.
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> ---
>>>> v1->v2:
>>>>   - fixed block/Makefile.objs [Ronnie]
>>>>   - do not always register a read handler [Ronnie]
>>>>   - add support for reading beyond EOF [Fam]
>>>>   - fixed struct and paramter naming [Fam]
>>>>   - fixed overlong lines and whitespace errors [Fam]
>>>>   - return return status from libnfs whereever possible [Fam]
>>>>   - added comment why we set allocated_file_size to -ENOTSUP after write [Fam]
>>>>   - avoid segfault when parsing filname [Fam]
>>>>   - remove unused close_bh from NFSClient [Fam]
>>>>   - avoid dividing and mutliplying total_size by BDRV_SECTOR_SIZE in nfs_file_create [Fam]
>>>>
>>>>   MAINTAINERS         |    5 +
>>>>   block/Makefile.objs |    1 +
>>>>   block/nfs.c         |  419 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   configure           |   38 +++++
>>>>   4 files changed, 463 insertions(+)
>>>>   create mode 100644 block/nfs.c
>>> Which NFS protocol versions are supported by current libnfs?
>>>
>>>> +#include <poll.h>
>>> Why is this header included?
>>>
>>>> +typedef struct nfsclient {
>>> Please either drop the struct tag or use "NFSClient".
>>>
>>>> +static void
>>>> +nfs_co_generic_cb(int status, struct nfs_context *nfs, void *data,
>>>> +                  void *private_data)
>>>> +{
>>>> +    NFSTask *Task = private_data;
>>> lowercase "task" local variable name please.
>>>
>>>> +static int coroutine_fn nfs_co_writev(BlockDriverState *bs,
>>>> +                                        int64_t sector_num, int nb_sectors,
>>>> +                                        QEMUIOVector *iov)
>>>> +{
>>>> +    NFSClient *client = bs->opaque;
>>>> +    NFSTask task;
>>>> +    char *buf = NULL;
>>>> +
>>>> +    nfs_co_init_task(client, &task);
>>>> +
>>>> +    buf = g_malloc(nb_sectors * BDRV_SECTOR_SIZE);
>>>> +    qemu_iovec_to_buf(iov, 0, buf, nb_sectors * BDRV_SECTOR_SIZE);
>>>> +
>>>> +    if (nfs_pwrite_async(client->context, client->fh,
>>>> +                         sector_num * BDRV_SECTOR_SIZE,
>>>> +                         nb_sectors * BDRV_SECTOR_SIZE,
>>>> +                         buf, nfs_co_generic_cb, &task) != 0) {
>>>> +        g_free(buf);
>>>> +        return -EIO;
>>> Can we get a more detailed errno here?  (e.g. ENOSPC)
>>>
>>>> +    }
>>>> +
>>>> +    while (!task.complete) {
>>>> +        nfs_set_events(client);
>>>> +        qemu_coroutine_yield();
>>>> +    }
>>>> +
>>>> +    g_free(buf);
>>>> +
>>>> +    if (task.status != nb_sectors * BDRV_SECTOR_SIZE) {
>>>> +        return task.status < 0 ? task.status : -EIO;
>>>> +    }
>>>> +
>>>> +    bs->total_sectors = MAX(bs->total_sectors, sector_num + nb_sectors);
>>> Why is this necessary?  block.c will update bs->total_sectors if the
>>> file is growable.
>>>
>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>> +     * in qemu-img open. So we can use the cached value for allocate
>>>> +     * filesize obtained from fstat at open time */
>>>> +    client->allocated_file_size = -ENOTSUP;
>>> Can you implement this fully?  By stubbing it out like this we won't be
>>> able to call get_allocated_file_size() at runtime in the future without
>>> updating the nfs block driver code.  It's just an fstat call, shouldn't
>>> be too hard to implement properly :).
>> It seems I have to leave it as is currently. bdrv_get_allocated_file_size
>> is not in a coroutine context. I get coroutine yields to no one.
> Create a coroutine and pump the event loop until it has reached
> completion:
>
> co = qemu_coroutine_create(my_coroutine_fn, ...);
> qemu_coroutine_enter(co, foo);
> while (!complete) {
>      qemu_aio_wait();
> }
>
> See block.c for similar examples.
Wouldn't it make sense to make this modification to bdrv_get_allocated_file_size in
block.c rather than in client/nfs.c and in the future potentially other drivers?

If yes, I would ask you to take v3 of the NFS protocol patch and I promise to send
a follow up early next year to make this modification to block.c and change block/nfs.c
and other implementations to be a coroutine_fn.

Thanks
Peter

Stefan Hajnoczi Dec. 20, 2013, 1:57 p.m. UTC | #32

On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>
>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>
>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>
>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>>> +     * in qemu-img open. So we can use the cached value for allocate
>>>>> +     * filesize obtained from fstat at open time */
>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>
>>>> Can you implement this fully?  By stubbing it out like this we won't be
>>>> able to call get_allocated_file_size() at runtime in the future without
>>>> updating the nfs block driver code.  It's just an fstat call, shouldn't
>>>> be too hard to implement properly :).
>>>
>>> It seems I have to leave it as is currently. bdrv_get_allocated_file_size
>>> is not in a coroutine context. I get coroutine yields to no one.
>>
>> Create a coroutine and pump the event loop until it has reached
>> completion:
>>
>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>> qemu_coroutine_enter(co, foo);
>> while (!complete) {
>>      qemu_aio_wait();
>> }
>>
>> See block.c for similar examples.
>
> Wouldn't it make sense to make this modification to
> bdrv_get_allocated_file_size in
> block.c rather than in client/nfs.c and in the future potentially other
> drivers?
>
> If yes, I would ask you to take v3 of the NFS protocol patch and I promise
> to send
> a follow up early next year to make this modification to block.c and change
> block/nfs.c
> and other implementations to be a coroutine_fn.

.bdrv_get_allocated_file_size() implementations in other block drivers
are synchronous.  Making the block driver interface use coroutines
would be wrong unless all the block drivers were updated to use
coroutines too.

Can you just call nfs_fstat() (the sync libnfs interface)?

Stefan

Peter Lieven Dec. 20, 2013, 2:07 p.m. UTC | #33

On 20.12.2013 14:57, Stefan Hajnoczi wrote:
> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>>>> +     * in qemu-img open. So we can use the cached value for allocate
>>>>>> +     * filesize obtained from fstat at open time */
>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>> Can you implement this fully?  By stubbing it out like this we won't be
>>>>> able to call get_allocated_file_size() at runtime in the future without
>>>>> updating the nfs block driver code.  It's just an fstat call, shouldn't
>>>>> be too hard to implement properly :).
>>>> It seems I have to leave it as is currently. bdrv_get_allocated_file_size
>>>> is not in a coroutine context. I get coroutine yields to no one.
>>> Create a coroutine and pump the event loop until it has reached
>>> completion:
>>>
>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>> qemu_coroutine_enter(co, foo);
>>> while (!complete) {
>>>       qemu_aio_wait();
>>> }
>>>
>>> See block.c for similar examples.
>> Wouldn't it make sense to make this modification to
>> bdrv_get_allocated_file_size in
>> block.c rather than in client/nfs.c and in the future potentially other
>> drivers?
>>
>> If yes, I would ask you to take v3 of the NFS protocol patch and I promise
>> to send
>> a follow up early next year to make this modification to block.c and change
>> block/nfs.c
>> and other implementations to be a coroutine_fn.
> .bdrv_get_allocated_file_size() implementations in other block drivers
> are synchronous.  Making the block driver interface use coroutines
> would be wrong unless all the block drivers were updated to use
> coroutines too.
I can do that. I think its not too complicated because all those
implementations do not rely on callbacks. It should be possible
to just rename the existing implemenations to lets say
.bdrv_co_get_allocated_file_size and call them inside a coroutine.
>
> Can you just call nfs_fstat() (the sync libnfs interface)?
I can only do that if its guaranteed that no other requests are in flight
otherwise it will mess up.

Peter

Stefan Hajnoczi Dec. 20, 2013, 2:38 p.m. UTC | #34

On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>
>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>
>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>
>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>
>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>
>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>>>>> +     * in qemu-img open. So we can use the cached value for allocate
>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>
>>>>>> Can you implement this fully?  By stubbing it out like this we won't
>>>>>> be
>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>> without
>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>> shouldn't
>>>>>> be too hard to implement properly :).
>>>>>
>>>>> It seems I have to leave it as is currently.
>>>>> bdrv_get_allocated_file_size
>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>
>>>> Create a coroutine and pump the event loop until it has reached
>>>> completion:
>>>>
>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>> qemu_coroutine_enter(co, foo);
>>>> while (!complete) {
>>>>       qemu_aio_wait();
>>>> }
>>>>
>>>> See block.c for similar examples.
>>>
>>> Wouldn't it make sense to make this modification to
>>> bdrv_get_allocated_file_size in
>>> block.c rather than in client/nfs.c and in the future potentially other
>>> drivers?
>>>
>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>> promise
>>> to send
>>> a follow up early next year to make this modification to block.c and
>>> change
>>> block/nfs.c
>>> and other implementations to be a coroutine_fn.
>>
>> .bdrv_get_allocated_file_size() implementations in other block drivers
>> are synchronous.  Making the block driver interface use coroutines
>> would be wrong unless all the block drivers were updated to use
>> coroutines too.
>
> I can do that. I think its not too complicated because all those
> implementations do not rely on callbacks. It should be possible
> to just rename the existing implemenations to lets say
> .bdrv_co_get_allocated_file_size and call them inside a coroutine.

No, that would be wrong because coroutine functions should not block.
The point of coroutines is that if they cannot proceed they must yield
so the event loop regains control.  If you simply rename the function
to _co_ then they will block the event loop and not be true coroutine
functions.

>>
>> Can you just call nfs_fstat() (the sync libnfs interface)?
>
> I can only do that if its guaranteed that no other requests are in flight
> otherwise it will mess up.

How will it mess up?

Stefan

Peter Lieven Dec. 20, 2013, 2:43 p.m. UTC | #35

On 20.12.2013 15:38, Stefan Hajnoczi wrote:
> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>>>>>> +     * in qemu-img open. So we can use the cached value for allocate
>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>> Can you implement this fully?  By stubbing it out like this we won't
>>>>>>> be
>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>> without
>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>> shouldn't
>>>>>>> be too hard to implement properly :).
>>>>>> It seems I have to leave it as is currently.
>>>>>> bdrv_get_allocated_file_size
>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>> Create a coroutine and pump the event loop until it has reached
>>>>> completion:
>>>>>
>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>> qemu_coroutine_enter(co, foo);
>>>>> while (!complete) {
>>>>>        qemu_aio_wait();
>>>>> }
>>>>>
>>>>> See block.c for similar examples.
>>>> Wouldn't it make sense to make this modification to
>>>> bdrv_get_allocated_file_size in
>>>> block.c rather than in client/nfs.c and in the future potentially other
>>>> drivers?
>>>>
>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>> promise
>>>> to send
>>>> a follow up early next year to make this modification to block.c and
>>>> change
>>>> block/nfs.c
>>>> and other implementations to be a coroutine_fn.
>>> .bdrv_get_allocated_file_size() implementations in other block drivers
>>> are synchronous.  Making the block driver interface use coroutines
>>> would be wrong unless all the block drivers were updated to use
>>> coroutines too.
>> I can do that. I think its not too complicated because all those
>> implementations do not rely on callbacks. It should be possible
>> to just rename the existing implemenations to lets say
>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
> No, that would be wrong because coroutine functions should not block.
> The point of coroutines is that if they cannot proceed they must yield
> so the event loop regains control.  If you simply rename the function
> to _co_ then they will block the event loop and not be true coroutine
> functions.
>
>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>> I can only do that if its guaranteed that no other requests are in flight
>> otherwise it will mess up.
> How will it mess up?
The sync calls into libnfs are just wrappers around the async calls.
The problem is that this wrapper will handle all the callbacks for the
in-flight requests and they will never return.

Peter

ronnie sahlberg Dec. 20, 2013, 3:03 p.m. UTC | #36

The sync calls uses a trivial eventloop built into libnfs using poll().

Mixing the _async() and _sync() interfaces in libnfs means you may
risk running nested eventloops. Pain and tears lie behind that door.

On Fri, Dec 20, 2013 at 6:43 AM, Peter Lieven <pl@kamp.de> wrote:
> On 20.12.2013 15:38, Stefan Hajnoczi wrote:
>>
>> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>>>
>>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>
>>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>>>
>>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>>>
>>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>>
>>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>>>>>>> +     * in qemu-img open. So we can use the cached value for
>>>>>>>>> allocate
>>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>>>
>>>>>>>> Can you implement this fully?  By stubbing it out like this we won't
>>>>>>>> be
>>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>>> without
>>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>>> shouldn't
>>>>>>>> be too hard to implement properly :).
>>>>>>>
>>>>>>> It seems I have to leave it as is currently.
>>>>>>> bdrv_get_allocated_file_size
>>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>>>
>>>>>> Create a coroutine and pump the event loop until it has reached
>>>>>> completion:
>>>>>>
>>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>>> qemu_coroutine_enter(co, foo);
>>>>>> while (!complete) {
>>>>>>        qemu_aio_wait();
>>>>>> }
>>>>>>
>>>>>> See block.c for similar examples.
>>>>>
>>>>> Wouldn't it make sense to make this modification to
>>>>> bdrv_get_allocated_file_size in
>>>>> block.c rather than in client/nfs.c and in the future potentially other
>>>>> drivers?
>>>>>
>>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>>> promise
>>>>> to send
>>>>> a follow up early next year to make this modification to block.c and
>>>>> change
>>>>> block/nfs.c
>>>>> and other implementations to be a coroutine_fn.
>>>>
>>>> .bdrv_get_allocated_file_size() implementations in other block drivers
>>>> are synchronous.  Making the block driver interface use coroutines
>>>> would be wrong unless all the block drivers were updated to use
>>>> coroutines too.
>>>
>>> I can do that. I think its not too complicated because all those
>>> implementations do not rely on callbacks. It should be possible
>>> to just rename the existing implemenations to lets say
>>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
>>
>> No, that would be wrong because coroutine functions should not block.
>> The point of coroutines is that if they cannot proceed they must yield
>> so the event loop regains control.  If you simply rename the function
>> to _co_ then they will block the event loop and not be true coroutine
>> functions.
>>
>>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>>>
>>> I can only do that if its guaranteed that no other requests are in flight
>>> otherwise it will mess up.
>>
>> How will it mess up?
>
> The sync calls into libnfs are just wrappers around the async calls.
> The problem is that this wrapper will handle all the callbacks for the
> in-flight requests and they will never return.
>
> Peter
>
>

Stefan Hajnoczi Dec. 20, 2013, 3:30 p.m. UTC | #37

On Fri, Dec 20, 2013 at 3:43 PM, Peter Lieven <pl@kamp.de> wrote:
> On 20.12.2013 15:38, Stefan Hajnoczi wrote:
>>
>> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>>>
>>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>
>>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>>>
>>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>>>
>>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>>>
>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>>
>>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>>>>>>> +     * in qemu-img open. So we can use the cached value for
>>>>>>>>> allocate
>>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>>>
>>>>>>>> Can you implement this fully?  By stubbing it out like this we won't
>>>>>>>> be
>>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>>> without
>>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>>> shouldn't
>>>>>>>> be too hard to implement properly :).
>>>>>>>
>>>>>>> It seems I have to leave it as is currently.
>>>>>>> bdrv_get_allocated_file_size
>>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>>>
>>>>>> Create a coroutine and pump the event loop until it has reached
>>>>>> completion:
>>>>>>
>>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>>> qemu_coroutine_enter(co, foo);
>>>>>> while (!complete) {
>>>>>>        qemu_aio_wait();
>>>>>> }
>>>>>>
>>>>>> See block.c for similar examples.
>>>>>
>>>>> Wouldn't it make sense to make this modification to
>>>>> bdrv_get_allocated_file_size in
>>>>> block.c rather than in client/nfs.c and in the future potentially other
>>>>> drivers?
>>>>>
>>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>>> promise
>>>>> to send
>>>>> a follow up early next year to make this modification to block.c and
>>>>> change
>>>>> block/nfs.c
>>>>> and other implementations to be a coroutine_fn.
>>>>
>>>> .bdrv_get_allocated_file_size() implementations in other block drivers
>>>> are synchronous.  Making the block driver interface use coroutines
>>>> would be wrong unless all the block drivers were updated to use
>>>> coroutines too.
>>>
>>> I can do that. I think its not too complicated because all those
>>> implementations do not rely on callbacks. It should be possible
>>> to just rename the existing implemenations to lets say
>>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
>>
>> No, that would be wrong because coroutine functions should not block.
>> The point of coroutines is that if they cannot proceed they must yield
>> so the event loop regains control.  If you simply rename the function
>> to _co_ then they will block the event loop and not be true coroutine
>> functions.
>>
>>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>>>
>>> I can only do that if its guaranteed that no other requests are in flight
>>> otherwise it will mess up.
>>
>> How will it mess up?
>
> The sync calls into libnfs are just wrappers around the async calls.
> The problem is that this wrapper will handle all the callbacks for the
> in-flight requests and they will never return.

So back to my original suggestion to use a qemu_aio_wait() loop in block/nfs.c?

Stefan

Peter Lieven Dec. 20, 2013, 3:49 p.m. UTC | #38

On 20.12.2013 16:30, Stefan Hajnoczi wrote:
> On Fri, Dec 20, 2013 at 3:43 PM, Peter Lieven <pl@kamp.de> wrote:
>> On 20.12.2013 15:38, Stefan Hajnoczi wrote:
>>> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>>>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only used
>>>>>>>>>> +     * in qemu-img open. So we can use the cached value for
>>>>>>>>>> allocate
>>>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>>>> Can you implement this fully?  By stubbing it out like this we won't
>>>>>>>>> be
>>>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>>>> without
>>>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>>>> shouldn't
>>>>>>>>> be too hard to implement properly :).
>>>>>>>> It seems I have to leave it as is currently.
>>>>>>>> bdrv_get_allocated_file_size
>>>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>>>> Create a coroutine and pump the event loop until it has reached
>>>>>>> completion:
>>>>>>>
>>>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>>>> qemu_coroutine_enter(co, foo);
>>>>>>> while (!complete) {
>>>>>>>         qemu_aio_wait();
>>>>>>> }
>>>>>>>
>>>>>>> See block.c for similar examples.
>>>>>> Wouldn't it make sense to make this modification to
>>>>>> bdrv_get_allocated_file_size in
>>>>>> block.c rather than in client/nfs.c and in the future potentially other
>>>>>> drivers?
>>>>>>
>>>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>>>> promise
>>>>>> to send
>>>>>> a follow up early next year to make this modification to block.c and
>>>>>> change
>>>>>> block/nfs.c
>>>>>> and other implementations to be a coroutine_fn.
>>>>> .bdrv_get_allocated_file_size() implementations in other block drivers
>>>>> are synchronous.  Making the block driver interface use coroutines
>>>>> would be wrong unless all the block drivers were updated to use
>>>>> coroutines too.
>>>> I can do that. I think its not too complicated because all those
>>>> implementations do not rely on callbacks. It should be possible
>>>> to just rename the existing implemenations to lets say
>>>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
>>> No, that would be wrong because coroutine functions should not block.
>>> The point of coroutines is that if they cannot proceed they must yield
>>> so the event loop regains control.  If you simply rename the function
>>> to _co_ then they will block the event loop and not be true coroutine
>>> functions.
>>>
>>>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>>>> I can only do that if its guaranteed that no other requests are in flight
>>>> otherwise it will mess up.
>>> How will it mess up?
>> The sync calls into libnfs are just wrappers around the async calls.
>> The problem is that this wrapper will handle all the callbacks for the
>> in-flight requests and they will never return.
> So back to my original suggestion to use a qemu_aio_wait() loop in block/nfs.c?
sorry, I cannot follow you. but maybe this here is a short solution. question
is, what will happen when there are pending requests which invoke callbacks.
will we end up returning from qemu_aio_wait? in the qemu-img info case
this here works:

static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
{
     NFSClient *client = bs->opaque;
     NFSRPC task = {0};
     struct stat st;

     task.st = &st;
     if (nfs_fstat_async(client->context, client->fh, nfs_co_generic_cb,
                         &task) != 0) {
         return -ENOMEM;
     }

     while (!task.complete) {
         nfs_set_events(client);
         qemu_aio_wait();
     }

     return (task.status < 0 ? task.status : st.st_blocks * st.st_blksize);
}

in theory we could also leave .bdrv_get_allocated_file_size completly out. Its just a
nice to have so that qemu-img info does not show "unavailable" for disk size.

Peter

Stefan Hajnoczi Dec. 20, 2013, 3:54 p.m. UTC | #39

On Fri, Dec 20, 2013 at 4:49 PM, Peter Lieven <pl@kamp.de> wrote:
> On 20.12.2013 16:30, Stefan Hajnoczi wrote:
>>
>> On Fri, Dec 20, 2013 at 3:43 PM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> On 20.12.2013 15:38, Stefan Hajnoczi wrote:
>>>>
>>>> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>
>>>>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>>>>>
>>>>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>>>
>>>>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>>>>>
>>>>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>>>>>
>>>>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>>>>
>>>>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only
>>>>>>>>>>> used
>>>>>>>>>>> +     * in qemu-img open. So we can use the cached value for
>>>>>>>>>>> allocate
>>>>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>>>>>
>>>>>>>>>> Can you implement this fully?  By stubbing it out like this we
>>>>>>>>>> won't
>>>>>>>>>> be
>>>>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>>>>> without
>>>>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>>>>> shouldn't
>>>>>>>>>> be too hard to implement properly :).
>>>>>>>>>
>>>>>>>>> It seems I have to leave it as is currently.
>>>>>>>>> bdrv_get_allocated_file_size
>>>>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>>>>>
>>>>>>>> Create a coroutine and pump the event loop until it has reached
>>>>>>>> completion:
>>>>>>>>
>>>>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>>>>> qemu_coroutine_enter(co, foo);
>>>>>>>> while (!complete) {
>>>>>>>>         qemu_aio_wait();
>>>>>>>> }
>>>>>>>>
>>>>>>>> See block.c for similar examples.
>>>>>>>
>>>>>>> Wouldn't it make sense to make this modification to
>>>>>>> bdrv_get_allocated_file_size in
>>>>>>> block.c rather than in client/nfs.c and in the future potentially
>>>>>>> other
>>>>>>> drivers?
>>>>>>>
>>>>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>>>>> promise
>>>>>>> to send
>>>>>>> a follow up early next year to make this modification to block.c and
>>>>>>> change
>>>>>>> block/nfs.c
>>>>>>> and other implementations to be a coroutine_fn.
>>>>>>
>>>>>> .bdrv_get_allocated_file_size() implementations in other block drivers
>>>>>> are synchronous.  Making the block driver interface use coroutines
>>>>>> would be wrong unless all the block drivers were updated to use
>>>>>> coroutines too.
>>>>>
>>>>> I can do that. I think its not too complicated because all those
>>>>> implementations do not rely on callbacks. It should be possible
>>>>> to just rename the existing implemenations to lets say
>>>>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
>>>>
>>>> No, that would be wrong because coroutine functions should not block.
>>>> The point of coroutines is that if they cannot proceed they must yield
>>>> so the event loop regains control.  If you simply rename the function
>>>> to _co_ then they will block the event loop and not be true coroutine
>>>> functions.
>>>>
>>>>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>>>>>
>>>>> I can only do that if its guaranteed that no other requests are in
>>>>> flight
>>>>> otherwise it will mess up.
>>>>
>>>> How will it mess up?
>>>
>>> The sync calls into libnfs are just wrappers around the async calls.
>>> The problem is that this wrapper will handle all the callbacks for the
>>> in-flight requests and they will never return.
>>
>> So back to my original suggestion to use a qemu_aio_wait() loop in
>> block/nfs.c?
>
> sorry, I cannot follow you. but maybe this here is a short solution.
> question
> is, what will happen when there are pending requests which invoke callbacks.
> will we end up returning from qemu_aio_wait? in the qemu-img info case
> this here works:
>
> static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
> {
>
>     NFSClient *client = bs->opaque;
>     NFSRPC task = {0};
>     struct stat st;
>
>     task.st = &st;
>     if (nfs_fstat_async(client->context, client->fh, nfs_co_generic_cb,
>                         &task) != 0) {
>         return -ENOMEM;
>     }
>
>     while (!task.complete) {
>         nfs_set_events(client);
>         qemu_aio_wait();
>     }
>
>     return (task.status < 0 ? task.status : st.st_blocks * st.st_blksize);
> }

Yes, that's exactly what I had in mind.

Yes, other callbacks will get called and requests will complete in
this event loop.  That can be a problem in some scenarios but should
be okay with bdrv_get_allocated_file_size().

Stefan

Peter Lieven Dec. 20, 2013, 3:57 p.m. UTC | #40

On 20.12.2013 16:54, Stefan Hajnoczi wrote:
> On Fri, Dec 20, 2013 at 4:49 PM, Peter Lieven <pl@kamp.de> wrote:
>> On 20.12.2013 16:30, Stefan Hajnoczi wrote:
>>> On Fri, Dec 20, 2013 at 3:43 PM, Peter Lieven <pl@kamp.de> wrote:
>>>> On 20.12.2013 15:38, Stefan Hajnoczi wrote:
>>>>> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only
>>>>>>>>>>>> used
>>>>>>>>>>>> +     * in qemu-img open. So we can use the cached value for
>>>>>>>>>>>> allocate
>>>>>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>>>>>> Can you implement this fully?  By stubbing it out like this we
>>>>>>>>>>> won't
>>>>>>>>>>> be
>>>>>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>>>>>> without
>>>>>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>>>>>> shouldn't
>>>>>>>>>>> be too hard to implement properly :).
>>>>>>>>>> It seems I have to leave it as is currently.
>>>>>>>>>> bdrv_get_allocated_file_size
>>>>>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>>>>>> Create a coroutine and pump the event loop until it has reached
>>>>>>>>> completion:
>>>>>>>>>
>>>>>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>>>>>> qemu_coroutine_enter(co, foo);
>>>>>>>>> while (!complete) {
>>>>>>>>>          qemu_aio_wait();
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> See block.c for similar examples.
>>>>>>>> Wouldn't it make sense to make this modification to
>>>>>>>> bdrv_get_allocated_file_size in
>>>>>>>> block.c rather than in client/nfs.c and in the future potentially
>>>>>>>> other
>>>>>>>> drivers?
>>>>>>>>
>>>>>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>>>>>> promise
>>>>>>>> to send
>>>>>>>> a follow up early next year to make this modification to block.c and
>>>>>>>> change
>>>>>>>> block/nfs.c
>>>>>>>> and other implementations to be a coroutine_fn.
>>>>>>> .bdrv_get_allocated_file_size() implementations in other block drivers
>>>>>>> are synchronous.  Making the block driver interface use coroutines
>>>>>>> would be wrong unless all the block drivers were updated to use
>>>>>>> coroutines too.
>>>>>> I can do that. I think its not too complicated because all those
>>>>>> implementations do not rely on callbacks. It should be possible
>>>>>> to just rename the existing implemenations to lets say
>>>>>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
>>>>> No, that would be wrong because coroutine functions should not block.
>>>>> The point of coroutines is that if they cannot proceed they must yield
>>>>> so the event loop regains control.  If you simply rename the function
>>>>> to _co_ then they will block the event loop and not be true coroutine
>>>>> functions.
>>>>>
>>>>>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>>>>>> I can only do that if its guaranteed that no other requests are in
>>>>>> flight
>>>>>> otherwise it will mess up.
>>>>> How will it mess up?
>>>> The sync calls into libnfs are just wrappers around the async calls.
>>>> The problem is that this wrapper will handle all the callbacks for the
>>>> in-flight requests and they will never return.
>>> So back to my original suggestion to use a qemu_aio_wait() loop in
>>> block/nfs.c?
>> sorry, I cannot follow you. but maybe this here is a short solution.
>> question
>> is, what will happen when there are pending requests which invoke callbacks.
>> will we end up returning from qemu_aio_wait? in the qemu-img info case
>> this here works:
>>
>> static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
>> {
>>
>>      NFSClient *client = bs->opaque;
>>      NFSRPC task = {0};
>>      struct stat st;
>>
>>      task.st = &st;
>>      if (nfs_fstat_async(client->context, client->fh, nfs_co_generic_cb,
>>                          &task) != 0) {
>>          return -ENOMEM;
>>      }
>>
>>      while (!task.complete) {
>>          nfs_set_events(client);
>>          qemu_aio_wait();
>>      }
>>
>>      return (task.status < 0 ? task.status : st.st_blocks * st.st_blksize);
>> }
> Yes, that's exactly what I had in mind.
>
> Yes, other callbacks will get called and requests will complete in
> this event loop.  That can be a problem in some scenarios but should
> be okay with bdrv_get_allocated_file_size().
ok I will send v4 and then prepare for the holidays ;-)

Peter

Stefan Hajnoczi Dec. 20, 2013, 4:27 p.m. UTC | #41

On Fri, Dec 20, 2013 at 4:57 PM, Peter Lieven <pl@kamp.de> wrote:
> On 20.12.2013 16:54, Stefan Hajnoczi wrote:
>>
>> On Fri, Dec 20, 2013 at 4:49 PM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> On 20.12.2013 16:30, Stefan Hajnoczi wrote:
>>>>
>>>> On Fri, Dec 20, 2013 at 3:43 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>
>>>>> On 20.12.2013 15:38, Stefan Hajnoczi wrote:
>>>>>>
>>>>>> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>>>
>>>>>>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>>>>>>>
>>>>>>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>>>>>
>>>>>>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>>>>>>>
>>>>>>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only
>>>>>>>>>>>>> used
>>>>>>>>>>>>> +     * in qemu-img open. So we can use the cached value for
>>>>>>>>>>>>> allocate
>>>>>>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>>>>>>>
>>>>>>>>>>>> Can you implement this fully?  By stubbing it out like this we
>>>>>>>>>>>> won't
>>>>>>>>>>>> be
>>>>>>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>>>>>>> without
>>>>>>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>>>>>>> shouldn't
>>>>>>>>>>>> be too hard to implement properly :).
>>>>>>>>>>>
>>>>>>>>>>> It seems I have to leave it as is currently.
>>>>>>>>>>> bdrv_get_allocated_file_size
>>>>>>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>>>>>>>
>>>>>>>>>> Create a coroutine and pump the event loop until it has reached
>>>>>>>>>> completion:
>>>>>>>>>>
>>>>>>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>>>>>>> qemu_coroutine_enter(co, foo);
>>>>>>>>>> while (!complete) {
>>>>>>>>>>          qemu_aio_wait();
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> See block.c for similar examples.
>>>>>>>>>
>>>>>>>>> Wouldn't it make sense to make this modification to
>>>>>>>>> bdrv_get_allocated_file_size in
>>>>>>>>> block.c rather than in client/nfs.c and in the future potentially
>>>>>>>>> other
>>>>>>>>> drivers?
>>>>>>>>>
>>>>>>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>>>>>>> promise
>>>>>>>>> to send
>>>>>>>>> a follow up early next year to make this modification to block.c
>>>>>>>>> and
>>>>>>>>> change
>>>>>>>>> block/nfs.c
>>>>>>>>> and other implementations to be a coroutine_fn.
>>>>>>>>
>>>>>>>> .bdrv_get_allocated_file_size() implementations in other block
>>>>>>>> drivers
>>>>>>>> are synchronous.  Making the block driver interface use coroutines
>>>>>>>> would be wrong unless all the block drivers were updated to use
>>>>>>>> coroutines too.
>>>>>>>
>>>>>>> I can do that. I think its not too complicated because all those
>>>>>>> implementations do not rely on callbacks. It should be possible
>>>>>>> to just rename the existing implemenations to lets say
>>>>>>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
>>>>>>
>>>>>> No, that would be wrong because coroutine functions should not block.
>>>>>> The point of coroutines is that if they cannot proceed they must yield
>>>>>> so the event loop regains control.  If you simply rename the function
>>>>>> to _co_ then they will block the event loop and not be true coroutine
>>>>>> functions.
>>>>>>
>>>>>>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>>>>>>>
>>>>>>> I can only do that if its guaranteed that no other requests are in
>>>>>>> flight
>>>>>>> otherwise it will mess up.
>>>>>>
>>>>>> How will it mess up?
>>>>>
>>>>> The sync calls into libnfs are just wrappers around the async calls.
>>>>> The problem is that this wrapper will handle all the callbacks for the
>>>>> in-flight requests and they will never return.
>>>>
>>>> So back to my original suggestion to use a qemu_aio_wait() loop in
>>>> block/nfs.c?
>>>
>>> sorry, I cannot follow you. but maybe this here is a short solution.
>>> question
>>> is, what will happen when there are pending requests which invoke
>>> callbacks.
>>> will we end up returning from qemu_aio_wait? in the qemu-img info case
>>> this here works:
>>>
>>> static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
>>> {
>>>
>>>      NFSClient *client = bs->opaque;
>>>      NFSRPC task = {0};
>>>      struct stat st;
>>>
>>>      task.st = &st;
>>>      if (nfs_fstat_async(client->context, client->fh, nfs_co_generic_cb,
>>>                          &task) != 0) {
>>>          return -ENOMEM;
>>>      }
>>>
>>>      while (!task.complete) {
>>>          nfs_set_events(client);
>>>          qemu_aio_wait();
>>>      }
>>>
>>>      return (task.status < 0 ? task.status : st.st_blocks *
>>> st.st_blksize);
>>> }
>>
>> Yes, that's exactly what I had in mind.
>>
>> Yes, other callbacks will get called and requests will complete in
>> this event loop.  That can be a problem in some scenarios but should
>> be okay with bdrv_get_allocated_file_size().
>
> ok I will send v4 and then prepare for the holidays ;-)

Happy holidays!  I already sent out the block pull request earlier
today but your patch will be included in the first January pull
request.

Peter Lieven Jan. 3, 2014, 10:35 a.m. UTC | #42

On 20.12.2013 17:27, Stefan Hajnoczi wrote:
> On Fri, Dec 20, 2013 at 4:57 PM, Peter Lieven <pl@kamp.de> wrote:
>> On 20.12.2013 16:54, Stefan Hajnoczi wrote:
>>> On Fri, Dec 20, 2013 at 4:49 PM, Peter Lieven <pl@kamp.de> wrote:
>>>> On 20.12.2013 16:30, Stefan Hajnoczi wrote:
>>>>> On Fri, Dec 20, 2013 at 3:43 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>> On 20.12.2013 15:38, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Dec 20, 2013 at 3:07 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>>>> On 20.12.2013 14:57, Stefan Hajnoczi wrote:
>>>>>>>>> On Fri, Dec 20, 2013 at 1:53 PM, Peter Lieven <pl@kamp.de> wrote:
>>>>>>>>>> On 20.12.2013 13:19, Stefan Hajnoczi wrote:
>>>>>>>>>>> On Fri, Dec 20, 2013 at 10:48:41AM +0100, Peter Lieven wrote:
>>>>>>>>>>>> On 17.12.2013 17:47, Stefan Hajnoczi wrote:
>>>>>>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>>>>>>> +    /* set to -ENOTSUP since bdrv_allocated_file_size is only
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>> +     * in qemu-img open. So we can use the cached value for
>>>>>>>>>>>>>> allocate
>>>>>>>>>>>>>> +     * filesize obtained from fstat at open time */
>>>>>>>>>>>>>> +    client->allocated_file_size = -ENOTSUP;
>>>>>>>>>>>>> Can you implement this fully?  By stubbing it out like this we
>>>>>>>>>>>>> won't
>>>>>>>>>>>>> be
>>>>>>>>>>>>> able to call get_allocated_file_size() at runtime in the future
>>>>>>>>>>>>> without
>>>>>>>>>>>>> updating the nfs block driver code.  It's just an fstat call,
>>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>> be too hard to implement properly :).
>>>>>>>>>>>> It seems I have to leave it as is currently.
>>>>>>>>>>>> bdrv_get_allocated_file_size
>>>>>>>>>>>> is not in a coroutine context. I get coroutine yields to no one.
>>>>>>>>>>> Create a coroutine and pump the event loop until it has reached
>>>>>>>>>>> completion:
>>>>>>>>>>>
>>>>>>>>>>> co = qemu_coroutine_create(my_coroutine_fn, ...);
>>>>>>>>>>> qemu_coroutine_enter(co, foo);
>>>>>>>>>>> while (!complete) {
>>>>>>>>>>>           qemu_aio_wait();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> See block.c for similar examples.
>>>>>>>>>> Wouldn't it make sense to make this modification to
>>>>>>>>>> bdrv_get_allocated_file_size in
>>>>>>>>>> block.c rather than in client/nfs.c and in the future potentially
>>>>>>>>>> other
>>>>>>>>>> drivers?
>>>>>>>>>>
>>>>>>>>>> If yes, I would ask you to take v3 of the NFS protocol patch and I
>>>>>>>>>> promise
>>>>>>>>>> to send
>>>>>>>>>> a follow up early next year to make this modification to block.c
>>>>>>>>>> and
>>>>>>>>>> change
>>>>>>>>>> block/nfs.c
>>>>>>>>>> and other implementations to be a coroutine_fn.
>>>>>>>>> .bdrv_get_allocated_file_size() implementations in other block
>>>>>>>>> drivers
>>>>>>>>> are synchronous.  Making the block driver interface use coroutines
>>>>>>>>> would be wrong unless all the block drivers were updated to use
>>>>>>>>> coroutines too.
>>>>>>>> I can do that. I think its not too complicated because all those
>>>>>>>> implementations do not rely on callbacks. It should be possible
>>>>>>>> to just rename the existing implemenations to lets say
>>>>>>>> .bdrv_co_get_allocated_file_size and call them inside a coroutine.
>>>>>>> No, that would be wrong because coroutine functions should not block.
>>>>>>> The point of coroutines is that if they cannot proceed they must yield
>>>>>>> so the event loop regains control.  If you simply rename the function
>>>>>>> to _co_ then they will block the event loop and not be true coroutine
>>>>>>> functions.
>>>>>>>
>>>>>>>>> Can you just call nfs_fstat() (the sync libnfs interface)?
>>>>>>>> I can only do that if its guaranteed that no other requests are in
>>>>>>>> flight
>>>>>>>> otherwise it will mess up.
>>>>>>> How will it mess up?
>>>>>> The sync calls into libnfs are just wrappers around the async calls.
>>>>>> The problem is that this wrapper will handle all the callbacks for the
>>>>>> in-flight requests and they will never return.
>>>>> So back to my original suggestion to use a qemu_aio_wait() loop in
>>>>> block/nfs.c?
>>>> sorry, I cannot follow you. but maybe this here is a short solution.
>>>> question
>>>> is, what will happen when there are pending requests which invoke
>>>> callbacks.
>>>> will we end up returning from qemu_aio_wait? in the qemu-img info case
>>>> this here works:
>>>>
>>>> static int64_t nfs_get_allocated_file_size(BlockDriverState *bs)
>>>> {
>>>>
>>>>       NFSClient *client = bs->opaque;
>>>>       NFSRPC task = {0};
>>>>       struct stat st;
>>>>
>>>>       task.st = &st;
>>>>       if (nfs_fstat_async(client->context, client->fh, nfs_co_generic_cb,
>>>>                           &task) != 0) {
>>>>           return -ENOMEM;
>>>>       }
>>>>
>>>>       while (!task.complete) {
>>>>           nfs_set_events(client);
>>>>           qemu_aio_wait();
>>>>       }
>>>>
>>>>       return (task.status < 0 ? task.status : st.st_blocks *
>>>> st.st_blksize);
>>>> }
>>> Yes, that's exactly what I had in mind.
>>>
>>> Yes, other callbacks will get called and requests will complete in
>>> this event loop.  That can be a problem in some scenarios but should
>>> be okay with bdrv_get_allocated_file_size().
>> ok I will send v4 and then prepare for the holidays ;-)
> Happy holidays!  I already sent out the block pull request earlier
> today but your patch will be included in the first January pull
> request.
If you pick this up please take v5.

Peter

[PATCHv2] block: add native support for NFS

Commit Message

Comments

Patch