Patchwork [1/2] hw/9pfs: Add new virtfs option cache=writethrough to skip host page cache

login
register
mail settings
Submitter Aneesh Kumar K.V
Date Oct. 10, 2011, 9:06 a.m.
Message ID <87obxp9pwm.fsf@skywalker.in.ibm.com>
Download mbox | patch
Permalink /patch/118656/
State New
Headers show

Comments

Aneesh Kumar K.V - Oct. 10, 2011, 9:06 a.m.
On Mon, 10 Oct 2011 05:54:56 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Sun, Oct 9, 2011 at 7:35 PM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > On Sun, 9 Oct 2011 17:16:50 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> On Sun, Oct 9, 2011 at 4:34 PM, Aneesh Kumar K.V
> >> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >> > On Sat, 8 Oct 2011 12:24:37 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> >> On Fri, Oct 7, 2011 at 7:46 AM, Aneesh Kumar K.V
> >> >> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >> >> > cache=writethrough implies the file are opened in the host with O_SYNC open flag
> >> >> >
> >> >> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >> >> > ---
> >> >> >  fsdev/file-op-9p.h         |    1 +
> >> >> >  fsdev/qemu-fsdev.c         |   10 ++++++++--
> >> >> >  fsdev/qemu-fsdev.h         |    2 ++
> >> >> >  hw/9pfs/virtio-9p-device.c |    5 +++++
> >> >> >  hw/9pfs/virtio-9p.c        |   24 ++++++++++++++++++------
> >> >> >  qemu-config.c              |    6 ++++++
> >> >> >  qemu-options.hx            |   17 ++++++++++++-----
> >> >> >  vl.c                       |    6 ++++++
> >> >> >  8 files changed, 58 insertions(+), 13 deletions(-)
> >> >>
> >> >> When would this be used?  For serving up vanilla 9P?
> >> >>
> >> >> I think 9P.u and 9P.l have support for fsync(2) while vanilla 9P does not.
> >> >>
> >> >
> >> > TFSYNC is added by 9p.L. So we would need this for 9p.u.
> >>
> >> I think 9p.u is covered by this wstat hack in
> >> http://ericvh.github.com/9p-rfc/rfc9p2000.html#anchor32:
> >>
> >> "if all the elements of the directory entry in a Twstat message are
> >> ``don't touch'' val- ues, the server may interpret it as a request to
> >> guarantee that the contents of the associated file are committed to
> >> stable storage before the Rwstat message is returned."
> >>
> >> A real TFSYNC operation is nicer though and could be mandatory (the
> >> 9P2000 RFC only says "the server *may*").
> >>
> >> > Another use
> >> > case is to ensure that we don't leave pages on host as dirty. That would
> >> > ensure that large writeback from a guest don't result in large number of
> >> > dirty pages on the host, thereby resulting in writeback in the host. It
> >> > would be needed for predictable I/O behavior in a setup where we have
> >> > multiple guest.
> >>
> >> I see.  I'm mostly curious about this change because the caching modes
> >> are a nightmare with block devices - a lot of time is spent discussing
> >> and benchmarking them, and they cause confusion when configuring KVM.
> >>
> >> It sounds like O_SYNC is being used in order to keep page cache clean.
> >>  But keeping the page cache clean is a side-effect of O_SYNC's
> >> behavior: writing out each page and synchronizing the disk write
> >> cache.  If you are attempting to bypass the page cache, just use
> >> O_DIRECT without O_SYNC.
> >
> > But how about reads. I want to make sure i get to use the page cache for
> > reads and also want to keep the page cache clean.
> >
> >>  O_SYNC is doing the additional disk write
> >> cache synchronization which will slow down I/O and prevent the server
> >> from using disk write cache.  O_SYNC is not the right flag to use.
> >
> > O_DIRECT have additional requirement on buffer alignment, and we don't
> > want to send every read to disk.  VirtFS also support zero copy
> > read/write, so that buffer alignment will always not be possible.
> > We want to make sure writes don't leave the page cache dirty so
> > that host doesn't spent much time in write back of data dirtied by the guest.
> 
> sync_file_range(2) kicks off the write-out but does not flush file
> system metadata or synchronize the disk write cache.
> 

something like below ?

commit bc7c28877b30155264f351d26692b3cceca853bb
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Mon May 23 11:29:15 2011 +0530

    hw/9pfs: Add new virtfs option cache=writethrough to skip host page cache
    
    cache=writethrough implies the file are opened in the host with O_SYNC open flag
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Stefan Hajnoczi - Oct. 12, 2011, 9:55 a.m.
On Mon, Oct 10, 2011 at 10:06 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
> index 5c8b5ed..441a37f 100644
> --- a/hw/9pfs/virtio-9p-handle.c
> +++ b/hw/9pfs/virtio-9p-handle.c
> @@ -202,6 +202,15 @@ static ssize_t handle_pwritev(FsContext *ctx, int fd, const struct iovec *iov,
>         return writev(fd, iov, iovcnt);

The sync_file_range(2) call below is dead-code since we'll return
immediately after writev(2) completes.  The writev(2) return value
needs to be saved temporarily and then returned after
sync_file_range(2).

>     }
>  #endif
> +    if (ctx->cache_flags & V9FS_WRITETHROUGH_CACHE) {

-drive cache=writethrough means something different from 9pfs
"writethrough".  This is confusing so I wonder if there is a better
name like immediate write-out.

> +        /*
> +         * Initiate a writeback. This is not a data integrity sync.
> +         * We want to ensure that we don't leave dirty pages in the cache
> +         * after write when cache=writethrough is sepcified.
> +         */
> +        sync_file_range(fd, offset, 0,
> +                        SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE);
> +    }

I'm not sure whether SYNC_FILE_RANGE_WAIT_BEFORE is necessary.  As a
best-effort mechanism just SYNC_FILE_RANGE_WRITE does the job although
a client that rapidly rewrites may be able to leave dirty pages in the
host page cache.  SYNC_FILE_RANGE_WAIT_BEFORE ensures that dirty pages
get written out but it is no longer asynchronous because it blocks.

Stefan
Aneesh Kumar K.V - Oct. 12, 2011, 1:19 p.m.
On Wed, 12 Oct 2011 10:55:21 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Oct 10, 2011 at 10:06 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> > diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
> > index 5c8b5ed..441a37f 100644
> > --- a/hw/9pfs/virtio-9p-handle.c
> > +++ b/hw/9pfs/virtio-9p-handle.c
> > @@ -202,6 +202,15 @@ static ssize_t handle_pwritev(FsContext *ctx, int fd, const struct iovec *iov,
> >         return writev(fd, iov, iovcnt);
> 
> The sync_file_range(2) call below is dead-code since we'll return
> immediately after writev(2) completes.  The writev(2) return value
> needs to be saved temporarily and then returned after
> sync_file_range(2).

Missed that. Will fix in the next update

> 
> >     }
> >  #endif
> > +    if (ctx->cache_flags & V9FS_WRITETHROUGH_CACHE) {
> 
> -drive cache=writethrough means something different from 9pfs
> "writethrough".  This is confusing so I wonder if there is a better
> name like immediate write-out.
> 

cache=immediate-write-out ? 

> > +        /*
> > +         * Initiate a writeback. This is not a data integrity sync.
> > +         * We want to ensure that we don't leave dirty pages in the cache
> > +         * after write when cache=writethrough is sepcified.
> > +         */
> > +        sync_file_range(fd, offset, 0,
> > +                        SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE);
> > +    }
> 
> I'm not sure whether SYNC_FILE_RANGE_WAIT_BEFORE is necessary.  As a
> best-effort mechanism just SYNC_FILE_RANGE_WRITE does the job although
> a client that rapidly rewrites may be able to leave dirty pages in the
> host page cache

Shouldn't we prevent this ? That is the reason for me to use that
WAIT_BEFORE ?

>.  SYNC_FILE_RANGE_WAIT_BEFORE ensures that dirty pages
> get written out but it is no longer asynchronous because it blocks.

-aneesh
Stefan Hajnoczi - Oct. 12, 2011, 2:32 p.m.
On Wed, Oct 12, 2011 at 2:19 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> On Wed, 12 Oct 2011 10:55:21 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Mon, Oct 10, 2011 at 10:06 AM, Aneesh Kumar K.V
>> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> > diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
>> > index 5c8b5ed..441a37f 100644
>> > --- a/hw/9pfs/virtio-9p-handle.c
>> > +++ b/hw/9pfs/virtio-9p-handle.c
>> > @@ -202,6 +202,15 @@ static ssize_t handle_pwritev(FsContext *ctx, int fd, const struct iovec *iov,
>> >         return writev(fd, iov, iovcnt);
>>
>> The sync_file_range(2) call below is dead-code since we'll return
>> immediately after writev(2) completes.  The writev(2) return value
>> needs to be saved temporarily and then returned after
>> sync_file_range(2).
>
> Missed that. Will fix in the next update
>
>>
>> >     }
>> >  #endif
>> > +    if (ctx->cache_flags & V9FS_WRITETHROUGH_CACHE) {
>>
>> -drive cache=writethrough means something different from 9pfs
>> "writethrough".  This is confusing so I wonder if there is a better
>> name like immediate write-out.
>>
>
> cache=immediate-write-out ?
>
>> > +        /*
>> > +         * Initiate a writeback. This is not a data integrity sync.
>> > +         * We want to ensure that we don't leave dirty pages in the cache
>> > +         * after write when cache=writethrough is sepcified.
>> > +         */
>> > +        sync_file_range(fd, offset, 0,
>> > +                        SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE);
>> > +    }
>>
>> I'm not sure whether SYNC_FILE_RANGE_WAIT_BEFORE is necessary.  As a
>> best-effort mechanism just SYNC_FILE_RANGE_WRITE does the job although
>> a client that rapidly rewrites may be able to leave dirty pages in the
>> host page cache
>
> Shouldn't we prevent this ? That is the reason for me to use that
> WAIT_BEFORE ?

The flag will cause sync_file_range(2) to wait on in-flight I/O.  The
guest will notice slow I/O.  You should at least specify a range
instead of nbytes=0 in the arguments.

Stefan

Patch

diff --git a/fsdev/file-op-9p.h b/fsdev/file-op-9p.h
index 8de8abf..84ec9e0 100644
--- a/fsdev/file-op-9p.h
+++ b/fsdev/file-op-9p.h
@@ -53,12 +53,16 @@  struct xattr_operations;
 /* FsContext flag values */
 #define PATHNAME_FSCONTEXT 0x1
 
+/* cache flags */
+#define V9FS_WRITETHROUGH_CACHE 0x1
+
 typedef struct FsContext
 {
     int flags;
     char *fs_root;
     SecModel fs_sm;
     uid_t uid;
+    int cache_flags;
     struct xattr_operations **xops;
     /* fs driver specific data */
     void *private;
diff --git a/fsdev/qemu-fsdev.c b/fsdev/qemu-fsdev.c
index 768819f..fce016b 100644
--- a/fsdev/qemu-fsdev.c
+++ b/fsdev/qemu-fsdev.c
@@ -34,6 +34,8 @@  int qemu_fsdev_add(QemuOpts *opts)
     const char *fstype = qemu_opt_get(opts, "fstype");
     const char *path = qemu_opt_get(opts, "path");
     const char *sec_model = qemu_opt_get(opts, "security_model");
+    const char *cache = qemu_opt_get(opts, "cache");
+
 
     if (!fsdev_id) {
         fprintf(stderr, "fsdev: No id specified\n");
@@ -72,10 +74,14 @@  int qemu_fsdev_add(QemuOpts *opts)
     fsle->fse.path = g_strdup(path);
     fsle->fse.security_model = g_strdup(sec_model);
     fsle->fse.ops = FsTypes[i].ops;
-
+    fsle->fse.cache_model = 0;
+    if (cache) {
+        if (!strcmp(cache, "writethrough")) {
+            fsle->fse.cache_model = V9FS_WRITETHROUGH_CACHE;
+        }
+    }
     QTAILQ_INSERT_TAIL(&fstype_entries, fsle, next);
     return 0;
-
 }
 
 FsTypeEntry *get_fsdev_fsentry(char *id)
diff --git a/fsdev/qemu-fsdev.h b/fsdev/qemu-fsdev.h
index e04931a..9c440f2 100644
--- a/fsdev/qemu-fsdev.h
+++ b/fsdev/qemu-fsdev.h
@@ -41,6 +41,7 @@  typedef struct FsTypeEntry {
     char *fsdev_id;
     char *path;
     char *security_model;
+    int cache_flags;
     FileOperations *ops;
 } FsTypeEntry;
 
diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index e5b68da..b23e8a4 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -115,6 +115,7 @@  VirtIODevice *virtio_9p_init(DeviceState *dev, V9fsConf *conf)
         exit(1);
     }
 
+    s->ctx.cache_flags = fse->cache_flags;
     s->ctx.fs_root = g_strdup(fse->path);
     len = strlen(conf->tag);
     if (len > MAX_TAG_LEN) {
diff --git a/hw/9pfs/virtio-9p-handle.c b/hw/9pfs/virtio-9p-handle.c
index 5c8b5ed..441a37f 100644
--- a/hw/9pfs/virtio-9p-handle.c
+++ b/hw/9pfs/virtio-9p-handle.c
@@ -202,6 +202,15 @@  static ssize_t handle_pwritev(FsContext *ctx, int fd, const struct iovec *iov,
         return writev(fd, iov, iovcnt);
     }
 #endif
+    if (ctx->cache_flags & V9FS_WRITETHROUGH_CACHE) {
+        /*
+         * Initiate a writeback. This is not a data integrity sync.
+         * We want to ensure that we don't leave dirty pages in the cache
+         * after write when cache=writethrough is sepcified.
+         */
+        sync_file_range(fd, offset, 0,
+                        SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE);
+    }
 }
 
 static int handle_chmod(FsContext *fs_ctx, V9fsPath *fs_path, FsCred *credp)
diff --git a/hw/9pfs/virtio-9p-local.c b/hw/9pfs/virtio-9p-local.c
index 9559ff6..5745d14 100644
--- a/hw/9pfs/virtio-9p-local.c
+++ b/hw/9pfs/virtio-9p-local.c
@@ -213,6 +213,15 @@  static ssize_t local_pwritev(FsContext *ctx, int fd, const struct iovec *iov,
         return writev(fd, iov, iovcnt);
     }
 #endif
+    if (ctx->cache_flags & V9FS_WRITETHROUGH_CACHE) {
+        /*
+         * Initiate a writeback. This is not a data integrity sync.
+         * We want to ensure that we don't leave dirty pages in the cache
+         * after write when cache=writethrough is sepcified.
+         */
+        sync_file_range(fd, offset, 0,
+                        SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE);
+    }
 }
 
 static int local_chmod(FsContext *fs_ctx, V9fsPath *fs_path, FsCred *credp)
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index c01c31a..3958788 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -80,6 +80,20 @@  void cred_init(FsCred *credp)
     credp->fc_rdev = -1;
 }
 
+static int get_dotl_openflags(V9fsState *s, int oflags)
+{
+    int flags;
+    /*
+     * Filter the client open flags
+     */
+    flags = oflags & ~(O_NOCTTY | O_ASYNC | O_CREAT);
+    /*
+     * Ignore direct disk access hint until the server supports it.
+     */
+    flags &= ~O_DIRECT;
+    return flags;
+}
+
 void v9fs_string_init(V9fsString *str)
 {
     str->data = NULL;
@@ -1598,10 +1612,7 @@  static void v9fs_open(void *opaque)
         err = offset;
     } else {
         if (s->proto_version == V9FS_PROTO_2000L) {
-            flags = mode;
-            flags &= ~(O_NOCTTY | O_ASYNC | O_CREAT);
-            /* Ignore direct disk access hint until the server supports it. */
-            flags &= ~O_DIRECT;
+            flags = get_dotl_openflags(s, mode);
         } else {
             flags = omode_to_uflags(mode);
         }
@@ -1650,8 +1661,7 @@  static void v9fs_lcreate(void *opaque)
         goto out_nofid;
     }
 
-    /* Ignore direct disk access hint until the server supports it. */
-    flags &= ~O_DIRECT;
+    flags = get_dotl_openflags(pdu->s, flags);
     err = v9fs_co_open2(pdu, fidp, &name, gid,
                         flags | O_CREAT, mode, &stbuf);
     if (err < 0) {
diff --git a/qemu-config.c b/qemu-config.c
index 7a7854f..b2ab0b2 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -177,6 +177,9 @@  QemuOptsList qemu_fsdev_opts = {
         }, {
             .name = "security_model",
             .type = QEMU_OPT_STRING,
+        }, {
+            .name = "cache",
+            .type = QEMU_OPT_STRING,
         },
         { /*End of list */ }
     },
@@ -199,6 +202,9 @@  QemuOptsList qemu_virtfs_opts = {
         }, {
             .name = "security_model",
             .type = QEMU_OPT_STRING,
+        }, {
+            .name = "cache",
+            .type = QEMU_OPT_STRING,
         },
 
         { /*End of list */ }
diff --git a/qemu-options.hx b/qemu-options.hx
index dfbabd0..38f0aef 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -525,7 +525,8 @@  ETEXI
 DEFHEADING(File system options:)
 
 DEF("fsdev", HAS_ARG, QEMU_OPTION_fsdev,
-    "-fsdev local,id=id,path=path,security_model=[mapped|passthrough|none]\n",
+    "-fsdev local,id=id,path=path,security_model=[mapped|passthrough|none]\n"
+    "       [,cache=writethrough]\n",
     QEMU_ARCH_ALL)
 
 STEXI
@@ -541,7 +542,7 @@  The specific Fstype will determine the applicable options.
 
 Options to each backend are described below.
 
-@item -fsdev local ,id=@var{id} ,path=@var{path} ,security_model=@var{security_model}
+@item -fsdev local ,id=@var{id} ,path=@var{path} ,security_model=@var{security_model}[,cache=@var{cache}]
 
 Create a file-system-"device" for local-filesystem.
 
@@ -552,13 +553,17 @@  Create a file-system-"device" for local-filesystem.
 @option{security_model} specifies the security model to be followed.
 @option{security_model} is required.
 
+@option{cache} specifies whether to skip the host page cache.
+@option{cache} is an optional argument.
+
 @end table
 ETEXI
 
 DEFHEADING(Virtual File system pass-through options:)
 
 DEF("virtfs", HAS_ARG, QEMU_OPTION_virtfs,
-    "-virtfs local,path=path,mount_tag=tag,security_model=[mapped|passthrough|none]\n",
+    "-virtfs local,path=path,mount_tag=tag,security_model=[mapped|passthrough|none]\n"
+    "        [,cache=writethrough]\n",
     QEMU_ARCH_ALL)
 
 STEXI
@@ -574,7 +579,7 @@  The specific Fstype will determine the applicable options.
 
 Options to each backend are described below.
 
-@item -virtfs local ,path=@var{path} ,mount_tag=@var{mount_tag} ,security_model=@var{security_model}
+@item -virtfs local ,path=@var{path} ,mount_tag=@var{mount_tag} ,security_model=@var{security_model}[,cache=@var{cache}]
 
 Create a Virtual file-system-pass through for local-filesystem.
 
@@ -585,10 +590,12 @@  Create a Virtual file-system-pass through for local-filesystem.
 @option{security_model} specifies the security model to be followed.
 @option{security_model} is required.
 
-
 @option{mount_tag} specifies the tag with which the exported file is mounted.
 @option{mount_tag} is required.
 
+@option{cache} specifies whether to skip the host page cache.
+@option{cache} is an optional argument.
+
 @end table
 ETEXI
 
diff --git a/vl.c b/vl.c
index bd4a5ce..6760e39 100644
--- a/vl.c
+++ b/vl.c
@@ -2794,6 +2794,7 @@  int main(int argc, char **argv, char **envp)
             case QEMU_OPTION_virtfs: {
                 QemuOpts *fsdev;
                 QemuOpts *device;
+                const char *cache;
 
                 olist = qemu_find_opts("virtfs");
                 if (!olist) {
@@ -2823,6 +2824,11 @@  int main(int argc, char **argv, char **envp)
                             qemu_opt_get(opts, "mount_tag"));
                     exit(1);
                 }
+
+                cache = qemu_opt_get(opts, "cache");
+                if (cache) {
+                    qemu_opt_set(fsdev, "cache", cache);
+                }
                 qemu_opt_set(fsdev, "fstype", qemu_opt_get(opts, "fstype"));
                 qemu_opt_set(fsdev, "path", qemu_opt_get(opts, "path"));
                 qemu_opt_set(fsdev, "security_model",