diff mbox

raw-posix: Linearize direct I/O on Linux NFS

Message ID 1302874855-14736-1-git-send-email-stefanha@linux.vnet.ibm.com
State New
Headers show

Commit Message

Stefan Hajnoczi April 15, 2011, 1:40 p.m. UTC
The Linux NFS client issues separate NFS requests for vectored direct
I/O writes.  For example, a pwritev() with 8 elements results in 8 write
requests to the server.  This is very inefficient and a kernel-side fix
is not trivial or likely to be available soon.

This patch detects files on NFS and uses the QEMU_AIO_MISALIGNED flag to
force requests to bounce through a linear buffer.

Khoa Huynh <khoa@us.ibm.com> reports the following ffsb benchmark
results over 1 Gbit Ethernet:

Test (threads=8)               unpatched patched
                                  (MB/s)  (MB/s)
Large File Creates (bs=256 KB)      20.5   112.0
Sequential Reads (bs=256 KB)        58.7   112.0
Large File Creates (bs=8 KB)         5.2     5.8
Sequential Reads (bs=8 KB)          46.7    80.9
Random Reads (bs=8 KB)               8.7    23.4
Random Writes (bs=8 KB)             39.6    44.0
Mail Server (bs=8 KB)               10.2    23.6

Test (threads=1)               unpatched patched
                                  (MB/s)  (MB/s)
Large File Creates (bs=256 KB)      14.5    49.8
Sequential Reads (bs=256 KB)        87.9    83.9
Large File Creates (bs=8 KB)         4.8     4.8
Sequential Reads (bs=8 KB)          23.2    23.1
Random Reads (bs=8 KB)               4.8     4.7
Random Writes (bs=8 KB)              9.4    12.8
Mail Server (bs=8 KB)                5.4     7.3

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/raw-posix.c |   55 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 46 insertions(+), 9 deletions(-)

Comments

Christoph Hellwig April 15, 2011, 3:05 p.m. UTC | #1
NAK.  Just wait for the bloody NFS client fix to get in instead of
adding crap like that.
Stefan Hajnoczi April 15, 2011, 3:26 p.m. UTC | #2
On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig <hch@lst.de> wrote:
> NAK.  Just wait for the bloody NFS client fix to get in instead of
> adding crap like that.

That's totally fine if NFS client will be fixed in the near future but
this doesn't seem likely:

http://www.spinics.net/lists/linux-nfs/msg20462.html

Stefan
Christoph Hellwig April 15, 2011, 3:34 p.m. UTC | #3
On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig <hch@lst.de> wrote:
> > NAK. ?Just wait for the bloody NFS client fix to get in instead of
> > adding crap like that.
> 
> That's totally fine if NFS client will be fixed in the near future but
> this doesn't seem likely:
> 
> http://www.spinics.net/lists/linux-nfs/msg20462.html

The code to use preadv/pwritev has been in qemu for over 2 years,
and it took people to notice the NFS slowdown until now, so don't
expect it to be fixed three days layer.

I can't event see you in the relevent threads arguing for getting it
fixed, so don't complain.
Anthony Liguori April 15, 2011, 4:10 p.m. UTC | #4
On 04/15/2011 10:34 AM, Christoph Hellwig wrote:
> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
>> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig<hch@lst.de>  wrote:
>>> NAK. ?Just wait for the bloody NFS client fix to get in instead of
>>> adding crap like that.
>> That's totally fine if NFS client will be fixed in the near future but
>> this doesn't seem likely:
>>
>> http://www.spinics.net/lists/linux-nfs/msg20462.html
> The code to use preadv/pwritev has been in qemu for over 2 years,
> and it took people to notice the NFS slowdown until now, so don't
> expect it to be fixed three days layer.
>
> I can't event see you in the relevent threads arguing for getting it
> fixed, so don't complain.

In general, since we are userspace, we should try to run well on 
whatever kernel we're on.

What I don't like about this patch is that likelihood of false 
positives.  We check for Linux and for NFS but that means an old 
userspace is doing unoptimal things on newer kernels.  Even if we had a 
kernel version check, if the fix gets backported to an older kernel, 
we'll still get a false positive.

Ideally, we'd be able to query the kernel to see whether we should 
bounce or not.  But AFAIK there is nothing even close to an interface to 
do this today.

Regards,

Anthony Liguori
Stefan Hajnoczi April 15, 2011, 4:17 p.m. UTC | #5
On Fri, Apr 15, 2011 at 5:10 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> On 04/15/2011 10:34 AM, Christoph Hellwig wrote:
>>
>> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
>>>
>>> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig<hch@lst.de>  wrote:
>>>>
>>>> NAK. ?Just wait for the bloody NFS client fix to get in instead of
>>>> adding crap like that.
>>>
>>> That's totally fine if NFS client will be fixed in the near future but
>>> this doesn't seem likely:
>>>
>>> http://www.spinics.net/lists/linux-nfs/msg20462.html
>>
>> The code to use preadv/pwritev has been in qemu for over 2 years,
>> and it took people to notice the NFS slowdown until now, so don't
>> expect it to be fixed three days layer.
>>
>> I can't event see you in the relevent threads arguing for getting it
>> fixed, so don't complain.
>
> In general, since we are userspace, we should try to run well on whatever
> kernel we're on.
>
> What I don't like about this patch is that likelihood of false positives.
>  We check for Linux and for NFS but that means an old userspace is doing
> unoptimal things on newer kernels.  Even if we had a kernel version check,
> if the fix gets backported to an older kernel, we'll still get a false
> positive.
>
> Ideally, we'd be able to query the kernel to see whether we should bounce or
> not.  But AFAIK there is nothing even close to an interface to do this
> today.

Bah, good point.  I was planning to sneak in a uname() later but that
really doesn't cut it due to backports/downstream.

Stefan
Badari Pulavarty April 15, 2011, 4:23 p.m. UTC | #6
On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote:
> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
> > On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig <hch@lst.de> wrote:
> > > NAK. ?Just wait for the bloody NFS client fix to get in instead of
> > > adding crap like that.
> > 
> > That's totally fine if NFS client will be fixed in the near future but
> > this doesn't seem likely:
> > 
> > http://www.spinics.net/lists/linux-nfs/msg20462.html
> 
> The code to use preadv/pwritev has been in qemu for over 2 years,
> and it took people to notice the NFS slowdown until now, so don't
> expect it to be fixed three days layer.

True. That brings up a different question - whether we are doing
enough testing on mainline QEMU :(

> I can't event see you in the relevent threads arguing for getting it
> fixed, so don't complain. 

I asked Stefan to work on a QEMU patch - since my QEMU skills are
limited and Stefan is more appropriate person to deal with QEMU :)

As mentioned earlier, we don't see performance with any temporary
solutions suggested so far (RPC table slot + ASYNC) comes anywhere
close to readv/writev support patch or linearized QEMU.

Thanks,
Badari
Christoph Hellwig April 15, 2011, 5:27 p.m. UTC | #7
On Fri, Apr 15, 2011 at 11:10:37AM -0500, Anthony Liguori wrote:
> In general, since we are userspace, we should try to run well on whatever 
> kernel we're on.

It should run with the interfaces given, but hacking around performance
bugs in a gross way is not something qemu should do.
Christoph Hellwig April 15, 2011, 5:29 p.m. UTC | #8
On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:
> True. That brings up a different question - whether we are doing
> enough testing on mainline QEMU :(

It seems you're clearly not doing enough testing on any qemu.  Even
the RHEL6 qemu has had preadv/pwritev since the first beta.

> I asked Stefan to work on a QEMU patch - since my QEMU skills are
> limited and Stefan is more appropriate person to deal with QEMU :)
> 
> As mentioned earlier, we don't see performance with any temporary
> solutions suggested so far (RPC table slot + ASYNC) comes anywhere
> close to readv/writev support patch or linearized QEMU.

So push harder for the real patch.  I haven't seen a clear NAK from
Trond, just that he'd like to avoid it.  For extra brownie points get
him release his read/write rework and integrate it with that.
Anthony Liguori April 15, 2011, 6:09 p.m. UTC | #9
On 04/15/2011 11:23 AM, Badari Pulavarty wrote:
> On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote:
>> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
>>> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig<hch@lst.de>  wrote:
>>>> NAK. ?Just wait for the bloody NFS client fix to get in instead of
>>>> adding crap like that.
>>> That's totally fine if NFS client will be fixed in the near future but
>>> this doesn't seem likely:
>>>
>>> http://www.spinics.net/lists/linux-nfs/msg20462.html
>> The code to use preadv/pwritev has been in qemu for over 2 years,
>> and it took people to notice the NFS slowdown until now, so don't
>> expect it to be fixed three days layer.
> True. That brings up a different question - whether we are doing
> enough testing on mainline QEMU :(

The issue here is NFS, not QEMU.  Moreover, the real problem is that 
we're using O_DIRECT.  O_DIRECT seems to result in nothing but problems 
and it never seems to be tested well on any file system.

I think the fundamental problem we keep running into really boils down 
to O_DIRECT being a second class interface within Linux.

Regards,

Anthony Liguori
Badari Pulavarty April 15, 2011, 6:25 p.m. UTC | #10
On Fri, 2011-04-15 at 13:09 -0500, Anthony Liguori wrote:
> On 04/15/2011 11:23 AM, Badari Pulavarty wrote:
> > On Fri, 2011-04-15 at 17:34 +0200, Christoph Hellwig wrote:
> >> On Fri, Apr 15, 2011 at 04:26:41PM +0100, Stefan Hajnoczi wrote:
> >>> On Fri, Apr 15, 2011 at 4:05 PM, Christoph Hellwig<hch@lst.de>  wrote:
> >>>> NAK. ?Just wait for the bloody NFS client fix to get in instead of
> >>>> adding crap like that.
> >>> That's totally fine if NFS client will be fixed in the near future but
> >>> this doesn't seem likely:
> >>>
> >>> http://www.spinics.net/lists/linux-nfs/msg20462.html
> >> The code to use preadv/pwritev has been in qemu for over 2 years,
> >> and it took people to notice the NFS slowdown until now, so don't
> >> expect it to be fixed three days layer.
> > True. That brings up a different question - whether we are doing
> > enough testing on mainline QEMU :(
> 
> The issue here is NFS, not QEMU.  

Sure. But we should have caught the regression on NFS when
preadv/pwritev change went into QEMU or before going in -- Isn't it ?
Since it was 2 years ago (like hch indicated) - we could have
fixed NFS long ago :) 

> Moreover, the real problem is that 
> we're using O_DIRECT.  O_DIRECT seems to result in nothing but problems 
> and it never seems to be tested well on any file system.

O_DIRECT was added for a specific use-case in mind - and its supposed
to handle only that case well (pre-allocated files with database usage -
where db has their own caching layer). That case is well tested 
by various DB vendors on most (important) local filesystems. 

You know very well why we are on O_DIRECT path :)

> 
> I think the fundamental problem we keep running into really boils down 
> to O_DIRECT being a second class interface within Linux.

Its by design. Its a special case for specific use.

Thanks,
Badari
Badari Pulavarty April 15, 2011, 10:21 p.m. UTC | #11
On 4/15/2011 10:29 AM, Christoph Hellwig wrote:
> On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:
>    
>> True. That brings up a different question - whether we are doing
>> enough testing on mainline QEMU :(
>>      
> It seems you're clearly not doing enough testing on any qemu.  Even
> the RHEL6 qemu has had preadv/pwritev since the first beta.
>    

Christoph,

When you say "you're" - you really meant RH right ? RH should have 
caught this in their
regression year ago as part of their first beta. Correct ?

Unfortunately, you are picking on person who spent time find & analyzing 
the regression,
narrowing the problem area and suggesting approaches to address the issue :(

BTW, I am not sure RH discussion is appropriate in this forum.

Thanks,
Badari
Anthony Liguori April 15, 2011, 11 p.m. UTC | #12
On 04/15/2011 05:21 PM, pbadari@linux.vnet.ibm.com wrote:
> On 4/15/2011 10:29 AM, Christoph Hellwig wrote:
>> On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:
>>> True. That brings up a different question - whether we are doing
>>> enough testing on mainline QEMU :(
>> It seems you're clearly not doing enough testing on any qemu.  Even
>> the RHEL6 qemu has had preadv/pwritev since the first beta.
>
> Christoph,
>
> When you say "you're" - you really meant RH right ? RH should have 
> caught this in their
> regression year ago as part of their first beta. Correct ?
>
> Unfortunately, you are picking on person who spent time find & 
> analyzing the regression,
> narrowing the problem area and suggesting approaches to address the 
> issue :(

This is a pretty silly discussion to be having.

The facts are:

1) NFS sucks with preadv/pwritev and O_DIRECT -- is anyone really surprised?

2) We could work around this in QEMU by doing something ugly

3) We have no way to detect when we no longer need a work around which 
makes (2) really unappealing.

4) That leaves us with:
     a) waiting for NFS to get fixed properly and just living with worse 
performance on older kernels

     b) having a user-tunable switch to enable bouncing

I really dislike the idea of (b) because we're stuck with it forever and 
it's yet another switch for people to mistakenly depend on.

I'm still waiting to see performance data without O_DIRECT.  I suspect 
that using cache=writethrough will make most of this problem go away in 
which case, we can just recommend that as a work around until NFS is 
properly fixed.

Regards,

Anthony Liguori
Badari Pulavarty April 15, 2011, 11:33 p.m. UTC | #13
On 4/15/2011 4:00 PM, Anthony Liguori wrote:
> On 04/15/2011 05:21 PM, pbadari@linux.vnet.ibm.com wrote:
>> On 4/15/2011 10:29 AM, Christoph Hellwig wrote:
>>> On Fri, Apr 15, 2011 at 09:23:54AM -0700, Badari Pulavarty wrote:
>>>> True. That brings up a different question - whether we are doing
>>>> enough testing on mainline QEMU :(
>>> It seems you're clearly not doing enough testing on any qemu.  Even
>>> the RHEL6 qemu has had preadv/pwritev since the first beta.
>>
>> Christoph,
>>
>> When you say "you're" - you really meant RH right ? RH should have 
>> caught this in their
>> regression year ago as part of their first beta. Correct ?
>>
>> Unfortunately, you are picking on person who spent time find & 
>> analyzing the regression,
>> narrowing the problem area and suggesting approaches to address the 
>> issue :(
>
> This is a pretty silly discussion to be having.
>
> The facts are:
>
> 1) NFS sucks with preadv/pwritev and O_DIRECT -- is anyone really 
> surprised?
>
> 2) We could work around this in QEMU by doing something ugly
>
> 3) We have no way to detect when we no longer need a work around which 
> makes (2) really unappealing.
>
> 4) That leaves us with:
>     a) waiting for NFS to get fixed properly and just living with 
> worse performance on older kernels
>
>     b) having a user-tunable switch to enable bouncing
>
> I really dislike the idea of (b) because we're stuck with it forever 
> and it's yet another switch for people to mistakenly depend on.
>
> I'm still waiting to see performance data without O_DIRECT.  I suspect 
> that using cache=writethrough will make most of this problem go away 
> in which case, we can just recommend that as a work around until NFS 
> is properly fixed.

We need to run through all cases and analyze the performance of 
cache=writethrough. Our initial (smaller setup) analysis
indicates that its better than unpatched O_DIRECT - but ~5% slower for 
sequential writes. But 30%+ slower for
random read/writes and mixed IO workloads. (In the past NFS O_SYNC is 
performance extremely poor compared to
O_DIRECT with no scaling - older kernels due to congestion control issues).

Khoa would collect the data over next few days.

To be honest with you, we should kill cache=none and just optimize only 
one case and live with it (like other commerical
hypervisor). :(

Thanks,
Badari
Christoph Hellwig April 16, 2011, 2:03 a.m. UTC | #14
On Fri, Apr 15, 2011 at 03:21:36PM -0700, Badari Pulavarty wrote:
> When you say "you're" - you really meant RH right ? RH should have caught 
> this in their
> regression year ago as part of their first beta. Correct ?

With you I mean whoever cares.  Which apparently is no one but IBM.
Christoph Hellwig April 16, 2011, 2:05 a.m. UTC | #15
On Fri, Apr 15, 2011 at 04:33:51PM -0700, Badari Pulavarty wrote:
> To be honest with you, we should kill cache=none and just optimize only one 
> case and live with it (like other commerical
> hypervisor). :(

cache=none is the only sane mode for qemu, modulo bugs in nfs or similar weird
protocols that no sane person should use for VM storage.

>
Stefan Hajnoczi April 16, 2011, 8:46 a.m. UTC | #16
On Sat, Apr 16, 2011 at 12:00 AM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> 3) We have no way to detect when we no longer need a work around which makes
> (2) really unappealing.

I agree.

> 4) That leaves us with:
>    a) waiting for NFS to get fixed properly and just living with worse
> performance on older kernels
>
>    b) having a user-tunable switch to enable bouncing
>
> I really dislike the idea of (b) because we're stuck with it forever and
> it's yet another switch for people to mistakenly depend on.

The user-tunable switch is potentially interesting for performance
troubleshooting.  We have seen another file system which has issues
with vectored direct I/O.  It would have been much easier to identify
the problem by telling the user "Try running it with linearize=on and
see if it makes a difference".

But let's try harder on linux-nfs.

Stefan
diff mbox

Patch

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 6b72470..40b7c61 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -49,8 +49,10 @@ 
 #ifdef __linux__
 #include <sys/ioctl.h>
 #include <sys/param.h>
+#include <sys/vfs.h>
 #include <linux/cdrom.h>
 #include <linux/fd.h>
+#include <linux/magic.h>
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
 #include <signal.h>
@@ -124,6 +126,7 @@  typedef struct BDRVRawState {
 #endif
     uint8_t *aligned_buf;
     unsigned aligned_buf_size;
+    bool force_linearize;
 #ifdef CONFIG_XFS
     bool is_xfs : 1;
 #endif
@@ -136,6 +139,32 @@  static int64_t raw_getlength(BlockDriverState *bs);
 static int cdrom_reopen(BlockDriverState *bs);
 #endif
 
+#if defined(__linux__)
+static bool is_vectored_io_slow(int fd, int open_flags)
+{
+    struct statfs stfs;
+    int ret;
+
+    do {
+        ret = fstatfs(fd, &stfs);
+    } while (ret != 0 && errno == EINTR);
+
+    /*
+     * Linux NFS client splits vectored direct I/O requests into separate NFS
+     * requests so it is faster to submit a single buffer instead.
+     */
+    if (!ret && stfs.f_type == NFS_SUPER_MAGIC && (open_flags & O_DIRECT)) {
+        return true;
+    }
+    return false;
+}
+#else /* !defined(__linux__) */
+static bool is_vectored_io_slow(int fd, int open_flags)
+{
+    return false;
+}
+#endif
+
 static int raw_open_common(BlockDriverState *bs, const char *filename,
                            int bdrv_flags, int open_flags)
 {
@@ -167,6 +196,7 @@  static int raw_open_common(BlockDriverState *bs, const char *filename,
     }
     s->fd = fd;
     s->aligned_buf = NULL;
+    s->force_linearize = is_vectored_io_slow(fd, s->open_flags);
 
     if ((bdrv_flags & BDRV_O_NOCACHE)) {
         /*
@@ -536,20 +566,27 @@  static BlockDriverAIOCB *raw_aio_submit(BlockDriverState *bs,
         return NULL;
 
     /*
+     * Check if buffers need to be copied into a single linear buffer.
+     */
+    if (s->force_linearize && qiov->niov > 1) {
+        type |= QEMU_AIO_MISALIGNED;
+    }
+
+    /*
      * If O_DIRECT is used the buffer needs to be aligned on a sector
-     * boundary.  Check if this is the case or telll the low-level
+     * boundary.  Check if this is the case or tell the low-level
      * driver that it needs to copy the buffer.
      */
-    if (s->aligned_buf) {
-        if (!qiov_is_aligned(bs, qiov)) {
-            type |= QEMU_AIO_MISALIGNED;
+    if (s->aligned_buf && !qiov_is_aligned(bs, qiov)) {
+        type |= QEMU_AIO_MISALIGNED;
+    }
+
 #ifdef CONFIG_LINUX_AIO
-        } else if (s->use_aio) {
-            return laio_submit(bs, s->aio_ctx, s->fd, sector_num, qiov,
-                               nb_sectors, cb, opaque, type);
-#endif
-        }
+    if (s->use_aio && (type & QEMU_AIO_MISALIGNED) == 0) {
+        return laio_submit(bs, s->aio_ctx, s->fd, sector_num, qiov,
+                           nb_sectors, cb, opaque, type);
     }
+#endif
 
     return paio_submit(bs, s->fd, sector_num, qiov, nb_sectors,
                        cb, opaque, type);