diff mbox

[1/2] linux aio: support flush operation

Message ID 1311791126-11383-2-git-send-email-freddy77@gmail.com
State New
Headers show

Commit Message

Frediano Ziglio July 27, 2011, 6:25 p.m. UTC
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
---
 block/raw-posix.c |    7 +++++++
 linux-aio.c       |    3 +++
 2 files changed, 10 insertions(+), 0 deletions(-)

Comments

Christoph Hellwig July 27, 2011, 6:31 p.m. UTC | #1
Did you test this at all?

On Wed, Jul 27, 2011 at 08:25:25PM +0200, Frediano Ziglio wrote:
> +    case QEMU_AIO_FLUSH:
> +        io_prep_fdsync(iocbs, fd);
> +        break;

Looks great, but doesn't work as expected.

Hint: grep for aio_fsync in the linux kernel.
Frediano Ziglio July 27, 2011, 7:52 p.m. UTC | #2
Il giorno 27/lug/2011, alle ore 20:31, Christoph Hellwig <hch@lst.de> ha scritto:

> Did you test this at all?
> 

Yes! Not at kernel level :-)
Usually I trust documentation and man pages.

> On Wed, Jul 27, 2011 at 08:25:25PM +0200, Frediano Ziglio wrote:
>> +    case QEMU_AIO_FLUSH:
>> +        io_prep_fdsync(iocbs, fd);
>> +        break;
> 
> Looks great, but doesn't work as expected.
> 
> Hint: grep for aio_fsync in the linux kernel.
> 

Thanks. I'll try to port misaligned access to Linux AIO. Also I'll add some comments on code to avoid somebody do the same mistache I did.
Mainly however -k qemu-img and aio=native in blockdev options are silently ignored if nocache is not enabled. Also I notice that combining XFS, Linux AIO, O_DIRECT and O_DSYNC give impressive performance but currently there is no way to specify all that flags together cause nocache enable O_DIRECT while O_DSYNC is enabled with writethrough.

Frediano
Christoph Hellwig July 27, 2011, 7:57 p.m. UTC | #3
On Wed, Jul 27, 2011 at 09:52:51PM +0200, Frediano Ziglio wrote:
> > 
> 
> Yes! Not at kernel level :-)

In that case we have a bad error handling problem somewhere in qemu.
the IOCB_CMD_FDSYNC aio opcode will always return EINVAL currently,
and we really should have cought that somewhere in qemu.

> Thanks. I'll try to port misaligned access to Linux AIO. Also I'll add some comments on code to avoid somebody do the same mistache I did.

It's direct I/O code in general that doesn't handle misaligned access.
Given that we should never get misaligned I/O from guests I just didn't
bother duplicating the read-modify-write code for that code path as well.

> Mainly however -k qemu-img and aio=native in blockdev options are silently ignored if nocache is not enabled.

Maybe we should indeed error out instead.  Care to prepare a patch for that?

> Also I notice that combining XFS, Linux AIO, O_DIRECT and O_DSYNC give impressive performance but currently there is no way to specify all that flags together cause nocache enable O_DIRECT while O_DSYNC is enabled with writethrough.

Indeed.  This has come up a few times, and actually is a mostly trivial
task.  Maybe we should give up waiting for -blockdev and separate cache
mode settings and allow a nocache-writethrough or similar mode now?  It's
going to be around 10 lines of code + documentation.
Kevin Wolf July 28, 2011, 7:47 a.m. UTC | #4
Am 27.07.2011 21:57, schrieb Christoph Hellwig:
> On Wed, Jul 27, 2011 at 09:52:51PM +0200, Frediano Ziglio wrote:
>> Also I notice that combining XFS, Linux AIO, O_DIRECT and O_DSYNC give impressive performance but currently there is no way to specify all that flags together cause nocache enable O_DIRECT while O_DSYNC is enabled with writethrough.
> 
> Indeed.  This has come up a few times, and actually is a mostly trivial
> task.  Maybe we should give up waiting for -blockdev and separate cache
> mode settings and allow a nocache-writethrough or similar mode now?  It's
> going to be around 10 lines of code + documentation.

I understand that there may be reasons for using O_DIRECT | O_DSYNC, but
what is the explanation for O_DSYNC improving performance?

Christoph, on another note: Can we rely on Linux AIO never returning
short writes except on EOF? Currently we return -EINVAL in this case, so
I hope it's true or we wouldn't return the correct error code.

The reason why I'm asking is because I want to allow reads across EOF
for growable images and pad with zeros (the synchronous code does this
today in order to allow bdrv_pread/pwrite to work, and when we start
using coroutines in the block layer, these cases will hit the AIO paths).

Kevin
Christoph Hellwig July 28, 2011, 12:15 p.m. UTC | #5
On Thu, Jul 28, 2011 at 09:47:05AM +0200, Kevin Wolf wrote:
> > Indeed.  This has come up a few times, and actually is a mostly trivial
> > task.  Maybe we should give up waiting for -blockdev and separate cache
> > mode settings and allow a nocache-writethrough or similar mode now?  It's
> > going to be around 10 lines of code + documentation.
> 
> I understand that there may be reasons for using O_DIRECT | O_DSYNC, but
> what is the explanation for O_DSYNC improving performance?

There isn't any, at least for modern Linux.  O_DSYNC at this point is
equivalent to a range fdatasync for each write call, and given that we're
doing O_DIRECT the ranges flush doesn't matter.  If you do have a modern
host and an old guest it might end up beeing faster because the barrier
implementation in Linux used to suck so badly, but that's not inhrent
to the I/O model.  If you guest however doesn't support cache flushes
at all O_DIRECT | O_DSYNC is the only sane model to use for local filesystems
and block devices.
 

> Christoph, on another note: Can we rely on Linux AIO never returning
> short writes except on EOF? Currently we return -EINVAL in this case, so
> I hope it's true or we wouldn't return the correct error code.

More or less.  There's one corner case for all Linux I/O, and that is
only writes up to INT_MAX are supported, and larger writes (and reads)
get truncated to it.  It's pretty nasty, but Linux has been vocally
opposed to fixing this issue.
Kevin Wolf July 28, 2011, 12:41 p.m. UTC | #6
Am 28.07.2011 14:15, schrieb Christoph Hellwig:
>> Christoph, on another note: Can we rely on Linux AIO never returning
>> short writes except on EOF? Currently we return -EINVAL in this case, so

"short reads" I meant, of course.

>> I hope it's true or we wouldn't return the correct error code.
> 
> More or less.  There's one corner case for all Linux I/O, and that is
> only writes up to INT_MAX are supported, and larger writes (and reads)
> get truncated to it.  It's pretty nasty, but Linux has been vocally
> opposed to fixing this issue.

I think we can safely ignore this. So just replacing the current
ret = -EINVAL; by a memset(buf + ret, 0, len - ret); ret = 0; should be
okay, right? (Of course using the qiov versions, but you get the idea)

Kevin
Christoph Hellwig July 29, 2011, 2:24 p.m. UTC | #7
On Thu, Jul 28, 2011 at 02:41:02PM +0200, Kevin Wolf wrote:
> > More or less.  There's one corner case for all Linux I/O, and that is
> > only writes up to INT_MAX are supported, and larger writes (and reads)
> > get truncated to it.  It's pretty nasty, but Linux has been vocally
> > opposed to fixing this issue.
> 
> I think we can safely ignore this. So just replacing the current
> ret = -EINVAL; by a memset(buf + ret, 0, len - ret); ret = 0; should be
> okay, right? (Of course using the qiov versions, but you get the idea)

This should be safe.
Stefan Hajnoczi July 29, 2011, 3:33 p.m. UTC | #8
On Thu, Jul 28, 2011 at 1:15 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Jul 28, 2011 at 09:47:05AM +0200, Kevin Wolf wrote:
>> > Indeed.  This has come up a few times, and actually is a mostly trivial
>> > task.  Maybe we should give up waiting for -blockdev and separate cache
>> > mode settings and allow a nocache-writethrough or similar mode now?  It's
>> > going to be around 10 lines of code + documentation.
>>
>> I understand that there may be reasons for using O_DIRECT | O_DSYNC, but
>> what is the explanation for O_DSYNC improving performance?
>
> There isn't any, at least for modern Linux.  O_DSYNC at this point is
> equivalent to a range fdatasync for each write call, and given that we're
> doing O_DIRECT the ranges flush doesn't matter.  If you do have a modern
> host and an old guest it might end up beeing faster because the barrier
> implementation in Linux used to suck so badly, but that's not inhrent
> to the I/O model.  If you guest however doesn't support cache flushes
> at all O_DIRECT | O_DSYNC is the only sane model to use for local filesystems
> and block devices.

I can rebase this cache=directsync patch and send it:
http://repo.or.cz/w/qemu/stefanha.git/commitdiff/6756719a46ac9876ac6d0460a33ad98e96a3a923

The other weird caching-related option I was playing with is -drive
...,readahead=on|off.  It lets you disable the host kernel readahead
on buffered modes (cache=writeback|writethrough):
http://repo.or.cz/w/qemu/stefanha.git/commitdiff/f2fc2b297a2b2dd0cccd1dc2f7c519f3b0374e0d

Stefan
diff mbox

Patch

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 3c6bd4b..27ae81e 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -628,6 +628,13 @@  static BlockDriverAIOCB *raw_aio_flush(BlockDriverState *bs,
     if (fd_open(bs) < 0)
         return NULL;
 
+#ifdef CONFIG_LINUX_AIO
+    if (s->use_aio) {
+        return laio_submit(bs, s->aio_ctx, s->fd, 0, NULL,
+                           0, cb, opaque, QEMU_AIO_FLUSH);
+    }
+#endif
+
     return paio_submit(bs, s->fd, 0, NULL, 0, cb, opaque, QEMU_AIO_FLUSH);
 }
 
diff --git a/linux-aio.c b/linux-aio.c
index 68f4b3d..d07c435 100644
--- a/linux-aio.c
+++ b/linux-aio.c
@@ -215,6 +215,9 @@  BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd,
     case QEMU_AIO_READ:
         io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset);
 	break;
+    case QEMU_AIO_FLUSH:
+        io_prep_fdsync(iocbs, fd);
+        break;
     default:
         fprintf(stderr, "%s: invalid AIO request type 0x%x.\n",
                         __func__, type);