diff mbox

qemu and qemu-xen: support empty write barriers in xen_disk

Message ID alpine.DEB.2.00.1011241305080.2373@kaball-desktop
State New
Headers show

Commit Message

Stefano Stabellini Nov. 24, 2010, 1:08 p.m. UTC
Hi all,
this patch can be applied to both qemu-xen and qemu and adds support
for empty write barriers to xen_disk.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Comments

Gerd Hoffmann Nov. 24, 2010, 2:23 p.m. UTC | #1
On 11/24/10 14:08, Stefano Stabellini wrote:
> this patch can be applied to both qemu-xen and qemu and adds support
> for empty write barriers to xen_disk.

Looks good.

Acked-by: Gerd Hoffmann <kraxel@redhat.com>

cheers,
   Gerd
Kevin Wolf Nov. 24, 2010, 4:29 p.m. UTC | #2
Am 24.11.2010 15:23, schrieb Gerd Hoffmann:
> On 11/24/10 14:08, Stefano Stabellini wrote:
>> this patch can be applied to both qemu-xen and qemu and adds support
>> for empty write barriers to xen_disk.
> 
> Looks good.
> 
> Acked-by: Gerd Hoffmann <kraxel@redhat.com>

Thanks, applied to the block branch.

Kevin
Stefano Stabellini Nov. 24, 2010, 4:46 p.m. UTC | #3
On Wed, 24 Nov 2010, Kevin Wolf wrote:
> > On 11/24/10 14:08, Stefano Stabellini wrote:
> >> this patch can be applied to both qemu-xen and qemu and adds support
> >> for empty write barriers to xen_disk.
> > 
> > Looks good.
> > 
> > Acked-by: Gerd Hoffmann <kraxel@redhat.com>
> 
> Thanks, applied to the block branch.
 
thank you!
Christoph Hellwig Nov. 24, 2010, 4:58 p.m. UTC | #4
I had the discussion with Jeremy in Boston before, but let's repeat it
here:

 - is there actually any pre-existing xen backend that does properly
   implement empty barries.  Back then we couldn't find any.
 - if this is a new concept to Xen please do not define an empty
   barrier primitive, but a new flush cache primitive.  That one
   maps natively to the qemu I/O layer, and with recent Linux, NetBSD,
   Windows, or Solaris guest will be a lot faster than a barrier
   which drains the queue.

Note that what your patch implements actually is a rather inefficient
implementation of the latter.  You do none of the queue draining which
the in-kernel blkback implementation does by submitting the old-style
barrier bio.  While most filesystem do not care you introduce a quite
subtile chance of data corruption for reiserfs, or ext4 with
asynchronous journal commits on pre-2.6.37 kernels.
Jeremy Fitzhardinge Nov. 24, 2010, 6:18 p.m. UTC | #5
On 11/24/2010 08:58 AM, Christoph Hellwig wrote:
> I had the discussion with Jeremy in Boston before, but let's repeat it
> here:
>
>  - is there actually any pre-existing xen backend that does properly
>    implement empty barries.  Back then we couldn't find any.
>  - if this is a new concept to Xen please do not define an empty
>    barrier primitive, but a new flush cache primitive.  That one
>    maps natively to the qemu I/O layer, and with recent Linux, NetBSD,
>    Windows, or Solaris guest will be a lot faster than a barrier
>    which drains the queue.
>
> Note that what your patch implements actually is a rather inefficient
> implementation of the latter.  You do none of the queue draining which
> the in-kernel blkback implementation does by submitting the old-style
> barrier bio.  While most filesystem do not care you introduce a quite
> subtile chance of data corruption for reiserfs, or ext4 with
> asynchronous journal commits on pre-2.6.37 kernels.

Yeah, I agree.  I think semantically empty WRITE_BARRIERs are supposed
to work, as evidenced by the effort blkback makes in trying to
specifically support them.  I haven't traced through to find out why it
EIOs them regardless - it seems to be coming from deeper in the block
subsystem (is there a "no payload" flag or something missing?).

But in the future, I think defining an unordered FLUSH operator like
Linux wants is a useful thing to do and implement (especially since it
amounts to standardising the ?BSD extension).  I'm not sure of their
precise semantics (esp WRT ordering), but I think its already OK.

(BTW, in case it wasn't clear, we're seriously considering - but not yet
committed to - using qemu as the primary PV block backend for Xen
instead of submitting the existing blkback code for upstream.  We still
need to do some proper testing and measuring to make sure it stacks up
OK, and work out how it would fit together with the rest of the
management stack.  But so far it looks promising.)

    J
Christoph Hellwig Nov. 24, 2010, 6:44 p.m. UTC | #6
On Wed, Nov 24, 2010 at 10:18:40AM -0800, Jeremy Fitzhardinge wrote:
> Linux wants is a useful thing to do and implement (especially since it
> amounts to standardising the ?BSD extension).  I'm not sure of their
> precise semantics (esp WRT ordering), but I think its already OK.

The nice bit is that a pure flush does not imply any odering at all.
Which is how the current qemu driver implements the barrier requests
anyway, so that needs some fixing.

> (BTW, in case it wasn't clear, we're seriously considering - but not yet
> committed to - using qemu as the primary PV block backend for Xen
> instead of submitting the existing blkback code for upstream.  We still
> need to do some proper testing and measuring to make sure it stacks up
> OK, and work out how it would fit together with the rest of the
> management stack.  But so far it looks promising.)

Good to know.  Besides the issue with barriers mentioned above there's
a few things that need addressing in xen_disk, if you (or Stefano or
Daniel) are interested:

 - remove the syncwrite tunable, as this is handled by the underlying
   posix I/O code if needed by using O_DSYNC which is a lot more
   efficient.
 - check whatever the issue with the use_aio codepath is and make it
   the default.  It should help the performance a lot.
 - Make sure to use bdrv_aio_flush for cache flushes in the aio
   codepath, currently it still uses plain synchronous flushes.

> 
>     J
---end quoted text---
Stefano Stabellini Nov. 24, 2010, 7:55 p.m. UTC | #7
On Wed, 24 Nov 2010, Christoph Hellwig wrote:
> On Wed, Nov 24, 2010 at 10:18:40AM -0800, Jeremy Fitzhardinge wrote:
> > Linux wants is a useful thing to do and implement (especially since it
> > amounts to standardising the ?BSD extension).  I'm not sure of their
> > precise semantics (esp WRT ordering), but I think its already OK.
> 
> The nice bit is that a pure flush does not imply any odering at all.
> Which is how the current qemu driver implements the barrier requests
> anyway, so that needs some fixing.
> 
> > (BTW, in case it wasn't clear, we're seriously considering - but not yet
> > committed to - using qemu as the primary PV block backend for Xen
> > instead of submitting the existing blkback code for upstream.  We still
> > need to do some proper testing and measuring to make sure it stacks up
> > OK, and work out how it would fit together with the rest of the
> > management stack.  But so far it looks promising.)
> 
> Good to know.  Besides the issue with barriers mentioned above there's
> a few things that need addressing in xen_disk, if you (or Stefano or
> Daniel) are interested:
> 
>  - remove the syncwrite tunable, as this is handled by the underlying
>    posix I/O code if needed by using O_DSYNC which is a lot more
>    efficient.
>  - check whatever the issue with the use_aio codepath is and make it
>    the default.  It should help the performance a lot.
>  - Make sure to use bdrv_aio_flush for cache flushes in the aio
>    codepath, currently it still uses plain synchronous flushes.
 
all very good suggestions, I am adding them to my todo list, but Daniel
is very welcome to contribute as well :)
Ian Jackson Nov. 25, 2010, 7:30 p.m. UTC | #8
Christoph Hellwig writes ("Re: [Xen-devel] Re: [Qemu-devel] [PATCH] qemu and qemu-xen: support empty write barriers in xen_disk"):
> On Wed, Nov 24, 2010 at 10:18:40AM -0800, Jeremy Fitzhardinge wrote:
> > Linux wants is a useful thing to do and implement (especially since it
> > amounts to standardising the ?BSD extension).  I'm not sure of their
> > precise semantics (esp WRT ordering), but I think its already OK.
> 
> The nice bit is that a pure flush does not imply any odering at all.
> Which is how the current qemu driver implements the barrier requests
> anyway, so that needs some fixing.

Thanks for your comments.  Does that mean, though, that Stefano's
patch is actually making the situation worse, or simply that it isn't
making it as good as it should be ?

If it's an improvement it should be applied it even if it's not
perfect, and it sounds to me like we don't want to wait for the more
complicated discussion to finish and produce code ?

Ian.
Jeremy Fitzhardinge Nov. 25, 2010, 7:46 p.m. UTC | #9
On 11/25/2010 11:30 AM, Ian Jackson wrote:
> Christoph Hellwig writes ("Re: [Xen-devel] Re: [Qemu-devel] [PATCH] qemu and qemu-xen: support empty write barriers in xen_disk"):
>> On Wed, Nov 24, 2010 at 10:18:40AM -0800, Jeremy Fitzhardinge wrote:
>>> Linux wants is a useful thing to do and implement (especially since it
>>> amounts to standardising the ?BSD extension).  I'm not sure of their
>>> precise semantics (esp WRT ordering), but I think its already OK.
>> The nice bit is that a pure flush does not imply any odering at all.
>> Which is how the current qemu driver implements the barrier requests
>> anyway, so that needs some fixing.
> Thanks for your comments.  Does that mean, though, that Stefano's
> patch is actually making the situation worse, or simply that it isn't
> making it as good as it should be ?

The latter.  There's a question over whether WRITE_BARRIER really
supports empty barriers, since it appears that none of the existing
backends implement it correctly - but on the other hand, the kernel
blkback code does *try* to implement it, even though it fails.  This
change makes empty WRITE_BARRIERS work in qemu, which is helpful because
the upstream blkfront tries to use them.

But WRITE_BARRIER is fundamentally suboptimal for Linux's needs because
it is a fully ordered barrier operation.  What Linux needs is a simple
FLUSH operation which just makes sure that previously completed writes
are fully flushed out of any caches and buffers and are really on
durable storage.  It has no ordering requirements, so it doesn't prevent
subsequent writes from being handled while the flush is going on.

Christoph is therefore recommending that we add a specific FLUSH
operation to the protocol with these properties so that we can achieve
the best performance.  But if the backend lacks FLUSH, we still need a
reliable WRITE_BARRIER.

(But it would be very sad that if, in practice, most backends in the
field fail empty WRITE_BARRIER operations, leaving guests with no
mechanism to force writes to stable storage.  In that case I guess we'll
need to look at some other hacky thing to try and make it work...)

    J
Christoph Hellwig Nov. 25, 2010, 11:30 p.m. UTC | #10
On Thu, Nov 25, 2010 at 11:46:40AM -0800, Jeremy Fitzhardinge wrote:
> The latter.  There's a question over whether WRITE_BARRIER really
> supports empty barriers, since it appears that none of the existing
> backends implement it correctly - but on the other hand, the kernel
> blkback code does *try* to implement it, even though it fails.  This
> change makes empty WRITE_BARRIERS work in qemu, which is helpful because
> the upstream blkfront tries to use them.

So far so good.

> 
> But WRITE_BARRIER is fundamentally suboptimal for Linux's needs because
> it is a fully ordered barrier operation.  What Linux needs is a simple
> FLUSH operation which just makes sure that previously completed writes
> are fully flushed out of any caches and buffers and are really on
> durable storage.  It has no ordering requirements, so it doesn't prevent
> subsequent writes from being handled while the flush is going on.
> 
> Christoph is therefore recommending that we add a specific FLUSH
> operation to the protocol with these properties so that we can achieve
> the best performance.  But if the backend lacks FLUSH, we still need a
> reliable WRITE_BARRIER.

The problem is that qemu currently implements WRITE_BARRIER incorrectly,
empty or not.  The Linux barrier primitive, which appears to extent 1:1
to Xen implies ordering semantics, which the blkback implementation
implementes by translating the write barrier back to a bio with the
barrier bit set.   But the qemu backend does not impose any ordering,
so it gives you different behaviour from blkback.  Is there any formal
specification of the Xen block protocol?

In the end the empty WRITE_BARRIER after this patch is equivalent to a
flush, which is fine for that particular command.  The problem is that
a small minority of Linux filesystems actually relied on the ordering
semantics of a non-empty WRITE_BARRIER command, which qemu doesn't
respect.

But the patch doesn't make anything worse by also accepting empty
barrier writes, so I guess it's fine.  My initial reply was just
supposed to be a reminder about the big dragon lurking here.
Kevin Wolf Nov. 26, 2010, 11:03 a.m. UTC | #11
Am 24.11.2010 19:44, schrieb Christoph Hellwig:
> On Wed, Nov 24, 2010 at 10:18:40AM -0800, Jeremy Fitzhardinge wrote:
>> Linux wants is a useful thing to do and implement (especially since it
>> amounts to standardising the ?BSD extension).  I'm not sure of their
>> precise semantics (esp WRT ordering), but I think its already OK.
> 
> The nice bit is that a pure flush does not imply any odering at all.
> Which is how the current qemu driver implements the barrier requests
> anyway, so that needs some fixing.
> 
>> (BTW, in case it wasn't clear, we're seriously considering - but not yet
>> committed to - using qemu as the primary PV block backend for Xen
>> instead of submitting the existing blkback code for upstream.  We still
>> need to do some proper testing and measuring to make sure it stacks up
>> OK, and work out how it would fit together with the rest of the
>> management stack.  But so far it looks promising.)
> 
> Good to know.  Besides the issue with barriers mentioned above there's
> a few things that need addressing in xen_disk, if you (or Stefano or
> Daniel) are interested:
> 
>  - remove the syncwrite tunable, as this is handled by the underlying
>    posix I/O code if needed by using O_DSYNC which is a lot more
>    efficient.
>  - check whatever the issue with the use_aio codepath is and make it
>    the default.  It should help the performance a lot.
>  - Make sure to use bdrv_aio_flush for cache flushes in the aio
>    codepath, currently it still uses plain synchronous flushes.

I don't think the synchronous flushes even do what they're supposed to
do. Shouldn't ioreq->postsync do the flush after the request has
completed instead of doing it as soon as it has been submitted?

Let alone that, as you mentioned above, it doesn't implement barrier
semantics at all.

Oh, and it would be very nice if the return values of
bdrv_aio_readv/writev and bdrv_flush were checked. They can return errors.

Kevin
Christoph Hellwig Nov. 26, 2010, 1:29 p.m. UTC | #12
On Fri, Nov 26, 2010 at 12:03:35PM +0100, Kevin Wolf wrote:
> I don't think the synchronous flushes even do what they're supposed to
> do. Shouldn't ioreq->postsync do the flush after the request has
> completed instead of doing it as soon as it has been submitted?

Indeed, that's yet another bug in the implementation.
Stefano Stabellini Nov. 26, 2010, 3:25 p.m. UTC | #13
On Fri, 26 Nov 2010, Kevin Wolf wrote:
> Am 24.11.2010 19:44, schrieb Christoph Hellwig:
> > On Wed, Nov 24, 2010 at 10:18:40AM -0800, Jeremy Fitzhardinge wrote:
> >> Linux wants is a useful thing to do and implement (especially since it
> >> amounts to standardising the ?BSD extension).  I'm not sure of their
> >> precise semantics (esp WRT ordering), but I think its already OK.
> > 
> > The nice bit is that a pure flush does not imply any odering at all.
> > Which is how the current qemu driver implements the barrier requests
> > anyway, so that needs some fixing.
> > 
> >> (BTW, in case it wasn't clear, we're seriously considering - but not yet
> >> committed to - using qemu as the primary PV block backend for Xen
> >> instead of submitting the existing blkback code for upstream.  We still
> >> need to do some proper testing and measuring to make sure it stacks up
> >> OK, and work out how it would fit together with the rest of the
> >> management stack.  But so far it looks promising.)
> > 
> > Good to know.  Besides the issue with barriers mentioned above there's
> > a few things that need addressing in xen_disk, if you (or Stefano or
> > Daniel) are interested:
> > 
> >  - remove the syncwrite tunable, as this is handled by the underlying
> >    posix I/O code if needed by using O_DSYNC which is a lot more
> >    efficient.
> >  - check whatever the issue with the use_aio codepath is and make it
> >    the default.  It should help the performance a lot.
> >  - Make sure to use bdrv_aio_flush for cache flushes in the aio
> >    codepath, currently it still uses plain synchronous flushes.
> 
> I don't think the synchronous flushes even do what they're supposed to
> do. Shouldn't ioreq->postsync do the flush after the request has
> completed instead of doing it as soon as it has been submitted?
> 

I noticed that, it is one of the reasons I disabled aio for xen_disk in
qemu-xen. That and the fact that the aio implementation in qemu-xen is
still thread based. In fact I am waiting for Xen support to be in
upstream qemu before doing any benchmarks, because I expect the
performances to be better.

> Let alone that, as you mentioned above, it doesn't implement barrier
> semantics at all.
> 
> Oh, and it would be very nice if the return values of
> bdrv_aio_readv/writev and bdrv_flush were checked. They can return errors.
 
Indeed, my todo list is growing...
Ian Jackson Dec. 14, 2010, 6:43 p.m. UTC | #14
Gerd Hoffmann writes ("[Xen-devel] Re: [Qemu-devel] [PATCH] qemu and qemu-xen: support empty write barriers in xen_disk"):
> On 11/24/10 14:08, Stefano Stabellini wrote:
> > this patch can be applied to both qemu-xen and qemu and adds support
> > for empty write barriers to xen_disk.
> 
> Looks good.
> 
> Acked-by: Gerd Hoffmann <kraxel@redhat.com>

I have applied this patch to qemu-xen, thanks.

Ian.
diff mbox

Patch

diff --git a/hw/xen_disk.c b/hw/xen_disk.c
index 38b5fbf..94af001 100644
--- a/hw/xen_disk.c
+++ b/hw/xen_disk.c
@@ -182,6 +182,10 @@  static int ioreq_parse(struct ioreq *ioreq)
 	ioreq->prot = PROT_WRITE; /* to memory */
 	break;
     case BLKIF_OP_WRITE_BARRIER:
+        if (!ioreq->req.nr_segments) {
+            ioreq->presync = 1;
+            return 0;
+        }
 	if (!syncwrite)
 	    ioreq->presync = ioreq->postsync = 1;
 	/* fall through */
@@ -306,7 +310,7 @@  static int ioreq_runio_qemu_sync(struct ioreq *ioreq)
     int i, rc, len = 0;
     off_t pos;
 
-    if (ioreq_map(ioreq) == -1)
+    if (ioreq->req.nr_segments && ioreq_map(ioreq) == -1)
 	goto err;
     if (ioreq->presync)
 	bdrv_flush(blkdev->bs);
@@ -330,6 +334,8 @@  static int ioreq_runio_qemu_sync(struct ioreq *ioreq)
 	break;
     case BLKIF_OP_WRITE:
     case BLKIF_OP_WRITE_BARRIER:
+        if (!ioreq->req.nr_segments)
+            break;
 	pos = ioreq->start;
 	for (i = 0; i < ioreq->v.niov; i++) {
 	    rc = bdrv_write(blkdev->bs, pos / BLOCK_SIZE,
@@ -387,7 +393,7 @@  static int ioreq_runio_qemu_aio(struct ioreq *ioreq)
 {
     struct XenBlkDev *blkdev = ioreq->blkdev;
 
-    if (ioreq_map(ioreq) == -1)
+    if (ioreq->req.nr_segments && ioreq_map(ioreq) == -1)
 	goto err;
 
     ioreq->aio_inflight++;
@@ -404,6 +410,8 @@  static int ioreq_runio_qemu_aio(struct ioreq *ioreq)
     case BLKIF_OP_WRITE:
     case BLKIF_OP_WRITE_BARRIER:
         ioreq->aio_inflight++;
+        if (!ioreq->req.nr_segments)
+            break;
         bdrv_aio_writev(blkdev->bs, ioreq->start / BLOCK_SIZE,
                         &ioreq->v, ioreq->v.size / BLOCK_SIZE,
                         qemu_aio_complete, ioreq);