Patchwork xen_disk: implement BLKIF_OP_FLUSH_DISKCACHE, remove BLKIF_OP_WRITE_BARRIER

login
register
mail settings
Submitter Stefano Stabellini
Date April 16, 2012, 5:05 p.m.
Message ID <1334595957-12552-1-git-send-email-stefano.stabellini@eu.citrix.com>
Download mbox | patch
Permalink /patch/152951/
State New
Headers show

Comments

Stefano Stabellini - April 16, 2012, 5:05 p.m.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
---
 hw/xen_disk.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)
Konrad Rzeszutek Wilk - April 16, 2012, 5:10 p.m.
On Mon, Apr 16, 2012 at 06:05:57PM +0100, Stefano Stabellini wrote:
> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  hw/xen_disk.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/xen_disk.c b/hw/xen_disk.c
> index 015d2af..3e4a47b 100644
> --- a/hw/xen_disk.c
> +++ b/hw/xen_disk.c
> @@ -183,12 +183,12 @@ static int ioreq_parse(struct ioreq *ioreq)
>      case BLKIF_OP_READ:
>          ioreq->prot = PROT_WRITE; /* to memory */
>          break;
> -    case BLKIF_OP_WRITE_BARRIER:
> +    case BLKIF_OP_FLUSH_DISKCACHE:
>          if (!ioreq->req.nr_segments) {
>              ioreq->presync = 1;
>              return 0;
>          }
> -        ioreq->presync = ioreq->postsync = 1;
> +        ioreq->postsync = 1;
>          /* fall through */
>      case BLKIF_OP_WRITE:
>          ioreq->prot = PROT_READ; /* from memory */
> @@ -354,7 +354,7 @@ static int ioreq_runio_qemu_aio(struct ioreq *ioreq)
>                         qemu_aio_complete, ioreq);
>          break;
>      case BLKIF_OP_WRITE:
> -    case BLKIF_OP_WRITE_BARRIER:
> +    case BLKIF_OP_FLUSH_DISKCACHE:
>          if (!ioreq->req.nr_segments) {
>              break;
>          }
> @@ -627,7 +627,7 @@ static int blk_init(struct XenDevice *xendev)
>                    blkdev->file_size, blkdev->file_size >> 20);
>  
>      /* fill info */
> -    xenstore_write_be_int(&blkdev->xendev, "feature-barrier", 1);
> +    xenstore_write_be_int(&blkdev->xendev, "feature-flush-cache", 1);
>      xenstore_write_be_int(&blkdev->xendev, "info",            info);
>      xenstore_write_be_int(&blkdev->xendev, "sector-size",     blkdev->file_blk);
>      xenstore_write_be_int(&blkdev->xendev, "sectors",
> -- 
> 1.7.2.5
Christoph Hellwig - April 25, 2012, 8:45 a.m.
> -    case BLKIF_OP_WRITE_BARRIER:
> +    case BLKIF_OP_FLUSH_DISKCACHE:
>          if (!ioreq->req.nr_segments) {
>              ioreq->presync = 1;
>              return 0;
>          }
> -        ioreq->presync = ioreq->postsync = 1;
> +        ioreq->postsync = 1;
>          /* fall through */

It might be worth documenting the semantics of BLKIF_OP_FLUSH_DISKCACHE
in a comment here.  I haven't found any spec for the xen_disk protocol,
but from looking at the Linux frontend it seems like the semantics
of REQ_FLUSH and REQ_FUA in the Linux block driver are overloaded into
BLKIF_OP_FLUSH_DISKCACHE, which is fairly confusing given that REQ_FLUSH
already overload functionality.

Even worse REQ_FLUSH with a payload implies a preflush, while REQ_FUA
implies a post flush, and it seems like Xen has no way to distinguish
the two, making thing like log writes very inefficient.

Independent of that the implementation should really use a state machine
around bdrv_aio_flush instead of doing guest-sychronous bdrv_flush calls.
Ian Campbell - April 25, 2012, 9:02 a.m.
On Wed, 2012-04-25 at 09:45 +0100, Christoph Hellwig wrote:
> > -    case BLKIF_OP_WRITE_BARRIER:
> > +    case BLKIF_OP_FLUSH_DISKCACHE:
> >          if (!ioreq->req.nr_segments) {
> >              ioreq->presync = 1;
> >              return 0;
> >          }
> > -        ioreq->presync = ioreq->postsync = 1;
> > +        ioreq->postsync = 1;
> >          /* fall through */
> 
> It might be worth documenting the semantics of BLKIF_OP_FLUSH_DISKCACHE
> in a comment here.  I haven't found any spec for the xen_disk protocol,

The blkif spec was recently much improved, you can find it at
http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,blkif.h.html

TBH I'm not sure it actually answers your questions wrt
BLKIF_OP_FLUSH_DISKCACHE, if not please let us know and we can see about
improving it.

Ian
Christoph Hellwig - April 25, 2012, 10:20 a.m.
On Wed, Apr 25, 2012 at 10:02:45AM +0100, Ian Campbell wrote:
> The blkif spec was recently much improved, you can find it at
> http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,blkif.h.html
> 
> TBH I'm not sure it actually answers your questions wrt
> BLKIF_OP_FLUSH_DISKCACHE, if not please let us know and we can see about
> improving it.

That description in there is overly simple, and does not match any of the
implementations known to me on either end.

Talking about those: the mainline Linux blkback backend also implements
different semantics from what mainline Linux blkfront seems to expect,
as well as different from qemu.  Looking at these three alone I can't see
how Xen ever managed to get data to disk reliably if using the paravirt
interface.
with the implementations in qemu and the Linux kernel frontend and backends,
which

> 
> Ian
> 
---end quoted text---
Ian Campbell - April 25, 2012, 11:17 a.m.
On Wed, 2012-04-25 at 12:21 +0100, Stefano Stabellini wrote:
> On Wed, 25 Apr 2012, Christoph Hellwig wrote:
> > On Wed, Apr 25, 2012 at 10:02:45AM +0100, Ian Campbell wrote:
> > > The blkif spec was recently much improved, you can find it at
> > > http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,blkif.h.html
> > > 
> > > TBH I'm not sure it actually answers your questions wrt
> > > BLKIF_OP_FLUSH_DISKCACHE, if not please let us know and we can see about
> > > improving it.
> > 
> > That description in there is overly simple, and does not match any of the
> > implementations known to me on either end.
> 
> That is true, in fact I couldn't figure out what I had to implement just
> reading the comment. So I went through the blkback code and tried to
> understand what I had to do, but I got it wrong.
> 
> Reading the code again it seems to me that BLKIF_OP_FLUSH_DISKCACHE
> is supposed to have the same semantics as REQ_FLUSH, that implies a
> preflush if nr_segments > 0, not a postflush like I did.
> 
> Konrad, can you please confirm this?

... and then provide a patch to blkif.h.

Thanks,

Ian.
Stefano Stabellini - April 25, 2012, 11:21 a.m.
On Wed, 25 Apr 2012, Christoph Hellwig wrote:
> On Wed, Apr 25, 2012 at 10:02:45AM +0100, Ian Campbell wrote:
> > The blkif spec was recently much improved, you can find it at
> > http://xenbits.xen.org/docs/unstable/hypercall/include,public,io,blkif.h.html
> > 
> > TBH I'm not sure it actually answers your questions wrt
> > BLKIF_OP_FLUSH_DISKCACHE, if not please let us know and we can see about
> > improving it.
> 
> That description in there is overly simple, and does not match any of the
> implementations known to me on either end.

That is true, in fact I couldn't figure out what I had to implement just
reading the comment. So I went through the blkback code and tried to
understand what I had to do, but I got it wrong.

Reading the code again it seems to me that BLKIF_OP_FLUSH_DISKCACHE
is supposed to have the same semantics as REQ_FLUSH, that implies a
preflush if nr_segments > 0, not a postflush like I did.

Konrad, can you please confirm this?
Christoph Hellwig - April 25, 2012, 11:23 a.m.
On Wed, Apr 25, 2012 at 12:21:53PM +0100, Stefano Stabellini wrote:
> That is true, in fact I couldn't figure out what I had to implement just
> reading the comment. So I went through the blkback code and tried to
> understand what I had to do, but I got it wrong.
> 
> Reading the code again it seems to me that BLKIF_OP_FLUSH_DISKCACHE
> is supposed to have the same semantics as REQ_FLUSH, that implies a
> preflush if nr_segments > 0, not a postflush like I did.

It's worse - blkfront translates both a REQ_FLUSH or a REQ_FUA
into BLKIF_OP_FLUSH_DISKCACHE.

REQ_FLUSH either is a pre flush or a pure flush without a data transfer,
and REQ_FUA is a post flush.  So to get the proper semantics you'll have
to do both, _and_ sequence it so that no operation starts before the
previous one finished.
Konrad Rzeszutek Wilk - April 26, 2012, 3:41 p.m.
On Wed, Apr 25, 2012 at 01:23:35PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 25, 2012 at 12:21:53PM +0100, Stefano Stabellini wrote:
> > That is true, in fact I couldn't figure out what I had to implement just
> > reading the comment. So I went through the blkback code and tried to
> > understand what I had to do, but I got it wrong.
> > 
> > Reading the code again it seems to me that BLKIF_OP_FLUSH_DISKCACHE
> > is supposed to have the same semantics as REQ_FLUSH, that implies a
> > preflush if nr_segments > 0, not a postflush like I did.
> 
> It's worse - blkfront translates both a REQ_FLUSH or a REQ_FUA
> into BLKIF_OP_FLUSH_DISKCACHE.

I think that is what remained of the BARRIER request.
> 
> REQ_FLUSH either is a pre flush or a pure flush without a data transfer,
> and REQ_FUA is a post flush.  So to get the proper semantics you'll have
> to do both, _and_ sequence it so that no operation starts before the
> previous one finished.

If I were to emulate the SCSI SYNC command which one would it be?

I think REQ_FLUSH? In which I would think that the blkfront needs to
get rid of the REQ_FUA part?
Stefano Stabellini - May 9, 2012, 12:42 p.m.
On Thu, 26 Apr 2012, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 25, 2012 at 01:23:35PM +0200, Christoph Hellwig wrote:
> > On Wed, Apr 25, 2012 at 12:21:53PM +0100, Stefano Stabellini wrote:
> > > That is true, in fact I couldn't figure out what I had to implement just
> > > reading the comment. So I went through the blkback code and tried to
> > > understand what I had to do, but I got it wrong.
> > > 
> > > Reading the code again it seems to me that BLKIF_OP_FLUSH_DISKCACHE
> > > is supposed to have the same semantics as REQ_FLUSH, that implies a
> > > preflush if nr_segments > 0, not a postflush like I did.
> > 
> > It's worse - blkfront translates both a REQ_FLUSH or a REQ_FUA
> > into BLKIF_OP_FLUSH_DISKCACHE.
> 
> I think that is what remained of the BARRIER request.
> > 
> > REQ_FLUSH either is a pre flush or a pure flush without a data transfer,
> > and REQ_FUA is a post flush.  So to get the proper semantics you'll have
> > to do both, _and_ sequence it so that no operation starts before the
> > previous one finished.
> 
> If I were to emulate the SCSI SYNC command which one would it be?
> 
> I think REQ_FLUSH? In which I would think that the blkfront needs to
> get rid of the REQ_FUA part?
> 

ping?
Konrad Rzeszutek Wilk - Dec. 19, 2012, 6:46 p.m.
On Wed, May 09, 2012 at 01:42:41PM +0100, Stefano Stabellini wrote:
> On Thu, 26 Apr 2012, Konrad Rzeszutek Wilk wrote:
> > On Wed, Apr 25, 2012 at 01:23:35PM +0200, Christoph Hellwig wrote:
> > > On Wed, Apr 25, 2012 at 12:21:53PM +0100, Stefano Stabellini wrote:
> > > > That is true, in fact I couldn't figure out what I had to implement just
> > > > reading the comment. So I went through the blkback code and tried to
> > > > understand what I had to do, but I got it wrong.
> > > > 
> > > > Reading the code again it seems to me that BLKIF_OP_FLUSH_DISKCACHE
> > > > is supposed to have the same semantics as REQ_FLUSH, that implies a
> > > > preflush if nr_segments > 0, not a postflush like I did.
> > > 
> > > It's worse - blkfront translates both a REQ_FLUSH or a REQ_FUA
> > > into BLKIF_OP_FLUSH_DISKCACHE.
> > 
> > I think that is what remained of the BARRIER request.
> > > 
> > > REQ_FLUSH either is a pre flush or a pure flush without a data transfer,
> > > and REQ_FUA is a post flush.  So to get the proper semantics you'll have
> > > to do both, _and_ sequence it so that no operation starts before the
> > > previous one finished.
> > 
> > If I were to emulate the SCSI SYNC command which one would it be?
> > 
> > I think REQ_FLUSH? In which I would think that the blkfront needs to
> > get rid of the REQ_FUA part?
> > 
> 
> ping?

And just shy of 7 months later I answer :-)

I think you are right. Getting rid of REQ_FUA looks like the
right way. Oh, and blkfront already does that!

1290         err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
1291                             "feature-flush-cache", "%d", &flush,
1292                             NULL);
1293 
1294         if (!err && flush) {
1295                 info->feature_flush = REQ_FLUSH;
1296                 info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
1297         }
1298 

So what I am missing?
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
Stefano Stabellini - Jan. 9, 2013, 6:44 p.m.
On Wed, 19 Dec 2012, Konrad Rzeszutek Wilk wrote:
> On Wed, May 09, 2012 at 01:42:41PM +0100, Stefano Stabellini wrote:
> > On Thu, 26 Apr 2012, Konrad Rzeszutek Wilk wrote:
> > > On Wed, Apr 25, 2012 at 01:23:35PM +0200, Christoph Hellwig wrote:
> > > > On Wed, Apr 25, 2012 at 12:21:53PM +0100, Stefano Stabellini wrote:
> > > > > That is true, in fact I couldn't figure out what I had to implement just
> > > > > reading the comment. So I went through the blkback code and tried to
> > > > > understand what I had to do, but I got it wrong.
> > > > > 
> > > > > Reading the code again it seems to me that BLKIF_OP_FLUSH_DISKCACHE
> > > > > is supposed to have the same semantics as REQ_FLUSH, that implies a
> > > > > preflush if nr_segments > 0, not a postflush like I did.
> > > > 
> > > > It's worse - blkfront translates both a REQ_FLUSH or a REQ_FUA
> > > > into BLKIF_OP_FLUSH_DISKCACHE.
> > > 
> > > I think that is what remained of the BARRIER request.
> > > > 
> > > > REQ_FLUSH either is a pre flush or a pure flush without a data transfer,
> > > > and REQ_FUA is a post flush.  So to get the proper semantics you'll have
> > > > to do both, _and_ sequence it so that no operation starts before the
> > > > previous one finished.
> > > 
> > > If I were to emulate the SCSI SYNC command which one would it be?
> > > 
> > > I think REQ_FLUSH? In which I would think that the blkfront needs to
> > > get rid of the REQ_FUA part?
> > > 
> > 
> > ping?
> 
> And just shy of 7 months later I answer :-)
> 
> I think you are right. Getting rid of REQ_FUA looks like the
> right way. Oh, and blkfront already does that!
> 
> 1290         err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> 1291                             "feature-flush-cache", "%d", &flush,
> 1292                             NULL);
> 1293 
> 1294         if (!err && flush) {
> 1295                 info->feature_flush = REQ_FLUSH;
> 1296                 info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
> 1297         }
> 1298 
> 
> So what I am missing?

Nothing, thanks. I have updated and resent the patch, fixing the
implementation of BLKIF_OP_FLUSH_DISKCACHE.

Patch

diff --git a/hw/xen_disk.c b/hw/xen_disk.c
index 015d2af..3e4a47b 100644
--- a/hw/xen_disk.c
+++ b/hw/xen_disk.c
@@ -183,12 +183,12 @@  static int ioreq_parse(struct ioreq *ioreq)
     case BLKIF_OP_READ:
         ioreq->prot = PROT_WRITE; /* to memory */
         break;
-    case BLKIF_OP_WRITE_BARRIER:
+    case BLKIF_OP_FLUSH_DISKCACHE:
         if (!ioreq->req.nr_segments) {
             ioreq->presync = 1;
             return 0;
         }
-        ioreq->presync = ioreq->postsync = 1;
+        ioreq->postsync = 1;
         /* fall through */
     case BLKIF_OP_WRITE:
         ioreq->prot = PROT_READ; /* from memory */
@@ -354,7 +354,7 @@  static int ioreq_runio_qemu_aio(struct ioreq *ioreq)
                        qemu_aio_complete, ioreq);
         break;
     case BLKIF_OP_WRITE:
-    case BLKIF_OP_WRITE_BARRIER:
+    case BLKIF_OP_FLUSH_DISKCACHE:
         if (!ioreq->req.nr_segments) {
             break;
         }
@@ -627,7 +627,7 @@  static int blk_init(struct XenDevice *xendev)
                   blkdev->file_size, blkdev->file_size >> 20);
 
     /* fill info */
-    xenstore_write_be_int(&blkdev->xendev, "feature-barrier", 1);
+    xenstore_write_be_int(&blkdev->xendev, "feature-flush-cache", 1);
     xenstore_write_be_int(&blkdev->xendev, "info",            info);
     xenstore_write_be_int(&blkdev->xendev, "sector-size",     blkdev->file_blk);
     xenstore_write_be_int(&blkdev->xendev, "sectors",