diff mbox

[v2,05/12] block: Introduce "drained begin/end" API

Message ID 1444650651-26227-6-git-send-email-famz@redhat.com
State New
Headers show

Commit Message

Fam Zheng Oct. 12, 2015, 11:50 a.m. UTC
The semantics is that after bdrv_drained_begin(bs), bs will not get new external
requests until the matching bdrv_drained_end(bs).

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block.c                   |  2 ++
 block/io.c                | 18 ++++++++++++++++++
 include/block/block.h     | 19 +++++++++++++++++++
 include/block/block_int.h |  2 ++
 4 files changed, 41 insertions(+)

Comments

Kevin Wolf Oct. 12, 2015, 1:59 p.m. UTC | #1
Am 12.10.2015 um 13:50 hat Fam Zheng geschrieben:
> The semantics is that after bdrv_drained_begin(bs), bs will not get new external
> requests until the matching bdrv_drained_end(bs).
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block.c                   |  2 ++
>  block/io.c                | 18 ++++++++++++++++++
>  include/block/block.h     | 19 +++++++++++++++++++
>  include/block/block_int.h |  2 ++
>  4 files changed, 41 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 1f90b47..9b28a07 100644
> --- a/block.c
> +++ b/block.c
> @@ -2058,6 +2058,8 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
>      bs_dest->device_list = bs_src->device_list;
>      bs_dest->blk = bs_src->blk;
>  
> +    bs_dest->quiesce_counter = bs_src->quiesce_counter;
> +
>      memcpy(bs_dest->op_blockers, bs_src->op_blockers,
>             sizeof(bs_dest->op_blockers));
>  }

This feels wrong. As I understand it, bdrv_drained_begin/end works on
specific nodes and not on trees. Including the field in
bdrv_move_feature_fields() means that it moves to the top of the tree
(i.e. it stays at in the same C object, which however belongs to a
different logical node now).

What I could imagine is that you did this so you can use
bdrv_draind_end() on the same BDS as you called bdrv_drained_start() on.
However, that's not the interface of bdrv_swap(), which really means
that the BDSes are swapped. So with this hunk you just end up having a
bug that cancels out the weirdness of the bdrv_swap() interface.

If you rebase on my bdrv_swap() removal series, things become a bit more
obvious. If you don't, you should drop this hunk and change some
bdrv_drained_end() calls, e.g. in the next patch, you'd have to call
bdrv_drained_begin(state->old_bs), but bdrv_drained_end(state->new_bs).

The rest of this patch looks good.

Kevin
Paolo Bonzini Oct. 12, 2015, 2:27 p.m. UTC | #2
On 12/10/2015 13:50, Fam Zheng wrote:
> +void bdrv_drained_begin(BlockDriverState *bs)
> +{
> +    if (bs->quiesce_counter++) {
> +        return;
> +    }
> +    aio_disable_external(bdrv_get_aio_context(bs));
> +    bdrv_drain(bs);
> +}

I think bdrv_drain should be called unconditionally, i.e. before the
"if".  This should also solve Kevin's doubt about new allocating write
request reenabling the timer: any write request from the drained section
happens normally, until you get a nested drain request and then the
callback completes the requests.

Paolo
Kevin Wolf Oct. 13, 2015, 9:31 a.m. UTC | #3
Am 12.10.2015 um 16:27 hat Paolo Bonzini geschrieben:
> 
> 
> On 12/10/2015 13:50, Fam Zheng wrote:
> > +void bdrv_drained_begin(BlockDriverState *bs)
> > +{
> > +    if (bs->quiesce_counter++) {
> > +        return;
> > +    }
> > +    aio_disable_external(bdrv_get_aio_context(bs));
> > +    bdrv_drain(bs);
> > +}
> 
> I think bdrv_drain should be called unconditionally, i.e. before the
> "if".  This should also solve Kevin's doubt about new allocating write
> request reenabling the timer: any write request from the drained section
> happens normally, until you get a nested drain request and then the
> callback completes the requests.

This would mean that once you've sent an I/O request inside a drain
section, you have to expect that more internal I/O might be going on
after the request has completed. If you don't want this, you have to
issue another bdrv_drain() or use a nested bdrv_drained_begin/end()
section.

Sounds reasonable enough to me, but I guess this should be explicitly
documented.

Kevin
Paolo Bonzini Oct. 13, 2015, 10:39 a.m. UTC | #4
On 13/10/2015 11:31, Kevin Wolf wrote:
> This would mean that once you've sent an I/O request inside a drain
> section, you have to expect that more internal I/O might be going on
> after the request has completed. If you don't want this, you have to
> issue another bdrv_drain() or use a nested bdrv_drained_begin/end()
> section.

Yes.

> Sounds reasonable enough to me, but I guess this should be explicitly
> documented.

I agree.  Perhaps bdrv_drained_begin/end() could be renamed to
bdrv_drain_and_lock() / bdrv_unlock()?

Paolo
Kevin Wolf Oct. 13, 2015, 11:12 a.m. UTC | #5
Am 13.10.2015 um 12:39 hat Paolo Bonzini geschrieben:
> 
> 
> On 13/10/2015 11:31, Kevin Wolf wrote:
> > This would mean that once you've sent an I/O request inside a drain
> > section, you have to expect that more internal I/O might be going on
> > after the request has completed. If you don't want this, you have to
> > issue another bdrv_drain() or use a nested bdrv_drained_begin/end()
> > section.
> 
> Yes.
> 
> > Sounds reasonable enough to me, but I guess this should be explicitly
> > documented.
> 
> I agree.  Perhaps bdrv_drained_begin/end() could be renamed to
> bdrv_drain_and_lock() / bdrv_unlock()?

It's not very obvious what bdrv_unlock() refers to, so I prefer the
current naming. Just making sure that the comment for bdrv_drained_begin
explains the exact semantics should be good enough.

Kevin
diff mbox

Patch

diff --git a/block.c b/block.c
index 1f90b47..9b28a07 100644
--- a/block.c
+++ b/block.c
@@ -2058,6 +2058,8 @@  static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
     bs_dest->device_list = bs_src->device_list;
     bs_dest->blk = bs_src->blk;
 
+    bs_dest->quiesce_counter = bs_src->quiesce_counter;
+
     memcpy(bs_dest->op_blockers, bs_src->op_blockers,
            sizeof(bs_dest->op_blockers));
 }
diff --git a/block/io.c b/block/io.c
index 94e18e6..5c088d5 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2618,3 +2618,21 @@  void bdrv_flush_io_queue(BlockDriverState *bs)
     }
     bdrv_start_throttled_reqs(bs);
 }
+
+void bdrv_drained_begin(BlockDriverState *bs)
+{
+    if (bs->quiesce_counter++) {
+        return;
+    }
+    aio_disable_external(bdrv_get_aio_context(bs));
+    bdrv_drain(bs);
+}
+
+void bdrv_drained_end(BlockDriverState *bs)
+{
+    assert(bs->quiesce_counter > 0);
+    if (--bs->quiesce_counter > 0) {
+        return;
+    }
+    aio_enable_external(bdrv_get_aio_context(bs));
+}
diff --git a/include/block/block.h b/include/block/block.h
index 2dd6630..c4f6eef 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -619,4 +619,23 @@  void bdrv_flush_io_queue(BlockDriverState *bs);
 
 BlockAcctStats *bdrv_get_stats(BlockDriverState *bs);
 
+/**
+ * bdrv_drained_begin:
+ *
+ * Begin a quiesced section for exclusive access to the BDS, by disabling
+ * external request sources including NBD server and device model. Note that
+ * this doesn't block timers or coroutines from submitting more requests, which
+ * means block_job_pause is still necessary.
+ *
+ * This function can be recursive.
+ */
+void bdrv_drained_begin(BlockDriverState *bs);
+
+/**
+ * bdrv_drained_end:
+ *
+ * End a quiescent section started by bdrv_drained_begin().
+ */
+void bdrv_drained_end(BlockDriverState *bs);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 14ad4c3..7c58221 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -456,6 +456,8 @@  struct BlockDriverState {
     /* threshold limit for writes, in bytes. "High water mark". */
     uint64_t write_threshold_offset;
     NotifierWithReturn write_threshold_notifier;
+
+    int quiesce_counter;
 };