diff mbox

[v9,2/3] quorum: implement bdrv_add_child() and bdrv_del_child()

Message ID 1451035376-7252-3-git-send-email-xiecl.fnst@cn.fujitsu.com
State New
Headers show

Commit Message

Changlong Xie Dec. 25, 2015, 9:22 a.m. UTC
From: Wen Congyang <wency@cn.fujitsu.com>

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
---
 block.c               |   8 ++--
 block/quorum.c        | 122 +++++++++++++++++++++++++++++++++++++++++++++++++-
 include/block/block.h |   4 ++
 3 files changed, 128 insertions(+), 6 deletions(-)

Comments

Alberto Garcia Jan. 20, 2016, 3:43 p.m. UTC | #1
On Fri 25 Dec 2015 10:22:55 AM CET, Changlong Xie wrote:
> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>          ret = -EINVAL;
>          goto exit;
>      }
> -    if (s->num_children < 2) {
> +    if (s->num_children < 1) {
>          error_setg(&local_err,
> -                   "Number of provided children must be greater than 1");
> +                   "Number of provided children must be 1 or more");
>          ret = -EINVAL;
>          goto exit;
>      }

I have a question: if you have a Quorum with just one member and you add
a new one, how do you know if it has the same data as the existing one?

In general, what do you do to make sure that the data in a new Quorum
child is consistent with that of the rest of the array?

Berto
Wen Congyang Jan. 21, 2016, 1:54 a.m. UTC | #2
On 01/20/2016 11:43 PM, Alberto Garcia wrote:
> On Fri 25 Dec 2015 10:22:55 AM CET, Changlong Xie wrote:
>> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>>          ret = -EINVAL;
>>          goto exit;
>>      }
>> -    if (s->num_children < 2) {
>> +    if (s->num_children < 1) {
>>          error_setg(&local_err,
>> -                   "Number of provided children must be greater than 1");
>> +                   "Number of provided children must be 1 or more");
>>          ret = -EINVAL;
>>          goto exit;
>>      }
> 
> I have a question: if you have a Quorum with just one member and you add
> a new one, how do you know if it has the same data as the existing one?
> 
> In general, what do you do to make sure that the data in a new Quorum
> child is consistent with that of the rest of the array?

Quorum can have more than one child when it starts. But we don't do the
similar check. So I don't think we should do such check here.

Thanks
Wen Congyang

> 
> Berto
> 
> 
> .
>
Alberto Garcia Jan. 21, 2016, 1:05 p.m. UTC | #3
On Thu 21 Jan 2016 02:54:10 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:

>>> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>>>          ret = -EINVAL;
>>>          goto exit;
>>>      }
>>> -    if (s->num_children < 2) {
>>> +    if (s->num_children < 1) {
>>>          error_setg(&local_err,
>>> -                   "Number of provided children must be greater than 1");
>>> +                   "Number of provided children must be 1 or more");
>>>          ret = -EINVAL;
>>>          goto exit;
>>>      }
>> 
>> I have a question: if you have a Quorum with just one member and you
>> add a new one, how do you know if it has the same data as the
>> existing one?
>> 
>> In general, what do you do to make sure that the data in a new Quorum
>> child is consistent with that of the rest of the array?
>
> Quorum can have more than one child when it starts. But we don't do
> the similar check. So I don't think we should do such check here.

Yes, but when you start a VM you can verify in advance that all members
of the Quorum have the same data. If you do that on a running VM how can
you know if the new disk is consistent with the others?

Berto
Eric Blake Jan. 21, 2016, 4:58 p.m. UTC | #4
On 01/21/2016 06:05 AM, Alberto Garcia wrote:
> On Thu 21 Jan 2016 02:54:10 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote:
> 
>>>> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
>>>>          ret = -EINVAL;
>>>>          goto exit;
>>>>      }
>>>> -    if (s->num_children < 2) {
>>>> +    if (s->num_children < 1) {
>>>>          error_setg(&local_err,
>>>> -                   "Number of provided children must be greater than 1");
>>>> +                   "Number of provided children must be 1 or more");
>>>>          ret = -EINVAL;
>>>>          goto exit;
>>>>      }
>>>
>>> I have a question: if you have a Quorum with just one member and you
>>> add a new one, how do you know if it has the same data as the
>>> existing one?
>>>
>>> In general, what do you do to make sure that the data in a new Quorum
>>> child is consistent with that of the rest of the array?
>>
>> Quorum can have more than one child when it starts. But we don't do
>> the similar check. So I don't think we should do such check here.
> 
> Yes, but when you start a VM you can verify in advance that all members
> of the Quorum have the same data. If you do that on a running VM how can
> you know if the new disk is consistent with the others?

User error if it is not.  Just the same as it is user error if you
request a shallow drive-mirror but the destination is not the same
contents as the backing file.  I don't think qemu has to protect us from
user error in this case.
Alberto Garcia Jan. 22, 2016, 9:42 a.m. UTC | #5
On Thu 21 Jan 2016 05:58:42 PM CET, Eric Blake <eblake@redhat.com> wrote:
>>>> In general, what do you do to make sure that the data in a new Quorum
>>>> child is consistent with that of the rest of the array?
>>>
>>> Quorum can have more than one child when it starts. But we don't do
>>> the similar check. So I don't think we should do such check here.
>> 
>> Yes, but when you start a VM you can verify in advance that all
>> members of the Quorum have the same data. If you do that on a running
>> VM how can you know if the new disk is consistent with the others?
>
> User error if it is not.  Just the same as it is user error if you
> request a shallow drive-mirror but the destination is not the same
> contents as the backing file.  I don't think qemu has to protect us
> from user error in this case.

But the backing file is read-only so the user can guarantee that the
destination has the same data before the shallow mirror. How do you do
that in this case?

Berto
Dr. David Alan Gilbert Jan. 22, 2016, 8:02 p.m. UTC | #6
* Alberto Garcia (berto@igalia.com) wrote:
> On Thu 21 Jan 2016 05:58:42 PM CET, Eric Blake <eblake@redhat.com> wrote:
> >>>> In general, what do you do to make sure that the data in a new Quorum
> >>>> child is consistent with that of the rest of the array?
> >>>
> >>> Quorum can have more than one child when it starts. But we don't do
> >>> the similar check. So I don't think we should do such check here.
> >> 
> >> Yes, but when you start a VM you can verify in advance that all
> >> members of the Quorum have the same data. If you do that on a running
> >> VM how can you know if the new disk is consistent with the others?
> >
> > User error if it is not.  Just the same as it is user error if you
> > request a shallow drive-mirror but the destination is not the same
> > contents as the backing file.  I don't think qemu has to protect us
> > from user error in this case.
> 
> But the backing file is read-only so the user can guarantee that the
> destination has the same data before the shallow mirror. How do you do
> that in this case?

I think in the colo case they're relying on doing a block migrate
to synchronise the remote disk prior to switching into colo mode.

Dave

> Berto
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Wen Congyang Jan. 25, 2016, 1:13 a.m. UTC | #7
On 01/23/2016 04:02 AM, Dr. David Alan Gilbert wrote:
> * Alberto Garcia (berto@igalia.com) wrote:
>> On Thu 21 Jan 2016 05:58:42 PM CET, Eric Blake <eblake@redhat.com> wrote:
>>>>>> In general, what do you do to make sure that the data in a new Quorum
>>>>>> child is consistent with that of the rest of the array?
>>>>>
>>>>> Quorum can have more than one child when it starts. But we don't do
>>>>> the similar check. So I don't think we should do such check here.
>>>>
>>>> Yes, but when you start a VM you can verify in advance that all
>>>> members of the Quorum have the same data. If you do that on a running
>>>> VM how can you know if the new disk is consistent with the others?
>>>
>>> User error if it is not.  Just the same as it is user error if you
>>> request a shallow drive-mirror but the destination is not the same
>>> contents as the backing file.  I don't think qemu has to protect us
>>> from user error in this case.
>>
>> But the backing file is read-only so the user can guarantee that the
>> destination has the same data before the shallow mirror. How do you do
>> that in this case?
> 
> I think in the colo case they're relying on doing a block migrate
> to synchronise the remote disk prior to switching into colo mode.

Yes, we can do a block migration to sync the disk. After the migration finished,
we stop block migration before starting colo.

Thanks
Wen Congyang

> 
> Dave
> 
>> Berto
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> .
>
Alberto Garcia Feb. 8, 2016, 5:06 p.m. UTC | #8
On Fri 22 Jan 2016 09:02:10 PM CET, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

>> >>>> In general, what do you do to make sure that the data in a new
>> >>>> Quorum child is consistent with that of the rest of the array?
>> >>>
>> >>> Quorum can have more than one child when it starts. But we don't
>> >>> do the similar check. So I don't think we should do such check
>> >>> here.
>> >> 
>> >> Yes, but when you start a VM you can verify in advance that all
>> >> members of the Quorum have the same data. If you do that on a
>> >> running VM how can you know if the new disk is consistent with the
>> >> others?
>> >
>> > User error if it is not.  Just the same as it is user error if you
>> > request a shallow drive-mirror but the destination is not the same
>> > contents as the backing file.  I don't think qemu has to protect us
>> > from user error in this case.
>> 
>> But the backing file is read-only so the user can guarantee that the
>> destination has the same data before the shallow mirror. How do you
>> do that in this case?
>
> I think in the colo case they're relying on doing a block migrate to
> synchronise the remote disk prior to switching into colo mode.

Yes but this is a general API that can be used independently from
COLO. I'd say if we want to allow that we should at least place a big
warning in the documentation.

Berto
Changlong Xie Feb. 16, 2016, 6:05 a.m. UTC | #9
On 02/09/2016 01:06 AM, Alberto Garcia wrote:
> On Fri 22 Jan 2016 09:02:10 PM CET, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>
>>>>>>> In general, what do you do to make sure that the data in a new
>>>>>>> Quorum child is consistent with that of the rest of the array?
>>>>>>
>>>>>> Quorum can have more than one child when it starts. But we don't
>>>>>> do the similar check. So I don't think we should do such check
>>>>>> here.
>>>>>
>>>>> Yes, but when you start a VM you can verify in advance that all
>>>>> members of the Quorum have the same data. If you do that on a
>>>>> running VM how can you know if the new disk is consistent with the
>>>>> others?
>>>>
>>>> User error if it is not.  Just the same as it is user error if you
>>>> request a shallow drive-mirror but the destination is not the same
>>>> contents as the backing file.  I don't think qemu has to protect us
>>>> from user error in this case.
>>>
>>> But the backing file is read-only so the user can guarantee that the
>>> destination has the same data before the shallow mirror. How do you
>>> do that in this case?
>>
>> I think in the colo case they're relying on doing a block migrate to
>> synchronise the remote disk prior to switching into colo mode.
>
> Yes but this is a general API that can be used independently from
> COLO. I'd say if we want to allow that we should at least place a big
> warning in the documentation.
>

Ok, that's fair enough. Will add in next version.

Thanks
	-Xie
> Berto
>
>
> .
>
diff mbox

Patch

diff --git a/block.c b/block.c
index a347008..b9e99da 100644
--- a/block.c
+++ b/block.c
@@ -1196,10 +1196,10 @@  static int bdrv_fill_options(QDict **options, const char *filename,
     return 0;
 }
 
-static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
-                                    BlockDriverState *child_bs,
-                                    const char *child_name,
-                                    const BdrvChildRole *child_role)
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+                             BlockDriverState *child_bs,
+                             const char *child_name,
+                             const BdrvChildRole *child_role)
 {
     BdrvChild *child = g_new(BdrvChild, 1);
     *child = (BdrvChild) {
diff --git a/block/quorum.c b/block/quorum.c
index 6793f12..e73418c 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -23,6 +23,7 @@ 
 #include "qapi/qmp/qstring.h"
 #include "qapi-event.h"
 #include "crypto/hash.h"
+#include "qemu/bitmap.h"
 
 #define HASH_LENGTH 32
 
@@ -80,6 +81,8 @@  typedef struct BDRVQuorumState {
     bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted
                             * block if Quorum is reached.
                             */
+    unsigned long *index_bitmap;
+    int bsize;
 
     QuorumReadPattern read_pattern;
 } BDRVQuorumState;
@@ -875,9 +878,9 @@  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
         ret = -EINVAL;
         goto exit;
     }
-    if (s->num_children < 2) {
+    if (s->num_children < 1) {
         error_setg(&local_err,
-                   "Number of provided children must be greater than 1");
+                   "Number of provided children must be 1 or more");
         ret = -EINVAL;
         goto exit;
     }
@@ -926,6 +929,7 @@  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     /* allocate the children array */
     s->children = g_new0(BdrvChild *, s->num_children);
     opened = g_new0(bool, s->num_children);
+    s->index_bitmap = bitmap_new(s->num_children);
 
     for (i = 0; i < s->num_children; i++) {
         char indexstr[32];
@@ -941,6 +945,8 @@  static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
 
         opened[i] = true;
     }
+    bitmap_set(s->index_bitmap, 0, s->num_children);
+    s->bsize = s->num_children;
 
     g_free(opened);
     goto exit;
@@ -997,6 +1003,115 @@  static void quorum_attach_aio_context(BlockDriverState *bs,
     }
 }
 
+static int get_new_child_index(BDRVQuorumState *s)
+{
+    int index;
+
+    index = find_next_zero_bit(s->index_bitmap, s->bsize, 0);
+    if (index < s->bsize) {
+        return index;
+    }
+
+    if ((s->bsize % BITS_PER_LONG) == 0) {
+        s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize,
+                                             s->bsize + 1);
+    }
+
+    return s->bsize++;
+}
+
+static void remove_child_index(BDRVQuorumState *s, int index)
+{
+    int last_index;
+    long new_len;
+
+    assert(index < s->bsize);
+
+    clear_bit(index, s->index_bitmap);
+    if (index < s->bsize - 1) {
+        /*
+         * The last bit is always set, and we don't clear
+         * the last bit.
+         */
+        return;
+    }
+
+    last_index = find_last_bit(s->index_bitmap, s->bsize);
+    s->bsize = last_index + 1;
+    if (BITS_TO_LONGS(last_index + 1) == BITS_TO_LONGS(s->bsize)) {
+        return;
+    }
+
+    new_len = BITS_TO_LONGS(last_index + 1) * sizeof(unsigned long);
+    s->index_bitmap = g_realloc(s->index_bitmap, new_len);
+}
+
+static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
+                             Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    BdrvChild *child;
+    char indexstr[32];
+    int index, ret;
+
+    index = get_new_child_index(s);
+    ret = snprintf(indexstr, 32, "children.%d", index);
+    if (ret < 0 || ret >= 32) {
+        error_setg(errp, "cannot generate child name");
+        return;
+    }
+
+    bdrv_drain(bs);
+
+    assert(s->num_children <= INT_MAX / sizeof(BdrvChild *));
+    if (s->num_children == INT_MAX / sizeof(BdrvChild *)) {
+        error_setg(errp, "Too many children");
+        return;
+    }
+    s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
+
+    bdrv_ref(child_bs);
+    child = bdrv_attach_child(bs, child_bs, indexstr, &child_format);
+    s->children[s->num_children++] = child;
+    set_bit(index, s->index_bitmap);
+}
+
+static void quorum_del_child(BlockDriverState *bs, BlockDriverState *child_bs,
+                             Error **errp)
+{
+    BDRVQuorumState *s = bs->opaque;
+    BdrvChild *child;
+    int i, index;
+
+    for (i = 0; i < s->num_children; i++) {
+        if (s->children[i]->bs == child_bs) {
+            break;
+        }
+    }
+
+    /* we have checked it in bdrv_del_child() */
+    assert(i < s->num_children);
+    child = s->children[i];
+
+    if (s->num_children <= s->threshold) {
+        error_setg(errp,
+            "The number of children cannot be lower than the vote threshold %d",
+            s->threshold);
+        return;
+    }
+
+    /* child->name is "children.%d" */
+    index = atoi(child->name + 9);
+
+    bdrv_drain(bs);
+    /* We can safely remove this child now */
+    memmove(&s->children[i], &s->children[i + 1],
+            (s->num_children - i - 1) * sizeof(void *));
+    s->children = g_renew(BdrvChild *, s->children, --s->num_children);
+    remove_child_index(s, index);
+    bdrv_unref_child(bs, child);
+}
+
 static void quorum_refresh_filename(BlockDriverState *bs, QDict *options)
 {
     BDRVQuorumState *s = bs->opaque;
@@ -1052,6 +1167,9 @@  static BlockDriver bdrv_quorum = {
     .bdrv_detach_aio_context            = quorum_detach_aio_context,
     .bdrv_attach_aio_context            = quorum_attach_aio_context,
 
+    .bdrv_add_child                     = quorum_add_child,
+    .bdrv_del_child                     = quorum_del_child,
+
     .is_filter                          = true,
     .bdrv_recurse_is_first_non_filter   = quorum_recurse_is_first_non_filter,
 };
diff --git a/include/block/block.h b/include/block/block.h
index 863a7c8..6c7e54b 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -513,6 +513,10 @@  void bdrv_disable_copy_on_read(BlockDriverState *bs);
 void bdrv_ref(BlockDriverState *bs);
 void bdrv_unref(BlockDriverState *bs);
 void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child);
+BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,
+                             BlockDriverState *child_bs,
+                             const char *child_name,
+                             const BdrvChildRole *child_role);
 
 bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp);
 void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);