Message ID | 1451035376-7252-3-git-send-email-xiecl.fnst@cn.fujitsu.com |
---|---|
State | New |
Headers | show |
On Fri 25 Dec 2015 10:22:55 AM CET, Changlong Xie wrote: > @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, > ret = -EINVAL; > goto exit; > } > - if (s->num_children < 2) { > + if (s->num_children < 1) { > error_setg(&local_err, > - "Number of provided children must be greater than 1"); > + "Number of provided children must be 1 or more"); > ret = -EINVAL; > goto exit; > } I have a question: if you have a Quorum with just one member and you add a new one, how do you know if it has the same data as the existing one? In general, what do you do to make sure that the data in a new Quorum child is consistent with that of the rest of the array? Berto
On 01/20/2016 11:43 PM, Alberto Garcia wrote: > On Fri 25 Dec 2015 10:22:55 AM CET, Changlong Xie wrote: >> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, >> ret = -EINVAL; >> goto exit; >> } >> - if (s->num_children < 2) { >> + if (s->num_children < 1) { >> error_setg(&local_err, >> - "Number of provided children must be greater than 1"); >> + "Number of provided children must be 1 or more"); >> ret = -EINVAL; >> goto exit; >> } > > I have a question: if you have a Quorum with just one member and you add > a new one, how do you know if it has the same data as the existing one? > > In general, what do you do to make sure that the data in a new Quorum > child is consistent with that of the rest of the array? Quorum can have more than one child when it starts. But we don't do the similar check. So I don't think we should do such check here. Thanks Wen Congyang > > Berto > > > . >
On Thu 21 Jan 2016 02:54:10 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote: >>> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, >>> ret = -EINVAL; >>> goto exit; >>> } >>> - if (s->num_children < 2) { >>> + if (s->num_children < 1) { >>> error_setg(&local_err, >>> - "Number of provided children must be greater than 1"); >>> + "Number of provided children must be 1 or more"); >>> ret = -EINVAL; >>> goto exit; >>> } >> >> I have a question: if you have a Quorum with just one member and you >> add a new one, how do you know if it has the same data as the >> existing one? >> >> In general, what do you do to make sure that the data in a new Quorum >> child is consistent with that of the rest of the array? > > Quorum can have more than one child when it starts. But we don't do > the similar check. So I don't think we should do such check here. Yes, but when you start a VM you can verify in advance that all members of the Quorum have the same data. If you do that on a running VM how can you know if the new disk is consistent with the others? Berto
On 01/21/2016 06:05 AM, Alberto Garcia wrote: > On Thu 21 Jan 2016 02:54:10 AM CET, Wen Congyang <wency@cn.fujitsu.com> wrote: > >>>> @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, >>>> ret = -EINVAL; >>>> goto exit; >>>> } >>>> - if (s->num_children < 2) { >>>> + if (s->num_children < 1) { >>>> error_setg(&local_err, >>>> - "Number of provided children must be greater than 1"); >>>> + "Number of provided children must be 1 or more"); >>>> ret = -EINVAL; >>>> goto exit; >>>> } >>> >>> I have a question: if you have a Quorum with just one member and you >>> add a new one, how do you know if it has the same data as the >>> existing one? >>> >>> In general, what do you do to make sure that the data in a new Quorum >>> child is consistent with that of the rest of the array? >> >> Quorum can have more than one child when it starts. But we don't do >> the similar check. So I don't think we should do such check here. > > Yes, but when you start a VM you can verify in advance that all members > of the Quorum have the same data. If you do that on a running VM how can > you know if the new disk is consistent with the others? User error if it is not. Just the same as it is user error if you request a shallow drive-mirror but the destination is not the same contents as the backing file. I don't think qemu has to protect us from user error in this case.
On Thu 21 Jan 2016 05:58:42 PM CET, Eric Blake <eblake@redhat.com> wrote: >>>> In general, what do you do to make sure that the data in a new Quorum >>>> child is consistent with that of the rest of the array? >>> >>> Quorum can have more than one child when it starts. But we don't do >>> the similar check. So I don't think we should do such check here. >> >> Yes, but when you start a VM you can verify in advance that all >> members of the Quorum have the same data. If you do that on a running >> VM how can you know if the new disk is consistent with the others? > > User error if it is not. Just the same as it is user error if you > request a shallow drive-mirror but the destination is not the same > contents as the backing file. I don't think qemu has to protect us > from user error in this case. But the backing file is read-only so the user can guarantee that the destination has the same data before the shallow mirror. How do you do that in this case? Berto
* Alberto Garcia (berto@igalia.com) wrote: > On Thu 21 Jan 2016 05:58:42 PM CET, Eric Blake <eblake@redhat.com> wrote: > >>>> In general, what do you do to make sure that the data in a new Quorum > >>>> child is consistent with that of the rest of the array? > >>> > >>> Quorum can have more than one child when it starts. But we don't do > >>> the similar check. So I don't think we should do such check here. > >> > >> Yes, but when you start a VM you can verify in advance that all > >> members of the Quorum have the same data. If you do that on a running > >> VM how can you know if the new disk is consistent with the others? > > > > User error if it is not. Just the same as it is user error if you > > request a shallow drive-mirror but the destination is not the same > > contents as the backing file. I don't think qemu has to protect us > > from user error in this case. > > But the backing file is read-only so the user can guarantee that the > destination has the same data before the shallow mirror. How do you do > that in this case? I think in the colo case they're relying on doing a block migrate to synchronise the remote disk prior to switching into colo mode. Dave > Berto -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 01/23/2016 04:02 AM, Dr. David Alan Gilbert wrote: > * Alberto Garcia (berto@igalia.com) wrote: >> On Thu 21 Jan 2016 05:58:42 PM CET, Eric Blake <eblake@redhat.com> wrote: >>>>>> In general, what do you do to make sure that the data in a new Quorum >>>>>> child is consistent with that of the rest of the array? >>>>> >>>>> Quorum can have more than one child when it starts. But we don't do >>>>> the similar check. So I don't think we should do such check here. >>>> >>>> Yes, but when you start a VM you can verify in advance that all >>>> members of the Quorum have the same data. If you do that on a running >>>> VM how can you know if the new disk is consistent with the others? >>> >>> User error if it is not. Just the same as it is user error if you >>> request a shallow drive-mirror but the destination is not the same >>> contents as the backing file. I don't think qemu has to protect us >>> from user error in this case. >> >> But the backing file is read-only so the user can guarantee that the >> destination has the same data before the shallow mirror. How do you do >> that in this case? > > I think in the colo case they're relying on doing a block migrate > to synchronise the remote disk prior to switching into colo mode. Yes, we can do a block migration to sync the disk. After the migration finished, we stop block migration before starting colo. Thanks Wen Congyang > > Dave > >> Berto > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > . >
On Fri 22 Jan 2016 09:02:10 PM CET, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: >> >>>> In general, what do you do to make sure that the data in a new >> >>>> Quorum child is consistent with that of the rest of the array? >> >>> >> >>> Quorum can have more than one child when it starts. But we don't >> >>> do the similar check. So I don't think we should do such check >> >>> here. >> >> >> >> Yes, but when you start a VM you can verify in advance that all >> >> members of the Quorum have the same data. If you do that on a >> >> running VM how can you know if the new disk is consistent with the >> >> others? >> > >> > User error if it is not. Just the same as it is user error if you >> > request a shallow drive-mirror but the destination is not the same >> > contents as the backing file. I don't think qemu has to protect us >> > from user error in this case. >> >> But the backing file is read-only so the user can guarantee that the >> destination has the same data before the shallow mirror. How do you >> do that in this case? > > I think in the colo case they're relying on doing a block migrate to > synchronise the remote disk prior to switching into colo mode. Yes but this is a general API that can be used independently from COLO. I'd say if we want to allow that we should at least place a big warning in the documentation. Berto
On 02/09/2016 01:06 AM, Alberto Garcia wrote: > On Fri 22 Jan 2016 09:02:10 PM CET, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote: > >>>>>>> In general, what do you do to make sure that the data in a new >>>>>>> Quorum child is consistent with that of the rest of the array? >>>>>> >>>>>> Quorum can have more than one child when it starts. But we don't >>>>>> do the similar check. So I don't think we should do such check >>>>>> here. >>>>> >>>>> Yes, but when you start a VM you can verify in advance that all >>>>> members of the Quorum have the same data. If you do that on a >>>>> running VM how can you know if the new disk is consistent with the >>>>> others? >>>> >>>> User error if it is not. Just the same as it is user error if you >>>> request a shallow drive-mirror but the destination is not the same >>>> contents as the backing file. I don't think qemu has to protect us >>>> from user error in this case. >>> >>> But the backing file is read-only so the user can guarantee that the >>> destination has the same data before the shallow mirror. How do you >>> do that in this case? >> >> I think in the colo case they're relying on doing a block migrate to >> synchronise the remote disk prior to switching into colo mode. > > Yes but this is a general API that can be used independently from > COLO. I'd say if we want to allow that we should at least place a big > warning in the documentation. > Ok, that's fair enough. Will add in next version. Thanks -Xie > Berto > > > . >
diff --git a/block.c b/block.c index a347008..b9e99da 100644 --- a/block.c +++ b/block.c @@ -1196,10 +1196,10 @@ static int bdrv_fill_options(QDict **options, const char *filename, return 0; } -static BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, - BlockDriverState *child_bs, - const char *child_name, - const BdrvChildRole *child_role) +BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, + BlockDriverState *child_bs, + const char *child_name, + const BdrvChildRole *child_role) { BdrvChild *child = g_new(BdrvChild, 1); *child = (BdrvChild) { diff --git a/block/quorum.c b/block/quorum.c index 6793f12..e73418c 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -23,6 +23,7 @@ #include "qapi/qmp/qstring.h" #include "qapi-event.h" #include "crypto/hash.h" +#include "qemu/bitmap.h" #define HASH_LENGTH 32 @@ -80,6 +81,8 @@ typedef struct BDRVQuorumState { bool rewrite_corrupted;/* true if the driver must rewrite-on-read corrupted * block if Quorum is reached. */ + unsigned long *index_bitmap; + int bsize; QuorumReadPattern read_pattern; } BDRVQuorumState; @@ -875,9 +878,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, ret = -EINVAL; goto exit; } - if (s->num_children < 2) { + if (s->num_children < 1) { error_setg(&local_err, - "Number of provided children must be greater than 1"); + "Number of provided children must be 1 or more"); ret = -EINVAL; goto exit; } @@ -926,6 +929,7 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, /* allocate the children array */ s->children = g_new0(BdrvChild *, s->num_children); opened = g_new0(bool, s->num_children); + s->index_bitmap = bitmap_new(s->num_children); for (i = 0; i < s->num_children; i++) { char indexstr[32]; @@ -941,6 +945,8 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags, opened[i] = true; } + bitmap_set(s->index_bitmap, 0, s->num_children); + s->bsize = s->num_children; g_free(opened); goto exit; @@ -997,6 +1003,115 @@ static void quorum_attach_aio_context(BlockDriverState *bs, } } +static int get_new_child_index(BDRVQuorumState *s) +{ + int index; + + index = find_next_zero_bit(s->index_bitmap, s->bsize, 0); + if (index < s->bsize) { + return index; + } + + if ((s->bsize % BITS_PER_LONG) == 0) { + s->index_bitmap = bitmap_zero_extend(s->index_bitmap, s->bsize, + s->bsize + 1); + } + + return s->bsize++; +} + +static void remove_child_index(BDRVQuorumState *s, int index) +{ + int last_index; + long new_len; + + assert(index < s->bsize); + + clear_bit(index, s->index_bitmap); + if (index < s->bsize - 1) { + /* + * The last bit is always set, and we don't clear + * the last bit. + */ + return; + } + + last_index = find_last_bit(s->index_bitmap, s->bsize); + s->bsize = last_index + 1; + if (BITS_TO_LONGS(last_index + 1) == BITS_TO_LONGS(s->bsize)) { + return; + } + + new_len = BITS_TO_LONGS(last_index + 1) * sizeof(unsigned long); + s->index_bitmap = g_realloc(s->index_bitmap, new_len); +} + +static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs, + Error **errp) +{ + BDRVQuorumState *s = bs->opaque; + BdrvChild *child; + char indexstr[32]; + int index, ret; + + index = get_new_child_index(s); + ret = snprintf(indexstr, 32, "children.%d", index); + if (ret < 0 || ret >= 32) { + error_setg(errp, "cannot generate child name"); + return; + } + + bdrv_drain(bs); + + assert(s->num_children <= INT_MAX / sizeof(BdrvChild *)); + if (s->num_children == INT_MAX / sizeof(BdrvChild *)) { + error_setg(errp, "Too many children"); + return; + } + s->children = g_renew(BdrvChild *, s->children, s->num_children + 1); + + bdrv_ref(child_bs); + child = bdrv_attach_child(bs, child_bs, indexstr, &child_format); + s->children[s->num_children++] = child; + set_bit(index, s->index_bitmap); +} + +static void quorum_del_child(BlockDriverState *bs, BlockDriverState *child_bs, + Error **errp) +{ + BDRVQuorumState *s = bs->opaque; + BdrvChild *child; + int i, index; + + for (i = 0; i < s->num_children; i++) { + if (s->children[i]->bs == child_bs) { + break; + } + } + + /* we have checked it in bdrv_del_child() */ + assert(i < s->num_children); + child = s->children[i]; + + if (s->num_children <= s->threshold) { + error_setg(errp, + "The number of children cannot be lower than the vote threshold %d", + s->threshold); + return; + } + + /* child->name is "children.%d" */ + index = atoi(child->name + 9); + + bdrv_drain(bs); + /* We can safely remove this child now */ + memmove(&s->children[i], &s->children[i + 1], + (s->num_children - i - 1) * sizeof(void *)); + s->children = g_renew(BdrvChild *, s->children, --s->num_children); + remove_child_index(s, index); + bdrv_unref_child(bs, child); +} + static void quorum_refresh_filename(BlockDriverState *bs, QDict *options) { BDRVQuorumState *s = bs->opaque; @@ -1052,6 +1167,9 @@ static BlockDriver bdrv_quorum = { .bdrv_detach_aio_context = quorum_detach_aio_context, .bdrv_attach_aio_context = quorum_attach_aio_context, + .bdrv_add_child = quorum_add_child, + .bdrv_del_child = quorum_del_child, + .is_filter = true, .bdrv_recurse_is_first_non_filter = quorum_recurse_is_first_non_filter, }; diff --git a/include/block/block.h b/include/block/block.h index 863a7c8..6c7e54b 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -513,6 +513,10 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs); void bdrv_ref(BlockDriverState *bs); void bdrv_unref(BlockDriverState *bs); void bdrv_unref_child(BlockDriverState *parent, BdrvChild *child); +BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs, + BlockDriverState *child_bs, + const char *child_name, + const BdrvChildRole *child_role); bool bdrv_op_is_blocked(BlockDriverState *bs, BlockOpType op, Error **errp); void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason);