From patchwork Mon Aug 27 07:30:26 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Benoit Canet X-Patchwork-Id: 180146 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id D9D802C0096 for ; Mon, 27 Aug 2012 17:48:32 +1000 (EST) Received: from localhost ([::1]:44911 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5tnf-00036I-3z for incoming@patchwork.ozlabs.org; Mon, 27 Aug 2012 03:31:55 -0400 Received: from eggs.gnu.org ([208.118.235.92]:49893) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5tmj-0001eB-SQ for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:31:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T5tme-0000lY-Ee for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:30:57 -0400 Received: from mail-we0-f173.google.com ([74.125.82.173]:35891) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5tme-0000lA-5G for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:30:52 -0400 Received: by weyz53 with SMTP id z53so2106366wey.4 for ; Mon, 27 Aug 2012 00:30:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references; bh=thpHzyzKANbZQ+nUBLjaKoZiIi9GAanXfmrl/aWLJ8I=; b=ptvazUatnwBQ8KLFH00rUidPM2FQ9bgGaDRZtsycSqqJQyd/Bd7gYIkbc63yOVvEy6 KrcjK3YxGyrnRekEJV1WqVVDMKBtWxWPYSsZxoTXlZcNjhPjJZqtoI0+yU0VYJ96z5CL KzwZM5gGMX91a8GbKOhPePtQhr7cC3JIfhIMCpTEclUXsAL+avn0GRvV8XVtAf8mvF2T a5dwiFfnbcl4JHiqVXAzcaf+YRh67rV85VwDlaNvNDV9UeOKiG6ngFdHLTynspGLwCyS go0B0bAB8JWEUGF3CjkuoahekF2YeZ2dnz+Vp6SvcqFLaBempQkwBtiDx8meF8cjOfvu VvUQ== Received: by 10.180.82.230 with SMTP id l6mr23377804wiy.21.1346052651362; Mon, 27 Aug 2012 00:30:51 -0700 (PDT) Received: from Laure.box.in.irqsave.net (paradis.irqsave.net. [109.190.18.76]) by mx.google.com with ESMTPS id o2sm20688986wiz.11.2012.08.27.00.30.49 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 27 Aug 2012 00:30:50 -0700 (PDT) From: "=?UTF-8?q?Beno=C3=AEt=20Canet?=" To: qemu-devel@nongnu.org Date: Mon, 27 Aug 2012 09:30:26 +0200 Message-Id: <1346052629-15686-9-git-send-email-benoit@irqsave.net> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <1346052629-15686-1-git-send-email-benoit@irqsave.net> References: <1346052629-15686-1-git-send-email-benoit@irqsave.net> X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 74.125.82.173 Cc: kwolf@redhat.com, stefanha@linux.vnet.ibm.com, blauwirbel@gmail.com, pbonzini@redhat.com, eblake@redhat.com, =?UTF-8?q?Beno=C3=AEt=20Canet?= Subject: [Qemu-devel] [RFC V5 08/11] quorum: Add quorum mechanism. X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Signed-off-by: Benoit Canet --- block/quorum.c | 222 +++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 221 insertions(+), 1 deletion(-) diff --git a/block/quorum.c b/block/quorum.c index 791ef4a..3fa9d53 100644 --- a/block/quorum.c +++ b/block/quorum.c @@ -14,6 +14,20 @@ */ #include "block_int.h" +#include "zlib.h" + +typedef struct QuorumVoteItem { + int index; + QLIST_ENTRY(QuorumVoteItem) next; +} QuorumVoteItem; + +typedef struct QuorumVoteVersion { + unsigned long value; + int index; + int vote_count; + QLIST_HEAD(, QuorumVoteItem) items; + QLIST_ENTRY(QuorumVoteVersion) next; +} QuorumVoteVersion; typedef struct { BlockDriverState **bs; @@ -31,6 +45,10 @@ typedef struct QuorumSingleAIOCB { QuorumAIOCB *parent; } QuorumSingleAIOCB; +typedef struct QuorumVotes { + QLIST_HEAD(, QuorumVoteVersion) vote_list; +} QuorumVotes; + struct QuorumAIOCB { BlockDriverAIOCB common; BDRVQuorumState *bqs; @@ -48,6 +66,8 @@ struct QuorumAIOCB { int success_count; /* number of successfully completed AIOCB */ bool *finished; /* completion signal for cancel */ + QuorumVotes votes; + void (*vote)(QuorumAIOCB *acb); int vote_ret; }; @@ -204,6 +224,11 @@ static void quorum_aio_bh(void *opaque) } qemu_bh_delete(acb->bh); + + if (acb->vote_ret) { + ret = acb->vote_ret; + } + acb->common.cb(acb->common.opaque, ret); if (acb->finished) { *acb->finished = true; @@ -239,6 +264,7 @@ static QuorumAIOCB *quorum_aio_get(BDRVQuorumState *s, acb->nb_sectors = nb_sectors; acb->vote = NULL; acb->vote_ret = 0; + QLIST_INIT(&acb->votes.vote_list); for (i = 0; i < s->total; i++) { acb->aios[i].buf = NULL; @@ -266,10 +292,202 @@ static void quorum_aio_cb(void *opaque, int ret) return; } + /* Do the vote */ + if (acb->vote) { + acb->vote(acb); + } + acb->bh = qemu_bh_new(quorum_aio_bh, acb); qemu_bh_schedule(acb->bh); } +static void quorum_print_bad(QuorumAIOCB *acb, const char *filename) +{ + fprintf(stderr, "quorum: corrected error in quorum file %s: sector_num=%" + PRId64 " nb_sectors=%i\n", filename, acb->sector_num, + acb->nb_sectors); +} + +static void quorum_print_failure(QuorumAIOCB *acb) +{ + fprintf(stderr, "quorum: failure sector_num=%" PRId64 " nb_sectors=%i\n", + acb->sector_num, acb->nb_sectors); +} + +static void quorum_print_bad_versions(QuorumAIOCB *acb, + unsigned long checksum) +{ + QuorumVoteVersion *version; + QuorumVoteItem *item; + BDRVQuorumState *s = acb->bqs; + + QLIST_FOREACH(version, &acb->votes.vote_list, next) { + if (version->value == checksum) { + continue; + } + QLIST_FOREACH(item, &version->items, next) { + quorum_print_bad(acb, s->filenames[item->index]); + } + } +} + +static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source) +{ + int i; + assert(dest->niov == source->niov); + assert(dest->size == source->size); + for (i = 0; i < source->niov; i++) { + assert(dest->iov[i].iov_len == source->iov[i].iov_len); + memcpy(dest->iov[i].iov_base, + source->iov[i].iov_base, + source->iov[i].iov_len); + } +} + +static void quorum_count_vote(QuorumVotes *votes, + unsigned long checksum, + int index) +{ + QuorumVoteVersion *v = NULL, *version = NULL; + QuorumVoteItem *item; + + /* look if we have something with this checksum */ + QLIST_FOREACH(v, &votes->vote_list, next) { + if (v->value == checksum) { + version = v; + break; + } + } + + /* It's a version not yet in the list add it */ + if (!version) { + version = g_new0(QuorumVoteVersion, 1); + QLIST_INIT(&version->items); + version->value = checksum; + version->index = index; + version->vote_count = 0; + QLIST_INSERT_HEAD(&votes->vote_list, version, next); + } + + version->vote_count++; + + item = g_new0(QuorumVoteItem, 1); + item->index = index; + QLIST_INSERT_HEAD(&version->items, item, next); +} + +static void quorum_free_vote_list(QuorumVotes *votes) +{ + QuorumVoteVersion *version, *next_version; + QuorumVoteItem *item, *next_item; + + QLIST_FOREACH_SAFE(version, &votes->vote_list, next, next_version) { + QLIST_REMOVE(version, next); + QLIST_FOREACH_SAFE(item, &version->items, next, next_item) { + QLIST_REMOVE(item, next); + g_free(item); + } + g_free(version); + } +} + +static unsigned long quorum_compute_checksum(QuorumAIOCB *acb, int i) +{ + int j; + unsigned long adler = adler32(0L, Z_NULL, 0); + QEMUIOVector *qiov = &acb->qiovs[i]; + + for (j = 0; j < qiov->niov; j++) { + adler = adler32(adler, + qiov->iov[j].iov_base, + qiov->iov[j].iov_len); + } + + return adler; +} + +static QuorumVoteVersion *quorum_get_vote_winner(QuorumVotes *votes) +{ + int i = 0; + QuorumVoteVersion *candidate, *winner = NULL; + + QLIST_FOREACH(candidate, &votes->vote_list, next) { + if (candidate->vote_count > i) { + i = candidate->vote_count; + winner = candidate; + } + } + + return winner; +} + +static void quorum_vote(QuorumAIOCB *acb) +{ + bool quorum = true; + int i, j; + unsigned long checksum = 0; + BDRVQuorumState *s = acb->bqs; + QuorumVoteVersion *winner; + + /* get the index of the first successfull read */ + for (i = 0; i < s->total; i++) { + if (!acb->aios[i].ret) { + break; + } + } + + /* compare this read with all other successfull read looking for quorum */ + for (j = i + 1; j < s->total; j++) { + if (acb->aios[j].ret) { + continue; + } + if (qemu_iovec_compare(&acb->qiovs[i], + &acb->qiovs[j]) != -1) { + quorum = false; + break; + } + } + + /* Every successfull read agrees -> Quorum */ + if (quorum) { + quorum_copy_qiov(acb->qiov, &acb->qiovs[i]); + return; + } + + /* compute checksums for each successfull read, also store indexes */ + for (i = 0; i < s->total; i++) { + if (acb->aios[i].ret) { + continue; + } + checksum = quorum_compute_checksum(acb, i); + quorum_count_vote(&acb->votes, checksum, i); + } + + /* vote to select the most represented version */ + winner = quorum_get_vote_winner(&acb->votes); + assert(winner != NULL); + + /* if the winner count is smaller than threshold read fail */ + if (winner->vote_count < s->threshold) { + quorum_print_failure(acb); + acb->vote_ret = -EIO; + fprintf(stderr, "quorum: vote result inferior to threshold\n"); + goto free_exit; + } + + /* we have a winner: copy it */ + quorum_copy_qiov(acb->qiov, &acb->qiovs[winner->index]); + + /* if some versions are bad print them */ + if (i < s->total) { + quorum_print_bad_versions(acb, winner->value); + } + +free_exit: + /* free lists */ + quorum_free_vote_list(&acb->votes); +} + static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs, int64_t sector_num, QEMUIOVector *qiov, @@ -282,6 +500,8 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs, nb_sectors, cb, opaque); int i; + acb->vote = quorum_vote; + for (i = 0; i < s->total; i++) { acb->aios[i].buf = qemu_blockalign(bs->file, qiov->size); qemu_iovec_init(&acb->qiovs[i], qiov->niov); @@ -289,7 +509,7 @@ static BlockDriverAIOCB *quorum_aio_readv(BlockDriverState *bs, } for (i = 0; i < s->total; i++) { - bdrv_aio_readv(s->bs[i], sector_num, qiov, nb_sectors, + bdrv_aio_readv(s->bs[i], sector_num, &acb->qiovs[i], nb_sectors, quorum_aio_cb, &acb->aios[i]); }