diff mbox

sheepdog: fix overlapping metadata update

Message ID 1438142555-27011-1-git-send-email-namei.unix@gmail.com
State New
Headers show

Commit Message

Liu Yuan July 29, 2015, 4:02 a.m. UTC
From: Liu Yuan <liuyuan@cmss.chinamobile.com>

Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching
the updates. But there is subtle problem by determining min_idx and max_idx:

for a single create request, min_idx == max_idx, so actually we just update one
one bit as expected.

Suppose we have 2 create request, create(10) and create(20), then min == 10,
max==20 even though we just need to update index 10 and index 20, update_inode(10,20)
will actually update range from 10 to 20. This would work if all the update_inode()
requests won't overlap. But unfortunately, this is not true for some corner case.
So the problem arise as following:

req 1: update_inode(10,20)
req 2: update_inode(15,22)

req 1 and req 2 might have different value between [15,20] and cause problems
and can be illustrated as following by adding a printf in sd_write_done:

@@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb)

     mn = s->min_dirty_data_idx;
     mx = s->max_dirty_data_idx;
+    printf("min %u, max %u\n", mn, mx);
     if (mn <= mx) {
         /* we need to update the vdi object. */
         offset = sizeof(s->inode) - sizeof(s->inode.data_vdi_id) +

...
min 4294967295, max 0
min 9221, max 9222
min 9224, max 9728
min 9223, max 9223
min 9729, max 9730
min 9731, max 9732
min 9733, max 9734
min 9736, max 10240
min 9735, max 10241
...

Every line represents a update_inode(min, max) last 2 lines show 2 requests
actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped
requests might be reordered via network and cause inode[idx] back to 0 after
being set by last request. Then a wrong remove request will be executed by sheep
internally and accidentally remove a victim object, which is reported at:

https://bugs.launchpad.net/sheepdog-project/+bug/1456421

The fix is simple that we just update inode one by one for aio_req. Since
aio_req is never overlapped, we'll have inode update never overlapped.

Cc: Jeff Cody <jcody@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Teruaki Ishizaki <ishizaki.teruaki@lab.ntt.co.jp>
Cc: Vasiliy Tolstov <v.tolstov@selfip.ru>
Cc: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
Signed-off-by: Liu Yuan <liuyuan@cmss.chinamobile.com>
---
 block/sheepdog.c | 77 ++++++++++++++++----------------------------------------
 1 file changed, 22 insertions(+), 55 deletions(-)

Comments

Hitoshi Mitake July 29, 2015, 5:04 a.m. UTC | #1
At Wed, 29 Jul 2015 12:02:35 +0800,
Liu Yuan wrote:
> 
> From: Liu Yuan <liuyuan@cmss.chinamobile.com>
> 
> Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching
> the updates. But there is subtle problem by determining min_idx and max_idx:
> 
> for a single create request, min_idx == max_idx, so actually we just update one
> one bit as expected.
> 
> Suppose we have 2 create request, create(10) and create(20), then min == 10,
> max==20 even though we just need to update index 10 and index 20, update_inode(10,20)
> will actually update range from 10 to 20. This would work if all the update_inode()
> requests won't overlap. But unfortunately, this is not true for some corner case.
> So the problem arise as following:
> 
> req 1: update_inode(10,20)
> req 2: update_inode(15,22)
> 
> req 1 and req 2 might have different value between [15,20] and cause problems
> and can be illustrated as following by adding a printf in sd_write_done:
> 
> @@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb)
> 
>      mn = s->min_dirty_data_idx;
>      mx = s->max_dirty_data_idx;
> +    printf("min %u, max %u\n", mn, mx);
>      if (mn <= mx) {
>          /* we need to update the vdi object. */
>          offset = sizeof(s->inode) - sizeof(s->inode.data_vdi_id) +
> 
> ...
> min 4294967295, max 0
> min 9221, max 9222
> min 9224, max 9728
> min 9223, max 9223
> min 9729, max 9730
> min 9731, max 9732
> min 9733, max 9734
> min 9736, max 10240
> min 9735, max 10241
> ...
> 
> Every line represents a update_inode(min, max) last 2 lines show 2 requests
> actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped
> requests might be reordered via network and cause inode[idx] back to 0 after
> being set by last request. Then a wrong remove request will be executed by sheep
> internally and accidentally remove a victim object, which is reported at:
> 
> https://bugs.launchpad.net/sheepdog-project/+bug/1456421
> 
> The fix is simple that we just update inode one by one for aio_req. Since
> aio_req is never overlapped, we'll have inode update never overlapped.

This patch increase a number of indoe update request than existing approach.
current: (max - min + 1) * data object creation + 1 inode update
this patch: (max - min + 1) * data object creation + (max - min + 1) * inode update
The increased number means increased number of network + disk
I/O. Even the overwrapping requests can be processed in parallel, the
overhead seems to be large. It has an impact especially in a case of
disk I/O heavy workload.
I don't have a comparison of benchmark result, but it is not obvious
that the approach of this patch is better.

BTW, sheepdog project was already forked, why don't you fork the block
driver, too?

Thanks,
Hitoshi

> 
> Cc: Jeff Cody <jcody@redhat.com>
> Cc: Kevin Wolf <kwolf@redhat.com>
> Cc: Stefan Hajnoczi <stefanha@redhat.com>
> Cc: Teruaki Ishizaki <ishizaki.teruaki@lab.ntt.co.jp>
> Cc: Vasiliy Tolstov <v.tolstov@selfip.ru>
> Cc: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
> Signed-off-by: Liu Yuan <liuyuan@cmss.chinamobile.com>
> ---
>  block/sheepdog.c | 77 ++++++++++++++++----------------------------------------
>  1 file changed, 22 insertions(+), 55 deletions(-)
> 
> diff --git a/block/sheepdog.c b/block/sheepdog.c
> index bd7cbed..d1eeb81 100644
> --- a/block/sheepdog.c
> +++ b/block/sheepdog.c
> @@ -342,9 +342,6 @@ typedef struct BDRVSheepdogState {
>  
>      SheepdogInode inode;
>  
> -    uint32_t min_dirty_data_idx;
> -    uint32_t max_dirty_data_idx;
> -
>      char name[SD_MAX_VDI_LEN];
>      bool is_snapshot;
>      uint32_t cache_flags;
> @@ -782,6 +779,26 @@ static coroutine_fn void reconnect_to_sdog(void *opaque)
>      }
>  }
>  
> +static void  update_inode(BDRVSheepdogState *s, AIOReq *aio_req)
> +{
> +    struct iovec iov;
> +    uint32_t offset, data_len;
> +    SheepdogAIOCB *acb = aio_req->aiocb;
> +    int idx = data_oid_to_idx(aio_req->oid);
> +
> +    offset = SD_INODE_HEADER_SIZE + sizeof(uint32_t) * idx;
> +    data_len = sizeof(uint32_t);
> +
> +    iov.iov_base = &s->inode;
> +    iov.iov_len = sizeof(s->inode);
> +    aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s->inode.vdi_id),
> +                            data_len, offset, 0, false, 0, offset);
> +    QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
> +    add_aio_request(s, aio_req, &iov, 1, AIOCB_WRITE_UDATA);
> +
> +    return;
> +}
> +
>  /*
>   * Receive responses of the I/O requests.
>   *
> @@ -820,25 +837,15 @@ static void coroutine_fn aio_read_response(void *opaque)
>  
>      switch (acb->aiocb_type) {
>      case AIOCB_WRITE_UDATA:
> -        /* this coroutine context is no longer suitable for co_recv
> -         * because we may send data to update vdi objects */
> -        s->co_recv = NULL;
>          if (!is_data_obj(aio_req->oid)) {
>              break;
>          }
>          idx = data_oid_to_idx(aio_req->oid);
>  
>          if (aio_req->create) {
> -            /*
> -             * If the object is newly created one, we need to update
> -             * the vdi object (metadata object).  min_dirty_data_idx
> -             * and max_dirty_data_idx are changed to include updated
> -             * index between them.
> -             */
>              if (rsp.result == SD_RES_SUCCESS) {
>                  s->inode.data_vdi_id[idx] = s->inode.vdi_id;
> -                s->max_dirty_data_idx = MAX(idx, s->max_dirty_data_idx);
> -                s->min_dirty_data_idx = MIN(idx, s->min_dirty_data_idx);
> +                update_inode(s, aio_req);
>              }
>              /*
>               * Some requests may be blocked because simultaneous
> @@ -1518,8 +1525,6 @@ static int sd_open(BlockDriverState *bs, QDict *options, int flags,
>      }
>  
>      memcpy(&s->inode, buf, sizeof(s->inode));
> -    s->min_dirty_data_idx = UINT32_MAX;
> -    s->max_dirty_data_idx = 0;
>  
>      bs->total_sectors = s->inode.vdi_size / BDRV_SECTOR_SIZE;
>      pstrcpy(s->name, sizeof(s->name), vdi);
> @@ -1962,44 +1967,6 @@ static int sd_truncate(BlockDriverState *bs, int64_t offset)
>      return ret;
>  }
>  
> -/*
> - * This function is called after writing data objects.  If we need to
> - * update metadata, this sends a write request to the vdi object.
> - * Otherwise, this switches back to sd_co_readv/writev.
> - */
> -static void coroutine_fn sd_write_done(SheepdogAIOCB *acb)
> -{
> -    BDRVSheepdogState *s = acb->common.bs->opaque;
> -    struct iovec iov;
> -    AIOReq *aio_req;
> -    uint32_t offset, data_len, mn, mx;
> -
> -    mn = s->min_dirty_data_idx;
> -    mx = s->max_dirty_data_idx;
> -    if (mn <= mx) {
> -        /* we need to update the vdi object. */
> -        offset = sizeof(s->inode) - sizeof(s->inode.data_vdi_id) +
> -            mn * sizeof(s->inode.data_vdi_id[0]);
> -        data_len = (mx - mn + 1) * sizeof(s->inode.data_vdi_id[0]);
> -
> -        s->min_dirty_data_idx = UINT32_MAX;
> -        s->max_dirty_data_idx = 0;
> -
> -        iov.iov_base = &s->inode;
> -        iov.iov_len = sizeof(s->inode);
> -        aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s->inode.vdi_id),
> -                                data_len, offset, 0, false, 0, offset);
> -        QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
> -        add_aio_request(s, aio_req, &iov, 1, AIOCB_WRITE_UDATA);
> -
> -        acb->aio_done_func = sd_finish_aiocb;
> -        acb->aiocb_type = AIOCB_WRITE_UDATA;
> -        return;
> -    }
> -
> -    sd_finish_aiocb(acb);
> -}
> -
>  /* Delete current working VDI on the snapshot chain */
>  static bool sd_delete(BDRVSheepdogState *s)
>  {
> @@ -2231,7 +2198,7 @@ static coroutine_fn int sd_co_writev(BlockDriverState *bs, int64_t sector_num,
>      }
>  
>      acb = sd_aio_setup(bs, qiov, sector_num, nb_sectors);
> -    acb->aio_done_func = sd_write_done;
> +    acb->aio_done_func = sd_finish_aiocb;
>      acb->aiocb_type = AIOCB_WRITE_UDATA;
>  
>      ret = sd_co_rw_vector(acb);
> -- 
> 1.9.1
> 
> -- 
> sheepdog mailing list
> sheepdog@lists.wpkg.org
> https://lists.wpkg.org/mailman/listinfo/sheepdog
Liu Yuan July 29, 2015, 9:31 a.m. UTC | #2
On Wed, Jul 29, 2015 at 02:04:55PM +0900, Hitoshi Mitake wrote:
> At Wed, 29 Jul 2015 12:02:35 +0800,
> Liu Yuan wrote:
> > 
> > From: Liu Yuan <liuyuan@cmss.chinamobile.com>
> > 
> > Current sheepdog driver use a range update_inode(min_idx, max_idx) for batching
> > the updates. But there is subtle problem by determining min_idx and max_idx:
> > 
> > for a single create request, min_idx == max_idx, so actually we just update one
> > one bit as expected.
> > 
> > Suppose we have 2 create request, create(10) and create(20), then min == 10,
> > max==20 even though we just need to update index 10 and index 20, update_inode(10,20)
> > will actually update range from 10 to 20. This would work if all the update_inode()
> > requests won't overlap. But unfortunately, this is not true for some corner case.
> > So the problem arise as following:
> > 
> > req 1: update_inode(10,20)
> > req 2: update_inode(15,22)
> > 
> > req 1 and req 2 might have different value between [15,20] and cause problems
> > and can be illustrated as following by adding a printf in sd_write_done:
> > 
> > @@ -1976,6 +1976,7 @@ static void coroutine_fn sd_write_done(SheepdogAIOCB *acb)
> > 
> >      mn = s->min_dirty_data_idx;
> >      mx = s->max_dirty_data_idx;
> > +    printf("min %u, max %u\n", mn, mx);
> >      if (mn <= mx) {
> >          /* we need to update the vdi object. */
> >          offset = sizeof(s->inode) - sizeof(s->inode.data_vdi_id) +
> > 
> > ...
> > min 4294967295, max 0
> > min 9221, max 9222
> > min 9224, max 9728
> > min 9223, max 9223
> > min 9729, max 9730
> > min 9731, max 9732
> > min 9733, max 9734
> > min 9736, max 10240
> > min 9735, max 10241
> > ...
> > 
> > Every line represents a update_inode(min, max) last 2 lines show 2 requests
> > actually overlap while I ran mkfs.ext4 on a sheepdog volume. The overlapped
> > requests might be reordered via network and cause inode[idx] back to 0 after
> > being set by last request. Then a wrong remove request will be executed by sheep
> > internally and accidentally remove a victim object, which is reported at:
> > 
> > https://bugs.launchpad.net/sheepdog-project/+bug/1456421
> > 
> > The fix is simple that we just update inode one by one for aio_req. Since
> > aio_req is never overlapped, we'll have inode update never overlapped.
> 
> This patch increase a number of indoe update request than existing approach.
> current: (max - min + 1) * data object creation + 1 inode update
> this patch: (max - min + 1) * data object creation + (max - min + 1) * inode update
> The increased number means increased number of network + disk
> I/O. Even the overwrapping requests can be processed in parallel, the
> overhead seems to be large. It has an impact especially in a case of
> disk I/O heavy workload.
> I don't have a comparison of benchmark result, but it is not obvious
> that the approach of this patch is better.

Technically, it won't affect the performance because index updates are not range
but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a
range will be updated and 90+% updates will be only 1. So if 2 updates stride a
large range, it will actually worse the performance of sheepdog because many
additional unref of object will be executed by sheep internally.

It is not a performance problem but more the right fix. Even with your patch,
updates of inode can overlap. You just don't allow overlapped requests go to
sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid
the overlapped inode updates, which can be done easily and incrementally on top
of old code, rather than take on a complete new untested overkill mechanism. So
what we get from your patch? Covering the problem and lock every requests?

Your patch actually fix nothing but just cover the problem by slowing down the
request and even with your patch, the problem still exists because inode updates
can overlap. Your commit log doesn't explain what is the real problem and why
your fix works. This is not your toy project that can commit whatever you want.

> BTW, sheepdog project was already forked, why don't you fork the block
> driver, too?

What makes you think you own the block driver?

We forked the sheepdog project because it is low quality of code partly and mostly
some company tries to make it a private project. It is not as open source friendly
as before and that is the main reason Kazutaka and I chose to fork the sheepdog
project. But this doesn't mean we need to fork the QEMU project, it is an
open source project and not your home toy.

Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog
block driver for years, so I think I am eligible to review the patch and
responsible to suggest the right fix. If you are pissed off when someone else
have other opinions, you can just fork the code and play with it at home or you
follow the rule of open source project.

Yuan
Vasiliy Tolstov July 30, 2015, 6:41 a.m. UTC | #3
2015-07-29 12:31 GMT+03:00 Liu Yuan <namei.unix@gmail.com>:
> Technically, it won't affect the performance because index updates are not range
> but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a
> range will be updated and 90+% updates will be only 1. So if 2 updates stride a
> large range, it will actually worse the performance of sheepdog because many
> additional unref of object will be executed by sheep internally.
>
> It is not a performance problem but more the right fix. Even with your patch,
> updates of inode can overlap. You just don't allow overlapped requests go to
> sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid
> the overlapped inode updates, which can be done easily and incrementally on top
> of old code, rather than take on a complete new untested overkill mechanism. So
> what we get from your patch? Covering the problem and lock every requests?
>
> Your patch actually fix nothing but just cover the problem by slowing down the
> request and even with your patch, the problem still exists because inode updates
> can overlap. Your commit log doesn't explain what is the real problem and why
> your fix works. This is not your toy project that can commit whatever you want.
>
>> BTW, sheepdog project was already forked, why don't you fork the block
>> driver, too?
>
> What makes you think you own the block driver?
>
> We forked the sheepdog project because it is low quality of code partly and mostly
> some company tries to make it a private project. It is not as open source friendly
> as before and that is the main reason Kazutaka and I chose to fork the sheepdog
> project. But this doesn't mean we need to fork the QEMU project, it is an
> open source project and not your home toy.
>
> Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog
> block driver for years, so I think I am eligible to review the patch and
> responsible to suggest the right fix. If you are pissed off when someone else
> have other opinions, you can just fork the code and play with it at home or you
> follow the rule of open source project.


Jeff Cody, please be the judge, patch from Hitoshi solved my problem
that i emailed in sheepdog list (i have test environment with 8 hosts
on each 6 SSD disks and infiniband interconnect between hosts) before
Hitoshi patch, massive writing to sheepdog storage breaks file system
and corrupt it.
After the patch i don't see issues.
Liu Yuan July 30, 2015, 9:13 a.m. UTC | #4
On Thu, Jul 30, 2015 at 09:41:08AM +0300, Vasiliy Tolstov wrote:
> 2015-07-29 12:31 GMT+03:00 Liu Yuan <namei.unix@gmail.com>:
> > Technically, it won't affect the performance because index updates are not range
> > but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a
> > range will be updated and 90+% updates will be only 1. So if 2 updates stride a
> > large range, it will actually worse the performance of sheepdog because many
> > additional unref of object will be executed by sheep internally.
> >
> > It is not a performance problem but more the right fix. Even with your patch,
> > updates of inode can overlap. You just don't allow overlapped requests go to
> > sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid
> > the overlapped inode updates, which can be done easily and incrementally on top
> > of old code, rather than take on a complete new untested overkill mechanism. So
> > what we get from your patch? Covering the problem and lock every requests?
> >
> > Your patch actually fix nothing but just cover the problem by slowing down the
> > request and even with your patch, the problem still exists because inode updates
> > can overlap. Your commit log doesn't explain what is the real problem and why
> > your fix works. This is not your toy project that can commit whatever you want.
> >
> >> BTW, sheepdog project was already forked, why don't you fork the block
> >> driver, too?
> >
> > What makes you think you own the block driver?
> >
> > We forked the sheepdog project because it is low quality of code partly and mostly
> > some company tries to make it a private project. It is not as open source friendly
> > as before and that is the main reason Kazutaka and I chose to fork the sheepdog
> > project. But this doesn't mean we need to fork the QEMU project, it is an
> > open source project and not your home toy.
> >
> > Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog
> > block driver for years, so I think I am eligible to review the patch and
> > responsible to suggest the right fix. If you are pissed off when someone else
> > have other opinions, you can just fork the code and play with it at home or you
> > follow the rule of open source project.
> 
> 
> Jeff Cody, please be the judge, patch from Hitoshi solved my problem
> that i emailed in sheepdog list (i have test environment with 8 hosts
> on each 6 SSD disks and infiniband interconnect between hosts) before
> Hitoshi patch, massive writing to sheepdog storage breaks file system
> and corrupt it.
> After the patch i don't see issues.

Hi Vasiliy,

Did you test my patch? I think both patch can make the problem gone. The
trick is that which way is the right fix. This is quit technical because
sometimes the problem is gone not because being fixed but is covered out.

If you have problem with applying the patch, feel free to mail me and I'll
package a patched QEMU tree for you.

Thanks,
Yuan
Vasiliy Tolstov July 30, 2015, 9:29 a.m. UTC | #5
2015-07-30 12:13 GMT+03:00 Liu Yuan <namei.unix@gmail.com>:
> Hi Vasiliy,
>
> Did you test my patch? I think both patch can make the problem gone. The
> trick is that which way is the right fix. This is quit technical because
> sometimes the problem is gone not because being fixed but is covered out.
>
> If you have problem with applying the patch, feel free to mail me and I'll
> package a patched QEMU tree for you.


Ok, i'm check you patch too. Please wait.
Jeff Cody July 30, 2015, 1:27 p.m. UTC | #6
On Thu, Jul 30, 2015 at 09:41:08AM +0300, Vasiliy Tolstov wrote:
> 2015-07-29 12:31 GMT+03:00 Liu Yuan <namei.unix@gmail.com>:
> > Technically, it won't affect the performance because index updates are not range
> > but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a
> > range will be updated and 90+% updates will be only 1. So if 2 updates stride a
> > large range, it will actually worse the performance of sheepdog because many
> > additional unref of object will be executed by sheep internally.
> >
> > It is not a performance problem but more the right fix. Even with your patch,
> > updates of inode can overlap. You just don't allow overlapped requests go to
> > sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid
> > the overlapped inode updates, which can be done easily and incrementally on top
> > of old code, rather than take on a complete new untested overkill mechanism. So
> > what we get from your patch? Covering the problem and lock every requests?
> >
> > Your patch actually fix nothing but just cover the problem by slowing down the
> > request and even with your patch, the problem still exists because inode updates
> > can overlap. Your commit log doesn't explain what is the real problem and why
> > your fix works. This is not your toy project that can commit whatever you want.
> >
> >> BTW, sheepdog project was already forked, why don't you fork the block
> >> driver, too?
> >
> > What makes you think you own the block driver?
> >
> > We forked the sheepdog project because it is low quality of code partly and mostly
> > some company tries to make it a private project. It is not as open source friendly
> > as before and that is the main reason Kazutaka and I chose to fork the sheepdog
> > project. But this doesn't mean we need to fork the QEMU project, it is an
> > open source project and not your home toy.
> >
> > Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog
> > block driver for years, so I think I am eligible to review the patch and
> > responsible to suggest the right fix. If you are pissed off when someone else
> > have other opinions, you can just fork the code and play with it at home or you
> > follow the rule of open source project.
> 
> 
> Jeff Cody, please be the judge, patch from Hitoshi solved my problem
> that i emailed in sheepdog list (i have test environment with 8 hosts
> on each 6 SSD disks and infiniband interconnect between hosts) before
> Hitoshi patch, massive writing to sheepdog storage breaks file system
> and corrupt it.
> After the patch i don't see issues.
>

I'd rather see some sort consensus amongst Liu, Hitoshi, yourself, or
others more intimately familiar with sheepdog.

Right now, we have Hitoshi's patch in the main git repo, slated for
2.4 release (which is Monday).  It sounds, from Liu's email, as this
may not fix the root cause.

Vasiliy said he would test Liu's patch; if he can confirm this new
patch fix, then I would be inclined to use Liu's patch, based on the
detailed analysis of the issue in the commit message.

Thanks,
Jeff
Vasiliy Tolstov July 31, 2015, 11:55 a.m. UTC | #7
2015-07-30 16:27 GMT+03:00 Jeff Cody <jcody@redhat.com>:
> I'd rather see some sort consensus amongst Liu, Hitoshi, yourself, or
> others more intimately familiar with sheepdog.
>
> Right now, we have Hitoshi's patch in the main git repo, slated for
> 2.4 release (which is Monday).  It sounds, from Liu's email, as this
> may not fix the root cause.
>
> Vasiliy said he would test Liu's patch; if he can confirm this new
> patch fix, then I would be inclined to use Liu's patch, based on the
> detailed analysis of the issue in the commit message.

Liu's patch also works for me. But also like in Hitoshi patch breaks
when using discards in qemu =(.
Vasiliy Tolstov July 31, 2015, 12:08 p.m. UTC | #8
2015-07-31 14:55 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
> Liu's patch also works for me. But also like in Hitoshi patch breaks
> when using discards in qemu =(.


Please wait to performance comparison. As i see Liu's patch may be
more slow then Hitoshi.
Liu Yuan Aug. 2, 2015, 2:06 a.m. UTC | #9
On Fri, Jul 31, 2015 at 03:08:09PM +0300, Vasiliy Tolstov wrote:
> 2015-07-31 14:55 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
> > Liu's patch also works for me. But also like in Hitoshi patch breaks
> > when using discards in qemu =(.
> 
> 
> Please wait to performance comparison. As i see Liu's patch may be
> more slow then Hitoshi.
> 

Thanks for your time! Well, as far as I know, my patch would be slightly better
in performance wise because it preserves the parallelism of requests. Due to
scatter gather IO requests characteristics, we could assume following IO pattern
as an illustration:

req1 is split into 2 sheep reqs:
create(2), create(10)

req2 is split into 2 sheep reqs:
create(5), create(100)

So there are finally 4 sheep requests and with my patch they will be run in
parallel by sheep cluster and only 4 unref of objects will be executed
internally: update_inode(2), update_inode(10), update_inode(5), update_inode(100)

With Hitoshi's patch, however, req1 and req2 will be serialized and only one req
is finished then the other one will be sent to sheep and there are 9+96=105 unref
of objects will be executed internally. There are still chances data corruption
because update_inode(2,10) and update_inode(5,100) will both update the range
[5,10], which is a potential problem if the overlapped range has different values
when the requests are queued with stale data.

This is really a several years bug: we should update the inode bits exactly as
we create the objects, not update the bits we don't touch at all. This bug isn't
revealed for a long time because most of the time, min == max in
create_inode(min, max) and before we introduction of generation reference counting
to the snapshot reference mechanism, updating inode bit with 0 won't cause a
remove request in sheepdog.

I'm also concerned with the complete new mechanism since current request handling
mechanism is solid as time goes by. It exists for years. The complete new stuff
might need a long time to stablize and need to fix the possible side effect we
don't know yet.

Thanks,
Yuan
Vasiliy Tolstov Aug. 2, 2015, 11:52 a.m. UTC | #10
2015-07-31 15:08 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
> Please wait to performance comparison. As i see Liu's patch may be
> more slow then Hitoshi.


I'm switch to local cluster driver to test only local ssd and not
network overhead.

But now Liu's qemu does not able to completely install vps:
sheep runs as:
/usr/sbin/sheep --log level=debug format=server dir=/var/log/
dst=default --bindaddr 0.0.0.0 --cluster local --directio --upgrade
--myaddr 192.168.240.134 --pidfile /var/run/sheepdog.pid
/var/lib/sheepdog/sd_meta
/var/lib/sheepdog/vg0-sd_data,/var/lib/sheepdog/vg1-sd_data,/var/lib/sheepdog/vg2-sd_data,/var/lib/sheepdog/vg3-sd_data,/var/lib/sheepdog/vg4-sd_data,/var/lib/sheepdog/vg5-sd_data

qemu runs as:
qemu-system-x86_64 -enable-kvm -name 30681 -S -machine
pc-i440fx-1.7,accel=kvm,usb=off -m 1024 -realtime
mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid
49771611-5e45-70c9-1a78-00000043bf91 -no-user-config -nodefaults
-chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/30681.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
virtio-scsi-pci,id=scsi0,num_queues=1,bus=pci.0,addr=0x4 -device
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive
file=sheepdog:30681,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,discard=ignore,aio=native
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
-drive if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw -device
scsi-cd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0
-netdev tap,fd=37,id=hostnet0,vhost=on,vhostfd=39 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:00:42:8d,bus=pci.0,addr=0x3,rombar=0
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -chardev
socket,id=charchannel0,path=/var/lib/libvirt/qemu/30681.agent,server,nowait
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
-device usb-mouse,id=input0 -device usb-kbd,id=input1 -vnc
[::]:0,password -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -object
rng-random,id=objrng0,filename=/dev/random -device
virtio-rng-pci,rng=objrng0,id=rng0,max-bytes=1024,period=2000,bus=pci.0,addr=0x7
-msg timestamp=on

sheep.log:
https://gist.github.com/raw/31201454b0afe1b86d00

dmesg:
[Sun Aug  2 14:48:45 2015] qemu-system-x86[6894]: segfault at 401364
ip 00007f212f060ff8 sp 00007f213efe5f40 error 4 in
qemu-system-x86_64[7f212ecbc000+518000]

fio job file:
[randrw]
blocksize=4k
filename=/dev/sdb
rw=randrw
direct=1
buffered=0
ioengine=libaio
iodepth=32
group_reporting
numjobs=10
runtime=40

For clean test i'm create new vdi and attach it to running vm: qemu
cache=none, io=native, discard=ignore
Hitoshi patch:
randrw: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
...
fio-2.1.11
Starting 10 processes
Jobs: 10 (f=10): [m(10)] [100.0% done] [13414KB/12291KB/0KB /s]
[3353/3072/0 iops] [eta 00m:00s]
randrw: (groupid=0, jobs=10): err= 0: pid=1318: Sun Aug  2 11:51:25 2015
  read : io=368004KB, bw=9191.9KB/s, iops=2297, runt= 40036msec
    slat (usec): min=1, max=408908, avg=1316.35, stdev=10057.52
    clat (usec): min=229, max=972302, avg=67203.98, stdev=65126.08
     lat (usec): min=294, max=972306, avg=68520.61, stdev=66741.02
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[   16], 10.00th=[   30], 20.00th=[   36],
     | 30.00th=[   40], 40.00th=[   44], 50.00th=[   48], 60.00th=[   52],
     | 70.00th=[   59], 80.00th=[   85], 90.00th=[  128], 95.00th=[  186],
     | 99.00th=[  367], 99.50th=[  441], 99.90th=[  570], 99.95th=[  619],
     | 99.99th=[  799]
    bw (KB  /s): min=    6, max= 1824, per=10.19%, avg=936.59, stdev=434.25
  write: io=366964KB, bw=9165.9KB/s, iops=2291, runt= 40036msec
    slat (usec): min=1, max=338238, avg=1320.61, stdev=9586.81
    clat (usec): min=912, max=972334, avg=69411.58, stdev=64715.54
     lat (usec): min=920, max=972338, avg=70732.51, stdev=66193.96
    clat percentiles (msec):
     |  1.00th=[    8],  5.00th=[   30], 10.00th=[   33], 20.00th=[   38],
     | 30.00th=[   42], 40.00th=[   45], 50.00th=[   49], 60.00th=[   53],
     | 70.00th=[   60], 80.00th=[   86], 90.00th=[  128], 95.00th=[  184],
     | 99.00th=[  379], 99.50th=[  441], 99.90th=[  570], 99.95th=[  635],
     | 99.99th=[  824]
    bw (KB  /s): min=    7, max= 1771, per=10.21%, avg=935.65, stdev=439.55
    lat (usec) : 250=0.01%, 500=0.02%, 750=0.05%, 1000=0.04%
    lat (msec) : 2=0.20%, 4=0.57%, 10=1.53%, 20=1.78%, 50=50.50%
    lat (msec) : 100=28.04%, 250=14.56%, 500=2.47%, 750=0.23%, 1000=0.03%
  cpu          : usr=0.17%, sys=0.26%, ctx=16905, majf=0, minf=78
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=92001/w=91741/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: io=368004KB, aggrb=9191KB/s, minb=9191KB/s, maxb=9191KB/s,
mint=40036msec, maxt=40036msec
  WRITE: io=366964KB, aggrb=9165KB/s, minb=9165KB/s, maxb=9165KB/s,
mint=40036msec, maxt=40036msec

Disk stats (read/write):
  sdb: ios=91739/91385, merge=55/0, ticks=2854936/3040936,
in_queue=5901740, util=99.82%
Vasiliy Tolstov Aug. 2, 2015, 12:07 p.m. UTC | #11
to compare fio results from host system with file places on the ssd
(ext4,nodiscard, trimmed before run):
randrw: (g=0): rw=randrw, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
2.0.8
Starting 10 processes
Jobs: 10 (f=10): [mmmmmmmmmm] [100.0% done] [50892K/49944K /s]
[12.8K/12.5K iops] [eta 00m:00s]
randrw: (groupid=0, jobs=10): err= 0: pid=7439
  read : io=739852KB, bw=18491KB/s, iops=4622 , runt= 40012msec
    slat (usec): min=2 , max=12573 , avg=308.09, stdev=955.59
    clat (usec): min=137 , max=107321 , avg=12750.90, stdev=7865.72
     lat (usec): min=167 , max=107364 , avg=13059.31, stdev=7932.79
    clat percentiles (usec):
     |  1.00th=[ 2512],  5.00th=[ 4192], 10.00th=[ 5280], 20.00th=[ 6816],
     | 30.00th=[ 8096], 40.00th=[ 9408], 50.00th=[10816], 60.00th=[12480],
     | 70.00th=[14528], 80.00th=[17536], 90.00th=[22656], 95.00th=[28032],
     | 99.00th=[40192], 99.50th=[45824], 99.90th=[59648], 99.95th=[68096],
     | 99.99th=[94720]
    bw (KB/s)  : min=    0, max= 6679, per=27.96%, avg=5170.37, stdev=665.70
  write: io=737492KB, bw=18432KB/s, iops=4607 , runt= 40012msec
    slat (usec): min=4 , max=12670 , avg=305.68, stdev=945.24
    clat (usec): min=31 , max=107147 , avg=11268.07, stdev=7542.08
     lat (usec): min=72 , max=107181 , avg=11574.10, stdev=7615.99
    clat percentiles (usec):
     |  1.00th=[ 1624],  5.00th=[ 3216], 10.00th=[ 4256], 20.00th=[ 5664],
     | 30.00th=[ 6816], 40.00th=[ 8032], 50.00th=[ 9408], 60.00th=[10944],
     | 70.00th=[12864], 80.00th=[15552], 90.00th=[20608], 95.00th=[26240],
     | 99.00th=[38144], 99.50th=[43264], 99.90th=[55552], 99.95th=[63744],
     | 99.99th=[89600]
    bw (KB/s)  : min=    0, max= 6767, per=27.39%, avg=5047.81, stdev=992.62
    lat (usec) : 50=0.01%, 100=0.01%, 250=0.06%, 500=0.12%, 750=0.06%
    lat (usec) : 1000=0.06%
    lat (msec) : 2=0.74%, 4=5.47%, 10=42.71%, 20=38.26%, 50=12.26%
    lat (msec) : 100=0.25%, 250=0.01%
  cpu          : usr=0.27%, sys=67.69%, ctx=339142, majf=0, minf=70
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=184963/w=184373/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=739852KB, aggrb=18490KB/s, minb=18490KB/s, maxb=18490KB/s,
mint=40012msec, maxt=40012msec
  WRITE: io=737492KB, aggrb=18431KB/s, minb=18431KB/s, maxb=18431KB/s,
mint=40012msec, maxt=40012msec

Disk stats (read/write):
    dm-5: ios=183885/183290, merge=0/0, ticks=1070388/977428,
in_queue=2053044, util=35.35%, aggrios=184963/184383, aggrmerge=0/3,
aggrticks=1072368/978452, aggrin_queue=2052424, aggrutil=35.39%
  sdd: ios=184963/184383, merge=0/3, ticks=1072368/978452,
in_queue=2052424, util=35.39%

2015-08-02 14:52 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
> 2015-07-31 15:08 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
>> Please wait to performance comparison. As i see Liu's patch may be
>> more slow then Hitoshi.
>
>
> I'm switch to local cluster driver to test only local ssd and not
> network overhead.
>
> But now Liu's qemu does not able to completely install vps:
> sheep runs as:
> /usr/sbin/sheep --log level=debug format=server dir=/var/log/
> dst=default --bindaddr 0.0.0.0 --cluster local --directio --upgrade
> --myaddr 192.168.240.134 --pidfile /var/run/sheepdog.pid
> /var/lib/sheepdog/sd_meta
> /var/lib/sheepdog/vg0-sd_data,/var/lib/sheepdog/vg1-sd_data,/var/lib/sheepdog/vg2-sd_data,/var/lib/sheepdog/vg3-sd_data,/var/lib/sheepdog/vg4-sd_data,/var/lib/sheepdog/vg5-sd_data
>
> qemu runs as:
> qemu-system-x86_64 -enable-kvm -name 30681 -S -machine
> pc-i440fx-1.7,accel=kvm,usb=off -m 1024 -realtime
> mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid
> 49771611-5e45-70c9-1a78-00000043bf91 -no-user-config -nodefaults
> -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/30681.monitor,server,nowait
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
> -no-shutdown -boot strict=on -device
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device
> virtio-scsi-pci,id=scsi0,num_queues=1,bus=pci.0,addr=0x4 -device
> virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive
> file=sheepdog:30681,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,discard=ignore,aio=native
> -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1
> -drive if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw -device
> scsi-cd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0
> -netdev tap,fd=37,id=hostnet0,vhost=on,vhostfd=39 -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:00:42:8d,bus=pci.0,addr=0x3,rombar=0
> -chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0 -chardev
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/30681.agent,server,nowait
> -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
> -device usb-mouse,id=input0 -device usb-kbd,id=input1 -vnc
> [::]:0,password -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -object
> rng-random,id=objrng0,filename=/dev/random -device
> virtio-rng-pci,rng=objrng0,id=rng0,max-bytes=1024,period=2000,bus=pci.0,addr=0x7
> -msg timestamp=on
>
> sheep.log:
> https://gist.github.com/raw/31201454b0afe1b86d00
>
> dmesg:
> [Sun Aug  2 14:48:45 2015] qemu-system-x86[6894]: segfault at 401364
> ip 00007f212f060ff8 sp 00007f213efe5f40 error 4 in
> qemu-system-x86_64[7f212ecbc000+518000]
>
> fio job file:
> [randrw]
> blocksize=4k
> filename=/dev/sdb
> rw=randrw
> direct=1
> buffered=0
> ioengine=libaio
> iodepth=32
> group_reporting
> numjobs=10
> runtime=40
>
> For clean test i'm create new vdi and attach it to running vm: qemu
> cache=none, io=native, discard=ignore
> Hitoshi patch:
> randrw: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
> ...
> fio-2.1.11
> Starting 10 processes
> Jobs: 10 (f=10): [m(10)] [100.0% done] [13414KB/12291KB/0KB /s]
> [3353/3072/0 iops] [eta 00m:00s]
> randrw: (groupid=0, jobs=10): err= 0: pid=1318: Sun Aug  2 11:51:25 2015
>   read : io=368004KB, bw=9191.9KB/s, iops=2297, runt= 40036msec
>     slat (usec): min=1, max=408908, avg=1316.35, stdev=10057.52
>     clat (usec): min=229, max=972302, avg=67203.98, stdev=65126.08
>      lat (usec): min=294, max=972306, avg=68520.61, stdev=66741.02
>     clat percentiles (msec):
>      |  1.00th=[    4],  5.00th=[   16], 10.00th=[   30], 20.00th=[   36],
>      | 30.00th=[   40], 40.00th=[   44], 50.00th=[   48], 60.00th=[   52],
>      | 70.00th=[   59], 80.00th=[   85], 90.00th=[  128], 95.00th=[  186],
>      | 99.00th=[  367], 99.50th=[  441], 99.90th=[  570], 99.95th=[  619],
>      | 99.99th=[  799]
>     bw (KB  /s): min=    6, max= 1824, per=10.19%, avg=936.59, stdev=434.25
>   write: io=366964KB, bw=9165.9KB/s, iops=2291, runt= 40036msec
>     slat (usec): min=1, max=338238, avg=1320.61, stdev=9586.81
>     clat (usec): min=912, max=972334, avg=69411.58, stdev=64715.54
>      lat (usec): min=920, max=972338, avg=70732.51, stdev=66193.96
>     clat percentiles (msec):
>      |  1.00th=[    8],  5.00th=[   30], 10.00th=[   33], 20.00th=[   38],
>      | 30.00th=[   42], 40.00th=[   45], 50.00th=[   49], 60.00th=[   53],
>      | 70.00th=[   60], 80.00th=[   86], 90.00th=[  128], 95.00th=[  184],
>      | 99.00th=[  379], 99.50th=[  441], 99.90th=[  570], 99.95th=[  635],
>      | 99.99th=[  824]
>     bw (KB  /s): min=    7, max= 1771, per=10.21%, avg=935.65, stdev=439.55
>     lat (usec) : 250=0.01%, 500=0.02%, 750=0.05%, 1000=0.04%
>     lat (msec) : 2=0.20%, 4=0.57%, 10=1.53%, 20=1.78%, 50=50.50%
>     lat (msec) : 100=28.04%, 250=14.56%, 500=2.47%, 750=0.23%, 1000=0.03%
>   cpu          : usr=0.17%, sys=0.26%, ctx=16905, majf=0, minf=78
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
>      issued    : total=r=92001/w=91741/d=0, short=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
>
> Run status group 0 (all jobs):
>    READ: io=368004KB, aggrb=9191KB/s, minb=9191KB/s, maxb=9191KB/s,
> mint=40036msec, maxt=40036msec
>   WRITE: io=366964KB, aggrb=9165KB/s, minb=9165KB/s, maxb=9165KB/s,
> mint=40036msec, maxt=40036msec
>
> Disk stats (read/write):
>   sdb: ios=91739/91385, merge=55/0, ticks=2854936/3040936,
> in_queue=5901740, util=99.82%
>
>
> --
> Vasiliy Tolstov,
> e-mail: v.tolstov@selfip.ru
Liu Yuan Aug. 3, 2015, 12:41 a.m. UTC | #12
On Sun, Aug 02, 2015 at 02:52:08PM +0300, Vasiliy Tolstov wrote:
> 2015-07-31 15:08 GMT+03:00 Vasiliy Tolstov <v.tolstov@selfip.ru>:
> > Please wait to performance comparison. As i see Liu's patch may be
> > more slow then Hitoshi.
> 
> 
> I'm switch to local cluster driver to test only local ssd and not
> network overhead.
> 
> But now Liu's qemu does not able to completely install vps:
> sheep runs as:

What did you mean by 'not able to completely install'? It is a installation
problem or something else?

My QEMU or my patch? I guess you can test the same QEMU with my patch and with
Hitoshi's patch separately. You know, different QEMU base might cover the
difference.

Thanks,
Yuan
Liu Yuan Aug. 3, 2015, 2:01 a.m. UTC | #13
On Thu, Jul 30, 2015 at 09:27:44AM -0400, Jeff Cody wrote:
> On Thu, Jul 30, 2015 at 09:41:08AM +0300, Vasiliy Tolstov wrote:
> > 2015-07-29 12:31 GMT+03:00 Liu Yuan <namei.unix@gmail.com>:
> > > Technically, it won't affect the performance because index updates are not range
> > > but concrete in terms of underlying 4M block size. Only 2 or 3 indexes in a
> > > range will be updated and 90+% updates will be only 1. So if 2 updates stride a
> > > large range, it will actually worse the performance of sheepdog because many
> > > additional unref of object will be executed by sheep internally.
> > >
> > > It is not a performance problem but more the right fix. Even with your patch,
> > > updates of inode can overlap. You just don't allow overlapped requests go to
> > > sheepdog, which is a overkill approach. IMHO, we should only adjust to avoid
> > > the overlapped inode updates, which can be done easily and incrementally on top
> > > of old code, rather than take on a complete new untested overkill mechanism. So
> > > what we get from your patch? Covering the problem and lock every requests?
> > >
> > > Your patch actually fix nothing but just cover the problem by slowing down the
> > > request and even with your patch, the problem still exists because inode updates
> > > can overlap. Your commit log doesn't explain what is the real problem and why
> > > your fix works. This is not your toy project that can commit whatever you want.
> > >
> > >> BTW, sheepdog project was already forked, why don't you fork the block
> > >> driver, too?
> > >
> > > What makes you think you own the block driver?
> > >
> > > We forked the sheepdog project because it is low quality of code partly and mostly
> > > some company tries to make it a private project. It is not as open source friendly
> > > as before and that is the main reason Kazutaka and I chose to fork the sheepdog
> > > project. But this doesn't mean we need to fork the QEMU project, it is an
> > > open source project and not your home toy.
> > >
> > > Kazutaka and I are the biggest contributers of both sheepdog and QEMU sheepdog
> > > block driver for years, so I think I am eligible to review the patch and
> > > responsible to suggest the right fix. If you are pissed off when someone else
> > > have other opinions, you can just fork the code and play with it at home or you
> > > follow the rule of open source project.
> > 
> > 
> > Jeff Cody, please be the judge, patch from Hitoshi solved my problem
> > that i emailed in sheepdog list (i have test environment with 8 hosts
> > on each 6 SSD disks and infiniband interconnect between hosts) before
> > Hitoshi patch, massive writing to sheepdog storage breaks file system
> > and corrupt it.
> > After the patch i don't see issues.
> >
> 
> I'd rather see some sort consensus amongst Liu, Hitoshi, yourself, or
> others more intimately familiar with sheepdog.
> 
> Right now, we have Hitoshi's patch in the main git repo, slated for
> 2.4 release (which is Monday).  It sounds, from Liu's email, as this
> may not fix the root cause.
> 
> Vasiliy said he would test Liu's patch; if he can confirm this new
> patch fix, then I would be inclined to use Liu's patch, based on the
> detailed analysis of the issue in the commit message.
> 

This is my performance comparison on top of latest QEMU with my latop with SSD.

sheepdog cluster run with 3 nodes with '-n' to get best volume performance.
QEMU command:
qemu-system-x86_64 -m 1024 --enable-kvm \
	-drive file=debian_squeeze_amd64_standard.qcow2,cache=writeback,if=virtio \
	-drive file=sheepdog:test,if=virtio

sheepdog:test is created as 'dog vdi create test 80G'

I test both time for mkfs and iops for fio write.

fio.conf:
[global]
ioengine=libaio
direct=1
thread=1
norandommap=1
runtime=60
size=300M
directory=/mnt

[write4k-rand]
stonewall
group_reporting
bs=4k
rw=randwrite
numjobs=8
iodepth=32

Resualt:
================================================
sheep formated with -c 2:1 (erasure coding)
       mkfs      fio
Yuan   0.069     4578    
Hitosh 0.071     3722

sheep formarted with -c 2 (replication)
       mkfs      fio
Yuan   0.074     6873  
Hitosh 0.081     6174
================================================

Thanks,
Yuan
Vasiliy Tolstov Aug. 4, 2015, 8:07 a.m. UTC | #14
2015-08-03 3:41 GMT+03:00 Liu Yuan <namei.unix@gmail.com>:
> What did you mean by 'not able to completely install'? It is a installation
> problem or something else?
>

i have simple test that works as:
1) boot 3.18.x kernel and initrd with static compiled golang binary
2) binary bringup dhcp network
3) fetch from http gzipped raw qemu image
3) in parallel decompress gzipped image and write it to disk

this use-case determine previous error in qemu (that brings this discussion)

> My QEMU or my patch? I guess you can test the same QEMU with my patch and with
> Hitoshi's patch separately. You know, different QEMU base might cover the
> difference.
>

I'm test you and Hitoshi patch with qemu 2.4-rc0, i'm compile two
independent debian packages with you patch and with Hitoshi.
Install qemu with you patch to one node and with Hitoshi on another.
Runs sheep with local cluster driver.
I don't know how to determine qemu segfault in my case, does it
possible to build qemu with debugging via configure flags ang get more
detailed descriptions?

> Thanks,
> Yuan
Jeff Cody Aug. 5, 2015, 6:58 p.m. UTC | #15
On Tue, Aug 04, 2015 at 11:07:16AM +0300, Vasiliy Tolstov wrote:
> 2015-08-03 3:41 GMT+03:00 Liu Yuan <namei.unix@gmail.com>:
> > What did you mean by 'not able to completely install'? It is a installation
> > problem or something else?
> >
> 
> i have simple test that works as:
> 1) boot 3.18.x kernel and initrd with static compiled golang binary
> 2) binary bringup dhcp network
> 3) fetch from http gzipped raw qemu image
> 3) in parallel decompress gzipped image and write it to disk
> 
> this use-case determine previous error in qemu (that brings this discussion)
> 
> > My QEMU or my patch? I guess you can test the same QEMU with my patch and with
> > Hitoshi's patch separately. You know, different QEMU base might cover the
> > difference.
> >
> 
> I'm test you and Hitoshi patch with qemu 2.4-rc0, i'm compile two
> independent debian packages with you patch and with Hitoshi.
> Install qemu with you patch to one node and with Hitoshi on another.
> Runs sheep with local cluster driver.
> I don't know how to determine qemu segfault in my case, does it
> possible to build qemu with debugging via configure flags ang get more
> detailed descriptions?
>

Hi Vasiliy,

If you run configure with  --disable-strip, it will not strip the
debugging symbols from the binary after the build.  Then, you can run
gdb on qemu, and do a backtrace after you hit the segfault ('bt').
That may shed some light, and is a good place to start.

Thanks,
Jeff
Vasiliy Tolstov Aug. 9, 2015, 2:03 p.m. UTC | #16
2015-08-05 21:58 GMT+03:00 Jeff Cody <jcody@redhat.com>:
> Hi Vasiliy,
>
> If you run configure with  --disable-strip, it will not strip the
> debugging symbols from the binary after the build.  Then, you can run
> gdb on qemu, and do a backtrace after you hit the segfault ('bt').
> That may shed some light, and is a good place to start.


I'm try to debug (disable-strip), but i'm start vps from libvirt (i'm
add -s flag to qemu, to start gdb), but when i'm attach to remote
session,qemu aready dies, or all works fine.
does it possible to determine by this dmesg what happening in qemu
binary with debug symbols?
qemu-system-x86[34046]: segfault at 401364 ip 00007f33f52a1ff8 sp
00007f3401ecad30 error 4 in qemu-system-x86_64[7f33f4efd000+518000]
Stefan Hajnoczi Aug. 10, 2015, 10:44 a.m. UTC | #17
On Sun, Aug 09, 2015 at 05:03:14PM +0300, Vasiliy Tolstov wrote:
> 2015-08-05 21:58 GMT+03:00 Jeff Cody <jcody@redhat.com>:
> > Hi Vasiliy,
> >
> > If you run configure with  --disable-strip, it will not strip the
> > debugging symbols from the binary after the build.  Then, you can run
> > gdb on qemu, and do a backtrace after you hit the segfault ('bt').
> > That may shed some light, and is a good place to start.
> 
> 
> I'm try to debug (disable-strip), but i'm start vps from libvirt (i'm
> add -s flag to qemu, to start gdb), but when i'm attach to remote
> session,qemu aready dies, or all works fine.
> does it possible to determine by this dmesg what happening in qemu
> binary with debug symbols?
> qemu-system-x86[34046]: segfault at 401364 ip 00007f33f52a1ff8 sp
> 00007f3401ecad30 error 4 in qemu-system-x86_64[7f33f4efd000+518000]

-s launches QEMU's gdbstub for *guest* debugging.  It allows you to see
the CPU and memory state inside the guest.  It won't help you debug this
segfault.

Since the problem you encountered is a *QEMU* segfault, you need to use
*host* GDB on the QEMU process instead.

I'm not sure if libvirt sets the RLIMIT_CORE rlimit for the QEMU process
(which could disable coredumps), but if your host uses systemd you may
be able to inspect the coredump with gdb using coredumpctl(1).
diff mbox

Patch

diff --git a/block/sheepdog.c b/block/sheepdog.c
index bd7cbed..d1eeb81 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -342,9 +342,6 @@  typedef struct BDRVSheepdogState {
 
     SheepdogInode inode;
 
-    uint32_t min_dirty_data_idx;
-    uint32_t max_dirty_data_idx;
-
     char name[SD_MAX_VDI_LEN];
     bool is_snapshot;
     uint32_t cache_flags;
@@ -782,6 +779,26 @@  static coroutine_fn void reconnect_to_sdog(void *opaque)
     }
 }
 
+static void  update_inode(BDRVSheepdogState *s, AIOReq *aio_req)
+{
+    struct iovec iov;
+    uint32_t offset, data_len;
+    SheepdogAIOCB *acb = aio_req->aiocb;
+    int idx = data_oid_to_idx(aio_req->oid);
+
+    offset = SD_INODE_HEADER_SIZE + sizeof(uint32_t) * idx;
+    data_len = sizeof(uint32_t);
+
+    iov.iov_base = &s->inode;
+    iov.iov_len = sizeof(s->inode);
+    aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s->inode.vdi_id),
+                            data_len, offset, 0, false, 0, offset);
+    QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
+    add_aio_request(s, aio_req, &iov, 1, AIOCB_WRITE_UDATA);
+
+    return;
+}
+
 /*
  * Receive responses of the I/O requests.
  *
@@ -820,25 +837,15 @@  static void coroutine_fn aio_read_response(void *opaque)
 
     switch (acb->aiocb_type) {
     case AIOCB_WRITE_UDATA:
-        /* this coroutine context is no longer suitable for co_recv
-         * because we may send data to update vdi objects */
-        s->co_recv = NULL;
         if (!is_data_obj(aio_req->oid)) {
             break;
         }
         idx = data_oid_to_idx(aio_req->oid);
 
         if (aio_req->create) {
-            /*
-             * If the object is newly created one, we need to update
-             * the vdi object (metadata object).  min_dirty_data_idx
-             * and max_dirty_data_idx are changed to include updated
-             * index between them.
-             */
             if (rsp.result == SD_RES_SUCCESS) {
                 s->inode.data_vdi_id[idx] = s->inode.vdi_id;
-                s->max_dirty_data_idx = MAX(idx, s->max_dirty_data_idx);
-                s->min_dirty_data_idx = MIN(idx, s->min_dirty_data_idx);
+                update_inode(s, aio_req);
             }
             /*
              * Some requests may be blocked because simultaneous
@@ -1518,8 +1525,6 @@  static int sd_open(BlockDriverState *bs, QDict *options, int flags,
     }
 
     memcpy(&s->inode, buf, sizeof(s->inode));
-    s->min_dirty_data_idx = UINT32_MAX;
-    s->max_dirty_data_idx = 0;
 
     bs->total_sectors = s->inode.vdi_size / BDRV_SECTOR_SIZE;
     pstrcpy(s->name, sizeof(s->name), vdi);
@@ -1962,44 +1967,6 @@  static int sd_truncate(BlockDriverState *bs, int64_t offset)
     return ret;
 }
 
-/*
- * This function is called after writing data objects.  If we need to
- * update metadata, this sends a write request to the vdi object.
- * Otherwise, this switches back to sd_co_readv/writev.
- */
-static void coroutine_fn sd_write_done(SheepdogAIOCB *acb)
-{
-    BDRVSheepdogState *s = acb->common.bs->opaque;
-    struct iovec iov;
-    AIOReq *aio_req;
-    uint32_t offset, data_len, mn, mx;
-
-    mn = s->min_dirty_data_idx;
-    mx = s->max_dirty_data_idx;
-    if (mn <= mx) {
-        /* we need to update the vdi object. */
-        offset = sizeof(s->inode) - sizeof(s->inode.data_vdi_id) +
-            mn * sizeof(s->inode.data_vdi_id[0]);
-        data_len = (mx - mn + 1) * sizeof(s->inode.data_vdi_id[0]);
-
-        s->min_dirty_data_idx = UINT32_MAX;
-        s->max_dirty_data_idx = 0;
-
-        iov.iov_base = &s->inode;
-        iov.iov_len = sizeof(s->inode);
-        aio_req = alloc_aio_req(s, acb, vid_to_vdi_oid(s->inode.vdi_id),
-                                data_len, offset, 0, false, 0, offset);
-        QLIST_INSERT_HEAD(&s->inflight_aio_head, aio_req, aio_siblings);
-        add_aio_request(s, aio_req, &iov, 1, AIOCB_WRITE_UDATA);
-
-        acb->aio_done_func = sd_finish_aiocb;
-        acb->aiocb_type = AIOCB_WRITE_UDATA;
-        return;
-    }
-
-    sd_finish_aiocb(acb);
-}
-
 /* Delete current working VDI on the snapshot chain */
 static bool sd_delete(BDRVSheepdogState *s)
 {
@@ -2231,7 +2198,7 @@  static coroutine_fn int sd_co_writev(BlockDriverState *bs, int64_t sector_num,
     }
 
     acb = sd_aio_setup(bs, qiov, sector_num, nb_sectors);
-    acb->aio_done_func = sd_write_done;
+    acb->aio_done_func = sd_finish_aiocb;
     acb->aiocb_type = AIOCB_WRITE_UDATA;
 
     ret = sd_co_rw_vector(acb);