Patchwork Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.

login
register
mail settings
Submitter Yoshiaki Tamura
Date Dec. 16, 2010, 7:36 a.m.
Message ID <AANLkTimvsozOXJpwyUYAqWBKLFsY==x8AzCkJ4CapgTg@mail.gmail.com>
Download mbox | patch
Permalink /patch/75723/
State New
Headers show

Comments

Yoshiaki Tamura - Dec. 16, 2010, 7:36 a.m.
2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>>> >> >>
>>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> >
>>> >> > This changes migration format, so it will break compatibility with
>>> >> > existing drivers. More generally, I think migrating internal
>>> >> > state that is not guest visible is always a mistake
>>> >> > as it ties migration format to an internal implementation
>>> >> > (yes, I know we do this sometimes, but we should at least
>>> >> > try not to add such cases).  I think the right thing to do in this case
>>> >> > is to flush outstanding
>>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>>> >> > I sent patches that do this for virtio net and block.
>>> >>
>>> >> Could you give me the link of your patches?  I'd like to test
>>> >> whether they work with Kemari upon failover.  If they do, I'm
>>> >> happy to drop this patch.
>>> >>
>>> >> Yoshi
>>> >
>>> > Look for this:
>>> > stable migration image on a stopped vm
>>> > sent on:
>>> > Wed, 24 Nov 2010 17:52:49 +0200
>>>
>>> Thanks for the info.
>>>
>>> However, The patch series above didn't solve the issue.  In
>>> case of Kemari, inuse is mostly > 0 because it queues the
>>> output, and while last_avail_idx gets incremented
>>> immediately, not sending inuse makes the state inconsistent
>>> between Primary and Secondary.
>>
>> Hmm. Can we simply avoid incrementing last_avail_idx?
>
> I think we can calculate or prepare an internal last_avail_idx,
> and update the external when inuse is decremented.  I'll try
> whether it work w/ w/o Kemari.

Hi Michael,

Could you please take a look at the following patch?

commit 36ee7910059e6b236fe9467a609f5b4aed866912
Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Date:   Thu Dec 16 14:50:54 2010 +0900

    virtio: update last_avail_idx when inuse is decreased.

    Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>




>
>>
>>>  I'm wondering why
>>> last_avail_idx is OK to send but not inuse.
>>
>> last_avail_idx is at some level a mistake, it exposes part of
>> our internal implementation, but it does *also* express
>> a guest observable state.
>>
>> Here's the problem that it solves: just looking at the rings in virtio
>> there is no way to detect that a specific request has already been
>> completed. And the protocol forbids completing the same request twice.
>>
>> Our implementation always starts processing the requests
>> in order, and since we flush outstanding requests
>> before save, it works to just tell the remote 'process only requests
>> after this place'.
>>
>> But there's no such requirement in the virtio protocol,
>> so to be really generic we could add a bitmask of valid avail
>> ring entries that did not complete yet. This would be
>> the exact representation of the guest observable state.
>> In practice we have rings of up to 512 entries.
>> That's 64 byte per ring, not a lot at all.
>>
>> However, if we ever do change the protocol to send the bitmask,
>> we would need some code to resubmit requests
>> out of order, so it's not trivial.
>>
>> Another minor mistake with last_avail_idx is that it has
>> some redundancy: the high bits in the index
>> (> vq size) are not necessary as they can be
>> got from avail idx.  There's a consistency check
>> in load but we really should try to use formats
>> that are always consistent.
>>
>>> The following patch does the same thing as original, yet
>>> keeps the format of the virtio.  It shouldn't break live
>>> migration either because inuse should be 0.
>>>
>>> Yoshi
>>
>> Question is, can you flush to make inuse 0 in kemari too?
>> And if not, how do you handle the fact that some requests
>> are in flight on the primary?
>
> Although we try flushing requests one by one making inuse 0,
> there are cases when it failovers to the secondary when inuse
> isn't 0.  We handle these in flight request on the primary by
> replaying on the secondary.
>
>>
>>> diff --git a/hw/virtio.c b/hw/virtio.c
>>> index c8a0fc6..875c7ca 100644
>>> --- a/hw/virtio.c
>>> +++ b/hw/virtio.c
>>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>>      qemu_put_be32(f, i);
>>>
>>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>>> +        uint16_t last_avail_idx;
>>> +
>>>          if (vdev->vq[i].vring.num == 0)
>>>              break;
>>>
>>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>>> +
>>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>>          qemu_put_be64(f, vdev->vq[i].pa);
>>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> +        qemu_put_be16s(f, &last_avail_idx);
>>>          if (vdev->binding->save_queue)
>>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>>      }
>>>
>>>
>>
>> This looks wrong to me.  Requests can complete in any order, can they
>> not?  So if request 0 did not complete and request 1 did not,
>> you send avail - inuse and on the secondary you will process and
>> complete request 1 the second time, crashing the guest.
>
> In case of Kemari, no.  We sit between devices and net/block, and
> queue the requests.  After completing each transaction, we flush
> the requests one by one.  So there won't be completion inversion,
> and therefore won't be visible to the guest.
>
> Yoshi
>
>>
>>>
>>> >
>>> >> >
>>> >> >> ---
>>> >> >>  hw/virtio.c |    8 +++++++-
>>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>>> >> >>
>>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >> index 849a60f..5509644 100644
>>> >> >> --- a/hw/virtio.c
>>> >> >> +++ b/hw/virtio.c
>>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>>> >> >>      VRing vring;
>>> >> >>      target_phys_addr_t pa;
>>> >> >>      uint16_t last_avail_idx;
>>> >> >> -    int inuse;
>>> >> >> +    uint16_t inuse;
>>> >> >>      uint16_t vector;
>>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>>> >> >>      VirtIODevice *vdev;
>>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>>> >> >>          if (vdev->binding->save_queue)
>>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> >> >>      }
>>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>>> >> >> +
>>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>>> >> >> +        vdev->vq[i].inuse = 0;
>>> >> >>
>>> >> >>          if (vdev->vq[i].pa) {
>>> >> >>              virtqueue_init(&vdev->vq[i]);
>>> >> >> --
>>> >> >> 1.7.1.2
>>> >> >>
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> > --
>>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
Michael S. Tsirkin - Dec. 16, 2010, 9:51 a.m.
On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >>> >> >>
> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >>> >> >
> >>> >> > This changes migration format, so it will break compatibility with
> >>> >> > existing drivers. More generally, I think migrating internal
> >>> >> > state that is not guest visible is always a mistake
> >>> >> > as it ties migration format to an internal implementation
> >>> >> > (yes, I know we do this sometimes, but we should at least
> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >>> >> > is to flush outstanding
> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >>> >> > I sent patches that do this for virtio net and block.
> >>> >>
> >>> >> Could you give me the link of your patches?  I'd like to test
> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >>> >> happy to drop this patch.
> >>> >>
> >>> >> Yoshi
> >>> >
> >>> > Look for this:
> >>> > stable migration image on a stopped vm
> >>> > sent on:
> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >>>
> >>> Thanks for the info.
> >>>
> >>> However, The patch series above didn't solve the issue.  In
> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >>> output, and while last_avail_idx gets incremented
> >>> immediately, not sending inuse makes the state inconsistent
> >>> between Primary and Secondary.
> >>
> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >
> > I think we can calculate or prepare an internal last_avail_idx,
> > and update the external when inuse is decremented.  I'll try
> > whether it work w/ w/o Kemari.
> 
> Hi Michael,
> 
> Could you please take a look at the following patch?

Which version is this against?

> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> Date:   Thu Dec 16 14:50:54 2010 +0900
> 
>     virtio: update last_avail_idx when inuse is decreased.
> 
>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

It would be better to have a commit description explaining why a change
is made, and why it is correct, not just repeating what can be seen from
the diff anyway.

> diff --git a/hw/virtio.c b/hw/virtio.c
> index c8a0fc6..6688c02 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>      wmb();
>      trace_virtqueue_flush(vq, count);
>      vring_used_idx_increment(vq, count);
> +    vq->last_avail_idx += count;
>      vq->inuse -= count;
>  }
> 
> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>      unsigned int i, head, max;
>      target_phys_addr_t desc_pa = vq->vring.desc;
> 
> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>          return 0;
> 
>      /* When we start there are none of either input nor output. */
> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> 
>      max = vq->vring.num;
> 
> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> 
>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> 

Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
Previous patch version sure looked simpler, and this seems functionally
equivalent, so my question still stands: here it is rephrased in a
different way:

	assume that we have in avail ring 2 requests at start of ring: A and B in this order

	host pops A, then B, then completes B and flushes

	now with this patch last_avail_idx will be 1, and then
	remote will get it, it will execute B again. As a result
	B will complete twice, and apparently A will never complete.


This is what I was saying below: assuming that there are
outstanding requests when we migrate, there is no way
a single index can be enough to figure out which requests
need to be handled and which are in flight already.

We must add some kind of bitmask to tell us which is which.

> >
> >>
> >>>  I'm wondering why
> >>> last_avail_idx is OK to send but not inuse.
> >>
> >> last_avail_idx is at some level a mistake, it exposes part of
> >> our internal implementation, but it does *also* express
> >> a guest observable state.
> >>
> >> Here's the problem that it solves: just looking at the rings in virtio
> >> there is no way to detect that a specific request has already been
> >> completed. And the protocol forbids completing the same request twice.
> >>
> >> Our implementation always starts processing the requests
> >> in order, and since we flush outstanding requests
> >> before save, it works to just tell the remote 'process only requests
> >> after this place'.
> >>
> >> But there's no such requirement in the virtio protocol,
> >> so to be really generic we could add a bitmask of valid avail
> >> ring entries that did not complete yet. This would be
> >> the exact representation of the guest observable state.
> >> In practice we have rings of up to 512 entries.
> >> That's 64 byte per ring, not a lot at all.
> >>
> >> However, if we ever do change the protocol to send the bitmask,
> >> we would need some code to resubmit requests
> >> out of order, so it's not trivial.
> >>
> >> Another minor mistake with last_avail_idx is that it has
> >> some redundancy: the high bits in the index
> >> (> vq size) are not necessary as they can be
> >> got from avail idx.  There's a consistency check
> >> in load but we really should try to use formats
> >> that are always consistent.
> >>
> >>> The following patch does the same thing as original, yet
> >>> keeps the format of the virtio.  It shouldn't break live
> >>> migration either because inuse should be 0.
> >>>
> >>> Yoshi
> >>
> >> Question is, can you flush to make inuse 0 in kemari too?
> >> And if not, how do you handle the fact that some requests
> >> are in flight on the primary?
> >
> > Although we try flushing requests one by one making inuse 0,
> > there are cases when it failovers to the secondary when inuse
> > isn't 0.  We handle these in flight request on the primary by
> > replaying on the secondary.
> >
> >>
> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >>> index c8a0fc6..875c7ca 100644
> >>> --- a/hw/virtio.c
> >>> +++ b/hw/virtio.c
> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >>>      qemu_put_be32(f, i);
> >>>
> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >>> +        uint16_t last_avail_idx;
> >>> +
> >>>          if (vdev->vq[i].vring.num == 0)
> >>>              break;
> >>>
> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >>> +
> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >>>          if (vdev->binding->save_queue)
> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >>>      }
> >>>
> >>>
> >>
> >> This looks wrong to me.  Requests can complete in any order, can they
> >> not?  So if request 0 did not complete and request 1 did not,
> >> you send avail - inuse and on the secondary you will process and
> >> complete request 1 the second time, crashing the guest.
> >
> > In case of Kemari, no.  We sit between devices and net/block, and
> > queue the requests.  After completing each transaction, we flush
> > the requests one by one.  So there won't be completion inversion,
> > and therefore won't be visible to the guest.
> >
> > Yoshi
> >
> >>
> >>>
> >>> >
> >>> >> >
> >>> >> >> ---
> >>> >> >>  hw/virtio.c |    8 +++++++-
> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >>> >> >>
> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >>> >> >> index 849a60f..5509644 100644
> >>> >> >> --- a/hw/virtio.c
> >>> >> >> +++ b/hw/virtio.c
> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >>> >> >>      VRing vring;
> >>> >> >>      target_phys_addr_t pa;
> >>> >> >>      uint16_t last_avail_idx;
> >>> >> >> -    int inuse;
> >>> >> >> +    uint16_t inuse;
> >>> >> >>      uint16_t vector;
> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >>> >> >>      VirtIODevice *vdev;
> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >>> >> >>          if (vdev->binding->save_queue)
> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >>> >> >>      }
> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >>> >> >> +
> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >>> >> >> +        vdev->vq[i].inuse = 0;
> >>> >> >>
> >>> >> >>          if (vdev->vq[i].pa) {
> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >>> >> >> --
> >>> >> >> 1.7.1.2
> >>> >> >>
> >>> >> >> --
> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >> > --
> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> >> > the body of a message to majordomo@vger.kernel.org
> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >> >
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> > the body of a message to majordomo@vger.kernel.org
> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
Yoshiaki Tamura - Dec. 16, 2010, 2:28 p.m.
2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >>> >> >>
>> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >>> >> >
>> >>> >> > This changes migration format, so it will break compatibility with
>> >>> >> > existing drivers. More generally, I think migrating internal
>> >>> >> > state that is not guest visible is always a mistake
>> >>> >> > as it ties migration format to an internal implementation
>> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >>> >> > is to flush outstanding
>> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >>> >> > I sent patches that do this for virtio net and block.
>> >>> >>
>> >>> >> Could you give me the link of your patches?  I'd like to test
>> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >>> >> happy to drop this patch.
>> >>> >>
>> >>> >> Yoshi
>> >>> >
>> >>> > Look for this:
>> >>> > stable migration image on a stopped vm
>> >>> > sent on:
>> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >>>
>> >>> Thanks for the info.
>> >>>
>> >>> However, The patch series above didn't solve the issue.  In
>> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >>> output, and while last_avail_idx gets incremented
>> >>> immediately, not sending inuse makes the state inconsistent
>> >>> between Primary and Secondary.
>> >>
>> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >
>> > I think we can calculate or prepare an internal last_avail_idx,
>> > and update the external when inuse is decremented.  I'll try
>> > whether it work w/ w/o Kemari.
>>
>> Hi Michael,
>>
>> Could you please take a look at the following patch?
>
> Which version is this against?

Oops.  It should be very old.
67f895bfe69f323b427b284430b6219c8a62e8d4

>> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> Date:   Thu Dec 16 14:50:54 2010 +0900
>>
>>     virtio: update last_avail_idx when inuse is decreased.
>>
>>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>
> It would be better to have a commit description explaining why a change
> is made, and why it is correct, not just repeating what can be seen from
> the diff anyway.

Sorry for being lazy here.

>> diff --git a/hw/virtio.c b/hw/virtio.c
>> index c8a0fc6..6688c02 100644
>> --- a/hw/virtio.c
>> +++ b/hw/virtio.c
>> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>>      wmb();
>>      trace_virtqueue_flush(vq, count);
>>      vring_used_idx_increment(vq, count);
>> +    vq->last_avail_idx += count;
>>      vq->inuse -= count;
>>  }
>>
>> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>      unsigned int i, head, max;
>>      target_phys_addr_t desc_pa = vq->vring.desc;
>>
>> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>>          return 0;
>>
>>      /* When we start there are none of either input nor output. */
>> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>
>>      max = vq->vring.num;
>>
>> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>>
>>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>>
>
> Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?

I think there are two problems.

1. When to update last_avail_idx.
2. The ordering issue you're mentioning below.

The patch above is only trying to address 1 because last time you
mentioned that modifying last_avail_idx upon save may break the
guest, which I agree.  If virtio_queue_empty and
virtqueue_avail_bytes are only used internally, meaning invisible
to the guest, I guess the approach above can be applied too.

> Previous patch version sure looked simpler, and this seems functionally
> equivalent, so my question still stands: here it is rephrased in a
> different way:
>
>        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>
>        host pops A, then B, then completes B and flushes
>
>        now with this patch last_avail_idx will be 1, and then
>        remote will get it, it will execute B again. As a result
>        B will complete twice, and apparently A will never complete.
>
>
> This is what I was saying below: assuming that there are
> outstanding requests when we migrate, there is no way
> a single index can be enough to figure out which requests
> need to be handled and which are in flight already.
>
> We must add some kind of bitmask to tell us which is which.

I should understand why this inversion can happen before solving
the issue.  Currently, how are you making virio-net to flush
every requests for live migration?  Is it qemu_aio_flush()?

Yoshi

>
>> >
>> >>
>> >>>  I'm wondering why
>> >>> last_avail_idx is OK to send but not inuse.
>> >>
>> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> our internal implementation, but it does *also* express
>> >> a guest observable state.
>> >>
>> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> there is no way to detect that a specific request has already been
>> >> completed. And the protocol forbids completing the same request twice.
>> >>
>> >> Our implementation always starts processing the requests
>> >> in order, and since we flush outstanding requests
>> >> before save, it works to just tell the remote 'process only requests
>> >> after this place'.
>> >>
>> >> But there's no such requirement in the virtio protocol,
>> >> so to be really generic we could add a bitmask of valid avail
>> >> ring entries that did not complete yet. This would be
>> >> the exact representation of the guest observable state.
>> >> In practice we have rings of up to 512 entries.
>> >> That's 64 byte per ring, not a lot at all.
>> >>
>> >> However, if we ever do change the protocol to send the bitmask,
>> >> we would need some code to resubmit requests
>> >> out of order, so it's not trivial.
>> >>
>> >> Another minor mistake with last_avail_idx is that it has
>> >> some redundancy: the high bits in the index
>> >> (> vq size) are not necessary as they can be
>> >> got from avail idx.  There's a consistency check
>> >> in load but we really should try to use formats
>> >> that are always consistent.
>> >>
>> >>> The following patch does the same thing as original, yet
>> >>> keeps the format of the virtio.  It shouldn't break live
>> >>> migration either because inuse should be 0.
>> >>>
>> >>> Yoshi
>> >>
>> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> And if not, how do you handle the fact that some requests
>> >> are in flight on the primary?
>> >
>> > Although we try flushing requests one by one making inuse 0,
>> > there are cases when it failovers to the secondary when inuse
>> > isn't 0.  We handle these in flight request on the primary by
>> > replaying on the secondary.
>> >
>> >>
>> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >>> index c8a0fc6..875c7ca 100644
>> >>> --- a/hw/virtio.c
>> >>> +++ b/hw/virtio.c
>> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >>>      qemu_put_be32(f, i);
>> >>>
>> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >>> +        uint16_t last_avail_idx;
>> >>> +
>> >>>          if (vdev->vq[i].vring.num == 0)
>> >>>              break;
>> >>>
>> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >>> +
>> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >>>          if (vdev->binding->save_queue)
>> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>>      }
>> >>>
>> >>>
>> >>
>> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> not?  So if request 0 did not complete and request 1 did not,
>> >> you send avail - inuse and on the secondary you will process and
>> >> complete request 1 the second time, crashing the guest.
>> >
>> > In case of Kemari, no.  We sit between devices and net/block, and
>> > queue the requests.  After completing each transaction, we flush
>> > the requests one by one.  So there won't be completion inversion,
>> > and therefore won't be visible to the guest.
>> >
>> > Yoshi
>> >
>> >>
>> >>>
>> >>> >
>> >>> >> >
>> >>> >> >> ---
>> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >>> >> >>
>> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >>> >> >> index 849a60f..5509644 100644
>> >>> >> >> --- a/hw/virtio.c
>> >>> >> >> +++ b/hw/virtio.c
>> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >>> >> >>      VRing vring;
>> >>> >> >>      target_phys_addr_t pa;
>> >>> >> >>      uint16_t last_avail_idx;
>> >>> >> >> -    int inuse;
>> >>> >> >> +    uint16_t inuse;
>> >>> >> >>      uint16_t vector;
>> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >>> >> >>      VirtIODevice *vdev;
>> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >>> >> >>          if (vdev->binding->save_queue)
>> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>> >> >>      }
>> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >>> >> >> +
>> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >>> >> >>
>> >>> >> >>          if (vdev->vq[i].pa) {
>> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >>> >> >> --
>> >>> >> >> 1.7.1.2
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >> > --
>> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >> >
>> >>> > --
>> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> > the body of a message to majordomo@vger.kernel.org
>> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Michael S. Tsirkin - Dec. 16, 2010, 2:40 p.m.
On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >>> >> >>
> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >>> >> >
> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >>> >> > state that is not guest visible is always a mistake
> >> >>> >> > as it ties migration format to an internal implementation
> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >>> >> > is to flush outstanding
> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >>> >>
> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >>> >> happy to drop this patch.
> >> >>> >>
> >> >>> >> Yoshi
> >> >>> >
> >> >>> > Look for this:
> >> >>> > stable migration image on a stopped vm
> >> >>> > sent on:
> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >>>
> >> >>> Thanks for the info.
> >> >>>
> >> >>> However, The patch series above didn't solve the issue.  In
> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >>> output, and while last_avail_idx gets incremented
> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >>> between Primary and Secondary.
> >> >>
> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >
> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> > and update the external when inuse is decremented.  I'll try
> >> > whether it work w/ w/o Kemari.
> >>
> >> Hi Michael,
> >>
> >> Could you please take a look at the following patch?
> >
> > Which version is this against?
> 
> Oops.  It should be very old.
> 67f895bfe69f323b427b284430b6219c8a62e8d4
> 
> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >>
> >>     virtio: update last_avail_idx when inuse is decreased.
> >>
> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >
> > It would be better to have a commit description explaining why a change
> > is made, and why it is correct, not just repeating what can be seen from
> > the diff anyway.
> 
> Sorry for being lazy here.
> 
> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> index c8a0fc6..6688c02 100644
> >> --- a/hw/virtio.c
> >> +++ b/hw/virtio.c
> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >>      wmb();
> >>      trace_virtqueue_flush(vq, count);
> >>      vring_used_idx_increment(vq, count);
> >> +    vq->last_avail_idx += count;
> >>      vq->inuse -= count;
> >>  }
> >>
> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >>      unsigned int i, head, max;
> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >>
> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >>          return 0;
> >>
> >>      /* When we start there are none of either input nor output. */
> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >>
> >>      max = vq->vring.num;
> >>
> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >>
> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >>
> >
> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> 
> I think there are two problems.
> 
> 1. When to update last_avail_idx.
> 2. The ordering issue you're mentioning below.
> 
> The patch above is only trying to address 1 because last time you
> mentioned that modifying last_avail_idx upon save may break the
> guest, which I agree.  If virtio_queue_empty and
> virtqueue_avail_bytes are only used internally, meaning invisible
> to the guest, I guess the approach above can be applied too.

So IMHO 2 is the real issue. This is what was problematic
with the save patch, otherwise of course changes in save
are better than changes all over the codebase.

> > Previous patch version sure looked simpler, and this seems functionally
> > equivalent, so my question still stands: here it is rephrased in a
> > different way:
> >
> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >
> >        host pops A, then B, then completes B and flushes
> >
> >        now with this patch last_avail_idx will be 1, and then
> >        remote will get it, it will execute B again. As a result
> >        B will complete twice, and apparently A will never complete.
> >
> >
> > This is what I was saying below: assuming that there are
> > outstanding requests when we migrate, there is no way
> > a single index can be enough to figure out which requests
> > need to be handled and which are in flight already.
> >
> > We must add some kind of bitmask to tell us which is which.
> 
> I should understand why this inversion can happen before solving
> the issue.

It's a fundamental thing in virtio.
I think it is currently only likely to happen with block, I think tap
currently completes things in order.  In any case relying on this in the
frontend is a mistake.

>  Currently, how are you making virio-net to flush
> every requests for live migration?  Is it qemu_aio_flush()?
> 
> Yoshi

Think so.


> >
> >> >
> >> >>
> >> >>>  I'm wondering why
> >> >>> last_avail_idx is OK to send but not inuse.
> >> >>
> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> our internal implementation, but it does *also* express
> >> >> a guest observable state.
> >> >>
> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> there is no way to detect that a specific request has already been
> >> >> completed. And the protocol forbids completing the same request twice.
> >> >>
> >> >> Our implementation always starts processing the requests
> >> >> in order, and since we flush outstanding requests
> >> >> before save, it works to just tell the remote 'process only requests
> >> >> after this place'.
> >> >>
> >> >> But there's no such requirement in the virtio protocol,
> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> ring entries that did not complete yet. This would be
> >> >> the exact representation of the guest observable state.
> >> >> In practice we have rings of up to 512 entries.
> >> >> That's 64 byte per ring, not a lot at all.
> >> >>
> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> we would need some code to resubmit requests
> >> >> out of order, so it's not trivial.
> >> >>
> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> some redundancy: the high bits in the index
> >> >> (> vq size) are not necessary as they can be
> >> >> got from avail idx.  There's a consistency check
> >> >> in load but we really should try to use formats
> >> >> that are always consistent.
> >> >>
> >> >>> The following patch does the same thing as original, yet
> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >>> migration either because inuse should be 0.
> >> >>>
> >> >>> Yoshi
> >> >>
> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> And if not, how do you handle the fact that some requests
> >> >> are in flight on the primary?
> >> >
> >> > Although we try flushing requests one by one making inuse 0,
> >> > there are cases when it failovers to the secondary when inuse
> >> > isn't 0.  We handle these in flight request on the primary by
> >> > replaying on the secondary.
> >> >
> >> >>
> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >>> index c8a0fc6..875c7ca 100644
> >> >>> --- a/hw/virtio.c
> >> >>> +++ b/hw/virtio.c
> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >>>      qemu_put_be32(f, i);
> >> >>>
> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >>> +        uint16_t last_avail_idx;
> >> >>> +
> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >>>              break;
> >> >>>
> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >>> +
> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >>>          if (vdev->binding->save_queue)
> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >>>      }
> >> >>>
> >> >>>
> >> >>
> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> you send avail - inuse and on the secondary you will process and
> >> >> complete request 1 the second time, crashing the guest.
> >> >
> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> > queue the requests.  After completing each transaction, we flush
> >> > the requests one by one.  So there won't be completion inversion,
> >> > and therefore won't be visible to the guest.
> >> >
> >> > Yoshi
> >> >
> >> >>
> >> >>>
> >> >>> >
> >> >>> >> >
> >> >>> >> >> ---
> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >>> >> >>
> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >>> >> >> index 849a60f..5509644 100644
> >> >>> >> >> --- a/hw/virtio.c
> >> >>> >> >> +++ b/hw/virtio.c
> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >>> >> >>      VRing vring;
> >> >>> >> >>      target_phys_addr_t pa;
> >> >>> >> >>      uint16_t last_avail_idx;
> >> >>> >> >> -    int inuse;
> >> >>> >> >> +    uint16_t inuse;
> >> >>> >> >>      uint16_t vector;
> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >>> >> >>      VirtIODevice *vdev;
> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >>> >> >>      }
> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >>> >> >> +
> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >>> >> >>
> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >>> >> >> --
> >> >>> >> >> 1.7.1.2
> >> >>> >> >>
> >> >>> >> >> --
> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >> > --
> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >> >
> >> >>> > --
> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
Yoshiaki Tamura - Dec. 16, 2010, 3:59 p.m.
2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >>> >> >>
>> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >>> >> >
>> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >>> >> > state that is not guest visible is always a mistake
>> >> >>> >> > as it ties migration format to an internal implementation
>> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >>> >> > is to flush outstanding
>> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >>> >>
>> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >>> >> happy to drop this patch.
>> >> >>> >>
>> >> >>> >> Yoshi
>> >> >>> >
>> >> >>> > Look for this:
>> >> >>> > stable migration image on a stopped vm
>> >> >>> > sent on:
>> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >>>
>> >> >>> Thanks for the info.
>> >> >>>
>> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >>> output, and while last_avail_idx gets incremented
>> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >>> between Primary and Secondary.
>> >> >>
>> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >
>> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> > and update the external when inuse is decremented.  I'll try
>> >> > whether it work w/ w/o Kemari.
>> >>
>> >> Hi Michael,
>> >>
>> >> Could you please take a look at the following patch?
>> >
>> > Which version is this against?
>>
>> Oops.  It should be very old.
>> 67f895bfe69f323b427b284430b6219c8a62e8d4
>>
>> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >>
>> >>     virtio: update last_avail_idx when inuse is decreased.
>> >>
>> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >
>> > It would be better to have a commit description explaining why a change
>> > is made, and why it is correct, not just repeating what can be seen from
>> > the diff anyway.
>>
>> Sorry for being lazy here.
>>
>> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> index c8a0fc6..6688c02 100644
>> >> --- a/hw/virtio.c
>> >> +++ b/hw/virtio.c
>> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >>      wmb();
>> >>      trace_virtqueue_flush(vq, count);
>> >>      vring_used_idx_increment(vq, count);
>> >> +    vq->last_avail_idx += count;
>> >>      vq->inuse -= count;
>> >>  }
>> >>
>> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >>      unsigned int i, head, max;
>> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >>
>> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >>          return 0;
>> >>
>> >>      /* When we start there are none of either input nor output. */
>> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >>
>> >>      max = vq->vring.num;
>> >>
>> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >>
>> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >>
>> >
>> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>>
>> I think there are two problems.
>>
>> 1. When to update last_avail_idx.
>> 2. The ordering issue you're mentioning below.
>>
>> The patch above is only trying to address 1 because last time you
>> mentioned that modifying last_avail_idx upon save may break the
>> guest, which I agree.  If virtio_queue_empty and
>> virtqueue_avail_bytes are only used internally, meaning invisible
>> to the guest, I guess the approach above can be applied too.
>
> So IMHO 2 is the real issue. This is what was problematic
> with the save patch, otherwise of course changes in save
> are better than changes all over the codebase.

All right.  Then let's focus on 2 first.

>> > Previous patch version sure looked simpler, and this seems functionally
>> > equivalent, so my question still stands: here it is rephrased in a
>> > different way:
>> >
>> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >
>> >        host pops A, then B, then completes B and flushes
>> >
>> >        now with this patch last_avail_idx will be 1, and then
>> >        remote will get it, it will execute B again. As a result
>> >        B will complete twice, and apparently A will never complete.
>> >
>> >
>> > This is what I was saying below: assuming that there are
>> > outstanding requests when we migrate, there is no way
>> > a single index can be enough to figure out which requests
>> > need to be handled and which are in flight already.
>> >
>> > We must add some kind of bitmask to tell us which is which.
>>
>> I should understand why this inversion can happen before solving
>> the issue.
>
> It's a fundamental thing in virtio.
> I think it is currently only likely to happen with block, I think tap
> currently completes things in order.  In any case relying on this in the
> frontend is a mistake.
>
>>  Currently, how are you making virio-net to flush
>> every requests for live migration?  Is it qemu_aio_flush()?
>
> Think so.

If qemu_aio_flush() is responsible for flushing the outstanding
virtio-net requests, I'm wondering why it's a problem for Kemari.
As I described in the previous message, Kemari queues the
requests first.  So in you example above, it should start with

virtio-net: last_avai_idx 0 inuse 2
event-tap: {A,B}

As you know, the requests are still in order still because net
layer initiates in order.  Not about completing.

In the first synchronization, the status above is transferred.  In
the next synchronization, the status will be as following.

virtio-net: last_avai_idx 1 inuse 1
event-tap: {B}

Why? Because Kemari flushes the first virtio-net request using
qemu_aio_flush() before each synchronization.  If
qemu_aio_flush() doesn't guarantee the order, what you pointed
should be problematic.  So in the final synchronization, the
state should be,

virtio-net: last_avai_idx 2 inuse 0
event-tap: {}

where A,B were completed in order.

Yoshi


>
>> >
>> >> >
>> >> >>
>> >> >>>  I'm wondering why
>> >> >>> last_avail_idx is OK to send but not inuse.
>> >> >>
>> >> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> >> our internal implementation, but it does *also* express
>> >> >> a guest observable state.
>> >> >>
>> >> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> >> there is no way to detect that a specific request has already been
>> >> >> completed. And the protocol forbids completing the same request twice.
>> >> >>
>> >> >> Our implementation always starts processing the requests
>> >> >> in order, and since we flush outstanding requests
>> >> >> before save, it works to just tell the remote 'process only requests
>> >> >> after this place'.
>> >> >>
>> >> >> But there's no such requirement in the virtio protocol,
>> >> >> so to be really generic we could add a bitmask of valid avail
>> >> >> ring entries that did not complete yet. This would be
>> >> >> the exact representation of the guest observable state.
>> >> >> In practice we have rings of up to 512 entries.
>> >> >> That's 64 byte per ring, not a lot at all.
>> >> >>
>> >> >> However, if we ever do change the protocol to send the bitmask,
>> >> >> we would need some code to resubmit requests
>> >> >> out of order, so it's not trivial.
>> >> >>
>> >> >> Another minor mistake with last_avail_idx is that it has
>> >> >> some redundancy: the high bits in the index
>> >> >> (> vq size) are not necessary as they can be
>> >> >> got from avail idx.  There's a consistency check
>> >> >> in load but we really should try to use formats
>> >> >> that are always consistent.
>> >> >>
>> >> >>> The following patch does the same thing as original, yet
>> >> >>> keeps the format of the virtio.  It shouldn't break live
>> >> >>> migration either because inuse should be 0.
>> >> >>>
>> >> >>> Yoshi
>> >> >>
>> >> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> >> And if not, how do you handle the fact that some requests
>> >> >> are in flight on the primary?
>> >> >
>> >> > Although we try flushing requests one by one making inuse 0,
>> >> > there are cases when it failovers to the secondary when inuse
>> >> > isn't 0.  We handle these in flight request on the primary by
>> >> > replaying on the secondary.
>> >> >
>> >> >>
>> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >>> index c8a0fc6..875c7ca 100644
>> >> >>> --- a/hw/virtio.c
>> >> >>> +++ b/hw/virtio.c
>> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >>>      qemu_put_be32(f, i);
>> >> >>>
>> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >> >>> +        uint16_t last_avail_idx;
>> >> >>> +
>> >> >>>          if (vdev->vq[i].vring.num == 0)
>> >> >>>              break;
>> >> >>>
>> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >> >>> +
>> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >> >>>          if (vdev->binding->save_queue)
>> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >>>      }
>> >> >>>
>> >> >>>
>> >> >>
>> >> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> >> not?  So if request 0 did not complete and request 1 did not,
>> >> >> you send avail - inuse and on the secondary you will process and
>> >> >> complete request 1 the second time, crashing the guest.
>> >> >
>> >> > In case of Kemari, no.  We sit between devices and net/block, and
>> >> > queue the requests.  After completing each transaction, we flush
>> >> > the requests one by one.  So there won't be completion inversion,
>> >> > and therefore won't be visible to the guest.
>> >> >
>> >> > Yoshi
>> >> >
>> >> >>
>> >> >>>
>> >> >>> >
>> >> >>> >> >
>> >> >>> >> >> ---
>> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >>> >> >>
>> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >>> >> >> index 849a60f..5509644 100644
>> >> >>> >> >> --- a/hw/virtio.c
>> >> >>> >> >> +++ b/hw/virtio.c
>> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >>> >> >>      VRing vring;
>> >> >>> >> >>      target_phys_addr_t pa;
>> >> >>> >> >>      uint16_t last_avail_idx;
>> >> >>> >> >> -    int inuse;
>> >> >>> >> >> +    uint16_t inuse;
>> >> >>> >> >>      uint16_t vector;
>> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >>> >> >>      VirtIODevice *vdev;
>> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >>> >> >>          if (vdev->binding->save_queue)
>> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >>> >> >>      }
>> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >>> >> >> +
>> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >>> >> >>
>> >> >>> >> >>          if (vdev->vq[i].pa) {
>> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >>> >> >> --
>> >> >>> >> >> 1.7.1.2
>> >> >>> >> >>
>> >> >>> >> >> --
>> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >> > --
>> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >> >
>> >> >>> > --
>> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >>> > the body of a message to majordomo@vger.kernel.org
>> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Yoshiaki Tamura - Dec. 17, 2010, 4:22 p.m.
2010/12/17 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>>> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>>> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>>> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>>> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>>> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>>> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>>> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>>> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>>> >> >>> >> >>
>>> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> >>> >> >
>>> >> >>> >> > This changes migration format, so it will break compatibility with
>>> >> >>> >> > existing drivers. More generally, I think migrating internal
>>> >> >>> >> > state that is not guest visible is always a mistake
>>> >> >>> >> > as it ties migration format to an internal implementation
>>> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>>> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>>> >> >>> >> > is to flush outstanding
>>> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>>> >> >>> >> > I sent patches that do this for virtio net and block.
>>> >> >>> >>
>>> >> >>> >> Could you give me the link of your patches?  I'd like to test
>>> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>>> >> >>> >> happy to drop this patch.
>>> >> >>> >>
>>> >> >>> >> Yoshi
>>> >> >>> >
>>> >> >>> > Look for this:
>>> >> >>> > stable migration image on a stopped vm
>>> >> >>> > sent on:
>>> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>>> >> >>>
>>> >> >>> Thanks for the info.
>>> >> >>>
>>> >> >>> However, The patch series above didn't solve the issue.  In
>>> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>>> >> >>> output, and while last_avail_idx gets incremented
>>> >> >>> immediately, not sending inuse makes the state inconsistent
>>> >> >>> between Primary and Secondary.
>>> >> >>
>>> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>>> >> >
>>> >> > I think we can calculate or prepare an internal last_avail_idx,
>>> >> > and update the external when inuse is decremented.  I'll try
>>> >> > whether it work w/ w/o Kemari.
>>> >>
>>> >> Hi Michael,
>>> >>
>>> >> Could you please take a look at the following patch?
>>> >
>>> > Which version is this against?
>>>
>>> Oops.  It should be very old.
>>> 67f895bfe69f323b427b284430b6219c8a62e8d4
>>>
>>> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>>> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>>> >>
>>> >>     virtio: update last_avail_idx when inuse is decreased.
>>> >>
>>> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >
>>> > It would be better to have a commit description explaining why a change
>>> > is made, and why it is correct, not just repeating what can be seen from
>>> > the diff anyway.
>>>
>>> Sorry for being lazy here.
>>>
>>> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> index c8a0fc6..6688c02 100644
>>> >> --- a/hw/virtio.c
>>> >> +++ b/hw/virtio.c
>>> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>>> >>      wmb();
>>> >>      trace_virtqueue_flush(vq, count);
>>> >>      vring_used_idx_increment(vq, count);
>>> >> +    vq->last_avail_idx += count;
>>> >>      vq->inuse -= count;
>>> >>  }
>>> >>
>>> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>> >>      unsigned int i, head, max;
>>> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>>> >>
>>> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>>> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>>> >>          return 0;
>>> >>
>>> >>      /* When we start there are none of either input nor output. */
>>> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>> >>
>>> >>      max = vq->vring.num;
>>> >>
>>> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>>> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>>> >>
>>> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>>> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>>> >>
>>> >
>>> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>>>
>>> I think there are two problems.
>>>
>>> 1. When to update last_avail_idx.
>>> 2. The ordering issue you're mentioning below.
>>>
>>> The patch above is only trying to address 1 because last time you
>>> mentioned that modifying last_avail_idx upon save may break the
>>> guest, which I agree.  If virtio_queue_empty and
>>> virtqueue_avail_bytes are only used internally, meaning invisible
>>> to the guest, I guess the approach above can be applied too.
>>
>> So IMHO 2 is the real issue. This is what was problematic
>> with the save patch, otherwise of course changes in save
>> are better than changes all over the codebase.
>
> All right.  Then let's focus on 2 first.
>
>>> > Previous patch version sure looked simpler, and this seems functionally
>>> > equivalent, so my question still stands: here it is rephrased in a
>>> > different way:
>>> >
>>> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>>> >
>>> >        host pops A, then B, then completes B and flushes
>>> >
>>> >        now with this patch last_avail_idx will be 1, and then
>>> >        remote will get it, it will execute B again. As a result
>>> >        B will complete twice, and apparently A will never complete.
>>> >
>>> >
>>> > This is what I was saying below: assuming that there are
>>> > outstanding requests when we migrate, there is no way
>>> > a single index can be enough to figure out which requests
>>> > need to be handled and which are in flight already.
>>> >
>>> > We must add some kind of bitmask to tell us which is which.
>>>
>>> I should understand why this inversion can happen before solving
>>> the issue.
>>
>> It's a fundamental thing in virtio.
>> I think it is currently only likely to happen with block, I think tap
>> currently completes things in order.  In any case relying on this in the
>> frontend is a mistake.
>>
>>>  Currently, how are you making virio-net to flush
>>> every requests for live migration?  Is it qemu_aio_flush()?
>>
>> Think so.
>
> If qemu_aio_flush() is responsible for flushing the outstanding
> virtio-net requests, I'm wondering why it's a problem for Kemari.
> As I described in the previous message, Kemari queues the
> requests first.  So in you example above, it should start with
>
> virtio-net: last_avai_idx 0 inuse 2
> event-tap: {A,B}
>
> As you know, the requests are still in order still because net
> layer initiates in order.  Not about completing.
>
> In the first synchronization, the status above is transferred.  In
> the next synchronization, the status will be as following.
>
> virtio-net: last_avai_idx 1 inuse 1
> event-tap: {B}
>
> Why? Because Kemari flushes the first virtio-net request using
> qemu_aio_flush() before each synchronization.  If
> qemu_aio_flush() doesn't guarantee the order, what you pointed
> should be problematic.  So in the final synchronization, the
> state should be,
>
> virtio-net: last_avai_idx 2 inuse 0
> event-tap: {}
>
> where A,B were completed in order.
>
> Yoshi

Hi Michael,

Please let me know if the discussion has gone wrong or my
explanation was incorrect.  I believe live migration shouldn't be
a problem either because it would flush until inuse gets 0.

Yoshi

>
>
>>
>>> >
>>> >> >
>>> >> >>
>>> >> >>>  I'm wondering why
>>> >> >>> last_avail_idx is OK to send but not inuse.
>>> >> >>
>>> >> >> last_avail_idx is at some level a mistake, it exposes part of
>>> >> >> our internal implementation, but it does *also* express
>>> >> >> a guest observable state.
>>> >> >>
>>> >> >> Here's the problem that it solves: just looking at the rings in virtio
>>> >> >> there is no way to detect that a specific request has already been
>>> >> >> completed. And the protocol forbids completing the same request twice.
>>> >> >>
>>> >> >> Our implementation always starts processing the requests
>>> >> >> in order, and since we flush outstanding requests
>>> >> >> before save, it works to just tell the remote 'process only requests
>>> >> >> after this place'.
>>> >> >>
>>> >> >> But there's no such requirement in the virtio protocol,
>>> >> >> so to be really generic we could add a bitmask of valid avail
>>> >> >> ring entries that did not complete yet. This would be
>>> >> >> the exact representation of the guest observable state.
>>> >> >> In practice we have rings of up to 512 entries.
>>> >> >> That's 64 byte per ring, not a lot at all.
>>> >> >>
>>> >> >> However, if we ever do change the protocol to send the bitmask,
>>> >> >> we would need some code to resubmit requests
>>> >> >> out of order, so it's not trivial.
>>> >> >>
>>> >> >> Another minor mistake with last_avail_idx is that it has
>>> >> >> some redundancy: the high bits in the index
>>> >> >> (> vq size) are not necessary as they can be
>>> >> >> got from avail idx.  There's a consistency check
>>> >> >> in load but we really should try to use formats
>>> >> >> that are always consistent.
>>> >> >>
>>> >> >>> The following patch does the same thing as original, yet
>>> >> >>> keeps the format of the virtio.  It shouldn't break live
>>> >> >>> migration either because inuse should be 0.
>>> >> >>>
>>> >> >>> Yoshi
>>> >> >>
>>> >> >> Question is, can you flush to make inuse 0 in kemari too?
>>> >> >> And if not, how do you handle the fact that some requests
>>> >> >> are in flight on the primary?
>>> >> >
>>> >> > Although we try flushing requests one by one making inuse 0,
>>> >> > there are cases when it failovers to the secondary when inuse
>>> >> > isn't 0.  We handle these in flight request on the primary by
>>> >> > replaying on the secondary.
>>> >> >
>>> >> >>
>>> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >>> index c8a0fc6..875c7ca 100644
>>> >> >>> --- a/hw/virtio.c
>>> >> >>> +++ b/hw/virtio.c
>>> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>>      qemu_put_be32(f, i);
>>> >> >>>
>>> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>>> >> >>> +        uint16_t last_avail_idx;
>>> >> >>> +
>>> >> >>>          if (vdev->vq[i].vring.num == 0)
>>> >> >>>              break;
>>> >> >>>
>>> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>>> >> >>> +
>>> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>>> >> >>>          if (vdev->binding->save_queue)
>>> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> >> >>>      }
>>> >> >>>
>>> >> >>>
>>> >> >>
>>> >> >> This looks wrong to me.  Requests can complete in any order, can they
>>> >> >> not?  So if request 0 did not complete and request 1 did not,
>>> >> >> you send avail - inuse and on the secondary you will process and
>>> >> >> complete request 1 the second time, crashing the guest.
>>> >> >
>>> >> > In case of Kemari, no.  We sit between devices and net/block, and
>>> >> > queue the requests.  After completing each transaction, we flush
>>> >> > the requests one by one.  So there won't be completion inversion,
>>> >> > and therefore won't be visible to the guest.
>>> >> >
>>> >> > Yoshi
>>> >> >
>>> >> >>
>>> >> >>>
>>> >> >>> >
>>> >> >>> >> >
>>> >> >>> >> >> ---
>>> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>>> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>>> >> >>> >> >>
>>> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >>> >> >> index 849a60f..5509644 100644
>>> >> >>> >> >> --- a/hw/virtio.c
>>> >> >>> >> >> +++ b/hw/virtio.c
>>> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>>> >> >>> >> >>      VRing vring;
>>> >> >>> >> >>      target_phys_addr_t pa;
>>> >> >>> >> >>      uint16_t last_avail_idx;
>>> >> >>> >> >> -    int inuse;
>>> >> >>> >> >> +    uint16_t inuse;
>>> >> >>> >> >>      uint16_t vector;
>>> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>>> >> >>> >> >>      VirtIODevice *vdev;
>>> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>>> >> >>> >> >>          if (vdev->binding->save_queue)
>>> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> >> >>> >> >>      }
>>> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>>> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>>> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>>> >> >>> >> >> +
>>> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>>> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>>> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>>> >> >>> >> >>
>>> >> >>> >> >>          if (vdev->vq[i].pa) {
>>> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>>> >> >>> >> >> --
>>> >> >>> >> >> 1.7.1.2
>>> >> >>> >> >>
>>> >> >>> >> >> --
>>> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>> >> > --
>>> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>> >> >
>>> >> >>> > --
>>> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >>> > the body of a message to majordomo@vger.kernel.org
>>> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>> >
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>
>>> >> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
Michael S. Tsirkin - Dec. 24, 2010, 9:27 a.m.
On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >>> >> >>
> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >>> >> >
> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >>> >> > is to flush outstanding
> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >>> >>
> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >>> >> happy to drop this patch.
> >> >> >>> >>
> >> >> >>> >> Yoshi
> >> >> >>> >
> >> >> >>> > Look for this:
> >> >> >>> > stable migration image on a stopped vm
> >> >> >>> > sent on:
> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >>>
> >> >> >>> Thanks for the info.
> >> >> >>>
> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >>> between Primary and Secondary.
> >> >> >>
> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >
> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> > whether it work w/ w/o Kemari.
> >> >>
> >> >> Hi Michael,
> >> >>
> >> >> Could you please take a look at the following patch?
> >> >
> >> > Which version is this against?
> >>
> >> Oops.  It should be very old.
> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >>
> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >>
> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >>
> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >
> >> > It would be better to have a commit description explaining why a change
> >> > is made, and why it is correct, not just repeating what can be seen from
> >> > the diff anyway.
> >>
> >> Sorry for being lazy here.
> >>
> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> index c8a0fc6..6688c02 100644
> >> >> --- a/hw/virtio.c
> >> >> +++ b/hw/virtio.c
> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >>      wmb();
> >> >>      trace_virtqueue_flush(vq, count);
> >> >>      vring_used_idx_increment(vq, count);
> >> >> +    vq->last_avail_idx += count;
> >> >>      vq->inuse -= count;
> >> >>  }
> >> >>
> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >>      unsigned int i, head, max;
> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >>
> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >>          return 0;
> >> >>
> >> >>      /* When we start there are none of either input nor output. */
> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >>
> >> >>      max = vq->vring.num;
> >> >>
> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >>
> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >>
> >> >
> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >>
> >> I think there are two problems.
> >>
> >> 1. When to update last_avail_idx.
> >> 2. The ordering issue you're mentioning below.
> >>
> >> The patch above is only trying to address 1 because last time you
> >> mentioned that modifying last_avail_idx upon save may break the
> >> guest, which I agree.  If virtio_queue_empty and
> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> to the guest, I guess the approach above can be applied too.
> >
> > So IMHO 2 is the real issue. This is what was problematic
> > with the save patch, otherwise of course changes in save
> > are better than changes all over the codebase.
> 
> All right.  Then let's focus on 2 first.
> 
> >> > Previous patch version sure looked simpler, and this seems functionally
> >> > equivalent, so my question still stands: here it is rephrased in a
> >> > different way:
> >> >
> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >
> >> >        host pops A, then B, then completes B and flushes
> >> >
> >> >        now with this patch last_avail_idx will be 1, and then
> >> >        remote will get it, it will execute B again. As a result
> >> >        B will complete twice, and apparently A will never complete.
> >> >
> >> >
> >> > This is what I was saying below: assuming that there are
> >> > outstanding requests when we migrate, there is no way
> >> > a single index can be enough to figure out which requests
> >> > need to be handled and which are in flight already.
> >> >
> >> > We must add some kind of bitmask to tell us which is which.
> >>
> >> I should understand why this inversion can happen before solving
> >> the issue.
> >
> > It's a fundamental thing in virtio.
> > I think it is currently only likely to happen with block, I think tap
> > currently completes things in order.  In any case relying on this in the
> > frontend is a mistake.
> >
> >>  Currently, how are you making virio-net to flush
> >> every requests for live migration?  Is it qemu_aio_flush()?
> >
> > Think so.
> 
> If qemu_aio_flush() is responsible for flushing the outstanding
> virtio-net requests, I'm wondering why it's a problem for Kemari.
> As I described in the previous message, Kemari queues the
> requests first.  So in you example above, it should start with
> 
> virtio-net: last_avai_idx 0 inuse 2
> event-tap: {A,B}
> 
> As you know, the requests are still in order still because net
> layer initiates in order.  Not about completing.
> 
> In the first synchronization, the status above is transferred.  In
> the next synchronization, the status will be as following.
> 
> virtio-net: last_avai_idx 1 inuse 1
> event-tap: {B}

OK, this answers the ordering question.

Another question: at this point we transfer this status: both
event-tap and virtio ring have the command B,
so the remote will have:

virtio-net: inuse 0
event-tap: {B}

Is this right? This already seems to be a problem as when B completes
inuse will go negative?

Next it seems that the remote virtio will resubmit B to event-tap. The
remote will then have:

virtio-net: inuse 1
event-tap: {B, B}

This looks kind of wrong ... will two packets go out?


> Why? Because Kemari flushes the first virtio-net request using
> qemu_aio_flush() before each synchronization.  If
> qemu_aio_flush() doesn't guarantee the order, what you pointed
> should be problematic.  So in the final synchronization, the
> state should be,
> 
> virtio-net: last_avai_idx 2 inuse 0
> event-tap: {}
> 
> where A,B were completed in order.
> 
> Yoshi


It might be better to discuss block because that's where
requests can complete out of order.

So let me see if I understand:
- each command passed to event tap is queued by it,
  it is not passed directly to the backend
- later requests are passed to the backend,
  always in the same order that they were submitted
- each synchronization point flushes all requests
  passed to the backend so far
- each synchronization transfers all requests not passed to the backend,
  to the remote, and they are replayed there

Now to analyse this for correctness I am looking at the original patch
because it is smaller so easier to analyse and I think it is
functionally equivalent, correct me if I am wrong in this.

So the reason there's no out of order issue is this
(and might be a good thing to put in commit log
or a comment somewhere):


At point of save callback event tap has flushed commands
passed to the backend already. Thus at the point of
the save callback if a command has completed
all previous commands have been flushed and completed.


Therefore inuse is
in fact the # of requests passed to event tap but not yet
passed to the backend (for non-event tap case all commands are
passed to the backend immediately and because of this
inuse is 0) and these are the last inuse commands submitted.


Right?

Now a question:

When we pass last_used_index - inuse to the remote,
the remote virtio will resubmit the request.
Since request is also passed by event tap, we get
the request twice, why is this not a problem?


> >
> >> >
> >> >> >
> >> >> >>
> >> >> >>>  I'm wondering why
> >> >> >>> last_avail_idx is OK to send but not inuse.
> >> >> >>
> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> >> our internal implementation, but it does *also* express
> >> >> >> a guest observable state.
> >> >> >>
> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> >> there is no way to detect that a specific request has already been
> >> >> >> completed. And the protocol forbids completing the same request twice.
> >> >> >>
> >> >> >> Our implementation always starts processing the requests
> >> >> >> in order, and since we flush outstanding requests
> >> >> >> before save, it works to just tell the remote 'process only requests
> >> >> >> after this place'.
> >> >> >>
> >> >> >> But there's no such requirement in the virtio protocol,
> >> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> >> ring entries that did not complete yet. This would be
> >> >> >> the exact representation of the guest observable state.
> >> >> >> In practice we have rings of up to 512 entries.
> >> >> >> That's 64 byte per ring, not a lot at all.
> >> >> >>
> >> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> >> we would need some code to resubmit requests
> >> >> >> out of order, so it's not trivial.
> >> >> >>
> >> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> >> some redundancy: the high bits in the index
> >> >> >> (> vq size) are not necessary as they can be
> >> >> >> got from avail idx.  There's a consistency check
> >> >> >> in load but we really should try to use formats
> >> >> >> that are always consistent.
> >> >> >>
> >> >> >>> The following patch does the same thing as original, yet
> >> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >> >>> migration either because inuse should be 0.
> >> >> >>>
> >> >> >>> Yoshi
> >> >> >>
> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> >> And if not, how do you handle the fact that some requests
> >> >> >> are in flight on the primary?
> >> >> >
> >> >> > Although we try flushing requests one by one making inuse 0,
> >> >> > there are cases when it failovers to the secondary when inuse
> >> >> > isn't 0.  We handle these in flight request on the primary by
> >> >> > replaying on the secondary.
> >> >> >
> >> >> >>
> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >>> index c8a0fc6..875c7ca 100644
> >> >> >>> --- a/hw/virtio.c
> >> >> >>> +++ b/hw/virtio.c
> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >>>      qemu_put_be32(f, i);
> >> >> >>>
> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >> >>> +        uint16_t last_avail_idx;
> >> >> >>> +
> >> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >> >>>              break;
> >> >> >>>
> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >> >>> +
> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >> >>>          if (vdev->binding->save_queue)
> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >>>      }
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> >> you send avail - inuse and on the secondary you will process and
> >> >> >> complete request 1 the second time, crashing the guest.
> >> >> >
> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> >> > queue the requests.  After completing each transaction, we flush
> >> >> > the requests one by one.  So there won't be completion inversion,
> >> >> > and therefore won't be visible to the guest.
> >> >> >
> >> >> > Yoshi
> >> >> >
> >> >> >>
> >> >> >>>
> >> >> >>> >
> >> >> >>> >> >
> >> >> >>> >> >> ---
> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >> >>> >> >>
> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >>> >> >> index 849a60f..5509644 100644
> >> >> >>> >> >> --- a/hw/virtio.c
> >> >> >>> >> >> +++ b/hw/virtio.c
> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >> >>> >> >>      VRing vring;
> >> >> >>> >> >>      target_phys_addr_t pa;
> >> >> >>> >> >>      uint16_t last_avail_idx;
> >> >> >>> >> >> -    int inuse;
> >> >> >>> >> >> +    uint16_t inuse;
> >> >> >>> >> >>      uint16_t vector;
> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >> >>> >> >>      VirtIODevice *vdev;
> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >>> >> >>      }
> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> >>> >> >> +
> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >> >>> >> >>
> >> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> >>> >> >> --
> >> >> >>> >> >> 1.7.1.2
> >> >> >>> >> >>
> >> >> >>> >> >> --
> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >> > --
> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >> >
> >> >> >>> > --
> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>
> >> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
Yoshiaki Tamura - Dec. 24, 2010, 11:42 a.m.
2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >> >>> >> >>
>> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >>> >> >
>> >> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >> >>> >> > state that is not guest visible is always a mistake
>> >> >> >>> >> > as it ties migration format to an internal implementation
>> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >> >>> >> > is to flush outstanding
>> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >> >>> >>
>> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >> >>> >> happy to drop this patch.
>> >> >> >>> >>
>> >> >> >>> >> Yoshi
>> >> >> >>> >
>> >> >> >>> > Look for this:
>> >> >> >>> > stable migration image on a stopped vm
>> >> >> >>> > sent on:
>> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >> >>>
>> >> >> >>> Thanks for the info.
>> >> >> >>>
>> >> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >> >>> output, and while last_avail_idx gets incremented
>> >> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >> >>> between Primary and Secondary.
>> >> >> >>
>> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >> >
>> >> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> >> > and update the external when inuse is decremented.  I'll try
>> >> >> > whether it work w/ w/o Kemari.
>> >> >>
>> >> >> Hi Michael,
>> >> >>
>> >> >> Could you please take a look at the following patch?
>> >> >
>> >> > Which version is this against?
>> >>
>> >> Oops.  It should be very old.
>> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
>> >>
>> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >> >>
>> >> >>     virtio: update last_avail_idx when inuse is decreased.
>> >> >>
>> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >
>> >> > It would be better to have a commit description explaining why a change
>> >> > is made, and why it is correct, not just repeating what can be seen from
>> >> > the diff anyway.
>> >>
>> >> Sorry for being lazy here.
>> >>
>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> index c8a0fc6..6688c02 100644
>> >> >> --- a/hw/virtio.c
>> >> >> +++ b/hw/virtio.c
>> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >> >>      wmb();
>> >> >>      trace_virtqueue_flush(vq, count);
>> >> >>      vring_used_idx_increment(vq, count);
>> >> >> +    vq->last_avail_idx += count;
>> >> >>      vq->inuse -= count;
>> >> >>  }
>> >> >>
>> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >>      unsigned int i, head, max;
>> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >> >>
>> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >> >>          return 0;
>> >> >>
>> >> >>      /* When we start there are none of either input nor output. */
>> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >>
>> >> >>      max = vq->vring.num;
>> >> >>
>> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >> >>
>> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >> >>
>> >> >
>> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>> >>
>> >> I think there are two problems.
>> >>
>> >> 1. When to update last_avail_idx.
>> >> 2. The ordering issue you're mentioning below.
>> >>
>> >> The patch above is only trying to address 1 because last time you
>> >> mentioned that modifying last_avail_idx upon save may break the
>> >> guest, which I agree.  If virtio_queue_empty and
>> >> virtqueue_avail_bytes are only used internally, meaning invisible
>> >> to the guest, I guess the approach above can be applied too.
>> >
>> > So IMHO 2 is the real issue. This is what was problematic
>> > with the save patch, otherwise of course changes in save
>> > are better than changes all over the codebase.
>>
>> All right.  Then let's focus on 2 first.
>>
>> >> > Previous patch version sure looked simpler, and this seems functionally
>> >> > equivalent, so my question still stands: here it is rephrased in a
>> >> > different way:
>> >> >
>> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >> >
>> >> >        host pops A, then B, then completes B and flushes
>> >> >
>> >> >        now with this patch last_avail_idx will be 1, and then
>> >> >        remote will get it, it will execute B again. As a result
>> >> >        B will complete twice, and apparently A will never complete.
>> >> >
>> >> >
>> >> > This is what I was saying below: assuming that there are
>> >> > outstanding requests when we migrate, there is no way
>> >> > a single index can be enough to figure out which requests
>> >> > need to be handled and which are in flight already.
>> >> >
>> >> > We must add some kind of bitmask to tell us which is which.
>> >>
>> >> I should understand why this inversion can happen before solving
>> >> the issue.
>> >
>> > It's a fundamental thing in virtio.
>> > I think it is currently only likely to happen with block, I think tap
>> > currently completes things in order.  In any case relying on this in the
>> > frontend is a mistake.
>> >
>> >>  Currently, how are you making virio-net to flush
>> >> every requests for live migration?  Is it qemu_aio_flush()?
>> >
>> > Think so.
>>
>> If qemu_aio_flush() is responsible for flushing the outstanding
>> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> As I described in the previous message, Kemari queues the
>> requests first.  So in you example above, it should start with
>>
>> virtio-net: last_avai_idx 0 inuse 2
>> event-tap: {A,B}
>>
>> As you know, the requests are still in order still because net
>> layer initiates in order.  Not about completing.
>>
>> In the first synchronization, the status above is transferred.  In
>> the next synchronization, the status will be as following.
>>
>> virtio-net: last_avai_idx 1 inuse 1
>> event-tap: {B}
>
> OK, this answers the ordering question.

Glad to hear that!

> Another question: at this point we transfer this status: both
> event-tap and virtio ring have the command B,
> so the remote will have:
>
> virtio-net: inuse 0
> event-tap: {B}
>
> Is this right? This already seems to be a problem as when B completes
> inuse will go negative?

I think state above is wrong.  inuse 0 means there shouldn't be
any requests in event-tap.  Note that the callback is called only
when event-tap flushes the requests.

> Next it seems that the remote virtio will resubmit B to event-tap. The
> remote will then have:
>
> virtio-net: inuse 1
> event-tap: {B, B}
>
> This looks kind of wrong ... will two packets go out?

No.  Currently, we're just replaying the requests with pio/mmio.
In the situation above, it should be,

virtio-net: inuse 1
event-tap: {B}

>> Why? Because Kemari flushes the first virtio-net request using
>> qemu_aio_flush() before each synchronization.  If
>> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> should be problematic.  So in the final synchronization, the
>> state should be,
>>
>> virtio-net: last_avai_idx 2 inuse 0
>> event-tap: {}
>>
>> where A,B were completed in order.
>>
>> Yoshi
>
>
> It might be better to discuss block because that's where
> requests can complete out of order.

It's same as net.  We queue requests and call bdrv_flush per
sending requests to the block.  So there shouldn't be any
inversion.

> So let me see if I understand:
> - each command passed to event tap is queued by it,
>  it is not passed directly to the backend
> - later requests are passed to the backend,
>  always in the same order that they were submitted
> - each synchronization point flushes all requests
>  passed to the backend so far
> - each synchronization transfers all requests not passed to the backend,
>  to the remote, and they are replayed there

Correct.

> Now to analyse this for correctness I am looking at the original patch
> because it is smaller so easier to analyse and I think it is
> functionally equivalent, correct me if I am wrong in this.

So you think decreasing last_avail_idx upon save is better than
updating it in the callback?

> So the reason there's no out of order issue is this
> (and might be a good thing to put in commit log
> or a comment somewhere):

I've done some in the latest patch.  Please point it out if it
wasn't enough.

> At point of save callback event tap has flushed commands
> passed to the backend already. Thus at the point of
> the save callback if a command has completed
> all previous commands have been flushed and completed.
>
>
> Therefore inuse is
> in fact the # of requests passed to event tap but not yet
> passed to the backend (for non-event tap case all commands are
> passed to the backend immediately and because of this
> inuse is 0) and these are the last inuse commands submitted.
>
>
> Right?

Yep.

> Now a question:
>
> When we pass last_used_index - inuse to the remote,
> the remote virtio will resubmit the request.
> Since request is also passed by event tap, we get
> the request twice, why is this not a problem?

It's not a problem because event-tap currently replays with
pio/mmio only, as I mentioned above.  Although event-tap receives
information about the queued requests, it won't pass it to the
backend.  The reason is the problem in setting the callbacks
which are specific to devices on the secondary.  These are
pointers, and even worse, are usually static functions, which
event-tap has no way to restore it upon failover.  I do want to
change event-tap replay to be this way in the future, pio/mmio
replay is implemented for now.

Thanks,

Yoshi

>
>
>> >
>> >> >
>> >> >> >
>> >> >> >>
>> >> >> >>>  I'm wondering why
>> >> >> >>> last_avail_idx is OK to send but not inuse.
>> >> >> >>
>> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> >> >> our internal implementation, but it does *also* express
>> >> >> >> a guest observable state.
>> >> >> >>
>> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> >> >> there is no way to detect that a specific request has already been
>> >> >> >> completed. And the protocol forbids completing the same request twice.
>> >> >> >>
>> >> >> >> Our implementation always starts processing the requests
>> >> >> >> in order, and since we flush outstanding requests
>> >> >> >> before save, it works to just tell the remote 'process only requests
>> >> >> >> after this place'.
>> >> >> >>
>> >> >> >> But there's no such requirement in the virtio protocol,
>> >> >> >> so to be really generic we could add a bitmask of valid avail
>> >> >> >> ring entries that did not complete yet. This would be
>> >> >> >> the exact representation of the guest observable state.
>> >> >> >> In practice we have rings of up to 512 entries.
>> >> >> >> That's 64 byte per ring, not a lot at all.
>> >> >> >>
>> >> >> >> However, if we ever do change the protocol to send the bitmask,
>> >> >> >> we would need some code to resubmit requests
>> >> >> >> out of order, so it's not trivial.
>> >> >> >>
>> >> >> >> Another minor mistake with last_avail_idx is that it has
>> >> >> >> some redundancy: the high bits in the index
>> >> >> >> (> vq size) are not necessary as they can be
>> >> >> >> got from avail idx.  There's a consistency check
>> >> >> >> in load but we really should try to use formats
>> >> >> >> that are always consistent.
>> >> >> >>
>> >> >> >>> The following patch does the same thing as original, yet
>> >> >> >>> keeps the format of the virtio.  It shouldn't break live
>> >> >> >>> migration either because inuse should be 0.
>> >> >> >>>
>> >> >> >>> Yoshi
>> >> >> >>
>> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> >> >> And if not, how do you handle the fact that some requests
>> >> >> >> are in flight on the primary?
>> >> >> >
>> >> >> > Although we try flushing requests one by one making inuse 0,
>> >> >> > there are cases when it failovers to the secondary when inuse
>> >> >> > isn't 0.  We handle these in flight request on the primary by
>> >> >> > replaying on the secondary.
>> >> >> >
>> >> >> >>
>> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >>> index c8a0fc6..875c7ca 100644
>> >> >> >>> --- a/hw/virtio.c
>> >> >> >>> +++ b/hw/virtio.c
>> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >>>      qemu_put_be32(f, i);
>> >> >> >>>
>> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >> >> >>> +        uint16_t last_avail_idx;
>> >> >> >>> +
>> >> >> >>>          if (vdev->vq[i].vring.num == 0)
>> >> >> >>>              break;
>> >> >> >>>
>> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >> >> >>> +
>> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >> >> >>>          if (vdev->binding->save_queue)
>> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >>>      }
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> >> >> not?  So if request 0 did not complete and request 1 did not,
>> >> >> >> you send avail - inuse and on the secondary you will process and
>> >> >> >> complete request 1 the second time, crashing the guest.
>> >> >> >
>> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
>> >> >> > queue the requests.  After completing each transaction, we flush
>> >> >> > the requests one by one.  So there won't be completion inversion,
>> >> >> > and therefore won't be visible to the guest.
>> >> >> >
>> >> >> > Yoshi
>> >> >> >
>> >> >> >>
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> >> >
>> >> >> >>> >> >> ---
>> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >> >>> >> >>
>> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >>> >> >> index 849a60f..5509644 100644
>> >> >> >>> >> >> --- a/hw/virtio.c
>> >> >> >>> >> >> +++ b/hw/virtio.c
>> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >> >>> >> >>      VRing vring;
>> >> >> >>> >> >>      target_phys_addr_t pa;
>> >> >> >>> >> >>      uint16_t last_avail_idx;
>> >> >> >>> >> >> -    int inuse;
>> >> >> >>> >> >> +    uint16_t inuse;
>> >> >> >>> >> >>      uint16_t vector;
>> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >> >>> >> >>      VirtIODevice *vdev;
>> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >> >>> >> >>          if (vdev->binding->save_queue)
>> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >>> >> >>      }
>> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >> >>> >> >> +
>> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >> >>> >> >>
>> >> >> >>> >> >>          if (vdev->vq[i].pa) {
>> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >> >>> >> >> --
>> >> >> >>> >> >> 1.7.1.2
>> >> >> >>> >> >>
>> >> >> >>> >> >> --
>> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >> > --
>> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >> >
>> >> >> >>> > --
>> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >>> > the body of a message to majordomo@vger.kernel.org
>> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >
>> >> >> >> --
>> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>
>> >> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Michael S. Tsirkin - Dec. 24, 2010, 1:21 p.m.
On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >>> >> >
> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >>> >>
> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >>> >>
> >> >> >> >>> >> Yoshi
> >> >> >> >>> >
> >> >> >> >>> > Look for this:
> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >>> > sent on:
> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >>>
> >> >> >> >>> Thanks for the info.
> >> >> >> >>>
> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >>> between Primary and Secondary.
> >> >> >> >>
> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >
> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >>
> >> >> >> Hi Michael,
> >> >> >>
> >> >> >> Could you please take a look at the following patch?
> >> >> >
> >> >> > Which version is this against?
> >> >>
> >> >> Oops.  It should be very old.
> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >>
> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >>
> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >>
> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >
> >> >> > It would be better to have a commit description explaining why a change
> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> > the diff anyway.
> >> >>
> >> >> Sorry for being lazy here.
> >> >>
> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> --- a/hw/virtio.c
> >> >> >> +++ b/hw/virtio.c
> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >>      wmb();
> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> +    vq->last_avail_idx += count;
> >> >> >>      vq->inuse -= count;
> >> >> >>  }
> >> >> >>
> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>      unsigned int i, head, max;
> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >>
> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >>          return 0;
> >> >> >>
> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>
> >> >> >>      max = vq->vring.num;
> >> >> >>
> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >>
> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >>
> >> >> >
> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >>
> >> >> I think there are two problems.
> >> >>
> >> >> 1. When to update last_avail_idx.
> >> >> 2. The ordering issue you're mentioning below.
> >> >>
> >> >> The patch above is only trying to address 1 because last time you
> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> to the guest, I guess the approach above can be applied too.
> >> >
> >> > So IMHO 2 is the real issue. This is what was problematic
> >> > with the save patch, otherwise of course changes in save
> >> > are better than changes all over the codebase.
> >>
> >> All right.  Then let's focus on 2 first.
> >>
> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> > different way:
> >> >> >
> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >
> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >
> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >
> >> >> >
> >> >> > This is what I was saying below: assuming that there are
> >> >> > outstanding requests when we migrate, there is no way
> >> >> > a single index can be enough to figure out which requests
> >> >> > need to be handled and which are in flight already.
> >> >> >
> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >>
> >> >> I should understand why this inversion can happen before solving
> >> >> the issue.
> >> >
> >> > It's a fundamental thing in virtio.
> >> > I think it is currently only likely to happen with block, I think tap
> >> > currently completes things in order.  In any case relying on this in the
> >> > frontend is a mistake.
> >> >
> >> >>  Currently, how are you making virio-net to flush
> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >
> >> > Think so.
> >>
> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> As I described in the previous message, Kemari queues the
> >> requests first.  So in you example above, it should start with
> >>
> >> virtio-net: last_avai_idx 0 inuse 2
> >> event-tap: {A,B}
> >>
> >> As you know, the requests are still in order still because net
> >> layer initiates in order.  Not about completing.
> >>
> >> In the first synchronization, the status above is transferred.  In
> >> the next synchronization, the status will be as following.
> >>
> >> virtio-net: last_avai_idx 1 inuse 1
> >> event-tap: {B}
> >
> > OK, this answers the ordering question.
> 
> Glad to hear that!
> 
> > Another question: at this point we transfer this status: both
> > event-tap and virtio ring have the command B,
> > so the remote will have:
> >
> > virtio-net: inuse 0
> > event-tap: {B}
> >
> > Is this right? This already seems to be a problem as when B completes
> > inuse will go negative?
> 
> I think state above is wrong.  inuse 0 means there shouldn't be
> any requests in event-tap.  Note that the callback is called only
> when event-tap flushes the requests.
> 
> > Next it seems that the remote virtio will resubmit B to event-tap. The
> > remote will then have:
> >
> > virtio-net: inuse 1
> > event-tap: {B, B}
> >
> > This looks kind of wrong ... will two packets go out?
> 
> No.  Currently, we're just replaying the requests with pio/mmio.
> In the situation above, it should be,
> 
> virtio-net: inuse 1
> event-tap: {B}
> 
> >> Why? Because Kemari flushes the first virtio-net request using
> >> qemu_aio_flush() before each synchronization.  If
> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> should be problematic.  So in the final synchronization, the
> >> state should be,
> >>
> >> virtio-net: last_avai_idx 2 inuse 0
> >> event-tap: {}
> >>
> >> where A,B were completed in order.
> >>
> >> Yoshi
> >
> >
> > It might be better to discuss block because that's where
> > requests can complete out of order.
> 
> It's same as net.  We queue requests and call bdrv_flush per
> sending requests to the block.  So there shouldn't be any
> inversion.
> 
> > So let me see if I understand:
> > - each command passed to event tap is queued by it,
> >  it is not passed directly to the backend
> > - later requests are passed to the backend,
> >  always in the same order that they were submitted
> > - each synchronization point flushes all requests
> >  passed to the backend so far
> > - each synchronization transfers all requests not passed to the backend,
> >  to the remote, and they are replayed there
> 
> Correct.
> 
> > Now to analyse this for correctness I am looking at the original patch
> > because it is smaller so easier to analyse and I think it is
> > functionally equivalent, correct me if I am wrong in this.
> 
> So you think decreasing last_avail_idx upon save is better than
> updating it in the callback?

If this is correct, of the two equivalent approaches the one
that only touches save/load seems superiour.

> > So the reason there's no out of order issue is this
> > (and might be a good thing to put in commit log
> > or a comment somewhere):
> 
> I've done some in the latest patch.  Please point it out if it
> wasn't enough.
> 
> > At point of save callback event tap has flushed commands
> > passed to the backend already. Thus at the point of
> > the save callback if a command has completed
> > all previous commands have been flushed and completed.
> >
> >
> > Therefore inuse is
> > in fact the # of requests passed to event tap but not yet
> > passed to the backend (for non-event tap case all commands are
> > passed to the backend immediately and because of this
> > inuse is 0) and these are the last inuse commands submitted.
> >
> >
> > Right?
> 
> Yep.
> 
> > Now a question:
> >
> > When we pass last_used_index - inuse to the remote,
> > the remote virtio will resubmit the request.
> > Since request is also passed by event tap, we get
> > the request twice, why is this not a problem?
> 
> It's not a problem because event-tap currently replays with
> pio/mmio only, as I mentioned above.  Although event-tap receives
> information about the queued requests, it won't pass it to the
> backend.  The reason is the problem in setting the callbacks
> which are specific to devices on the secondary.  These are
> pointers, and even worse, are usually static functions, which
> event-tap has no way to restore it upon failover.  I do want to
> change event-tap replay to be this way in the future, pio/mmio
> replay is implemented for now.
> 
> Thanks,
> 
> Yoshi
> 
> >
> >
> >> >
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>  I'm wondering why
> >> >> >> >>> last_avail_idx is OK to send but not inuse.
> >> >> >> >>
> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> >> >> our internal implementation, but it does *also* express
> >> >> >> >> a guest observable state.
> >> >> >> >>
> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> >> >> there is no way to detect that a specific request has already been
> >> >> >> >> completed. And the protocol forbids completing the same request twice.
> >> >> >> >>
> >> >> >> >> Our implementation always starts processing the requests
> >> >> >> >> in order, and since we flush outstanding requests
> >> >> >> >> before save, it works to just tell the remote 'process only requests
> >> >> >> >> after this place'.
> >> >> >> >>
> >> >> >> >> But there's no such requirement in the virtio protocol,
> >> >> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> >> >> ring entries that did not complete yet. This would be
> >> >> >> >> the exact representation of the guest observable state.
> >> >> >> >> In practice we have rings of up to 512 entries.
> >> >> >> >> That's 64 byte per ring, not a lot at all.
> >> >> >> >>
> >> >> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> >> >> we would need some code to resubmit requests
> >> >> >> >> out of order, so it's not trivial.
> >> >> >> >>
> >> >> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> >> >> some redundancy: the high bits in the index
> >> >> >> >> (> vq size) are not necessary as they can be
> >> >> >> >> got from avail idx.  There's a consistency check
> >> >> >> >> in load but we really should try to use formats
> >> >> >> >> that are always consistent.
> >> >> >> >>
> >> >> >> >>> The following patch does the same thing as original, yet
> >> >> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >> >> >>> migration either because inuse should be 0.
> >> >> >> >>>
> >> >> >> >>> Yoshi
> >> >> >> >>
> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> >> >> And if not, how do you handle the fact that some requests
> >> >> >> >> are in flight on the primary?
> >> >> >> >
> >> >> >> > Although we try flushing requests one by one making inuse 0,
> >> >> >> > there are cases when it failovers to the secondary when inuse
> >> >> >> > isn't 0.  We handle these in flight request on the primary by
> >> >> >> > replaying on the secondary.
> >> >> >> >
> >> >> >> >>
> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> index c8a0fc6..875c7ca 100644
> >> >> >> >>> --- a/hw/virtio.c
> >> >> >> >>> +++ b/hw/virtio.c
> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>>      qemu_put_be32(f, i);
> >> >> >> >>>
> >> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >> >> >>> +        uint16_t last_avail_idx;
> >> >> >> >>> +
> >> >> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >> >> >>>              break;
> >> >> >> >>>
> >> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >> >> >>> +
> >> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >> >> >>>          if (vdev->binding->save_queue)
> >> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>>      }
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> >> >> you send avail - inuse and on the secondary you will process and
> >> >> >> >> complete request 1 the second time, crashing the guest.
> >> >> >> >
> >> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> >> >> > queue the requests.  After completing each transaction, we flush
> >> >> >> > the requests one by one.  So there won't be completion inversion,
> >> >> >> > and therefore won't be visible to the guest.
> >> >> >> >
> >> >> >> > Yoshi
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>
> >> >> >> >>> >
> >> >> >> >>> >> >
> >> >> >> >>> >> >> ---
> >> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> >> >> index 849a60f..5509644 100644
> >> >> >> >>> >> >> --- a/hw/virtio.c
> >> >> >> >>> >> >> +++ b/hw/virtio.c
> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >> >> >>> >> >>      VRing vring;
> >> >> >> >>> >> >>      target_phys_addr_t pa;
> >> >> >> >>> >> >>      uint16_t last_avail_idx;
> >> >> >> >>> >> >> -    int inuse;
> >> >> >> >>> >> >> +    uint16_t inuse;
> >> >> >> >>> >> >>      uint16_t vector;
> >> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >> >> >>> >> >>      VirtIODevice *vdev;
> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>> >> >>      }
> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >> +
> >> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >> >> >>> >> >>
> >> >> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> 1.7.1.2
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> > --
> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> >
> >> >> >> >>> > --
> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>
> >> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
Michael S. Tsirkin - Dec. 26, 2010, 9:05 a.m.
On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> As I described in the previous message, Kemari queues the
> >> requests first.  So in you example above, it should start with
> >>
> >> virtio-net: last_avai_idx 0 inuse 2
> >> event-tap: {A,B}
> >>
> >> As you know, the requests are still in order still because net
> >> layer initiates in order.  Not about completing.
> >>
> >> In the first synchronization, the status above is transferred.  In
> >> the next synchronization, the status will be as following.
> >>
> >> virtio-net: last_avai_idx 1 inuse 1
> >> event-tap: {B}
> >
> > OK, this answers the ordering question.
> 
> Glad to hear that!
> 
> > Another question: at this point we transfer this status: both
> > event-tap and virtio ring have the command B,
> > so the remote will have:
> >
> > virtio-net: inuse 0
> > event-tap: {B}
> >
> > Is this right? This already seems to be a problem as when B completes
> > inuse will go negative?
> 
> I think state above is wrong.  inuse 0 means there shouldn't be
> any requests in event-tap.  Note that the callback is called only
> when event-tap flushes the requests.
> 
> > Next it seems that the remote virtio will resubmit B to event-tap. The
> > remote will then have:
> >
> > virtio-net: inuse 1
> > event-tap: {B, B}
> >
> > This looks kind of wrong ... will two packets go out?
> 
> No.  Currently, we're just replaying the requests with pio/mmio.
> In the situation above, it should be,
> 
> virtio-net: inuse 1
> event-tap: {B}
> >> Why? Because Kemari flushes the first virtio-net request using
> >> qemu_aio_flush() before each synchronization.  If
> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> should be problematic.  So in the final synchronization, the
> >> state should be,
> >>
> >> virtio-net: last_avai_idx 2 inuse 0
> >> event-tap: {}
> >>
> >> where A,B were completed in order.
> >>
> >> Yoshi
> >
> >
> > It might be better to discuss block because that's where
> > requests can complete out of order.
> 
> It's same as net.  We queue requests and call bdrv_flush per
> sending requests to the block.  So there shouldn't be any
> inversion.
> 
> > So let me see if I understand:
> > - each command passed to event tap is queued by it,
> >  it is not passed directly to the backend
> > - later requests are passed to the backend,
> >  always in the same order that they were submitted
> > - each synchronization point flushes all requests
> >  passed to the backend so far
> > - each synchronization transfers all requests not passed to the backend,
> >  to the remote, and they are replayed there
> 
> Correct.
> 
> > Now to analyse this for correctness I am looking at the original patch
> > because it is smaller so easier to analyse and I think it is
> > functionally equivalent, correct me if I am wrong in this.
> 
> So you think decreasing last_avail_idx upon save is better than
> updating it in the callback?
> 
> > So the reason there's no out of order issue is this
> > (and might be a good thing to put in commit log
> > or a comment somewhere):
> 
> I've done some in the latest patch.  Please point it out if it
> wasn't enough.
> 
> > At point of save callback event tap has flushed commands
> > passed to the backend already. Thus at the point of
> > the save callback if a command has completed
> > all previous commands have been flushed and completed.
> >
> >
> > Therefore inuse is
> > in fact the # of requests passed to event tap but not yet
> > passed to the backend (for non-event tap case all commands are
> > passed to the backend immediately and because of this
> > inuse is 0) and these are the last inuse commands submitted.
> >
> >
> > Right?
> 
> Yep.
> 
> > Now a question:
> >
> > When we pass last_used_index - inuse to the remote,
> > the remote virtio will resubmit the request.
> > Since request is also passed by event tap, we get
> > the request twice, why is this not a problem?
> 
> It's not a problem because event-tap currently replays with
> pio/mmio only, as I mentioned above.  Although event-tap receives
> information about the queued requests, it won't pass it to the
> backend.  The reason is the problem in setting the callbacks
> which are specific to devices on the secondary.  These are
> pointers, and even worse, are usually static functions, which
> event-tap has no way to restore it upon failover.  I do want to
> change event-tap replay to be this way in the future, pio/mmio
> replay is implemented for now.
> 
> Thanks,
> 
> Yoshi
> 

Then I am still confused, sorry.  inuse != 0 means that some requests
were passed to the backend but did not complete.  I think that if you do
a flush, this waits until all requests passed to the backend will
complete.  Why does not this guarantee inuse = 0 on the origin at the
synchronization point?
Yoshiaki Tamura - Dec. 26, 2010, 10:14 a.m.
2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> As I described in the previous message, Kemari queues the
>> >> requests first.  So in you example above, it should start with
>> >>
>> >> virtio-net: last_avai_idx 0 inuse 2
>> >> event-tap: {A,B}
>> >>
>> >> As you know, the requests are still in order still because net
>> >> layer initiates in order.  Not about completing.
>> >>
>> >> In the first synchronization, the status above is transferred.  In
>> >> the next synchronization, the status will be as following.
>> >>
>> >> virtio-net: last_avai_idx 1 inuse 1
>> >> event-tap: {B}
>> >
>> > OK, this answers the ordering question.
>>
>> Glad to hear that!
>>
>> > Another question: at this point we transfer this status: both
>> > event-tap and virtio ring have the command B,
>> > so the remote will have:
>> >
>> > virtio-net: inuse 0
>> > event-tap: {B}
>> >
>> > Is this right? This already seems to be a problem as when B completes
>> > inuse will go negative?
>>
>> I think state above is wrong.  inuse 0 means there shouldn't be
>> any requests in event-tap.  Note that the callback is called only
>> when event-tap flushes the requests.
>>
>> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> > remote will then have:
>> >
>> > virtio-net: inuse 1
>> > event-tap: {B, B}
>> >
>> > This looks kind of wrong ... will two packets go out?
>>
>> No.  Currently, we're just replaying the requests with pio/mmio.
>> In the situation above, it should be,
>>
>> virtio-net: inuse 1
>> event-tap: {B}
>> >> Why? Because Kemari flushes the first virtio-net request using
>> >> qemu_aio_flush() before each synchronization.  If
>> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> >> should be problematic.  So in the final synchronization, the
>> >> state should be,
>> >>
>> >> virtio-net: last_avai_idx 2 inuse 0
>> >> event-tap: {}
>> >>
>> >> where A,B were completed in order.
>> >>
>> >> Yoshi
>> >
>> >
>> > It might be better to discuss block because that's where
>> > requests can complete out of order.
>>
>> It's same as net.  We queue requests and call bdrv_flush per
>> sending requests to the block.  So there shouldn't be any
>> inversion.
>>
>> > So let me see if I understand:
>> > - each command passed to event tap is queued by it,
>> >  it is not passed directly to the backend
>> > - later requests are passed to the backend,
>> >  always in the same order that they were submitted
>> > - each synchronization point flushes all requests
>> >  passed to the backend so far
>> > - each synchronization transfers all requests not passed to the backend,
>> >  to the remote, and they are replayed there
>>
>> Correct.
>>
>> > Now to analyse this for correctness I am looking at the original patch
>> > because it is smaller so easier to analyse and I think it is
>> > functionally equivalent, correct me if I am wrong in this.
>>
>> So you think decreasing last_avail_idx upon save is better than
>> updating it in the callback?
>>
>> > So the reason there's no out of order issue is this
>> > (and might be a good thing to put in commit log
>> > or a comment somewhere):
>>
>> I've done some in the latest patch.  Please point it out if it
>> wasn't enough.
>>
>> > At point of save callback event tap has flushed commands
>> > passed to the backend already. Thus at the point of
>> > the save callback if a command has completed
>> > all previous commands have been flushed and completed.
>> >
>> >
>> > Therefore inuse is
>> > in fact the # of requests passed to event tap but not yet
>> > passed to the backend (for non-event tap case all commands are
>> > passed to the backend immediately and because of this
>> > inuse is 0) and these are the last inuse commands submitted.
>> >
>> >
>> > Right?
>>
>> Yep.
>>
>> > Now a question:
>> >
>> > When we pass last_used_index - inuse to the remote,
>> > the remote virtio will resubmit the request.
>> > Since request is also passed by event tap, we get
>> > the request twice, why is this not a problem?
>>
>> It's not a problem because event-tap currently replays with
>> pio/mmio only, as I mentioned above.  Although event-tap receives
>> information about the queued requests, it won't pass it to the
>> backend.  The reason is the problem in setting the callbacks
>> which are specific to devices on the secondary.  These are
>> pointers, and even worse, are usually static functions, which
>> event-tap has no way to restore it upon failover.  I do want to
>> change event-tap replay to be this way in the future, pio/mmio
>> replay is implemented for now.
>>
>> Thanks,
>>
>> Yoshi
>>
>
> Then I am still confused, sorry.  inuse != 0 means that some requests
> were passed to the backend but did not complete.  I think that if you do
> a flush, this waits until all requests passed to the backend will
> complete.  Why does not this guarantee inuse = 0 on the origin at the
> synchronization point?

The synchronization is done before event-tap releases requests to
the backend, so there are two types of flush: event-tap and
backend block/net.  I assume you're confused with the fact that
flushing backend with qemu_aio_flush/bdrv_flush doesn't necessary
decrease inuse if event-tap has queued requests because there are
no requests passed to the backend.  Let me do a case study again.

virtio: inuse 4
event-tap: {A,B,C}
backend: {D}

synchronization starts.  backend gets flushed.

virtio: inuse 3
event-tap: {A,B,C}
backend: {}

synchronization gets done.
# secondary is virtio: inuse 3

event-tap flushes one request.

virtio: inuse 2
event-tap: {B,C}
backend: {}

repeats above and finally it should be,

virtio: inuse 0
event-tap: {}

Hope this helps.

Yoshi

>
> --
> MST
>
>
Michael S. Tsirkin - Dec. 26, 2010, 10:46 a.m.
On Sun, Dec 26, 2010 at 07:14:44PM +0900, Yoshiaki Tamura wrote:
> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> >> As I described in the previous message, Kemari queues the
> >> >> requests first.  So in you example above, it should start with
> >> >>
> >> >> virtio-net: last_avai_idx 0 inuse 2
> >> >> event-tap: {A,B}
> >> >>
> >> >> As you know, the requests are still in order still because net
> >> >> layer initiates in order.  Not about completing.
> >> >>
> >> >> In the first synchronization, the status above is transferred.  In
> >> >> the next synchronization, the status will be as following.
> >> >>
> >> >> virtio-net: last_avai_idx 1 inuse 1
> >> >> event-tap: {B}
> >> >
> >> > OK, this answers the ordering question.
> >>
> >> Glad to hear that!
> >>
> >> > Another question: at this point we transfer this status: both
> >> > event-tap and virtio ring have the command B,
> >> > so the remote will have:
> >> >
> >> > virtio-net: inuse 0
> >> > event-tap: {B}
> >> >
> >> > Is this right? This already seems to be a problem as when B completes
> >> > inuse will go negative?
> >>
> >> I think state above is wrong.  inuse 0 means there shouldn't be
> >> any requests in event-tap.  Note that the callback is called only
> >> when event-tap flushes the requests.
> >>
> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
> >> > remote will then have:
> >> >
> >> > virtio-net: inuse 1
> >> > event-tap: {B, B}
> >> >
> >> > This looks kind of wrong ... will two packets go out?
> >>
> >> No.  Currently, we're just replaying the requests with pio/mmio.
> >> In the situation above, it should be,
> >>
> >> virtio-net: inuse 1
> >> event-tap: {B}
> >> >> Why? Because Kemari flushes the first virtio-net request using
> >> >> qemu_aio_flush() before each synchronization.  If
> >> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> >> should be problematic.  So in the final synchronization, the
> >> >> state should be,
> >> >>
> >> >> virtio-net: last_avai_idx 2 inuse 0
> >> >> event-tap: {}
> >> >>
> >> >> where A,B were completed in order.
> >> >>
> >> >> Yoshi
> >> >
> >> >
> >> > It might be better to discuss block because that's where
> >> > requests can complete out of order.
> >>
> >> It's same as net.  We queue requests and call bdrv_flush per
> >> sending requests to the block.  So there shouldn't be any
> >> inversion.
> >>
> >> > So let me see if I understand:
> >> > - each command passed to event tap is queued by it,
> >> >  it is not passed directly to the backend
> >> > - later requests are passed to the backend,
> >> >  always in the same order that they were submitted
> >> > - each synchronization point flushes all requests
> >> >  passed to the backend so far
> >> > - each synchronization transfers all requests not passed to the backend,
> >> >  to the remote, and they are replayed there
> >>
> >> Correct.
> >>
> >> > Now to analyse this for correctness I am looking at the original patch
> >> > because it is smaller so easier to analyse and I think it is
> >> > functionally equivalent, correct me if I am wrong in this.
> >>
> >> So you think decreasing last_avail_idx upon save is better than
> >> updating it in the callback?
> >>
> >> > So the reason there's no out of order issue is this
> >> > (and might be a good thing to put in commit log
> >> > or a comment somewhere):
> >>
> >> I've done some in the latest patch.  Please point it out if it
> >> wasn't enough.
> >>
> >> > At point of save callback event tap has flushed commands
> >> > passed to the backend already. Thus at the point of
> >> > the save callback if a command has completed
> >> > all previous commands have been flushed and completed.
> >> >
> >> >
> >> > Therefore inuse is
> >> > in fact the # of requests passed to event tap but not yet
> >> > passed to the backend (for non-event tap case all commands are
> >> > passed to the backend immediately and because of this
> >> > inuse is 0) and these are the last inuse commands submitted.
> >> >
> >> >
> >> > Right?
> >>
> >> Yep.
> >>
> >> > Now a question:
> >> >
> >> > When we pass last_used_index - inuse to the remote,
> >> > the remote virtio will resubmit the request.
> >> > Since request is also passed by event tap, we get
> >> > the request twice, why is this not a problem?
> >>
> >> It's not a problem because event-tap currently replays with
> >> pio/mmio only, as I mentioned above.  Although event-tap receives
> >> information about the queued requests, it won't pass it to the
> >> backend.  The reason is the problem in setting the callbacks
> >> which are specific to devices on the secondary.  These are
> >> pointers, and even worse, are usually static functions, which
> >> event-tap has no way to restore it upon failover.  I do want to
> >> change event-tap replay to be this way in the future, pio/mmio
> >> replay is implemented for now.
> >>
> >> Thanks,
> >>
> >> Yoshi
> >>
> >
> > Then I am still confused, sorry.  inuse != 0 means that some requests
> > were passed to the backend but did not complete.  I think that if you do
> > a flush, this waits until all requests passed to the backend will
> > complete.  Why does not this guarantee inuse = 0 on the origin at the
> > synchronization point?
> 
> The synchronization is done before event-tap releases requests to
> the backend, so there are two types of flush: event-tap and
> backend block/net.  I assume you're confused with the fact that
> flushing backend with qemu_aio_flush/bdrv_flush doesn't necessary
> decrease inuse if event-tap has queued requests because there are
> no requests passed to the backend.  Let me do a case study again.
> 
> virtio: inuse 4
> event-tap: {A,B,C}
> backend: {D}
> 


There are two event-tap devices, right?
PIO one is above virtio, AIO one is between virtio and backend
(e.g. bdrv)? Which one is meant here?


> synchronization starts.  backend gets flushed.
> 
> virtio: inuse 3
> event-tap: {A,B,C}
> backend: {}
> synchronization gets done.
> # secondary is virtio: inuse 3
> 
> event-tap flushes one request.
> 
> virtio: inuse 2
> event-tap: {B,C}
> backend: {}
> repeats above and finally it should be,
> 
> virtio: inuse 0
> event-tap: {}
> 
> Hope this helps.
> 
> Yoshi
> 
> >
> > --
> > MST
> >
> >
Michael S. Tsirkin - Dec. 26, 2010, 10:49 a.m.
On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >>> >> >
> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >>> >>
> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >>> >>
> >> >> >> >>> >> Yoshi
> >> >> >> >>> >
> >> >> >> >>> > Look for this:
> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >>> > sent on:
> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >>>
> >> >> >> >>> Thanks for the info.
> >> >> >> >>>
> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >>> between Primary and Secondary.
> >> >> >> >>
> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >
> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >>
> >> >> >> Hi Michael,
> >> >> >>
> >> >> >> Could you please take a look at the following patch?
> >> >> >
> >> >> > Which version is this against?
> >> >>
> >> >> Oops.  It should be very old.
> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >>
> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >>
> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >>
> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >
> >> >> > It would be better to have a commit description explaining why a change
> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> > the diff anyway.
> >> >>
> >> >> Sorry for being lazy here.
> >> >>
> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> --- a/hw/virtio.c
> >> >> >> +++ b/hw/virtio.c
> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >>      wmb();
> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> +    vq->last_avail_idx += count;
> >> >> >>      vq->inuse -= count;
> >> >> >>  }
> >> >> >>
> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>      unsigned int i, head, max;
> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >>
> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >>          return 0;
> >> >> >>
> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>
> >> >> >>      max = vq->vring.num;
> >> >> >>
> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >>
> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >>
> >> >> >
> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >>
> >> >> I think there are two problems.
> >> >>
> >> >> 1. When to update last_avail_idx.
> >> >> 2. The ordering issue you're mentioning below.
> >> >>
> >> >> The patch above is only trying to address 1 because last time you
> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> to the guest, I guess the approach above can be applied too.
> >> >
> >> > So IMHO 2 is the real issue. This is what was problematic
> >> > with the save patch, otherwise of course changes in save
> >> > are better than changes all over the codebase.
> >>
> >> All right.  Then let's focus on 2 first.
> >>
> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> > different way:
> >> >> >
> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >
> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >
> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >
> >> >> >
> >> >> > This is what I was saying below: assuming that there are
> >> >> > outstanding requests when we migrate, there is no way
> >> >> > a single index can be enough to figure out which requests
> >> >> > need to be handled and which are in flight already.
> >> >> >
> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >>
> >> >> I should understand why this inversion can happen before solving
> >> >> the issue.
> >> >
> >> > It's a fundamental thing in virtio.
> >> > I think it is currently only likely to happen with block, I think tap
> >> > currently completes things in order.  In any case relying on this in the
> >> > frontend is a mistake.
> >> >
> >> >>  Currently, how are you making virio-net to flush
> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >
> >> > Think so.
> >>
> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> As I described in the previous message, Kemari queues the
> >> requests first.  So in you example above, it should start with
> >>
> >> virtio-net: last_avai_idx 0 inuse 2
> >> event-tap: {A,B}
> >>
> >> As you know, the requests are still in order still because net
> >> layer initiates in order.  Not about completing.
> >>
> >> In the first synchronization, the status above is transferred.  In
> >> the next synchronization, the status will be as following.
> >>
> >> virtio-net: last_avai_idx 1 inuse 1
> >> event-tap: {B}
> >
> > OK, this answers the ordering question.
> 
> Glad to hear that!
> 
> > Another question: at this point we transfer this status: both
> > event-tap and virtio ring have the command B,
> > so the remote will have:
> >
> > virtio-net: inuse 0
> > event-tap: {B}
> >
> > Is this right? This already seems to be a problem as when B completes
> > inuse will go negative?
> 
> I think state above is wrong.  inuse 0 means there shouldn't be
> any requests in event-tap.  Note that the callback is called only
> when event-tap flushes the requests.
> 
> > Next it seems that the remote virtio will resubmit B to event-tap. The
> > remote will then have:
> >
> > virtio-net: inuse 1
> > event-tap: {B, B}
> >
> > This looks kind of wrong ... will two packets go out?
> 
> No.  Currently, we're just replaying the requests with pio/mmio.

You do?  What purpose do the hooks in bdrv/net serve then?
A placeholder for the future?

> In the situation above, it should be,
> 
> virtio-net: inuse 1
> event-tap: {B}
> 
> >> Why? Because Kemari flushes the first virtio-net request using
> >> qemu_aio_flush() before each synchronization.  If
> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> should be problematic.  So in the final synchronization, the
> >> state should be,
> >>
> >> virtio-net: last_avai_idx 2 inuse 0
> >> event-tap: {}
> >>
> >> where A,B were completed in order.
> >>
> >> Yoshi
> >
> >
> > It might be better to discuss block because that's where
> > requests can complete out of order.
> 
> It's same as net.  We queue requests and call bdrv_flush per
> sending requests to the block.  So there shouldn't be any
> inversion.
> 
> > So let me see if I understand:
> > - each command passed to event tap is queued by it,
> >  it is not passed directly to the backend
> > - later requests are passed to the backend,
> >  always in the same order that they were submitted
> > - each synchronization point flushes all requests
> >  passed to the backend so far
> > - each synchronization transfers all requests not passed to the backend,
> >  to the remote, and they are replayed there
> 
> Correct.
> 
> > Now to analyse this for correctness I am looking at the original patch
> > because it is smaller so easier to analyse and I think it is
> > functionally equivalent, correct me if I am wrong in this.
> 
> So you think decreasing last_avail_idx upon save is better than
> updating it in the callback?
> 
> > So the reason there's no out of order issue is this
> > (and might be a good thing to put in commit log
> > or a comment somewhere):
> 
> I've done some in the latest patch.  Please point it out if it
> wasn't enough.
> 
> > At point of save callback event tap has flushed commands
> > passed to the backend already. Thus at the point of
> > the save callback if a command has completed
> > all previous commands have been flushed and completed.
> >
> >
> > Therefore inuse is
> > in fact the # of requests passed to event tap but not yet
> > passed to the backend (for non-event tap case all commands are
> > passed to the backend immediately and because of this
> > inuse is 0) and these are the last inuse commands submitted.
> >
> >
> > Right?
> 
> Yep.
> 
> > Now a question:
> >
> > When we pass last_used_index - inuse to the remote,
> > the remote virtio will resubmit the request.
> > Since request is also passed by event tap, we get
> > the request twice, why is this not a problem?
> 
> It's not a problem because event-tap currently replays with
> pio/mmio only, as I mentioned above.  Although event-tap receives
> information about the queued requests, it won't pass it to the
> backend.  The reason is the problem in setting the callbacks
> which are specific to devices on the secondary.  These are
> pointers, and even worse, are usually static functions, which
> event-tap has no way to restore it upon failover.  I do want to
> change event-tap replay to be this way in the future, pio/mmio
> replay is implemented for now.
> 
> Thanks,
> 
> Yoshi
> 
> >
> >
> >> >
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>  I'm wondering why
> >> >> >> >>> last_avail_idx is OK to send but not inuse.
> >> >> >> >>
> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> >> >> our internal implementation, but it does *also* express
> >> >> >> >> a guest observable state.
> >> >> >> >>
> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> >> >> there is no way to detect that a specific request has already been
> >> >> >> >> completed. And the protocol forbids completing the same request twice.
> >> >> >> >>
> >> >> >> >> Our implementation always starts processing the requests
> >> >> >> >> in order, and since we flush outstanding requests
> >> >> >> >> before save, it works to just tell the remote 'process only requests
> >> >> >> >> after this place'.
> >> >> >> >>
> >> >> >> >> But there's no such requirement in the virtio protocol,
> >> >> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> >> >> ring entries that did not complete yet. This would be
> >> >> >> >> the exact representation of the guest observable state.
> >> >> >> >> In practice we have rings of up to 512 entries.
> >> >> >> >> That's 64 byte per ring, not a lot at all.
> >> >> >> >>
> >> >> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> >> >> we would need some code to resubmit requests
> >> >> >> >> out of order, so it's not trivial.
> >> >> >> >>
> >> >> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> >> >> some redundancy: the high bits in the index
> >> >> >> >> (> vq size) are not necessary as they can be
> >> >> >> >> got from avail idx.  There's a consistency check
> >> >> >> >> in load but we really should try to use formats
> >> >> >> >> that are always consistent.
> >> >> >> >>
> >> >> >> >>> The following patch does the same thing as original, yet
> >> >> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >> >> >>> migration either because inuse should be 0.
> >> >> >> >>>
> >> >> >> >>> Yoshi
> >> >> >> >>
> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> >> >> And if not, how do you handle the fact that some requests
> >> >> >> >> are in flight on the primary?
> >> >> >> >
> >> >> >> > Although we try flushing requests one by one making inuse 0,
> >> >> >> > there are cases when it failovers to the secondary when inuse
> >> >> >> > isn't 0.  We handle these in flight request on the primary by
> >> >> >> > replaying on the secondary.
> >> >> >> >
> >> >> >> >>
> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> index c8a0fc6..875c7ca 100644
> >> >> >> >>> --- a/hw/virtio.c
> >> >> >> >>> +++ b/hw/virtio.c
> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>>      qemu_put_be32(f, i);
> >> >> >> >>>
> >> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >> >> >>> +        uint16_t last_avail_idx;
> >> >> >> >>> +
> >> >> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >> >> >>>              break;
> >> >> >> >>>
> >> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >> >> >>> +
> >> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >> >> >>>          if (vdev->binding->save_queue)
> >> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>>      }
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> >> >> you send avail - inuse and on the secondary you will process and
> >> >> >> >> complete request 1 the second time, crashing the guest.
> >> >> >> >
> >> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> >> >> > queue the requests.  After completing each transaction, we flush
> >> >> >> > the requests one by one.  So there won't be completion inversion,
> >> >> >> > and therefore won't be visible to the guest.
> >> >> >> >
> >> >> >> > Yoshi
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>
> >> >> >> >>> >
> >> >> >> >>> >> >
> >> >> >> >>> >> >> ---
> >> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> >> >> index 849a60f..5509644 100644
> >> >> >> >>> >> >> --- a/hw/virtio.c
> >> >> >> >>> >> >> +++ b/hw/virtio.c
> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >> >> >>> >> >>      VRing vring;
> >> >> >> >>> >> >>      target_phys_addr_t pa;
> >> >> >> >>> >> >>      uint16_t last_avail_idx;
> >> >> >> >>> >> >> -    int inuse;
> >> >> >> >>> >> >> +    uint16_t inuse;
> >> >> >> >>> >> >>      uint16_t vector;
> >> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >> >> >>> >> >>      VirtIODevice *vdev;
> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>> >> >>      }
> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >> +
> >> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >> >> >>> >> >>
> >> >> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> 1.7.1.2
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> > --
> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> >
> >> >> >> >>> > --
> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>
> >> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
Yoshiaki Tamura - Dec. 26, 2010, 10:50 a.m.
2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Sun, Dec 26, 2010 at 07:14:44PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
>> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> >> As I described in the previous message, Kemari queues the
>> >> >> requests first.  So in you example above, it should start with
>> >> >>
>> >> >> virtio-net: last_avai_idx 0 inuse 2
>> >> >> event-tap: {A,B}
>> >> >>
>> >> >> As you know, the requests are still in order still because net
>> >> >> layer initiates in order.  Not about completing.
>> >> >>
>> >> >> In the first synchronization, the status above is transferred.  In
>> >> >> the next synchronization, the status will be as following.
>> >> >>
>> >> >> virtio-net: last_avai_idx 1 inuse 1
>> >> >> event-tap: {B}
>> >> >
>> >> > OK, this answers the ordering question.
>> >>
>> >> Glad to hear that!
>> >>
>> >> > Another question: at this point we transfer this status: both
>> >> > event-tap and virtio ring have the command B,
>> >> > so the remote will have:
>> >> >
>> >> > virtio-net: inuse 0
>> >> > event-tap: {B}
>> >> >
>> >> > Is this right? This already seems to be a problem as when B completes
>> >> > inuse will go negative?
>> >>
>> >> I think state above is wrong.  inuse 0 means there shouldn't be
>> >> any requests in event-tap.  Note that the callback is called only
>> >> when event-tap flushes the requests.
>> >>
>> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> >> > remote will then have:
>> >> >
>> >> > virtio-net: inuse 1
>> >> > event-tap: {B, B}
>> >> >
>> >> > This looks kind of wrong ... will two packets go out?
>> >>
>> >> No.  Currently, we're just replaying the requests with pio/mmio.
>> >> In the situation above, it should be,
>> >>
>> >> virtio-net: inuse 1
>> >> event-tap: {B}
>> >> >> Why? Because Kemari flushes the first virtio-net request using
>> >> >> qemu_aio_flush() before each synchronization.  If
>> >> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> >> >> should be problematic.  So in the final synchronization, the
>> >> >> state should be,
>> >> >>
>> >> >> virtio-net: last_avai_idx 2 inuse 0
>> >> >> event-tap: {}
>> >> >>
>> >> >> where A,B were completed in order.
>> >> >>
>> >> >> Yoshi
>> >> >
>> >> >
>> >> > It might be better to discuss block because that's where
>> >> > requests can complete out of order.
>> >>
>> >> It's same as net.  We queue requests and call bdrv_flush per
>> >> sending requests to the block.  So there shouldn't be any
>> >> inversion.
>> >>
>> >> > So let me see if I understand:
>> >> > - each command passed to event tap is queued by it,
>> >> >  it is not passed directly to the backend
>> >> > - later requests are passed to the backend,
>> >> >  always in the same order that they were submitted
>> >> > - each synchronization point flushes all requests
>> >> >  passed to the backend so far
>> >> > - each synchronization transfers all requests not passed to the backend,
>> >> >  to the remote, and they are replayed there
>> >>
>> >> Correct.
>> >>
>> >> > Now to analyse this for correctness I am looking at the original patch
>> >> > because it is smaller so easier to analyse and I think it is
>> >> > functionally equivalent, correct me if I am wrong in this.
>> >>
>> >> So you think decreasing last_avail_idx upon save is better than
>> >> updating it in the callback?
>> >>
>> >> > So the reason there's no out of order issue is this
>> >> > (and might be a good thing to put in commit log
>> >> > or a comment somewhere):
>> >>
>> >> I've done some in the latest patch.  Please point it out if it
>> >> wasn't enough.
>> >>
>> >> > At point of save callback event tap has flushed commands
>> >> > passed to the backend already. Thus at the point of
>> >> > the save callback if a command has completed
>> >> > all previous commands have been flushed and completed.
>> >> >
>> >> >
>> >> > Therefore inuse is
>> >> > in fact the # of requests passed to event tap but not yet
>> >> > passed to the backend (for non-event tap case all commands are
>> >> > passed to the backend immediately and because of this
>> >> > inuse is 0) and these are the last inuse commands submitted.
>> >> >
>> >> >
>> >> > Right?
>> >>
>> >> Yep.
>> >>
>> >> > Now a question:
>> >> >
>> >> > When we pass last_used_index - inuse to the remote,
>> >> > the remote virtio will resubmit the request.
>> >> > Since request is also passed by event tap, we get
>> >> > the request twice, why is this not a problem?
>> >>
>> >> It's not a problem because event-tap currently replays with
>> >> pio/mmio only, as I mentioned above.  Although event-tap receives
>> >> information about the queued requests, it won't pass it to the
>> >> backend.  The reason is the problem in setting the callbacks
>> >> which are specific to devices on the secondary.  These are
>> >> pointers, and even worse, are usually static functions, which
>> >> event-tap has no way to restore it upon failover.  I do want to
>> >> change event-tap replay to be this way in the future, pio/mmio
>> >> replay is implemented for now.
>> >>
>> >> Thanks,
>> >>
>> >> Yoshi
>> >>
>> >
>> > Then I am still confused, sorry.  inuse != 0 means that some requests
>> > were passed to the backend but did not complete.  I think that if you do
>> > a flush, this waits until all requests passed to the backend will
>> > complete.  Why does not this guarantee inuse = 0 on the origin at the
>> > synchronization point?
>>
>> The synchronization is done before event-tap releases requests to
>> the backend, so there are two types of flush: event-tap and
>> backend block/net.  I assume you're confused with the fact that
>> flushing backend with qemu_aio_flush/bdrv_flush doesn't necessary
>> decrease inuse if event-tap has queued requests because there are
>> no requests passed to the backend.  Let me do a case study again.
>>
>> virtio: inuse 4
>> event-tap: {A,B,C}
>> backend: {D}
>>
>
>
> There are two event-tap devices, right?
> PIO one is above virtio, AIO one is between virtio and backend
> (e.g. bdrv)? Which one is meant here?

Right.  I'm mentioning about the latter, between virtio and
backend.  Note that event-tap function in pio/mmio doesn't queue
but just records what initiated the requests.

Yoshi

>
>
>> synchronization starts.  backend gets flushed.
>>
>> virtio: inuse 3
>> event-tap: {A,B,C}
>> backend: {}
>> synchronization gets done.
>> # secondary is virtio: inuse 3
>>
>> event-tap flushes one request.
>>
>> virtio: inuse 2
>> event-tap: {B,C}
>> backend: {}
>> repeats above and finally it should be,
>>
>> virtio: inuse 0
>> event-tap: {}
>>
>> Hope this helps.
>>
>> Yoshi
>>
>> >
>> > --
>> > MST
>> >
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Yoshiaki Tamura - Dec. 26, 2010, 10:57 a.m.
2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
>> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >> >> >>> >> > state that is not guest visible is always a mistake
>> >> >> >> >>> >> > as it ties migration format to an internal implementation
>> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >> >> >>> >> > is to flush outstanding
>> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >> >> >>> >> happy to drop this patch.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Yoshi
>> >> >> >> >>> >
>> >> >> >> >>> > Look for this:
>> >> >> >> >>> > stable migration image on a stopped vm
>> >> >> >> >>> > sent on:
>> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >> >> >>>
>> >> >> >> >>> Thanks for the info.
>> >> >> >> >>>
>> >> >> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >> >> >>> output, and while last_avail_idx gets incremented
>> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >> >> >>> between Primary and Secondary.
>> >> >> >> >>
>> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >> >> >
>> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> >> >> > and update the external when inuse is decremented.  I'll try
>> >> >> >> > whether it work w/ w/o Kemari.
>> >> >> >>
>> >> >> >> Hi Michael,
>> >> >> >>
>> >> >> >> Could you please take a look at the following patch?
>> >> >> >
>> >> >> > Which version is this against?
>> >> >>
>> >> >> Oops.  It should be very old.
>> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
>> >> >>
>> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >> >> >>
>> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
>> >> >> >>
>> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >
>> >> >> > It would be better to have a commit description explaining why a change
>> >> >> > is made, and why it is correct, not just repeating what can be seen from
>> >> >> > the diff anyway.
>> >> >>
>> >> >> Sorry for being lazy here.
>> >> >>
>> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> index c8a0fc6..6688c02 100644
>> >> >> >> --- a/hw/virtio.c
>> >> >> >> +++ b/hw/virtio.c
>> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >> >> >>      wmb();
>> >> >> >>      trace_virtqueue_flush(vq, count);
>> >> >> >>      vring_used_idx_increment(vq, count);
>> >> >> >> +    vq->last_avail_idx += count;
>> >> >> >>      vq->inuse -= count;
>> >> >> >>  }
>> >> >> >>
>> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >>      unsigned int i, head, max;
>> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >> >> >>
>> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >> >> >>          return 0;
>> >> >> >>
>> >> >> >>      /* When we start there are none of either input nor output. */
>> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >>
>> >> >> >>      max = vq->vring.num;
>> >> >> >>
>> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >> >> >>
>> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >> >> >>
>> >> >> >
>> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>> >> >>
>> >> >> I think there are two problems.
>> >> >>
>> >> >> 1. When to update last_avail_idx.
>> >> >> 2. The ordering issue you're mentioning below.
>> >> >>
>> >> >> The patch above is only trying to address 1 because last time you
>> >> >> mentioned that modifying last_avail_idx upon save may break the
>> >> >> guest, which I agree.  If virtio_queue_empty and
>> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
>> >> >> to the guest, I guess the approach above can be applied too.
>> >> >
>> >> > So IMHO 2 is the real issue. This is what was problematic
>> >> > with the save patch, otherwise of course changes in save
>> >> > are better than changes all over the codebase.
>> >>
>> >> All right.  Then let's focus on 2 first.
>> >>
>> >> >> > Previous patch version sure looked simpler, and this seems functionally
>> >> >> > equivalent, so my question still stands: here it is rephrased in a
>> >> >> > different way:
>> >> >> >
>> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >> >> >
>> >> >> >        host pops A, then B, then completes B and flushes
>> >> >> >
>> >> >> >        now with this patch last_avail_idx will be 1, and then
>> >> >> >        remote will get it, it will execute B again. As a result
>> >> >> >        B will complete twice, and apparently A will never complete.
>> >> >> >
>> >> >> >
>> >> >> > This is what I was saying below: assuming that there are
>> >> >> > outstanding requests when we migrate, there is no way
>> >> >> > a single index can be enough to figure out which requests
>> >> >> > need to be handled and which are in flight already.
>> >> >> >
>> >> >> > We must add some kind of bitmask to tell us which is which.
>> >> >>
>> >> >> I should understand why this inversion can happen before solving
>> >> >> the issue.
>> >> >
>> >> > It's a fundamental thing in virtio.
>> >> > I think it is currently only likely to happen with block, I think tap
>> >> > currently completes things in order.  In any case relying on this in the
>> >> > frontend is a mistake.
>> >> >
>> >> >>  Currently, how are you making virio-net to flush
>> >> >> every requests for live migration?  Is it qemu_aio_flush()?
>> >> >
>> >> > Think so.
>> >>
>> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> As I described in the previous message, Kemari queues the
>> >> requests first.  So in you example above, it should start with
>> >>
>> >> virtio-net: last_avai_idx 0 inuse 2
>> >> event-tap: {A,B}
>> >>
>> >> As you know, the requests are still in order still because net
>> >> layer initiates in order.  Not about completing.
>> >>
>> >> In the first synchronization, the status above is transferred.  In
>> >> the next synchronization, the status will be as following.
>> >>
>> >> virtio-net: last_avai_idx 1 inuse 1
>> >> event-tap: {B}
>> >
>> > OK, this answers the ordering question.
>>
>> Glad to hear that!
>>
>> > Another question: at this point we transfer this status: both
>> > event-tap and virtio ring have the command B,
>> > so the remote will have:
>> >
>> > virtio-net: inuse 0
>> > event-tap: {B}
>> >
>> > Is this right? This already seems to be a problem as when B completes
>> > inuse will go negative?
>>
>> I think state above is wrong.  inuse 0 means there shouldn't be
>> any requests in event-tap.  Note that the callback is called only
>> when event-tap flushes the requests.
>>
>> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> > remote will then have:
>> >
>> > virtio-net: inuse 1
>> > event-tap: {B, B}
>> >
>> > This looks kind of wrong ... will two packets go out?
>>
>> No.  Currently, we're just replaying the requests with pio/mmio.
>
> You do?  What purpose do the hooks in bdrv/net serve then?
> A placeholder for the future?

Not only for that reason.  The hooks in bdrv/net is the main
function that queues requests and starts synchronization.
pio/mmio hooks are there for recording what initiated the
requests monitored in bdrv/net layer.  I would like to remove
pio/mmio part if we could make bdrv/net level replay is possible.

Yoshi

>
>> In the situation above, it should be,
>>
>> virtio-net: inuse 1
>> event-tap: {B}
>>
>> >> Why? Because Kemari flushes the first virtio-net request using
>> >> qemu_aio_flush() before each synchronization.  If
>> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> >> should be problematic.  So in the final synchronization, the
>> >> state should be,
>> >>
>> >> virtio-net: last_avai_idx 2 inuse 0
>> >> event-tap: {}
>> >>
>> >> where A,B were completed in order.
>> >>
>> >> Yoshi
>> >
>> >
>> > It might be better to discuss block because that's where
>> > requests can complete out of order.
>>
>> It's same as net.  We queue requests and call bdrv_flush per
>> sending requests to the block.  So there shouldn't be any
>> inversion.
>>
>> > So let me see if I understand:
>> > - each command passed to event tap is queued by it,
>> >  it is not passed directly to the backend
>> > - later requests are passed to the backend,
>> >  always in the same order that they were submitted
>> > - each synchronization point flushes all requests
>> >  passed to the backend so far
>> > - each synchronization transfers all requests not passed to the backend,
>> >  to the remote, and they are replayed there
>>
>> Correct.
>>
>> > Now to analyse this for correctness I am looking at the original patch
>> > because it is smaller so easier to analyse and I think it is
>> > functionally equivalent, correct me if I am wrong in this.
>>
>> So you think decreasing last_avail_idx upon save is better than
>> updating it in the callback?
>>
>> > So the reason there's no out of order issue is this
>> > (and might be a good thing to put in commit log
>> > or a comment somewhere):
>>
>> I've done some in the latest patch.  Please point it out if it
>> wasn't enough.
>>
>> > At point of save callback event tap has flushed commands
>> > passed to the backend already. Thus at the point of
>> > the save callback if a command has completed
>> > all previous commands have been flushed and completed.
>> >
>> >
>> > Therefore inuse is
>> > in fact the # of requests passed to event tap but not yet
>> > passed to the backend (for non-event tap case all commands are
>> > passed to the backend immediately and because of this
>> > inuse is 0) and these are the last inuse commands submitted.
>> >
>> >
>> > Right?
>>
>> Yep.
>>
>> > Now a question:
>> >
>> > When we pass last_used_index - inuse to the remote,
>> > the remote virtio will resubmit the request.
>> > Since request is also passed by event tap, we get
>> > the request twice, why is this not a problem?
>>
>> It's not a problem because event-tap currently replays with
>> pio/mmio only, as I mentioned above.  Although event-tap receives
>> information about the queued requests, it won't pass it to the
>> backend.  The reason is the problem in setting the callbacks
>> which are specific to devices on the secondary.  These are
>> pointers, and even worse, are usually static functions, which
>> event-tap has no way to restore it upon failover.  I do want to
>> change event-tap replay to be this way in the future, pio/mmio
>> replay is implemented for now.
>>
>> Thanks,
>>
>> Yoshi
>>
>> >
>> >
>> >> >
>> >> >> >
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>>  I'm wondering why
>> >> >> >> >>> last_avail_idx is OK to send but not inuse.
>> >> >> >> >>
>> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> >> >> >> our internal implementation, but it does *also* express
>> >> >> >> >> a guest observable state.
>> >> >> >> >>
>> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> >> >> >> there is no way to detect that a specific request has already been
>> >> >> >> >> completed. And the protocol forbids completing the same request twice.
>> >> >> >> >>
>> >> >> >> >> Our implementation always starts processing the requests
>> >> >> >> >> in order, and since we flush outstanding requests
>> >> >> >> >> before save, it works to just tell the remote 'process only requests
>> >> >> >> >> after this place'.
>> >> >> >> >>
>> >> >> >> >> But there's no such requirement in the virtio protocol,
>> >> >> >> >> so to be really generic we could add a bitmask of valid avail
>> >> >> >> >> ring entries that did not complete yet. This would be
>> >> >> >> >> the exact representation of the guest observable state.
>> >> >> >> >> In practice we have rings of up to 512 entries.
>> >> >> >> >> That's 64 byte per ring, not a lot at all.
>> >> >> >> >>
>> >> >> >> >> However, if we ever do change the protocol to send the bitmask,
>> >> >> >> >> we would need some code to resubmit requests
>> >> >> >> >> out of order, so it's not trivial.
>> >> >> >> >>
>> >> >> >> >> Another minor mistake with last_avail_idx is that it has
>> >> >> >> >> some redundancy: the high bits in the index
>> >> >> >> >> (> vq size) are not necessary as they can be
>> >> >> >> >> got from avail idx.  There's a consistency check
>> >> >> >> >> in load but we really should try to use formats
>> >> >> >> >> that are always consistent.
>> >> >> >> >>
>> >> >> >> >>> The following patch does the same thing as original, yet
>> >> >> >> >>> keeps the format of the virtio.  It shouldn't break live
>> >> >> >> >>> migration either because inuse should be 0.
>> >> >> >> >>>
>> >> >> >> >>> Yoshi
>> >> >> >> >>
>> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> >> >> >> And if not, how do you handle the fact that some requests
>> >> >> >> >> are in flight on the primary?
>> >> >> >> >
>> >> >> >> > Although we try flushing requests one by one making inuse 0,
>> >> >> >> > there are cases when it failovers to the secondary when inuse
>> >> >> >> > isn't 0.  We handle these in flight request on the primary by
>> >> >> >> > replaying on the secondary.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> >>> index c8a0fc6..875c7ca 100644
>> >> >> >> >>> --- a/hw/virtio.c
>> >> >> >> >>> +++ b/hw/virtio.c
>> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >> >>>      qemu_put_be32(f, i);
>> >> >> >> >>>
>> >> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >> >> >> >>> +        uint16_t last_avail_idx;
>> >> >> >> >>> +
>> >> >> >> >>>          if (vdev->vq[i].vring.num == 0)
>> >> >> >> >>>              break;
>> >> >> >> >>>
>> >> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >> >> >> >>> +
>> >> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >> >> >> >>>          if (vdev->binding->save_queue)
>> >> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >> >>>      }
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>
>> >> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> >> >> >> not?  So if request 0 did not complete and request 1 did not,
>> >> >> >> >> you send avail - inuse and on the secondary you will process and
>> >> >> >> >> complete request 1 the second time, crashing the guest.
>> >> >> >> >
>> >> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
>> >> >> >> > queue the requests.  After completing each transaction, we flush
>> >> >> >> > the requests one by one.  So there won't be completion inversion,
>> >> >> >> > and therefore won't be visible to the guest.
>> >> >> >> >
>> >> >> >> > Yoshi
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>> >
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >> ---
>> >> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> >>> >> >> index 849a60f..5509644 100644
>> >> >> >> >>> >> >> --- a/hw/virtio.c
>> >> >> >> >>> >> >> +++ b/hw/virtio.c
>> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >> >> >>> >> >>      VRing vring;
>> >> >> >> >>> >> >>      target_phys_addr_t pa;
>> >> >> >> >>> >> >>      uint16_t last_avail_idx;
>> >> >> >> >>> >> >> -    int inuse;
>> >> >> >> >>> >> >> +    uint16_t inuse;
>> >> >> >> >>> >> >>      uint16_t vector;
>> >> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >> >> >>> >> >>      VirtIODevice *vdev;
>> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >> >> >>> >> >>          if (vdev->binding->save_queue)
>> >> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >> >>> >> >>      }
>> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >> >> >>> >> >> +
>> >> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >>          if (vdev->vq[i].pa) {
>> >> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >> >> >>> >> >> --
>> >> >> >> >>> >> >> 1.7.1.2
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >> --
>> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>> >> > --
>> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>> >> >
>> >> >> >> >>> > --
>> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >>> > the body of a message to majordomo@vger.kernel.org
>> >> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>> >
>> >> >> >> >> --
>> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>
>> >> >> >> >
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Patch

diff --git a/hw/virtio.c b/hw/virtio.c
index c8a0fc6..6688c02 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -237,6 +237,7 @@  void virtqueue_flush(VirtQueue *vq, unsigned int count)
     wmb();
     trace_virtqueue_flush(vq, count);
     vring_used_idx_increment(vq, count);
+    vq->last_avail_idx += count;
     vq->inuse -= count;
 }

@@ -385,7 +386,7 @@  int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
     unsigned int i, head, max;
     target_phys_addr_t desc_pa = vq->vring.desc;

-    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
+    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
         return 0;

     /* When we start there are none of either input nor output. */
@@ -393,7 +394,7 @@  int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)

     max = vq->vring.num;

-    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
+    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);

     if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
         if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {