Patchwork Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.

login
register
mail settings
Submitter Yoshiaki Tamura
Date Dec. 1, 2010, 8:03 a.m.
Message ID <AANLkTim2CojXtyGVDNngzaYMCXfan3kkS=thDE9i-yT=@mail.gmail.com>
Download mbox | patch
Permalink /patch/73826/
State New
Headers show

Comments

Yoshiaki Tamura - Dec. 1, 2010, 8:03 a.m.
2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> last_avail_idx with inuse if there are outstanding emulation.
>> >>
>> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >
>> > This changes migration format, so it will break compatibility with
>> > existing drivers. More generally, I think migrating internal
>> > state that is not guest visible is always a mistake
>> > as it ties migration format to an internal implementation
>> > (yes, I know we do this sometimes, but we should at least
>> > try not to add such cases).  I think the right thing to do in this case
>> > is to flush outstanding
>> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> > I sent patches that do this for virtio net and block.
>>
>> Could you give me the link of your patches?  I'd like to test
>> whether they work with Kemari upon failover.  If they do, I'm
>> happy to drop this patch.
>>
>> Yoshi
>
> Look for this:
> stable migration image on a stopped vm
> sent on:
> Wed, 24 Nov 2010 17:52:49 +0200

Thanks for the info.

However, The patch series above didn't solve the issue.  In
case of Kemari, inuse is mostly > 0 because it queues the
output, and while last_avail_idx gets incremented
immediately, not sending inuse makes the state inconsistent
between Primary and Secondary.  I'm wondering why
last_avail_idx is OK to send but not inuse.

The following patch does the same thing as original, yet
keeps the format of the virtio.  It shouldn't break live
migration either because inuse should be 0.

Yoshi




>
>> >
>> >> ---
>> >>  hw/virtio.c |    8 +++++++-
>> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >>
>> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> index 849a60f..5509644 100644
>> >> --- a/hw/virtio.c
>> >> +++ b/hw/virtio.c
>> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >>      VRing vring;
>> >>      target_phys_addr_t pa;
>> >>      uint16_t last_avail_idx;
>> >> -    int inuse;
>> >> +    uint16_t inuse;
>> >>      uint16_t vector;
>> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >>      VirtIODevice *vdev;
>> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >>          if (vdev->binding->save_queue)
>> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>      }
>> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> +
>> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> +        vdev->vq[i].inuse = 0;
>> >>
>> >>          if (vdev->vq[i].pa) {
>> >>              virtqueue_init(&vdev->vq[i]);
>> >> --
>> >> 1.7.1.2
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Michael S. Tsirkin - Dec. 2, 2010, 12:02 p.m.
On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >>
> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >
> >> > This changes migration format, so it will break compatibility with
> >> > existing drivers. More generally, I think migrating internal
> >> > state that is not guest visible is always a mistake
> >> > as it ties migration format to an internal implementation
> >> > (yes, I know we do this sometimes, but we should at least
> >> > try not to add such cases).  I think the right thing to do in this case
> >> > is to flush outstanding
> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> > I sent patches that do this for virtio net and block.
> >>
> >> Could you give me the link of your patches?  I'd like to test
> >> whether they work with Kemari upon failover.  If they do, I'm
> >> happy to drop this patch.
> >>
> >> Yoshi
> >
> > Look for this:
> > stable migration image on a stopped vm
> > sent on:
> > Wed, 24 Nov 2010 17:52:49 +0200
> 
> Thanks for the info.
> 
> However, The patch series above didn't solve the issue.  In
> case of Kemari, inuse is mostly > 0 because it queues the
> output, and while last_avail_idx gets incremented
> immediately, not sending inuse makes the state inconsistent
> between Primary and Secondary.

Hmm. Can we simply avoid incrementing last_avail_idx?

>  I'm wondering why
> last_avail_idx is OK to send but not inuse.

last_avail_idx is at some level a mistake, it exposes part of
our internal implementation, but it does *also* express
a guest observable state.

Here's the problem that it solves: just looking at the rings in virtio
there is no way to detect that a specific request has already been
completed. And the protocol forbids completing the same request twice.

Our implementation always starts processing the requests
in order, and since we flush outstanding requests
before save, it works to just tell the remote 'process only requests
after this place'.

But there's no such requirement in the virtio protocol,
so to be really generic we could add a bitmask of valid avail
ring entries that did not complete yet. This would be
the exact representation of the guest observable state.
In practice we have rings of up to 512 entries.
That's 64 byte per ring, not a lot at all.

However, if we ever do change the protocol to send the bitmask,
we would need some code to resubmit requests
out of order, so it's not trivial.

Another minor mistake with last_avail_idx is that it has
some redundancy: the high bits in the index
(> vq size) are not necessary as they can be
got from avail idx.  There's a consistency check
in load but we really should try to use formats
that are always consistent.

> The following patch does the same thing as original, yet
> keeps the format of the virtio.  It shouldn't break live
> migration either because inuse should be 0.
> 
> Yoshi

Question is, can you flush to make inuse 0 in kemari too?
And if not, how do you handle the fact that some requests
are in flight on the primary?

> diff --git a/hw/virtio.c b/hw/virtio.c
> index c8a0fc6..875c7ca 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>      qemu_put_be32(f, i);
> 
>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> +        uint16_t last_avail_idx;
> +
>          if (vdev->vq[i].vring.num == 0)
>              break;
> 
> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> +
>          qemu_put_be32(f, vdev->vq[i].vring.num);
>          qemu_put_be64(f, vdev->vq[i].pa);
> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> +        qemu_put_be16s(f, &last_avail_idx);
>          if (vdev->binding->save_queue)
>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>      }
> 
> 

This looks wrong to me.  Requests can complete in any order, can they
not?  So if request 0 did not complete and request 1 did not,
you send avail - inuse and on the secondary you will process and
complete request 1 the second time, crashing the guest.

> 
> >
> >> >
> >> >> ---
> >> >>  hw/virtio.c |    8 +++++++-
> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >>
> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> index 849a60f..5509644 100644
> >> >> --- a/hw/virtio.c
> >> >> +++ b/hw/virtio.c
> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >>      VRing vring;
> >> >>      target_phys_addr_t pa;
> >> >>      uint16_t last_avail_idx;
> >> >> -    int inuse;
> >> >> +    uint16_t inuse;
> >> >>      uint16_t vector;
> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >>      VirtIODevice *vdev;
> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >>          if (vdev->binding->save_queue)
> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >>      }
> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> +
> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> +        vdev->vq[i].inuse = 0;
> >> >>
> >> >>          if (vdev->vq[i].pa) {
> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> --
> >> >> 1.7.1.2
> >> >>
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
Yoshiaki Tamura - Dec. 3, 2010, 6:28 a.m.
2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >>
>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >
>> >> > This changes migration format, so it will break compatibility with
>> >> > existing drivers. More generally, I think migrating internal
>> >> > state that is not guest visible is always a mistake
>> >> > as it ties migration format to an internal implementation
>> >> > (yes, I know we do this sometimes, but we should at least
>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> > is to flush outstanding
>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> > I sent patches that do this for virtio net and block.
>> >>
>> >> Could you give me the link of your patches?  I'd like to test
>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> happy to drop this patch.
>> >>
>> >> Yoshi
>> >
>> > Look for this:
>> > stable migration image on a stopped vm
>> > sent on:
>> > Wed, 24 Nov 2010 17:52:49 +0200
>>
>> Thanks for the info.
>>
>> However, The patch series above didn't solve the issue.  In
>> case of Kemari, inuse is mostly > 0 because it queues the
>> output, and while last_avail_idx gets incremented
>> immediately, not sending inuse makes the state inconsistent
>> between Primary and Secondary.
>
> Hmm. Can we simply avoid incrementing last_avail_idx?

I think we can calculate or prepare an internal last_avail_idx,
and update the external when inuse is decremented.  I'll try
whether it work w/ w/o Kemari.

>
>>  I'm wondering why
>> last_avail_idx is OK to send but not inuse.
>
> last_avail_idx is at some level a mistake, it exposes part of
> our internal implementation, but it does *also* express
> a guest observable state.
>
> Here's the problem that it solves: just looking at the rings in virtio
> there is no way to detect that a specific request has already been
> completed. And the protocol forbids completing the same request twice.
>
> Our implementation always starts processing the requests
> in order, and since we flush outstanding requests
> before save, it works to just tell the remote 'process only requests
> after this place'.
>
> But there's no such requirement in the virtio protocol,
> so to be really generic we could add a bitmask of valid avail
> ring entries that did not complete yet. This would be
> the exact representation of the guest observable state.
> In practice we have rings of up to 512 entries.
> That's 64 byte per ring, not a lot at all.
>
> However, if we ever do change the protocol to send the bitmask,
> we would need some code to resubmit requests
> out of order, so it's not trivial.
>
> Another minor mistake with last_avail_idx is that it has
> some redundancy: the high bits in the index
> (> vq size) are not necessary as they can be
> got from avail idx.  There's a consistency check
> in load but we really should try to use formats
> that are always consistent.
>
>> The following patch does the same thing as original, yet
>> keeps the format of the virtio.  It shouldn't break live
>> migration either because inuse should be 0.
>>
>> Yoshi
>
> Question is, can you flush to make inuse 0 in kemari too?
> And if not, how do you handle the fact that some requests
> are in flight on the primary?

Although we try flushing requests one by one making inuse 0,
there are cases when it failovers to the secondary when inuse
isn't 0.  We handle these in flight request on the primary by
replaying on the secondary.

>
>> diff --git a/hw/virtio.c b/hw/virtio.c
>> index c8a0fc6..875c7ca 100644
>> --- a/hw/virtio.c
>> +++ b/hw/virtio.c
>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>      qemu_put_be32(f, i);
>>
>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> +        uint16_t last_avail_idx;
>> +
>>          if (vdev->vq[i].vring.num == 0)
>>              break;
>>
>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> +
>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>          qemu_put_be64(f, vdev->vq[i].pa);
>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> +        qemu_put_be16s(f, &last_avail_idx);
>>          if (vdev->binding->save_queue)
>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>      }
>>
>>
>
> This looks wrong to me.  Requests can complete in any order, can they
> not?  So if request 0 did not complete and request 1 did not,
> you send avail - inuse and on the secondary you will process and
> complete request 1 the second time, crashing the guest.

In case of Kemari, no.  We sit between devices and net/block, and
queue the requests.  After completing each transaction, we flush
the requests one by one.  So there won't be completion inversion,
and therefore won't be visible to the guest.

Yoshi

>
>>
>> >
>> >> >
>> >> >> ---
>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >>
>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> index 849a60f..5509644 100644
>> >> >> --- a/hw/virtio.c
>> >> >> +++ b/hw/virtio.c
>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >>      VRing vring;
>> >> >>      target_phys_addr_t pa;
>> >> >>      uint16_t last_avail_idx;
>> >> >> -    int inuse;
>> >> >> +    uint16_t inuse;
>> >> >>      uint16_t vector;
>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >>      VirtIODevice *vdev;
>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >>          if (vdev->binding->save_queue)
>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >>      }
>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >> +
>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >>
>> >> >>          if (vdev->vq[i].pa) {
>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >> --
>> >> >> 1.7.1.2
>> >> >>
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Patch

diff --git a/hw/virtio.c b/hw/virtio.c
index c8a0fc6..875c7ca 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -664,12 +664,16 @@  void virtio_save(VirtIODevice *vdev, QEMUFile *f)
     qemu_put_be32(f, i);

     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
+        uint16_t last_avail_idx;
+
         if (vdev->vq[i].vring.num == 0)
             break;

+        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
+
         qemu_put_be32(f, vdev->vq[i].vring.num);
         qemu_put_be64(f, vdev->vq[i].pa);
-        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
+        qemu_put_be16s(f, &last_avail_idx);
         if (vdev->binding->save_queue)
             vdev->binding->save_queue(vdev->binding_opaque, i, f);
     }