diff mbox

Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.

Message ID 20101226120151.GA1926@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin Dec. 26, 2010, 12:01 p.m. UTC
On Sun, Dec 26, 2010 at 07:57:52PM +0900, Yoshiaki Tamura wrote:
> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >> >>> >> >>
> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >>> >> >
> >> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> Yoshi
> >> >> >> >> >>> >
> >> >> >> >> >>> > Look for this:
> >> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >> >>> > sent on:
> >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >> >>>
> >> >> >> >> >>> Thanks for the info.
> >> >> >> >> >>>
> >> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >> >>> between Primary and Secondary.
> >> >> >> >> >>
> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >> >
> >> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >> >>
> >> >> >> >> Hi Michael,
> >> >> >> >>
> >> >> >> >> Could you please take a look at the following patch?
> >> >> >> >
> >> >> >> > Which version is this against?
> >> >> >>
> >> >> >> Oops.  It should be very old.
> >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >> >>
> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >> >>
> >> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >> >>
> >> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >
> >> >> >> > It would be better to have a commit description explaining why a change
> >> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> >> > the diff anyway.
> >> >> >>
> >> >> >> Sorry for being lazy here.
> >> >> >>
> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> >> --- a/hw/virtio.c
> >> >> >> >> +++ b/hw/virtio.c
> >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >> >>      wmb();
> >> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> >> +    vq->last_avail_idx += count;
> >> >> >> >>      vq->inuse -= count;
> >> >> >> >>  }
> >> >> >> >>
> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >>      unsigned int i, head, max;
> >> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >> >>
> >> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >> >>          return 0;
> >> >> >> >>
> >> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >>
> >> >> >> >>      max = vq->vring.num;
> >> >> >> >>
> >> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >> >>
> >> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >> >>
> >> >> >> >
> >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >> >>
> >> >> >> I think there are two problems.
> >> >> >>
> >> >> >> 1. When to update last_avail_idx.
> >> >> >> 2. The ordering issue you're mentioning below.
> >> >> >>
> >> >> >> The patch above is only trying to address 1 because last time you
> >> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> >> to the guest, I guess the approach above can be applied too.
> >> >> >
> >> >> > So IMHO 2 is the real issue. This is what was problematic
> >> >> > with the save patch, otherwise of course changes in save
> >> >> > are better than changes all over the codebase.
> >> >>
> >> >> All right.  Then let's focus on 2 first.
> >> >>
> >> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> >> > different way:
> >> >> >> >
> >> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >> >
> >> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >> >
> >> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >> >
> >> >> >> >
> >> >> >> > This is what I was saying below: assuming that there are
> >> >> >> > outstanding requests when we migrate, there is no way
> >> >> >> > a single index can be enough to figure out which requests
> >> >> >> > need to be handled and which are in flight already.
> >> >> >> >
> >> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >> >>
> >> >> >> I should understand why this inversion can happen before solving
> >> >> >> the issue.
> >> >> >
> >> >> > It's a fundamental thing in virtio.
> >> >> > I think it is currently only likely to happen with block, I think tap
> >> >> > currently completes things in order.  In any case relying on this in the
> >> >> > frontend is a mistake.
> >> >> >
> >> >> >>  Currently, how are you making virio-net to flush
> >> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >> >
> >> >> > Think so.
> >> >>
> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> >> As I described in the previous message, Kemari queues the
> >> >> requests first.  So in you example above, it should start with
> >> >>
> >> >> virtio-net: last_avai_idx 0 inuse 2
> >> >> event-tap: {A,B}
> >> >>
> >> >> As you know, the requests are still in order still because net
> >> >> layer initiates in order.  Not about completing.
> >> >>
> >> >> In the first synchronization, the status above is transferred.  In
> >> >> the next synchronization, the status will be as following.
> >> >>
> >> >> virtio-net: last_avai_idx 1 inuse 1
> >> >> event-tap: {B}
> >> >
> >> > OK, this answers the ordering question.
> >>
> >> Glad to hear that!
> >>
> >> > Another question: at this point we transfer this status: both
> >> > event-tap and virtio ring have the command B,
> >> > so the remote will have:
> >> >
> >> > virtio-net: inuse 0
> >> > event-tap: {B}
> >> >
> >> > Is this right? This already seems to be a problem as when B completes
> >> > inuse will go negative?
> >>
> >> I think state above is wrong.  inuse 0 means there shouldn't be
> >> any requests in event-tap.  Note that the callback is called only
> >> when event-tap flushes the requests.
> >>
> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
> >> > remote will then have:
> >> >
> >> > virtio-net: inuse 1
> >> > event-tap: {B, B}
> >> >
> >> > This looks kind of wrong ... will two packets go out?
> >>
> >> No.  Currently, we're just replaying the requests with pio/mmio.
> >
> > You do?  What purpose do the hooks in bdrv/net serve then?
> > A placeholder for the future?
> 
> Not only for that reason.  The hooks in bdrv/net is the main
> function that queues requests and starts synchronization.
> pio/mmio hooks are there for recording what initiated the
> requests monitored in bdrv/net layer.  I would like to remove
> pio/mmio part if we could make bdrv/net level replay is possible.
> 
> Yoshi

I think I begin see. So when event-tap does a replay,
we will probably need to pass the inuse value.
But since we generally don't try to support new->old
cross-version migrations in qemu, my guess is that
it is better not to change the format in anticipation
right now.

So basically for now we just need to add a comment explaining
the reason for moving last_avail_idx back.
Does something like the below (completely untested) make sense?

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

Comments

Yoshiaki Tamura Dec. 26, 2010, 12:16 p.m. UTC | #1
2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Sun, Dec 26, 2010 at 07:57:52PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
>> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
>> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >> >> >> >>> >> >>
>> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >> >>> >> >
>> >> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >> >> >> >>> >> > state that is not guest visible is always a mistake
>> >> >> >> >> >>> >> > as it ties migration format to an internal implementation
>> >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >> >> >> >>> >> > is to flush outstanding
>> >> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >> >> >> >>> >>
>> >> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >> >> >> >>> >> happy to drop this patch.
>> >> >> >> >> >>> >>
>> >> >> >> >> >>> >> Yoshi
>> >> >> >> >> >>> >
>> >> >> >> >> >>> > Look for this:
>> >> >> >> >> >>> > stable migration image on a stopped vm
>> >> >> >> >> >>> > sent on:
>> >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >> >> >> >>>
>> >> >> >> >> >>> Thanks for the info.
>> >> >> >> >> >>>
>> >> >> >> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >> >> >> >>> output, and while last_avail_idx gets incremented
>> >> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >> >> >> >>> between Primary and Secondary.
>> >> >> >> >> >>
>> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >> >> >> >
>> >> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> >> >> >> > and update the external when inuse is decremented.  I'll try
>> >> >> >> >> > whether it work w/ w/o Kemari.
>> >> >> >> >>
>> >> >> >> >> Hi Michael,
>> >> >> >> >>
>> >> >> >> >> Could you please take a look at the following patch?
>> >> >> >> >
>> >> >> >> > Which version is this against?
>> >> >> >>
>> >> >> >> Oops.  It should be very old.
>> >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
>> >> >> >>
>> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >> >> >> >>
>> >> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
>> >> >> >> >>
>> >> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >
>> >> >> >> > It would be better to have a commit description explaining why a change
>> >> >> >> > is made, and why it is correct, not just repeating what can be seen from
>> >> >> >> > the diff anyway.
>> >> >> >>
>> >> >> >> Sorry for being lazy here.
>> >> >> >>
>> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> >> index c8a0fc6..6688c02 100644
>> >> >> >> >> --- a/hw/virtio.c
>> >> >> >> >> +++ b/hw/virtio.c
>> >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >> >> >> >>      wmb();
>> >> >> >> >>      trace_virtqueue_flush(vq, count);
>> >> >> >> >>      vring_used_idx_increment(vq, count);
>> >> >> >> >> +    vq->last_avail_idx += count;
>> >> >> >> >>      vq->inuse -= count;
>> >> >> >> >>  }
>> >> >> >> >>
>> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >> >>      unsigned int i, head, max;
>> >> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >> >> >> >>
>> >> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >> >> >> >>          return 0;
>> >> >> >> >>
>> >> >> >> >>      /* When we start there are none of either input nor output. */
>> >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >> >>
>> >> >> >> >>      max = vq->vring.num;
>> >> >> >> >>
>> >> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >> >> >> >>
>> >> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>> >> >> >>
>> >> >> >> I think there are two problems.
>> >> >> >>
>> >> >> >> 1. When to update last_avail_idx.
>> >> >> >> 2. The ordering issue you're mentioning below.
>> >> >> >>
>> >> >> >> The patch above is only trying to address 1 because last time you
>> >> >> >> mentioned that modifying last_avail_idx upon save may break the
>> >> >> >> guest, which I agree.  If virtio_queue_empty and
>> >> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
>> >> >> >> to the guest, I guess the approach above can be applied too.
>> >> >> >
>> >> >> > So IMHO 2 is the real issue. This is what was problematic
>> >> >> > with the save patch, otherwise of course changes in save
>> >> >> > are better than changes all over the codebase.
>> >> >>
>> >> >> All right.  Then let's focus on 2 first.
>> >> >>
>> >> >> >> > Previous patch version sure looked simpler, and this seems functionally
>> >> >> >> > equivalent, so my question still stands: here it is rephrased in a
>> >> >> >> > different way:
>> >> >> >> >
>> >> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >> >> >> >
>> >> >> >> >        host pops A, then B, then completes B and flushes
>> >> >> >> >
>> >> >> >> >        now with this patch last_avail_idx will be 1, and then
>> >> >> >> >        remote will get it, it will execute B again. As a result
>> >> >> >> >        B will complete twice, and apparently A will never complete.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > This is what I was saying below: assuming that there are
>> >> >> >> > outstanding requests when we migrate, there is no way
>> >> >> >> > a single index can be enough to figure out which requests
>> >> >> >> > need to be handled and which are in flight already.
>> >> >> >> >
>> >> >> >> > We must add some kind of bitmask to tell us which is which.
>> >> >> >>
>> >> >> >> I should understand why this inversion can happen before solving
>> >> >> >> the issue.
>> >> >> >
>> >> >> > It's a fundamental thing in virtio.
>> >> >> > I think it is currently only likely to happen with block, I think tap
>> >> >> > currently completes things in order.  In any case relying on this in the
>> >> >> > frontend is a mistake.
>> >> >> >
>> >> >> >>  Currently, how are you making virio-net to flush
>> >> >> >> every requests for live migration?  Is it qemu_aio_flush()?
>> >> >> >
>> >> >> > Think so.
>> >> >>
>> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> >> As I described in the previous message, Kemari queues the
>> >> >> requests first.  So in you example above, it should start with
>> >> >>
>> >> >> virtio-net: last_avai_idx 0 inuse 2
>> >> >> event-tap: {A,B}
>> >> >>
>> >> >> As you know, the requests are still in order still because net
>> >> >> layer initiates in order.  Not about completing.
>> >> >>
>> >> >> In the first synchronization, the status above is transferred.  In
>> >> >> the next synchronization, the status will be as following.
>> >> >>
>> >> >> virtio-net: last_avai_idx 1 inuse 1
>> >> >> event-tap: {B}
>> >> >
>> >> > OK, this answers the ordering question.
>> >>
>> >> Glad to hear that!
>> >>
>> >> > Another question: at this point we transfer this status: both
>> >> > event-tap and virtio ring have the command B,
>> >> > so the remote will have:
>> >> >
>> >> > virtio-net: inuse 0
>> >> > event-tap: {B}
>> >> >
>> >> > Is this right? This already seems to be a problem as when B completes
>> >> > inuse will go negative?
>> >>
>> >> I think state above is wrong.  inuse 0 means there shouldn't be
>> >> any requests in event-tap.  Note that the callback is called only
>> >> when event-tap flushes the requests.
>> >>
>> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> >> > remote will then have:
>> >> >
>> >> > virtio-net: inuse 1
>> >> > event-tap: {B, B}
>> >> >
>> >> > This looks kind of wrong ... will two packets go out?
>> >>
>> >> No.  Currently, we're just replaying the requests with pio/mmio.
>> >
>> > You do?  What purpose do the hooks in bdrv/net serve then?
>> > A placeholder for the future?
>>
>> Not only for that reason.  The hooks in bdrv/net is the main
>> function that queues requests and starts synchronization.
>> pio/mmio hooks are there for recording what initiated the
>> requests monitored in bdrv/net layer.  I would like to remove
>> pio/mmio part if we could make bdrv/net level replay is possible.
>>
>> Yoshi
>
> I think I begin see. So when event-tap does a replay,
> we will probably need to pass the inuse value.

Completely correct.

> But since we generally don't try to support new->old
> cross-version migrations in qemu, my guess is that
> it is better not to change the format in anticipation
> right now.

I agree.

> So basically for now we just need to add a comment explaining
> the reason for moving last_avail_idx back.
> Does something like the below (completely untested) make sense?

Yes, it does.  Thank you for putting a decent comment.  Can I put
the patch into my series as is?

Yoshi

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 07dbf86..d1509f28 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>     qemu_put_be32(f, i);
>
>     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> +        /* For regular migration inuse == 0 always as
> +         * requests are flushed before save. However,
> +         * event-tap log when enabled introduces an extra
> +         * queue for requests which is not being flushed,
> +         * thus the last inuse requests are left in the event-tap queue.
> +         * Move the last_avail_idx value sent to the remote back
> +         * to make it repeat the last inuse requests. */
> +        uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>         if (vdev->vq[i].vring.num == 0)
>             break;
>
>         qemu_put_be32(f, vdev->vq[i].vring.num);
>         qemu_put_be64(f, vdev->vq[i].pa);
> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> +        qemu_put_be16s(f, &last_avail);
>         if (vdev->binding->save_queue)
>             vdev->binding->save_queue(vdev->binding_opaque, i, f);
>     }
>
>
Michael S. Tsirkin Dec. 26, 2010, 12:17 p.m. UTC | #2
On Sun, Dec 26, 2010 at 09:16:28PM +0900, Yoshiaki Tamura wrote:
> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> > On Sun, Dec 26, 2010 at 07:57:52PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >> >> >>> >> >>
> >> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >> >>> >> >
> >> >> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>> >> Yoshi
> >> >> >> >> >> >>> >
> >> >> >> >> >> >>> > Look for this:
> >> >> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >> >> >>> > sent on:
> >> >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> Thanks for the info.
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >> >> >>> between Primary and Secondary.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >> >> >
> >> >> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >> >> >>
> >> >> >> >> >> Hi Michael,
> >> >> >> >> >>
> >> >> >> >> >> Could you please take a look at the following patch?
> >> >> >> >> >
> >> >> >> >> > Which version is this against?
> >> >> >> >>
> >> >> >> >> Oops.  It should be very old.
> >> >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >> >> >>
> >> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >> >> >>
> >> >> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >> >> >>
> >> >> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >
> >> >> >> >> > It would be better to have a commit description explaining why a change
> >> >> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> >> >> > the diff anyway.
> >> >> >> >>
> >> >> >> >> Sorry for being lazy here.
> >> >> >> >>
> >> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> >> >> --- a/hw/virtio.c
> >> >> >> >> >> +++ b/hw/virtio.c
> >> >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >> >> >>      wmb();
> >> >> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> >> >> +    vq->last_avail_idx += count;
> >> >> >> >> >>      vq->inuse -= count;
> >> >> >> >> >>  }
> >> >> >> >> >>
> >> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >> >>      unsigned int i, head, max;
> >> >> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >> >> >>
> >> >> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >> >> >>          return 0;
> >> >> >> >> >>
> >> >> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >> >>
> >> >> >> >> >>      max = vq->vring.num;
> >> >> >> >> >>
> >> >> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >> >> >>
> >> >> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >> >> >>
> >> >> >> >> I think there are two problems.
> >> >> >> >>
> >> >> >> >> 1. When to update last_avail_idx.
> >> >> >> >> 2. The ordering issue you're mentioning below.
> >> >> >> >>
> >> >> >> >> The patch above is only trying to address 1 because last time you
> >> >> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> >> >> to the guest, I guess the approach above can be applied too.
> >> >> >> >
> >> >> >> > So IMHO 2 is the real issue. This is what was problematic
> >> >> >> > with the save patch, otherwise of course changes in save
> >> >> >> > are better than changes all over the codebase.
> >> >> >>
> >> >> >> All right.  Then let's focus on 2 first.
> >> >> >>
> >> >> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> >> >> > different way:
> >> >> >> >> >
> >> >> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >> >> >
> >> >> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >> >> >
> >> >> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > This is what I was saying below: assuming that there are
> >> >> >> >> > outstanding requests when we migrate, there is no way
> >> >> >> >> > a single index can be enough to figure out which requests
> >> >> >> >> > need to be handled and which are in flight already.
> >> >> >> >> >
> >> >> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >> >> >>
> >> >> >> >> I should understand why this inversion can happen before solving
> >> >> >> >> the issue.
> >> >> >> >
> >> >> >> > It's a fundamental thing in virtio.
> >> >> >> > I think it is currently only likely to happen with block, I think tap
> >> >> >> > currently completes things in order.  In any case relying on this in the
> >> >> >> > frontend is a mistake.
> >> >> >> >
> >> >> >> >>  Currently, how are you making virio-net to flush
> >> >> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >> >> >
> >> >> >> > Think so.
> >> >> >>
> >> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> >> >> As I described in the previous message, Kemari queues the
> >> >> >> requests first.  So in you example above, it should start with
> >> >> >>
> >> >> >> virtio-net: last_avai_idx 0 inuse 2
> >> >> >> event-tap: {A,B}
> >> >> >>
> >> >> >> As you know, the requests are still in order still because net
> >> >> >> layer initiates in order.  Not about completing.
> >> >> >>
> >> >> >> In the first synchronization, the status above is transferred.  In
> >> >> >> the next synchronization, the status will be as following.
> >> >> >>
> >> >> >> virtio-net: last_avai_idx 1 inuse 1
> >> >> >> event-tap: {B}
> >> >> >
> >> >> > OK, this answers the ordering question.
> >> >>
> >> >> Glad to hear that!
> >> >>
> >> >> > Another question: at this point we transfer this status: both
> >> >> > event-tap and virtio ring have the command B,
> >> >> > so the remote will have:
> >> >> >
> >> >> > virtio-net: inuse 0
> >> >> > event-tap: {B}
> >> >> >
> >> >> > Is this right? This already seems to be a problem as when B completes
> >> >> > inuse will go negative?
> >> >>
> >> >> I think state above is wrong.  inuse 0 means there shouldn't be
> >> >> any requests in event-tap.  Note that the callback is called only
> >> >> when event-tap flushes the requests.
> >> >>
> >> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
> >> >> > remote will then have:
> >> >> >
> >> >> > virtio-net: inuse 1
> >> >> > event-tap: {B, B}
> >> >> >
> >> >> > This looks kind of wrong ... will two packets go out?
> >> >>
> >> >> No.  Currently, we're just replaying the requests with pio/mmio.
> >> >
> >> > You do?  What purpose do the hooks in bdrv/net serve then?
> >> > A placeholder for the future?
> >>
> >> Not only for that reason.  The hooks in bdrv/net is the main
> >> function that queues requests and starts synchronization.
> >> pio/mmio hooks are there for recording what initiated the
> >> requests monitored in bdrv/net layer.  I would like to remove
> >> pio/mmio part if we could make bdrv/net level replay is possible.
> >>
> >> Yoshi
> >
> > I think I begin see. So when event-tap does a replay,
> > we will probably need to pass the inuse value.
> 
> Completely correct.
> 
> > But since we generally don't try to support new->old
> > cross-version migrations in qemu, my guess is that
> > it is better not to change the format in anticipation
> > right now.
> 
> I agree.
> 
> > So basically for now we just need to add a comment explaining
> > the reason for moving last_avail_idx back.
> > Does something like the below (completely untested) make sense?
> 
> Yes, it does.  Thank you for putting a decent comment.  Can I put
> the patch into my series as is?
> 
> Yoshi

Sure.

> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > diff --git a/hw/virtio.c b/hw/virtio.c
> > index 07dbf86..d1509f28 100644
> > --- a/hw/virtio.c
> > +++ b/hw/virtio.c
> > @@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >     qemu_put_be32(f, i);
> >
> >     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> > +        /* For regular migration inuse == 0 always as
> > +         * requests are flushed before save. However,
> > +         * event-tap log when enabled introduces an extra
> > +         * queue for requests which is not being flushed,
> > +         * thus the last inuse requests are left in the event-tap queue.
> > +         * Move the last_avail_idx value sent to the remote back
> > +         * to make it repeat the last inuse requests. */
> > +        uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >         if (vdev->vq[i].vring.num == 0)
> >             break;
> >
> >         qemu_put_be32(f, vdev->vq[i].vring.num);
> >         qemu_put_be64(f, vdev->vq[i].pa);
> > -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> > +        qemu_put_be16s(f, &last_avail);
> >         if (vdev->binding->save_queue)
> >             vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >     }
> >
> >
diff mbox

Patch

diff --git a/hw/virtio.c b/hw/virtio.c
index 07dbf86..d1509f28 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -665,12 +665,20 @@  void virtio_save(VirtIODevice *vdev, QEMUFile *f)
     qemu_put_be32(f, i);
 
     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
+        /* For regular migration inuse == 0 always as
+         * requests are flushed before save. However, 
+         * event-tap log when enabled introduces an extra
+         * queue for requests which is not being flushed,
+         * thus the last inuse requests are left in the event-tap queue.
+         * Move the last_avail_idx value sent to the remote back
+         * to make it repeat the last inuse requests. */
+        uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
         if (vdev->vq[i].vring.num == 0)
             break;
 
         qemu_put_be32(f, vdev->vq[i].vring.num);
         qemu_put_be64(f, vdev->vq[i].pa);
-        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
+        qemu_put_be16s(f, &last_avail);
         if (vdev->binding->save_queue)
             vdev->binding->save_queue(vdev->binding_opaque, i, f);
     }