diff mbox

[PATCHv2] virtio: verify that all outstanding buffers are flushed

Message ID 20121212105101.GA6461@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin Dec. 12, 2012, 10:51 a.m. UTC
Add sanity check to address the following concern:

During migration, all we pass the index of the request;
the rest can be re-read from the ring.

This is not generally enough if any available requests are outstanding.
Imagine a ring of size 4.  Below A means available U means used.

A 1
A 2
U 2
A 2
U 2
A 2
U 2
A 2
U 2

At this point available ring has wrapped around, the only
way to know head 1 is outstanding is because backend
has stored this info somewhere.

The reason we manage to migrate without tracking this in migration
state is because we flush outstanding requests before
migration.
This flush is device-specific though, let's add
a safeguard in virtio core to ensure it's done properly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Changes from v1:
    v1 was against the wrong tree, it didn't build against qemu.git

 hw/virtio.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Stefan Hajnoczi Dec. 12, 2012, 1:50 p.m. UTC | #1
On Wed, Dec 12, 2012 at 12:51:01PM +0200, Michael S. Tsirkin wrote:
> Add sanity check to address the following concern:
> 
> During migration, all we pass the index of the request;
> the rest can be re-read from the ring.
> 
> This is not generally enough if any available requests are outstanding.
> Imagine a ring of size 4.  Below A means available U means used.
> 
> A 1
> A 2
> U 2
> A 2
> U 2
> A 2
> U 2
> A 2
> U 2
> 
> At this point available ring has wrapped around, the only
> way to know head 1 is outstanding is because backend
> has stored this info somewhere.
> 
> The reason we manage to migrate without tracking this in migration
> state is because we flush outstanding requests before
> migration.
> This flush is device-specific though, let's add
> a safeguard in virtio core to ensure it's done properly.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> Changes from v1:
>     v1 was against the wrong tree, it didn't build against qemu.git
> 
>  hw/virtio.c | 2 ++
>  1 file changed, 2 insertions(+)

VirtIOBlock->rq can trigger the assertion.

IIUC hw/virtio-blk.c may handle I/O errors by keeping the request
pending and on a list (->rq).  This allows the user to restart them
after, for example, adding more space to the host file system containing
the disk image file.

We keep a list of failed requests and we migrate this list.  So I think
inuse != 0 when migrating with pending failed I/O requests.

Stefan
Paolo Bonzini Dec. 12, 2012, 2:01 p.m. UTC | #2
Il 12/12/2012 14:50, Stefan Hajnoczi ha scritto:
> VirtIOBlock->rq can trigger the assertion.
> 
> IIUC hw/virtio-blk.c may handle I/O errors by keeping the request
> pending and on a list (->rq).  This allows the user to restart them
> after, for example, adding more space to the host file system containing
> the disk image file.
> 
> We keep a list of failed requests and we migrate this list.  So I think
> inuse != 0 when migrating with pending failed I/O requests.

Same for virtio-scsi.  Each request in that case is sent as part of the
SCSIDevice that it refers to, via callbacks in SCSIBusInfo.

Paolo
Michael S. Tsirkin Dec. 12, 2012, 2:26 p.m. UTC | #3
On Wed, Dec 12, 2012 at 02:50:50PM +0100, Stefan Hajnoczi wrote:
> On Wed, Dec 12, 2012 at 12:51:01PM +0200, Michael S. Tsirkin wrote:
> > Add sanity check to address the following concern:
> > 
> > During migration, all we pass the index of the request;
> > the rest can be re-read from the ring.
> > 
> > This is not generally enough if any available requests are outstanding.
> > Imagine a ring of size 4.  Below A means available U means used.
> > 
> > A 1
> > A 2
> > U 2
> > A 2
> > U 2
> > A 2
> > U 2
> > A 2
> > U 2
> > 
> > At this point available ring has wrapped around, the only
> > way to know head 1 is outstanding is because backend
> > has stored this info somewhere.
> > 
> > The reason we manage to migrate without tracking this in migration
> > state is because we flush outstanding requests before
> > migration.
> > This flush is device-specific though, let's add
> > a safeguard in virtio core to ensure it's done properly.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > Changes from v1:
> >     v1 was against the wrong tree, it didn't build against qemu.git
> > 
> >  hw/virtio.c | 2 ++
> >  1 file changed, 2 insertions(+)
> 
> VirtIOBlock->rq can trigger the assertion.
> 
> IIUC hw/virtio-blk.c may handle I/O errors by keeping the request
> pending and on a list (->rq).  This allows the user to restart them
> after, for example, adding more space to the host file system containing
> the disk image file.
> 
> We keep a list of failed requests and we migrate this list.

Could not find it. It needs to be in virtio-blk in order
to know to update the used ring once it completes, right?

>  So I think
> inuse != 0 when migrating with pending failed I/O requests.
> 
> Stefan

Okay but let's make sure this is not a bug.
Michael S. Tsirkin Dec. 12, 2012, 2:30 p.m. UTC | #4
On Wed, Dec 12, 2012 at 03:01:55PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 14:50, Stefan Hajnoczi ha scritto:
> > VirtIOBlock->rq can trigger the assertion.
> > 
> > IIUC hw/virtio-blk.c may handle I/O errors by keeping the request
> > pending and on a list (->rq).  This allows the user to restart them
> > after, for example, adding more space to the host file system containing
> > the disk image file.
> > 
> > We keep a list of failed requests and we migrate this list.  So I think
> > inuse != 0 when migrating with pending failed I/O requests.
> 
> Same for virtio-scsi.  Each request in that case is sent as part of the
> SCSIDevice that it refers to, via callbacks in SCSIBusInfo.
> 
> Paolo

Looks like this will leak ring entries.

All I see is: virtio_scsi_load calling virtio_load.
When the loading side will get last avail index it
will assume all requests up to that value have
completed, so it will never put the missing heads
in the used ring.

And it is at this point too late for
the source to change the used ring as guest memory
has migrated.
Paolo Bonzini Dec. 12, 2012, 2:36 p.m. UTC | #5
Il 12/12/2012 15:30, Michael S. Tsirkin ha scritto:
> > Same for virtio-scsi.  Each request in that case is sent as part of the
> > SCSIDevice that it refers to, via callbacks in SCSIBusInfo.

It is in virtio_scsi_load_request.

> Looks like this will leak ring entries.
> 
> All I see is: virtio_scsi_load calling virtio_load.
> When the loading side will get last avail index it
> will assume all requests up to that value have
> completed, so it will never put the missing heads
> in the used ring.

Ok, so we need some API for virtio-{blk,scsi} to communicate back the
indexes of in-flight requests to virtio.  The indexes are known from the
VirtQueueElement, so that's fine.

Even better would be a virtio_save_request/virtio_load_request API...

Paolo
Michael S. Tsirkin Dec. 12, 2012, 2:47 p.m. UTC | #6
On Wed, Dec 12, 2012 at 03:36:10PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 15:30, Michael S. Tsirkin ha scritto:
> > > Same for virtio-scsi.  Each request in that case is sent as part of the
> > > SCSIDevice that it refers to, via callbacks in SCSIBusInfo.
> 
> It is in virtio_scsi_load_request.
> 
> > Looks like this will leak ring entries.
> > 
> > All I see is: virtio_scsi_load calling virtio_load.
> > When the loading side will get last avail index it
> > will assume all requests up to that value have
> > completed, so it will never put the missing heads
> > in the used ring.
> 
> Ok, so we need some API for virtio-{blk,scsi} to communicate back the
> indexes of in-flight requests to virtio.  The indexes are known from the
> VirtQueueElement, so that's fine.
> 
> Even better would be a virtio_save_request/virtio_load_request API...
> 
> Paolo

So you are saying this is a bug then? Great. This is exactly what
the assert above is out there to catch.
And you really can't fix it without breaking
migration compatibility.

As step 1, I think we should just complete all outstanding
requests when VM stops.

Yes it means you can't do the retry hack after migration
but this is hardly common scenario.
Paolo Bonzini Dec. 12, 2012, 3:01 p.m. UTC | #7
Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto:
> > Ok, so we need some API for virtio-{blk,scsi} to communicate back the
> > indexes of in-flight requests to virtio.  The indexes are known from the
> > VirtQueueElement, so that's fine.
> > 
> > Even better would be a virtio_save_request/virtio_load_request API...
> 
> So you are saying this is a bug then? Great.

I'm not sure what you mean by "it will never put the missing heads
in the used ring".  The serialized requests are put in the used rings
when they are completed and virtio-{blk,scsi} calls virtqueue_push.  Is
the problem if you have a vring that looks like this:

    A A U A U U A A

?  Which heads are leaked?  {0,1}, {2} or {6,7}?  Or a combination thereof?

Also, I'm not sure your "fix" (crash QEMU) is correct.  I hope we can
make it work.

> This is exactly what the assert above is out there to catch.
> And you really can't fix it without breaking migration compatibility.

Why not?  The index in the vring is in the migration data.

> As step 1, I think we should just complete all outstanding
> requests when VM stops.
> 
> Yes it means you can't do the retry hack after migration
> but this is hardly common scenario.

I disagree...

Paolo
Michael S. Tsirkin Dec. 12, 2012, 3:25 p.m. UTC | #8
On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto:
> > > Ok, so we need some API for virtio-{blk,scsi} to communicate back the
> > > indexes of in-flight requests to virtio.  The indexes are known from the
> > > VirtQueueElement, so that's fine.
> > > 
> > > Even better would be a virtio_save_request/virtio_load_request API...
> > 
> > So you are saying this is a bug then? Great.
> 
> I'm not sure what you mean by "it will never put the missing heads
> in the used ring".  The serialized requests are put in the used rings
> when they are completed and virtio-{blk,scsi} calls virtqueue_push.  Is
> the problem if you have a vring that looks like this:
> 
>     A A U A U U A A
> 
> ?  Which heads are leaked?  {0,1}, {2} or {6,7}?  Or a combination thereof?

I don't know what A A U A U U A A means.
Pls see the example of the problem in the original commit log.

> Also, I'm not sure your "fix" (crash QEMU) is correct.  I hope we can
> make it work.

Of course assert is not a fix.
The patch is just to catch bugs like this one.

To make it work, complete all requests when vm is stopped.
This needs to be done in the specific device, the assert will catch
buggy devices.

> > This is exactly what the assert above is out there to catch.
> > And you really can't fix it without breaking migration compatibility.
> 
> Why not?  The index in the vring is in the migration data.

index is not enough if requests are outstanding.
Pls check the example in the log of the patch.

> > As step 1, I think we should just complete all outstanding
> > requests when VM stops.
> > 
> > Yes it means you can't do the retry hack after migration
> > but this is hardly common scenario.
> 
> I disagree...
> 
> Paolo

Disagree with what? You are saying it's common?
Paolo Bonzini Dec. 12, 2012, 3:52 p.m. UTC | #9
Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto:
> On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote:
>> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto:
>>>> Ok, so we need some API for virtio-{blk,scsi} to communicate back the
>>>> indexes of in-flight requests to virtio.  The indexes are known from the
>>>> VirtQueueElement, so that's fine.
>>>>
>>>> Even better would be a virtio_save_request/virtio_load_request API...
>>>
>>> So you are saying this is a bug then? Great.
>>
>> I'm not sure what you mean by "it will never put the missing heads
>> in the used ring".  The serialized requests are put in the used rings
>> when they are completed and virtio-{blk,scsi} calls virtqueue_push.  Is
>> the problem if you have a vring that looks like this:
>>
>>     A A U A U U A A
>>
>> ?  Which heads are leaked?  {0,1}, {2} or {6,7}?  Or a combination thereof?
> 
> I don't know what A A U A U U A A means.

Right, not very clear.... It means that descriptor 2 and 4 and 5 are
free, while the others are in-flight.

> To make it work, complete all requests when vm is stopped.

That's not a choice, sorry.

>>> This is exactly what the assert above is out there to catch.
>>> And you really can't fix it without breaking migration compatibility.
>>
>> Why not?  The index in the vring is in the migration data.
> 
> index is not enough if requests are outstanding.

Sorry, I meant the descriptor index.

> Pls check the example in the log of the patch.

I'm likewise not sure what you meant by

   A 1
   A 2
   U 2
   A 2
   U 2
   A 2
   U 2
   A 2     <---
   U 2

If I understand it, before the point marked with the arrow, the avail
ring is

   1 2 2 2

   vring_avail_idx(vq) == 3
   last_avail_idx == 3

After, it is

   2 2 2 2
   vring_avail_idx(vq) == 4
   last_avail_idx == 3

What's wrong with that?

You wrote "the only way to know head 1 is outstanding is because backend
has stored this info somewhere".  But the backend _is_ tracking it (by
serializing and then restoring the VirtQueueElement) and no leak happens
because virtqueue_fill/flush will put the head on the used ring sooner
or later.

>>> As step 1, I think we should just complete all outstanding
>>> requests when VM stops.
>>>
>>> Yes it means you can't do the retry hack after migration
>>> but this is hardly common scenario.
>>
>> I disagree...
> 
> Disagree with what? You are saying it's common?

It's not common, but you cannot block migration because you have an I/O
error.  Solving the error may involve migrating the guests away from
that host.

Paolo
Michael S. Tsirkin Dec. 12, 2012, 4:37 p.m. UTC | #10
On Wed, Dec 12, 2012 at 04:52:53PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto:
> > On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote:
> >> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto:
> >>>> Ok, so we need some API for virtio-{blk,scsi} to communicate back the
> >>>> indexes of in-flight requests to virtio.  The indexes are known from the
> >>>> VirtQueueElement, so that's fine.
> >>>>
> >>>> Even better would be a virtio_save_request/virtio_load_request API...
> >>>
> >>> So you are saying this is a bug then? Great.
> >>
> >> I'm not sure what you mean by "it will never put the missing heads
> >> in the used ring".  The serialized requests are put in the used rings
> >> when they are completed and virtio-{blk,scsi} calls virtqueue_push.  Is
> >> the problem if you have a vring that looks like this:
> >>
> >>     A A U A U U A A
> >>
> >> ?  Which heads are leaked?  {0,1}, {2} or {6,7}?  Or a combination thereof?
> > 
> > I don't know what A A U A U U A A means.
> 
> Right, not very clear.... It means that descriptor 2 and 4 and 5 are
> free, while the others are in-flight.
> 
> > To make it work, complete all requests when vm is stopped.
> 
> That's not a choice, sorry.
> 
> >>> This is exactly what the assert above is out there to catch.
> >>> And you really can't fix it without breaking migration compatibility.
> >>
> >> Why not?  The index in the vring is in the migration data.
> > 
> > index is not enough if requests are outstanding.
> 
> Sorry, I meant the descriptor index.
> 
> > Pls check the example in the log of the patch.
> 
> I'm likewise not sure what you meant by
> 
>    A 1
>    A 2
>    U 2
>    A 2
>    U 2
>    A 2
>    U 2
>    A 2     <---
>    U 2
> 
> If I understand it, before the point marked with the arrow, the avail
> ring is
> 
>    1 2 2 2
> 
>    vring_avail_idx(vq) == 3
>    last_avail_idx == 3
> 
> After, it is
> 
>    2 2 2 2
>    vring_avail_idx(vq) == 4
>    last_avail_idx == 3
> 
> What's wrong with that?
> 
> You wrote "the only way to know head 1 is outstanding is because backend
> has stored this info somewhere".  But the backend _is_ tracking it (by
> serializing and then restoring the VirtQueueElement) and no leak happens
> because virtqueue_fill/flush will put the head on the used ring sooner
> or later.

If you did this before save vm inuse would be 0.

You said that at the point where we save state,
some entries are outstanding. It is too late to
put head at that point.


> >>> As step 1, I think we should just complete all outstanding
> >>> requests when VM stops.
> >>>
> >>> Yes it means you can't do the retry hack after migration
> >>> but this is hardly common scenario.
> >>
> >> I disagree...
> > 
> > Disagree with what? You are saying it's common?
> 
> It's not common, but you cannot block migration because you have an I/O
> error.  Solving the error may involve migrating the guests away from
> that host.
> 
> Paolo

No, you should complete with error.
Paolo Bonzini Dec. 12, 2012, 4:51 p.m. UTC | #11
Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto:
> > You wrote "the only way to know head 1 is outstanding is because backend
> > has stored this info somewhere".  But the backend _is_ tracking it (by
> > serializing and then restoring the VirtQueueElement) and no leak happens
> > because virtqueue_fill/flush will put the head on the used ring sooner
> > or later.
> 
> If you did this before save vm inuse would be 0.

No, I won't.  I want a simple API that the device can call to keep inuse
up-to-date.  Perhaps a bit ugly compared to just saving inuse, but it
works.  Or are there other bits that need resyncing besides inuse?  Bits
that cannot be recovered from the existing migration data?

> You said that at the point where we save state,
> some entries are outstanding. It is too late to
> put head at that point.

I don't want to put head on the source.  I want to put it on the
destination, when the request is completed.  Same as it is done now,
with bugfixes of course.  Are there any problems doing so, except that
inuse will not be up-to-date (easily fixed)?

> > It's not common, but you cannot block migration because you have an I/O
> > error.  Solving the error may involve migrating the guests away from
> > that host.
> 
> No, you should complete with error.

Knowing that the request will fail, the admin will not be able to do
migration, even if that will solve the error transparently.

Paolo
Michael S. Tsirkin Dec. 12, 2012, 5:14 p.m. UTC | #12
On Wed, Dec 12, 2012 at 05:51:51PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto:
> > > You wrote "the only way to know head 1 is outstanding is because backend
> > > has stored this info somewhere".  But the backend _is_ tracking it (by
> > > serializing and then restoring the VirtQueueElement) and no leak happens
> > > because virtqueue_fill/flush will put the head on the used ring sooner
> > > or later.
> > 
> > If you did this before save vm inuse would be 0.
> 
> No, I won't.  I want a simple API that the device can call to keep inuse
> up-to-date.  Perhaps a bit ugly compared to just saving inuse, but it
> works.  Or are there other bits that need resyncing besides inuse?  Bits
> that cannot be recovered from the existing migration data?

Saving inuse counter is useless. We need to know which requests
are outstanding if we want to retry them on remote.

> > You said that at the point where we save state,
> > some entries are outstanding. It is too late to
> > put head at that point.
> 
> I don't want to put head on the source.  I want to put it on the
> destination, when the request is completed.  Same as it is done now,
> with bugfixes of course.  Are there any problems doing so, except that
> inuse will not be up-to-date (easily fixed)?

You have an outstanding request that is behind last avail index.
You do not want to complete it. You migrate. There is no
way for remote to understand that the request is outstanding.

> > > It's not common, but you cannot block migration because you have an I/O
> > > error.  Solving the error may involve migrating the guests away from
> > > that host.
> > 
> > No, you should complete with error.
> 
> Knowing that the request will fail, the admin will not be able to do
> migration, even if that will solve the error transparently.
> 
> Paolo

You are saying there's no way to complete all requests?
Paolo Bonzini Dec. 12, 2012, 5:39 p.m. UTC | #13
Il 12/12/2012 18:14, Michael S. Tsirkin ha scritto:
> On Wed, Dec 12, 2012 at 05:51:51PM +0100, Paolo Bonzini wrote:
>> Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto:
>>>> You wrote "the only way to know head 1 is outstanding is because backend
>>>> has stored this info somewhere".  But the backend _is_ tracking it (by
>>>> serializing and then restoring the VirtQueueElement) and no leak happens
>>>> because virtqueue_fill/flush will put the head on the used ring sooner
>>>> or later.
>>>
>>> If you did this before save vm inuse would be 0.
>>
>> No, I won't.  I want a simple API that the device can call to keep inuse
>> up-to-date.  Perhaps a bit ugly compared to just saving inuse, but it
>> works.  Or are there other bits that need resyncing besides inuse?  Bits
>> that cannot be recovered from the existing migration data?
> 
> Saving inuse counter is useless. We need to know which requests
> are outstanding if we want to retry them on remote.

And that's what virtio-blk and virtio-scsi have been doing for years.
They store the VirtQueueElement including the index and the sglists.
Can you explain *why* the index is not enough to reconstruct the state
on the destination?  There may be bugs and you may need help from
virtio_blk_load, but that's okay.

>>> You said that at the point where we save state,
>>> some entries are outstanding. It is too late to
>>> put head at that point.
>>
>> I don't want to put head on the source.  I want to put it on the
>> destination, when the request is completed.  Same as it is done now,
>> with bugfixes of course.  Are there any problems doing so, except that
>> inuse will not be up-to-date (easily fixed)?
> 
> You have an outstanding request that is behind last avail index.
> You do not want to complete it. You migrate. There is no
> way for remote to understand that the request is outstanding.

The savevm callbacks know which request is outstanding and pass the
information to the destination.  See virtio_blk_save and virtio_blk_load.

What is not clear, and you haven't explained, is how you get to a bug in
the handling of the avail ring.  What's wrong with this explanation:

   A 1
   A 2
   U 2
   A 2
   U 2
   A 2
   U 2
   A 2     <---
   U 2

where before the point marked with the arrow, the avail ring is

   1 2 2 2

   vring_avail_idx(vq) == 3
   last_avail_idx == 3

and after the point marked with the arrow, the avail ring is

   2 2 2 2
   vring_avail_idx(vq) == 4
   last_avail_idx == 3

?!?

>>>> It's not common, but you cannot block migration because you have an I/O
>>>> error.  Solving the error may involve migrating the guests away from
>>>> that host.
>>>
>>> No, you should complete with error.
>>
>> Knowing that the request will fail, the admin will not be able to do
>> migration, even if that will solve the error transparently.
> 
> You are saying there's no way to complete all requests?

With an error, yes.  Transparently after fixing the error (which may
involve migration), no.

Paolo
Michael S. Tsirkin Dec. 12, 2012, 7:23 p.m. UTC | #14
On Wed, Dec 12, 2012 at 06:39:15PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 18:14, Michael S. Tsirkin ha scritto:
> > On Wed, Dec 12, 2012 at 05:51:51PM +0100, Paolo Bonzini wrote:
> >> Il 12/12/2012 17:37, Michael S. Tsirkin ha scritto:
> >>>> You wrote "the only way to know head 1 is outstanding is because backend
> >>>> has stored this info somewhere".  But the backend _is_ tracking it (by
> >>>> serializing and then restoring the VirtQueueElement) and no leak happens
> >>>> because virtqueue_fill/flush will put the head on the used ring sooner
> >>>> or later.
> >>>
> >>> If you did this before save vm inuse would be 0.
> >>
> >> No, I won't.  I want a simple API that the device can call to keep inuse
> >> up-to-date.  Perhaps a bit ugly compared to just saving inuse, but it
> >> works.  Or are there other bits that need resyncing besides inuse?  Bits
> >> that cannot be recovered from the existing migration data?
> > 
> > Saving inuse counter is useless. We need to know which requests
> > are outstanding if we want to retry them on remote.
> 
> And that's what virtio-blk and virtio-scsi have been doing for years.

I don't see it - all I see in save is virtio_save.
there's the extra code to save the elements in flight
and send them to remote?

> They store the VirtQueueElement including the index and the sglists.
> Can you explain *why* the index is not enough to reconstruct the state
> on the destination?  There may be bugs and you may need help from
> virtio_blk_load, but that's okay.
> 
> >>> You said that at the point where we save state,
> >>> some entries are outstanding. It is too late to
> >>> put head at that point.
> >>
> >> I don't want to put head on the source.  I want to put it on the
> >> destination, when the request is completed.  Same as it is done now,
> >> with bugfixes of course.  Are there any problems doing so, except that
> >> inuse will not be up-to-date (easily fixed)?
> > 
> > You have an outstanding request that is behind last avail index.
> > You do not want to complete it. You migrate. There is no
> > way for remote to understand that the request is outstanding.
> 
> The savevm callbacks know which request is outstanding and pass the
> information to the destination.  See virtio_blk_save and virtio_blk_load.
> 
> What is not clear, and you haven't explained, is how you get to a bug in
> the handling of the avail ring.  What's wrong with this explanation:
> 
>    A 1
>    A 2
>    U 2
>    A 2
>    U 2
>    A 2
>    U 2
>    A 2     <---
>    U 2
> 
> where before the point marked with the arrow, the avail ring is
> 
>    1 2 2 2
> 
>    vring_avail_idx(vq) == 3
>    last_avail_idx == 3
> 
> and after the point marked with the arrow, the avail ring is
> 
>    2 2 2 2
>    vring_avail_idx(vq) == 4
>    last_avail_idx == 3
> 
> ?!?

You need to retry A1 on remote. How do you do that? There's
no way to find out it has not been completed
from the ring itself.


> >>>> It's not common, but you cannot block migration because you have an I/O
> >>>> error.  Solving the error may involve migrating the guests away from
> >>>> that host.
> >>>
> >>> No, you should complete with error.
> >>
> >> Knowing that the request will fail, the admin will not be able to do
> >> migration, even if that will solve the error transparently.
> > 
> > You are saying there's no way to complete all requests?
> 
> With an error, yes.  Transparently after fixing the error (which may
> involve migration), no.
> 
> Paolo
Paolo Bonzini Dec. 12, 2012, 9 p.m. UTC | #15
Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto:
>>> Saving inuse counter is useless. We need to know which requests
>>> are outstanding if we want to retry them on remote.
>>
>> And that's what virtio-blk and virtio-scsi have been doing for years.
> 
> I don't see it - all I see in save is virtio_save.

static void virtio_blk_save(QEMUFile *f, void *opaque)
{
    VirtIOBlock *s = opaque;
    VirtIOBlockReq *req = s->rq;

    virtio_save(&s->vdev, f);
    
    while (req) {
        qemu_put_sbyte(f, 1);
        qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
        req = req->next;
    }
    qemu_put_sbyte(f, 0);
}


virtio-scsi does it in virtio_scsi_save_request.

> You need to retry A1 on remote. How do you do that? There's
> no way to find out it has not been completed
> from the ring itself.

virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it.

Paolo
Michael S. Tsirkin Dec. 12, 2012, 9:19 p.m. UTC | #16
On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote:
> Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto:
> >>> Saving inuse counter is useless. We need to know which requests
> >>> are outstanding if we want to retry them on remote.
> >>
> >> And that's what virtio-blk and virtio-scsi have been doing for years.
> > 
> > I don't see it - all I see in save is virtio_save.
> 
> static void virtio_blk_save(QEMUFile *f, void *opaque)
> {
>     VirtIOBlock *s = opaque;
>     VirtIOBlockReq *req = s->rq;
> 
>     virtio_save(&s->vdev, f);
>     
>     while (req) {
>         qemu_put_sbyte(f, 1);
>         qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));

Ow. Does it really save VirtQueueElement?

typedef struct VirtQueueElement
{        
    unsigned int index;
    unsigned int out_num;
    unsigned int in_num;
    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
} VirtQueueElement; 

Complete with pointers into qemu memory and all?
That's got to hurt.

All we really need is the index.

>         req = req->next;
>     }
>     qemu_put_sbyte(f, 0);
> }
> 
> 
> virtio-scsi does it in virtio_scsi_save_request.
> 
> > You need to retry A1 on remote. How do you do that? There's
> > no way to find out it has not been completed
> > from the ring itself.
> 
> virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it.
> 
> Paolo

Okay, so the only bug is inuse getting negative right?
So all we need to do is fix up the inuse value
after restoring the outstanding requests - basically
count the restored buffers and set inuse accordingly.
Anthony Liguori Dec. 12, 2012, 9:33 p.m. UTC | #17
"Michael S. Tsirkin" <mst@redhat.com> writes:

> On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote:
>> Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto:
>> >>> Saving inuse counter is useless. We need to know which requests
>> >>> are outstanding if we want to retry them on remote.
>> >>
>> >> And that's what virtio-blk and virtio-scsi have been doing for years.
>> > 
>> > I don't see it - all I see in save is virtio_save.
>> 
>> static void virtio_blk_save(QEMUFile *f, void *opaque)
>> {
>>     VirtIOBlock *s = opaque;
>>     VirtIOBlockReq *req = s->rq;
>> 
>>     virtio_save(&s->vdev, f);
>>     
>>     while (req) {
>>         qemu_put_sbyte(f, 1);
>>         qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
>
> Ow. Does it really save VirtQueueElement?
>
> typedef struct VirtQueueElement
> {        
>     unsigned int index;
>     unsigned int out_num;
>     unsigned int in_num;
>     hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
>     hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
>     struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
>     struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> } VirtQueueElement; 
>
> Complete with pointers into qemu memory and all?
> That's got to hurt.
>
> All we really need is the index.
>

Yes, take a look at the series I sent out that scrubs all of this to
just send the index and the addresses of the element.

We technically should save the addresses and sizes too.  It makes it a
heck of a lot safer then re-reading guest memory since we do some
validation on the size of the sg elements.

But we could get away with only saving the index if we really wanted to.

Regards,

Anthony Liguori

>>         req = req->next;
>>     }
>>     qemu_put_sbyte(f, 0);
>> }
>> 
>> 
>> virtio-scsi does it in virtio_scsi_save_request.
>> 
>> > You need to retry A1 on remote. How do you do that? There's
>> > no way to find out it has not been completed
>> > from the ring itself.
>> 
>> virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it.
>> 
>> Paolo
>
> Okay, so the only bug is inuse getting negative right?
> So all we need to do is fix up the inuse value
> after restoring the outstanding requests - basically
> count the restored buffers and set inuse accordingly.
>
> -- 
> MST
Michael S. Tsirkin Dec. 12, 2012, 9:51 p.m. UTC | #18
On Wed, Dec 12, 2012 at 03:33:32PM -0600, Anthony Liguori wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> 
> > On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote:
> >> Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto:
> >> >>> Saving inuse counter is useless. We need to know which requests
> >> >>> are outstanding if we want to retry them on remote.
> >> >>
> >> >> And that's what virtio-blk and virtio-scsi have been doing for years.
> >> > 
> >> > I don't see it - all I see in save is virtio_save.
> >> 
> >> static void virtio_blk_save(QEMUFile *f, void *opaque)
> >> {
> >>     VirtIOBlock *s = opaque;
> >>     VirtIOBlockReq *req = s->rq;
> >> 
> >>     virtio_save(&s->vdev, f);
> >>     
> >>     while (req) {
> >>         qemu_put_sbyte(f, 1);
> >>         qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
> >
> > Ow. Does it really save VirtQueueElement?
> >
> > typedef struct VirtQueueElement
> > {        
> >     unsigned int index;
> >     unsigned int out_num;
> >     unsigned int in_num;

BTW there's a hole after in_num which is uninitialized, that's
also a nasty thing to send on the wire.

> >     hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
> >     hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
> >     struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
> >     struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> > } VirtQueueElement; 
> >
> > Complete with pointers into qemu memory and all?
> > That's got to hurt.
> >
> > All we really need is the index.
> >
> 
> Yes, take a look at the series I sent out that scrubs all of this to
> just send the index and the addresses of the element.

Will do.

> We technically should save the addresses and sizes too.

I guess as long as these are guest addresses, not ther qemu ones.

>  It makes it a
> heck of a lot safer then re-reading guest memory since we do some
> validation on the size of the sg elements.

> But we could get away with only saving the index if we really wanted to.

I guess re-validating it is needed anyway: we should not
trust remote more than we trust the guest.

> Regards,
> 
> Anthony Liguori
> 
> >>         req = req->next;
> >>     }
> >>     qemu_put_sbyte(f, 0);
> >> }
> >> 
> >> 
> >> virtio-scsi does it in virtio_scsi_save_request.
> >> 
> >> > You need to retry A1 on remote. How do you do that? There's
> >> > no way to find out it has not been completed
> >> > from the ring itself.
> >> 
> >> virtio_blk_dma_restart_bh and scsi_dma_restart_bh do it.
> >> 
> >> Paolo
> >
> > Okay, so the only bug is inuse getting negative right?
> > So all we need to do is fix up the inuse value
> > after restoring the outstanding requests - basically
> > count the restored buffers and set inuse accordingly.
> >
> > -- 
> > MST
Paolo Bonzini Dec. 13, 2012, 7:39 a.m. UTC | #19
Il 12/12/2012 22:19, Michael S. Tsirkin ha scritto:
> On Wed, Dec 12, 2012 at 10:00:37PM +0100, Paolo Bonzini wrote:
>> Il 12/12/2012 20:23, Michael S. Tsirkin ha scritto:
>>>>> Saving inuse counter is useless. We need to know which requests
>>>>> are outstanding if we want to retry them on remote.
>>>>
>>>> And that's what virtio-blk and virtio-scsi have been doing for years.
>>>
>>> I don't see it - all I see in save is virtio_save.
>>
>> static void virtio_blk_save(QEMUFile *f, void *opaque)
>> {
>>     VirtIOBlock *s = opaque;
>>     VirtIOBlockReq *req = s->rq;
>>
>>     virtio_save(&s->vdev, f);
>>     
>>     while (req) {
>>         qemu_put_sbyte(f, 1);
>>         qemu_put_buffer(f, (unsigned char*)&req->elem, sizeof(req->elem));
> 
> Ow. Does it really save VirtQueueElement?
> 
> typedef struct VirtQueueElement
> {        
>     unsigned int index;
>     unsigned int out_num;
>     unsigned int in_num;
>     hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
>     hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
>     struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
>     struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
> } VirtQueueElement; 
> 
> Complete with pointers into qemu memory and all?

Yes, that sucks (also the endianness is totally broken), but the
pointers into QEMU memory are restored afterwards by remapping the
address.  Like Anthony, I'm not sure whether reloading the sglist from
guest memory is safe.  It would require re-validation of everything.

We _can_ trust remote.  Things like races on connecting to the
incoming-migration socket are best handled outside QEMU with a firewall
or a separate network that is not guest-accessible.  I'm pretty sure
that virtio-blk is the latest of our worries here.

> Okay, so the only bug is inuse getting negative right?
> So all we need to do is fix up the inuse value
> after restoring the outstanding requests - basically
> count the restored buffers and set inuse accordingly.

Yes, I agree.

Paolo
Kevin Wolf Dec. 13, 2012, 10:48 a.m. UTC | #20
Am 12.12.2012 17:37, schrieb Michael S. Tsirkin:
> On Wed, Dec 12, 2012 at 04:52:53PM +0100, Paolo Bonzini wrote:
>> Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto:
>>> On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote:
>>>> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto:
>>>>> As step 1, I think we should just complete all outstanding
>>>>> requests when VM stops.
>>>>>
>>>>> Yes it means you can't do the retry hack after migration
>>>>> but this is hardly common scenario.
>>>>
>>>> I disagree...
>>>
>>> Disagree with what? You are saying it's common?
>>
>> It's not common, but you cannot block migration because you have an I/O
>> error.  Solving the error may involve migrating the guests away from
>> that host.
>>
>> Paolo
> 
> No, you should complete with error.

Just to confirm that this isn't possible: rerror/werror=stop is a
supported feature, and I do get bug reports when it breaks (including
migration while the VM is stopped for an I/O error).

Makes it common enough that breaking it is definitely not an option, right?

Kevin
Rusty Russell Dec. 14, 2012, 1:06 a.m. UTC | #21
Anthony Liguori <aliguori@us.ibm.com> writes:
> Yes, take a look at the series I sent out that scrubs all of this to
> just send the index and the addresses of the element.
>
> We technically should save the addresses and sizes too.  It makes it a
> heck of a lot safer then re-reading guest memory since we do some
> validation on the size of the sg elements.

Not really.

The guest puts the descriptors in the ring and leaves them there until
the device acks.  If it changes them once they're exposed but before
they're acked, it can get either before or after version, and always
could.

Cheers,
Rusty.
Paolo Bonzini Dec. 14, 2012, 7:51 a.m. UTC | #22
> > We technically should save the addresses and sizes too.  It makes
> > it a heck of a lot safer then re-reading guest memory since we do some
> > validation on the size of the sg elements.
> 
> Not really.
> 
> The guest puts the descriptors in the ring and leaves them there until
> the device acks.  If it changes them once they're exposed but before
> they're acked, it can get either before or after version, and always
> could.

The problems start when the guest tries to race against QEMU and defy
the validation.  Always using the validated version is a bit easier
than redoing the validation after migration.

Paolo
Michael S. Tsirkin Dec. 16, 2012, 4:14 p.m. UTC | #23
On Thu, Dec 13, 2012 at 11:48:17AM +0100, Kevin Wolf wrote:
> Am 12.12.2012 17:37, schrieb Michael S. Tsirkin:
> > On Wed, Dec 12, 2012 at 04:52:53PM +0100, Paolo Bonzini wrote:
> >> Il 12/12/2012 16:25, Michael S. Tsirkin ha scritto:
> >>> On Wed, Dec 12, 2012 at 04:01:27PM +0100, Paolo Bonzini wrote:
> >>>> Il 12/12/2012 15:47, Michael S. Tsirkin ha scritto:
> >>>>> As step 1, I think we should just complete all outstanding
> >>>>> requests when VM stops.
> >>>>>
> >>>>> Yes it means you can't do the retry hack after migration
> >>>>> but this is hardly common scenario.
> >>>>
> >>>> I disagree...
> >>>
> >>> Disagree with what? You are saying it's common?
> >>
> >> It's not common, but you cannot block migration because you have an I/O
> >> error.  Solving the error may involve migrating the guests away from
> >> that host.
> >>
> >> Paolo
> > 
> > No, you should complete with error.
> 
> Just to confirm that this isn't possible: rerror/werror=stop is a
> supported feature, and I do get bug reports when it breaks (including
> migration while the VM is stopped for an I/O error).
> 
> Makes it common enough that breaking it is definitely not an option, right?
> 
> Kevin

Nod. So we need to fix inuse which is currently broken.
We still can have an assert but it needs to be done
more carefully, counting outstanding requests and
checking it matches inuse.
Anthony Liguori Dec. 16, 2012, 8:36 p.m. UTC | #24
Paolo Bonzini <pbonzini@redhat.com> writes:

>> > We technically should save the addresses and sizes too.  It makes
>> > it a heck of a lot safer then re-reading guest memory since we do some
>> > validation on the size of the sg elements.
>> 
>> Not really.
>> 
>> The guest puts the descriptors in the ring and leaves them there until
>> the device acks.  If it changes them once they're exposed but before
>> they're acked, it can get either before or after version, and always
>> could.
>
> The problems start when the guest tries to race against QEMU and defy
> the validation.  Always using the validated version is a bit easier
> than redoing the validation after migration.

Exactly.

Regards,

Anthony Liguori

>
> Paolo
diff mbox

Patch

diff --git a/hw/virtio.c b/hw/virtio.c
index f40a8c5..6227642 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -788,6 +788,8 @@  void virtio_save(VirtIODevice *vdev, QEMUFile *f)
         if (vdev->vq[i].vring.num == 0)
             break;
 
+        assert(!vdev->vq[i].inuse);
+
         qemu_put_be32(f, vdev->vq[i].vring.num);
         qemu_put_be64(f, vdev->vq[i].pa);
         qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);