diff mbox

[RFC] New Migration Protocol using Visitor Interface

Message ID 4E89CE20.6050706@codemonkey.ws
State New
Headers show

Commit Message

Anthony Liguori Oct. 3, 2011, 3 p.m. UTC
On 10/03/2011 09:41 AM, Michael S. Tsirkin wrote:
> On Mon, Oct 03, 2011 at 08:51:10AM -0500, Anthony Liguori wrote:
>> On 10/03/2011 08:38 AM, Michael S. Tsirkin wrote:
>>> On Mon, Oct 03, 2011 at 07:55:48AM -0500, Anthony Liguori wrote:
>>>> On 10/02/2011 04:08 PM, Michael S. Tsirkin wrote:
>>>>> On Sun, Oct 02, 2011 at 04:21:47PM -0400, Stefan Berger wrote:
>>>>>>
>>>>>>> 4) Implement the BERVisitor and make this the default migration protocol.
>>>>>>>
>>>>>>> Most of the work will be in 1), though with the implementation in this series we should be able to do it incrementally. I'm not sure if the best approach is doing the mechanical phase 1 conversion, then doing phase 2 sometime after 4), doing phase 1 + 2 as part of 1), or just doing VMState conversions which gives basically the same capabilities as phase 1 + 2.
>>>>>>>
>>>>>>> Thoughts?
>>>>>> Is anyone working on this? If not I may give it a shot (tomorrow++)
>>>>>> for at least some of the primitives... for enabling vNVRAM metadata
>>>>>> of course. Indefinite length encoding of constructed data types I
>>>>>> suppose won't be used otherwise the visitor interface seems wrong
>>>>>> for parsing and skipping of extra data towards the end of a
>>>>>> structure if version n wrote the stream and appended some of its
>>>>>> version n data and now version m<    n is trying to read the struct
>>>>>> and needs to skip the version [m+1, n ] data fields ... in that case
>>>>>> the de-serialization of the stream should probably be stream-driven
>>>>>> rather than structure-driven.
>>>>>>
>>>>>>     Stefan
>>>>>
>>>>> Yes I've been struggling with that exactly.
>>>>> Anthony, any thoughts?
>>>>
>>>> It just depends on how you write your visitor.  If you used
>>>> sequences, you'd probably do something like this:
>>>>
>>>> start_struct ->
>>>>    check for sequence tag, push starting offset and size onto stack
>>>>    increment offset to next tag
>>>>
>>>> type_int (et al) ->
>>>>    check for explicit type, parse data
>>>>    increment offset to next tag
>>>>
>>>> end_struct ->
>>>>    pop starting offset and size to temp variables
>>>>    set offset to starting offset + size
>>>>
>>>> This is roughly how the QMP input marshaller works FWIW.
>>>>
>>>> Regards,
>>>>
>>>> Anthony Liguori
>>>
>>> One thing I worry about is enabling zero copy for
>>> large string types (e.g. memory migration).
>>
>> Memory shouldn't be done through Visitors.  It should be handled as a special case.
>
> OK, that's fine then.
>
>>> So we need to be able to see a tag for memory page + address,
>>> read that from socket directly at the correct virtual address.
>>>
>>> Probably, we can avoid using visitors for memory, and hope
>>> everything else can stand an extra copy since it's small.
>>>
>>> But then, why do we worry about the size of
>>> encoded device state as Anthony seems to do?
>>
>> There's a significant difference between the cost of something on
>> the wire and the cost of doing a memcpy.  The cost of the data on
>> the wire is directly proportional to downtime.  So if we increase
>> the size of the device state by a factor of 10, we increase the
>> minimum downtime by a factor of 10.
>>
>> Of course, *if* the size of device state is already negligible with
>> respect to the minimum downtime, then it doesn't matter.  This is
>> easy to quantify though.  For a normal migration session today,
>> what's the total size of the device state in relation to the
>> calculated bandwidth of the minimum downtime?
>>
>> If it's very small, then we can add names and not worry about it.
>>
>> Regards,
>>
>> Anthony Liguori
>
> Yes, it's easy to quantify. I think the following gives us
> the offset before and after, so the difference is the size
> we seek, right?

Yeah, you'll also want:

You'll want to compare the size to max bwidth.

BTW, putting this info properly into migration stats would probably be pretty 
useful.

Regards,

Anthony Liguori

Comments

Michael S. Tsirkin Oct. 3, 2011, 3:45 p.m. UTC | #1
On Mon, Oct 03, 2011 at 10:00:48AM -0500, Anthony Liguori wrote:
> On 10/03/2011 09:41 AM, Michael S. Tsirkin wrote:
> >On Mon, Oct 03, 2011 at 08:51:10AM -0500, Anthony Liguori wrote:
> >>On 10/03/2011 08:38 AM, Michael S. Tsirkin wrote:
> >>>On Mon, Oct 03, 2011 at 07:55:48AM -0500, Anthony Liguori wrote:
> >>>>On 10/02/2011 04:08 PM, Michael S. Tsirkin wrote:
> >>>>>On Sun, Oct 02, 2011 at 04:21:47PM -0400, Stefan Berger wrote:
> >>>>>>
> >>>>>>>4) Implement the BERVisitor and make this the default migration protocol.
> >>>>>>>
> >>>>>>>Most of the work will be in 1), though with the implementation in this series we should be able to do it incrementally. I'm not sure if the best approach is doing the mechanical phase 1 conversion, then doing phase 2 sometime after 4), doing phase 1 + 2 as part of 1), or just doing VMState conversions which gives basically the same capabilities as phase 1 + 2.
> >>>>>>>
> >>>>>>>Thoughts?
> >>>>>>Is anyone working on this? If not I may give it a shot (tomorrow++)
> >>>>>>for at least some of the primitives... for enabling vNVRAM metadata
> >>>>>>of course. Indefinite length encoding of constructed data types I
> >>>>>>suppose won't be used otherwise the visitor interface seems wrong
> >>>>>>for parsing and skipping of extra data towards the end of a
> >>>>>>structure if version n wrote the stream and appended some of its
> >>>>>>version n data and now version m<    n is trying to read the struct
> >>>>>>and needs to skip the version [m+1, n ] data fields ... in that case
> >>>>>>the de-serialization of the stream should probably be stream-driven
> >>>>>>rather than structure-driven.
> >>>>>>
> >>>>>>    Stefan
> >>>>>
> >>>>>Yes I've been struggling with that exactly.
> >>>>>Anthony, any thoughts?
> >>>>
> >>>>It just depends on how you write your visitor.  If you used
> >>>>sequences, you'd probably do something like this:
> >>>>
> >>>>start_struct ->
> >>>>   check for sequence tag, push starting offset and size onto stack
> >>>>   increment offset to next tag
> >>>>
> >>>>type_int (et al) ->
> >>>>   check for explicit type, parse data
> >>>>   increment offset to next tag
> >>>>
> >>>>end_struct ->
> >>>>   pop starting offset and size to temp variables
> >>>>   set offset to starting offset + size
> >>>>
> >>>>This is roughly how the QMP input marshaller works FWIW.
> >>>>
> >>>>Regards,
> >>>>
> >>>>Anthony Liguori
> >>>
> >>>One thing I worry about is enabling zero copy for
> >>>large string types (e.g. memory migration).
> >>
> >>Memory shouldn't be done through Visitors.  It should be handled as a special case.
> >
> >OK, that's fine then.
> >
> >>>So we need to be able to see a tag for memory page + address,
> >>>read that from socket directly at the correct virtual address.
> >>>
> >>>Probably, we can avoid using visitors for memory, and hope
> >>>everything else can stand an extra copy since it's small.
> >>>
> >>>But then, why do we worry about the size of
> >>>encoded device state as Anthony seems to do?
> >>
> >>There's a significant difference between the cost of something on
> >>the wire and the cost of doing a memcpy.  The cost of the data on
> >>the wire is directly proportional to downtime.  So if we increase
> >>the size of the device state by a factor of 10, we increase the
> >>minimum downtime by a factor of 10.
> >>
> >>Of course, *if* the size of device state is already negligible with
> >>respect to the minimum downtime, then it doesn't matter.  This is
> >>easy to quantify though.  For a normal migration session today,
> >>what's the total size of the device state in relation to the
> >>calculated bandwidth of the minimum downtime?
> >>
> >>If it's very small, then we can add names and not worry about it.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >
> >Yes, it's easy to quantify. I think the following gives us
> >the offset before and after, so the difference is the size
> >we seek, right?
> 
> Yeah, you'll also want:
> 
> diff --git a/arch_init.c b/arch_init.c
> index a6c69c7..0d64200 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -334,6 +334,10 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, voi
> 
>      expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
> 
> +    if (stage == 2 && expected_time <= migrate_max_downtime()) {
> +        fprintf(stderr, "max bwidth: %lld\n", (long)(expected_time * bwidth));
> +    }
> +
>      return (stage == 2) && (expected_time <= migrate_max_downtime());
>  }
> 
> You'll want to compare the size to max bwidth.

Well that depends on how guest behaves etc. I'm guessing
just full memory size is a sane thing to compare against.

I don't have a problem sticking this fprintf in as well.

> BTW, putting this info properly into migration stats would probably
> be pretty useful.
> 
> Regards,
> 
> Anthony Liguori

Problem is adding anything to monitor makes me worry
about future compatibility so much I usually just give up.
IMO we really need a namespace for in-development experimental
commands, like "unsupported-XXX", this would belong.
Anthony Liguori Oct. 3, 2011, 4:05 p.m. UTC | #2
On 10/03/2011 10:45 AM, Michael S. Tsirkin wrote:
>> BTW, putting this info properly into migration stats would probably
>> be pretty useful.
>>
>> Regards,
>>
>> Anthony Liguori
>
> Problem is adding anything to monitor makes me worry
> about future compatibility so much I usually just give up.
> IMO we really need a namespace for in-development experimental
> commands, like "unsupported-XXX", this would belong.

Or just make all of QMP unsupported across any given version.  I'm not kidding 
about that actually.

If we document what the protocol is for any given version, then a layer like 
libvirt can deal with providing a consistent interface.

I often wonder who we're trying to preserve compatibility for.  Part of 
libvirt's mission statement is providing a stable API so why not leverage that 
mission and get us out of the compatibility business.

Regards,

Anthony Liguori

>
Daniel P. Berrangé Oct. 3, 2011, 4:24 p.m. UTC | #3
On Mon, Oct 03, 2011 at 11:05:02AM -0500, Anthony Liguori wrote:
> On 10/03/2011 10:45 AM, Michael S. Tsirkin wrote:
> >>BTW, putting this info properly into migration stats would probably
> >>be pretty useful.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >
> >Problem is adding anything to monitor makes me worry
> >about future compatibility so much I usually just give up.
> >IMO we really need a namespace for in-development experimental
> >commands, like "unsupported-XXX", this would belong.
> 
> Or just make all of QMP unsupported across any given version.  I'm
> not kidding about that actually.
> 
> If we document what the protocol is for any given version, then a
> layer like libvirt can deal with providing a consistent interface.
> 
> I often wonder who we're trying to preserve compatibility for.  Part
> of libvirt's mission statement is providing a stable API so why not
> leverage that mission and get us out of the compatibility business.

You are correct, that ultimately libvirt should be isolating applications
from any changes in QEMU's monitor protocol that take place across
versions. We just ask that QEMU try not to make too many gratuitous
changes, if it is reasonable to avoid them without negative impact on
QEMU. The more time we have to spend adding support for different
commands across each QEMU release, the less time we have to spend on
adding useful features elsewhere.

To me the biggest benefit of QMP is not that the commands are to be
supported long term, but rather than it is a sanely parsable format
with detectable error reporting and asynchronous events.

The important thing to us, is not to make semantic changes to any
*existing* commands. If an existing command is flawed, either leave
it alone & introduce a new one, or if compatibility can't be maintained
for some reason remove the old one & introduce a new one with a *new*
name to avoid any confusion.

So if you have to deprecate some existing QMP commands and introduce
new ones to replace them across releases we can & will cope with that
in libvirt.

Daniel
Michael S. Tsirkin Oct. 3, 2011, 4:51 p.m. UTC | #4
On Mon, Oct 03, 2011 at 11:05:02AM -0500, Anthony Liguori wrote:
> On 10/03/2011 10:45 AM, Michael S. Tsirkin wrote:
> >>BTW, putting this info properly into migration stats would probably
> >>be pretty useful.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >
> >Problem is adding anything to monitor makes me worry
> >about future compatibility so much I usually just give up.
> >IMO we really need a namespace for in-development experimental
> >commands, like "unsupported-XXX", this would belong.
> 
> Or just make all of QMP unsupported across any given version.  I'm
> not kidding about that actually.
> 
> If we document what the protocol is for any given version, then a
> layer like libvirt can deal with providing a consistent interface.

Once distros ship qemu, people will use it. Once they use it
in some way for long, it becomes the supposted interface.
No one reads documentation, obviously :) - do you expect
people to re-read qemu documentation *with every release*?

> 
> I often wonder who we're trying to preserve compatibility for.  Part
> of libvirt's mission statement is providing a stable API so why not
> leverage that mission and get us out of the compatibility business.
> 
> Regards,
> 
> Anthony Liguori

you can't force everyone to use libvirt.
Part of QMP mission statement was a stable interface.
If it's not that I don't know what it's for.
Michael S. Tsirkin Oct. 5, 2011, 11:28 a.m. UTC | #5
On Mon, Oct 03, 2011 at 10:00:48AM -0500, Anthony Liguori wrote:
> >Yes, it's easy to quantify. I think the following gives us
> >the offset before and after, so the difference is the size
> >we seek, right?

OK, Orit (Cc'd) did some research - this is a booting
while still in grub, size probably does not get much less than that.
windows is said to be much more aggressive in allocating memory.

 start offset: 9600673
 end offset: 9614933

So we get 15K out of 9M.
By the way, most of the memory here is pretty much
all uniform I guess, because compressing it gets us:
gzip: 1934169
bzip2 -9: 1462551

So even with aggressive compression, we probably won't be able to get
below 1.5M for memory, two orders of magnitude above device state.

Sounds convincing?
Anthony Liguori Oct. 5, 2011, 12:46 p.m. UTC | #6
On 10/05/2011 06:28 AM, Michael S. Tsirkin wrote:
> On Mon, Oct 03, 2011 at 10:00:48AM -0500, Anthony Liguori wrote:
>>> Yes, it's easy to quantify. I think the following gives us
>>> the offset before and after, so the difference is the size
>>> we seek, right?
>
> OK, Orit (Cc'd) did some research - this is a booting
> while still in grub, size probably does not get much less than that.
> windows is said to be much more aggressive in allocating memory.
>
>   start offset: 9600673
>   end offset: 9614933
>
> So we get 15K out of 9M.

So, let's do some napkin math.

Assume that it works out that most of that 15k are 4 byte integers.  If we 
assume an average name length of 6 characters, a string should be encoded in 8 
bytes.  That means for every 4 bytes, we add 8 bytes which means we're 
increasing by 200%.

That means 45k.

A 1gbit link can xmit at max, 128k in 1ms.  So that extra 30k is going to cost 
~250us in transmit time if we can get 1gbit.  A the default rate limit, it 
should cost us right around 1ms.

I guess it's liveable although with 30 network cards, I suspect it gets a heck 
of a lot worse.

Regards,

Anthony Liguori

> By the way, most of the memory here is pretty much
> all uniform I guess, because compressing it gets us:
> gzip: 1934169
> bzip2 -9: 1462551
>
> So even with aggressive compression, we probably won't be able to get
> below 1.5M for memory, two orders of magnitude above device state.
>
> Sounds convincing?
>
diff mbox

Patch

diff --git a/arch_init.c b/arch_init.c
index a6c69c7..0d64200 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -334,6 +334,10 @@  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, voi

      expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;

+    if (stage == 2 && expected_time <= migrate_max_downtime()) {
+        fprintf(stderr, "max bwidth: %lld\n", (long)(expected_time * bwidth));
+    }
+
      return (stage == 2) && (expected_time <= migrate_max_downtime());
  }