diff mbox

[3/5] QMP: Introduce MIGRATION events

Message ID 9b6575587d22a5c85ec536172810520ee3b945d5.1274796992.git.quintela@redhat.com
State New
Headers show

Commit Message

Juan Quintela May 25, 2010, 2:21 p.m. UTC
They are emitted when migration starts, ends, has a failure or is canceled.

Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 QMP/qmp-events.txt |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 monitor.c          |   12 ++++++++++++
 monitor.h          |    4 ++++
 3 files changed, 66 insertions(+), 0 deletions(-)

Comments

Anthony Liguori May 25, 2010, 3:09 p.m. UTC | #1
On 05/25/2010 09:21 AM, Juan Quintela wrote:
> They are emitted when migration starts, ends, has a failure or is canceled.
>
> Signed-off-by: Juan Quintela<quintela@redhat.com>
> ---
>   QMP/qmp-events.txt |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   monitor.c          |   12 ++++++++++++
>   monitor.h          |    4 ++++
>   3 files changed, 66 insertions(+), 0 deletions(-)
>
> diff --git a/QMP/qmp-events.txt b/QMP/qmp-events.txt
> index 01ec85f..93caa4d 100644
> --- a/QMP/qmp-events.txt
> +++ b/QMP/qmp-events.txt
> @@ -26,6 +26,56 @@ Example:
>   Note: If action is "stop", a STOP event will eventually follow the
>   BLOCK_IO_ERROR event.
>    


> +MIGRATION_CANCELED
> +------------------
> +
> +Emitted when migration is canceled.  This is emitted in the source.
> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
> +and CANCELED migration for target).
>    

But the management tool is the one that cancels so surely, it knows why 
already.

> +Data: None
> +
> +Example:
> +
> +{ "event": "MIGRATION_CANCELED",
> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> +
> +MIGRATION_ENDED
> +---------------
> +
> +Emitted when migration ends (both in source and target)
>    

A start event is going to be generated already, no?

> +Data: None
> +
> +Example:
> +
> +{ "event": "MIGRATION_ENDED",
> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> +
> +MIGRATION_FAILED
> +----------------
> +
> +Emitted when migration fails (both is source and target).
> +
> +Data: None
>    

There should be some information about why it failed, no? Preferrably in 
a QError format.

> +Example:
> +
> +{ "event": "MIGRATION_FAILED",
> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> +
> +MIGRATION_STARTED
> +-----------------
> +
> +Emitted when migration starts (both in source and target).
>    

I think this makes more sense as a MIGRATION_CONNECTED event.  It 
probably should carry peer information too.

Regards,

Anthony Liguori

> +Data: None
> +
> +Example:
> +
> +{ "event": "MIGRATION_STARTED",
> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> +
>   RESET
>   -----
>
> diff --git a/monitor.c b/monitor.c
> index ad50f12..5158780 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -444,6 +444,18 @@ void monitor_protocol_event(MonitorEvent event, QObject *data)
>           case QEVENT_WATCHDOG:
>               event_name = "WATCHDOG";
>               break;
> +        case QEVENT_MIGRATION_STARTED:
> +            event_name = "MIGRATION_STARTED";
> +            break;
> +        case QEVENT_MIGRATION_ENDED:
> +            event_name = "MIGRATION_ENDED";
> +            break;
> +        case QEVENT_MIGRATION_FAILED:
> +            event_name = "MIGRATION_FAILED";
> +            break;
> +        case QEVENT_MIGRATION_CANCELED:
> +            event_name = "MIGRATION_CANCELED";
> +            break;
>           default:
>               abort();
>               break;
> diff --git a/monitor.h b/monitor.h
> index ea15469..34bcd38 100644
> --- a/monitor.h
> +++ b/monitor.h
> @@ -28,6 +28,10 @@ typedef enum MonitorEvent {
>       QEVENT_BLOCK_IO_ERROR,
>       QEVENT_RTC_CHANGE,
>       QEVENT_WATCHDOG,
> +    QEVENT_MIGRATION_STARTED,
> +    QEVENT_MIGRATION_ENDED,
> +    QEVENT_MIGRATION_FAILED,
> +    QEVENT_MIGRATION_CANCELED,
>       QEVENT_MAX,
>   } MonitorEvent;
>
>
Juan Quintela May 25, 2010, 3:35 p.m. UTC | #2
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/25/2010 09:21 AM, Juan Quintela wrote:

>> +MIGRATION_CANCELED
>> +------------------
>> +
>> +Emitted when migration is canceled.  This is emitted in the source.
>> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
>> +and CANCELED migration for target).
>>    
>
> But the management tool is the one that cancels so surely, it knows
> why already.

ok, then that one is ok.

>
>> +Data: None
>> +
>> +Example:
>> +
>> +{ "event": "MIGRATION_CANCELED",
>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>> +
>> +MIGRATION_ENDED
>> +---------------
>> +
>> +Emitted when migration ends (both in source and target)
>>    
>
> A start event is going to be generated already, no?

problem here is that libvirt start target with -S, and waits to do the
"cont" as soon as possible.  As of know, only way to do it is to poll
info migrate on source faster.

>> +Data: None
>> +
>> +Example:
>> +
>> +{ "event": "MIGRATION_ENDED",
>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>> +
>> +MIGRATION_FAILED
>> +----------------
>> +
>> +Emitted when migration fails (both is source and target).
>> +
>> +Data: None
>>    
>
> There should be some information about why it failed, no? Preferrably
> in a QError format.

At this point, we have basically -1 :(

I can add a field with an error number, but we are very bad at the
moment about moving errno's upstack.

>> +Example:
>> +
>> +{ "event": "MIGRATION_FAILED",
>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>> +
>> +MIGRATION_STARTED
>> +-----------------
>> +
>> +Emitted when migration starts (both in source and target).
>>    
>
> I think this makes more sense as a MIGRATION_CONNECTED event.  It
> probably should carry peer information too.

What kind of peer information?

We have tcp/fd/exec/unix migrations.  calling it CONNECTED vs STARTED, I
don't care.  But adding information?  Notice that the management
application knows what it did, I can put the:

 "exec: gzip -d < /tmp/foo"

string, but not much more that I can put here.

Later, Juan.
Daniel P. Berrangé May 25, 2010, 3:48 p.m. UTC | #3
On Tue, May 25, 2010 at 10:09:55AM -0500, Anthony Liguori wrote:
> On 05/25/2010 09:21 AM, Juan Quintela wrote:
> >They are emitted when migration starts, ends, has a failure or is canceled.
> >
> >+Data: None
> >+
> >+Example:
> >+
> >+{ "event": "MIGRATION_CANCELED",
> >+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >+
> >+MIGRATION_ENDED
> >+---------------
> >+
> >+Emitted when migration ends (both in source and target)
> >   
> 
> A start event is going to be generated already, no?
> 
> >+Data: None
> >+
> >+Example:
> >+
> >+{ "event": "MIGRATION_ENDED",
> >+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >+
> >+MIGRATION_FAILED
> >+----------------
> >+
> >+Emitted when migration fails (both is source and target).
> >+
> >+Data: None
> >   
> 
> There should be some information about why it failed, no? Preferrably in 
> a QError format.
> 
> >+Example:
> >+
> >+{ "event": "MIGRATION_FAILED",
> >+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >+
> >+MIGRATION_STARTED
> >+-----------------
> >+
> >+Emitted when migration starts (both in source and target).
> >   
> 
> I think this makes more sense as a MIGRATION_CONNECTED event.  It 
> probably should carry peer information too.

FYI the original request for these events from a libvirt POV
for in terms of identifying the lifecycle transitions.

Currently we issue a migration commend on source, and then have
to poll very frequently on 'info migrate' to get progress stats,
and to determine completion. We want to poll much less frequently
for stats, and get async notification of completion/errors on the
source. 

Simiarly on the destination, we need to know when any migration
operation is taking place, so we can avoid issuing monitor
commands to the QEMU process during that time, and also track
success/failure + eventually get progress information via an
equivalent of 'info migrate' on destination.

So this is really focused on lifecycle transitions, rather than
network client connections. I'm not convinced that we should mix
up the two sorts of data. If we want to track network client
connections IMHO they ought to be separate events. Perhaps
there should be a generic QEMU  network CONNECT/DISCONNECT
event that works for all QEMU network sockets (migration, chardevs,
netdev sockets, vnc, spice, and whatever we invent in future using
sockets).


Daniel
Daniel P. Berrangé May 25, 2010, 3:52 p.m. UTC | #4
On Tue, May 25, 2010 at 05:35:53PM +0200, Juan Quintela wrote:
> Anthony Liguori <anthony@codemonkey.ws> wrote:
> 
> >> +Example:
> >> +
> >> +{ "event": "MIGRATION_FAILED",
> >> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >> +
> >> +MIGRATION_STARTED
> >> +-----------------
> >> +
> >> +Emitted when migration starts (both in source and target).
> >>    
> >
> > I think this makes more sense as a MIGRATION_CONNECTED event.  It
> > probably should carry peer information too.
> 
> What kind of peer information?
> 
> We have tcp/fd/exec/unix migrations.  calling it CONNECTED vs STARTED, I
> don't care.  But adding information?  Notice that the management
> application knows what it did, I can put the:
> 
>  "exec: gzip -d < /tmp/foo"
> 
> string, but not much more that I can put here.

This is why I think network client CONNECT/DISCONNECT events should be
separate from MIGRATION START/END events. They happen to occur at roughly
the same time if using a TCP / UNIX socket based migration transport,
but CONNECT/DISCONNECT + peer info is meaningless for exec or fd based
migration.


Daniel
Anthony Liguori May 25, 2010, 3:57 p.m. UTC | #5
On 05/25/2010 10:35 AM, Juan Quintela wrote:
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 05/25/2010 09:21 AM, Juan Quintela wrote:
>>      
>    
>>> +MIGRATION_CANCELED
>>> +------------------
>>> +
>>> +Emitted when migration is canceled.  This is emitted in the source.
>>> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
>>> +and CANCELED migration for target).
>>>
>>>        
>> But the management tool is the one that cancels so surely, it knows
>> why already.
>>      
> ok, then that one is ok.
>
>    
>>      
>>> +Data: None
>>> +
>>> +Example:
>>> +
>>> +{ "event": "MIGRATION_CANCELED",
>>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>>> +
>>> +MIGRATION_ENDED
>>> +---------------
>>> +
>>> +Emitted when migration ends (both in source and target)
>>>
>>>        
>> A start event is going to be generated already, no?
>>      
> problem here is that libvirt start target with -S, and waits to do the
> "cont" as soon as possible.  As of know, only way to do it is to poll
> info migrate on source faster.
>    

Why does it do that??

That sound like a terrible idea.

>>> +Data: None
>>> +
>>> +Example:
>>> +
>>> +{ "event": "MIGRATION_ENDED",
>>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>>> +
>>> +MIGRATION_FAILED
>>> +----------------
>>> +
>>> +Emitted when migration fails (both is source and target).
>>> +
>>> +Data: None
>>>
>>>        
>> There should be some information about why it failed, no? Preferrably
>> in a QError format.
>>      
> At this point, we have basically -1 :(
>
> I can add a field with an error number, but we are very bad at the
> moment about moving errno's upstack.
>    

We need a better solution for reporting errors via notifications.

>> I think this makes more sense as a MIGRATION_CONNECTED event.  It
>> probably should carry peer information too.
>>      
> What kind of peer information?
>
> We have tcp/fd/exec/unix migrations.  calling it CONNECTED vs STARTED, I
> don't care.  But adding information?  Notice that the management
> application knows what it did, I can put the:
>
>   "exec: gzip -d<  /tmp/foo"
>
> string, but not much more that I can put here.
>    

Basically, do we have any useful information in info migrate that we can 
include?

Regards,

Anthony Liguori

> Later, Juan.
>
Juan Quintela May 25, 2010, 4:04 p.m. UTC | #6
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/25/2010 10:35 AM, Juan Quintela wrote:

>> problem here is that libvirt start target with -S, and waits to do the
>> "cont" as soon as possible.  As of know, only way to do it is to poll
>> info migrate on source faster.
>>    
>
> Why does it do that??
>
> That sound like a terrible idea.

Becaues migration is not reliable, and they don't have a way to issue
cont only in one of the sides :(

We make migration protocol reliable, or management application have to
decide when migration suceeded or not.

This new events help then a lot.  But they issue the cont really fast
(before migration ends).  I don't remember why they did that.

danp?

>>> There should be some information about why it failed, no? Preferrably
>>> in a QError format.
>>>      
>> At this point, we have basically -1 :(
>>
>> I can add a field with an error number, but we are very bad at the
>> moment about moving errno's upstack.
>>    
>
> We need a better solution for reporting errors via notifications.

Suggestions?

Notice that what we need now is a way to know if migration ended with
success or in any other way, as soon as possible.

>>> I think this makes more sense as a MIGRATION_CONNECTED event.  It
>>> probably should carry peer information too.
>>>      
>> What kind of peer information?
>>
>> We have tcp/fd/exec/unix migrations.  calling it CONNECTED vs STARTED, I
>> don't care.  But adding information?  Notice that the management
>> application knows what it did, I can put the:
>>
>>   "exec: gzip -d<  /tmp/foo"
>>
>> string, but not much more that I can put here.
>>    
>
> Basically, do we have any useful information in info migrate that we
> can include?

(qemu) info migrate
Migration status: active
transferred ram: 874808 kbytes
remaining ram: 227912 kbytes
total ram: 1065344 kbytes
(qemu) 

I can't see anything interesting to put here :(

About the CONNECTED/STARTED distintion, I fully agree with danp.  We
just want STARTED event for migration, CONNECTION should be generated
(or not) for all sockets/char devices.  it don't make sense for fd/exec
for instance.

Later, Juan.
Daniel P. Berrangé May 25, 2010, 4:04 p.m. UTC | #7
On Tue, May 25, 2010 at 10:57:33AM -0500, Anthony Liguori wrote:
> On 05/25/2010 10:35 AM, Juan Quintela wrote:
> >Anthony Liguori<anthony@codemonkey.ws>  wrote:
> >   
> >   
> >>     
> >>>+Data: None
> >>>+
> >>>+Example:
> >>>+
> >>>+{ "event": "MIGRATION_CANCELED",
> >>>+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >>>+
> >>>+MIGRATION_ENDED
> >>>+---------------
> >>>+
> >>>+Emitted when migration ends (both in source and target)
> >>>
> >>>       
> >>A start event is going to be generated already, no?
> >>     
> >problem here is that libvirt start target with -S, and waits to do the
> >"cont" as soon as possible.  As of know, only way to do it is to poll
> >info migrate on source faster.
> >   
> 
> Why does it do that??
> 
> That sound like a terrible idea.

Historically QEMU gave no alternative. Adding these STARTED/ENDED 
events is to allow libvirt to detect start + end of migration 
reliably, avoiding the previous hacks QEMU forced us todo on the
dest, and avoid the high rate polling we had no choice but todo
on the source.

> >>I think this makes more sense as a MIGRATION_CONNECTED event.  It
> >>probably should carry peer information too.
> >>     
> >What kind of peer information?
> >
> >We have tcp/fd/exec/unix migrations.  calling it CONNECTED vs STARTED, I
> >don't care.  But adding information?  Notice that the management
> >application knows what it did, I can put the:
> >
> >  "exec: gzip -d<  /tmp/foo"
> >
> >string, but not much more that I can put here.
> 
> Basically, do we have any useful information in info migrate that we can 
> include?

info migrate just includes the progress info + state (running, finished, 
cancelled, failed). The event itself replicates state. I don't see a hugely
compelling need to include the progress info in the FINISHED/CANCELLED
events. If really needed, the app can still call 'info migrate' to get it.

Regards,
Daniel
Juan Quintela May 25, 2010, 4:04 p.m. UTC | #8
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/25/2010 10:35 AM, Juan Quintela wrote:

>> problem here is that libvirt start target with -S, and waits to do the
>> "cont" as soon as possible.  As of know, only way to do it is to poll
>> info migrate on source faster.
>>    
>
> Why does it do that??
>
> That sound like a terrible idea.

Becaues migration is not reliable, and they don't have a way to issue
cont only in one of the sides :(

We make migration protocol reliable, or management application have to
decide when migration suceeded or not.

This new events help then a lot.  But they issue the cont really fast
(before migration ends).  I don't remember why they did that.

danp?

>>> There should be some information about why it failed, no? Preferrably
>>> in a QError format.
>>>      
>> At this point, we have basically -1 :(
>>
>> I can add a field with an error number, but we are very bad at the
>> moment about moving errno's upstack.
>>    
>
> We need a better solution for reporting errors via notifications.

Suggestions?

Notice that what we need now is a way to know if migration ended with
success or in any other way, as soon as possible.

>>> I think this makes more sense as a MIGRATION_CONNECTED event.  It
>>> probably should carry peer information too.
>>>      
>> What kind of peer information?
>>
>> We have tcp/fd/exec/unix migrations.  calling it CONNECTED vs STARTED, I
>> don't care.  But adding information?  Notice that the management
>> application knows what it did, I can put the:
>>
>>   "exec: gzip -d<  /tmp/foo"
>>
>> string, but not much more that I can put here.
>>    
>
> Basically, do we have any useful information in info migrate that we
> can include?

(qemu) info migrate
Migration status: active
transferred ram: 874808 kbytes
remaining ram: 227912 kbytes
total ram: 1065344 kbytes
(qemu) 

I can't see anything interesting to put here :(

About the CONNECTED/STARTED distintion, I fully agree with danp.  We
just want STARTED event for migration, CONNECTION should be generated
(or not) for all sockets/char devices.  it don't make sense for fd/exec
for instance.

Later, Juan.
Anthony Liguori May 25, 2010, 4:10 p.m. UTC | #9
On 05/25/2010 11:04 AM, Juan Quintela wrote:
> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>    
>> On 05/25/2010 10:35 AM, Juan Quintela wrote:
>>      
>    
>>> problem here is that libvirt start target with -S, and waits to do the
>>> "cont" as soon as possible.  As of know, only way to do it is to poll
>>> info migrate on source faster.
>>>
>>>        
>> Why does it do that??
>>
>> That sound like a terrible idea.
>>      
> Becaues migration is not reliable, and they don't have a way to issue
> cont only in one of the sides :(
>    

I don't know what you mean by reliable.

When the migration completes on the destination, it will start 
automatically.

The source will not start unless explicitly invoked.  If you 
successfully cancel a migration on the source, it's guaranteed that it 
won't start on the destination.  So the sequence looks like:

src) // decide we want to give up migration
src) migrate_cancel
src) // check migration status
src) cont // if migration cancelled
src) //if migration succeeded, check destination for completion
dst) // if not responsive and not completed in appropriate amount of 
time, kill guest
src) cont // if killed destination

I don't see what the problem is.

> We make migration protocol reliable, or management application have to
> decide when migration suceeded or not.
>    

Reliability has nothing to do with the protocol and everything to do 
with the presence of the third node.

> This new events help then a lot.  But they issue the cont really fast
> (before migration ends).  I don't remember why they did that.
>    

If libvirt is launching the destination with -S, it's doing the wrong 
thing and we ought make sure the proper fix gets implemented.

> danp?
>
>    
>>>> There should be some information about why it failed, no? Preferrably
>>>> in a QError format.
>>>>
>>>>          
>>> At this point, we have basically -1 :(
>>>
>>> I can add a field with an error number, but we are very bad at the
>>> moment about moving errno's upstack.
>>>
>>>        
>> We need a better solution for reporting errors via notifications.
>>      
> Suggestions?
>
> Notice that what we need now is a way to know if migration ended with
> success or in any other way, as soon as possible.
>    

Markus/Luiz?

>>>> I think this makes more sense as a MIGRATION_CONNECTED event.  It
>>>> probably should carry peer information too.
>>>>
>>>>          
>>> What kind of peer information?
>>>
>>> We have tcp/fd/exec/unix migrations.  calling it CONNECTED vs STARTED, I
>>> don't care.  But adding information?  Notice that the management
>>> application knows what it did, I can put the:
>>>
>>>    "exec: gzip -d<   /tmp/foo"
>>>
>>> string, but not much more that I can put here.
>>>
>>>        
>> Basically, do we have any useful information in info migrate that we
>> can include?
>>      
> (qemu) info migrate
> Migration status: active
> transferred ram: 874808 kbytes
> remaining ram: 227912 kbytes
> total ram: 1065344 kbytes
> (qemu)
>
> I can't see anything interesting to put here :(
>    

Ugh.

> About the CONNECTED/STARTED distintion, I fully agree with danp.  We
> just want STARTED event for migration, CONNECTION should be generated
> (or not) for all sockets/char devices.  it don't make sense for fd/exec
> for instance.
>    

That makes sense to me.

Regards,

Anthony Liguori

> Later, Juan.
>
Daniel P. Berrangé May 25, 2010, 4:25 p.m. UTC | #10
On Tue, May 25, 2010 at 06:04:17PM +0200, Juan Quintela wrote:
> Anthony Liguori <anthony@codemonkey.ws> wrote:
> > On 05/25/2010 10:35 AM, Juan Quintela wrote:
> 
> >> problem here is that libvirt start target with -S, and waits to do the
> >> "cont" as soon as possible.  As of know, only way to do it is to poll
> >> info migrate on source faster.
> >>    
> >
> > Why does it do that??
> >
> > That sound like a terrible idea.
> 
> Becaues migration is not reliable, and they don't have a way to issue
> cont only in one of the sides :(
> 
> We make migration protocol reliable, or management application have to
> decide when migration suceeded or not.
> 
> This new events help then a lot.  But they issue the cont really fast
> (before migration ends).  I don't remember why they did that.

The use of '-S / cont' isn't really because of reliability. There
are several scenarios though. There's a migrate API option to leave
the guest paused upon completion, hence we need to start it with -S
to stop it auto-running upon completion. With some disk locking 
approaches we need todo a lock transfer before allowing the dest
to continue running. It could be optimized to avoid the -S /cont
in cases where those two scenarios aren't relevant, but only if
we can get a separate async notification of when migration starts
and completes on the destination, so we can notify mgmt apps that
need this lifecycle event.

So in summary these lifecycle events on source + dest for start,
complete, fail, cancel are all focused on allowing libvirt to 
remove its existing hacks in migration support for current QEMU.

Regards,
Daniel
Anthony Liguori May 25, 2010, 4:33 p.m. UTC | #11
On 05/25/2010 11:25 AM, Daniel P. Berrange wrote:
> On Tue, May 25, 2010 at 06:04:17PM +0200, Juan Quintela wrote:
>    
>> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>>      
>>> On 05/25/2010 10:35 AM, Juan Quintela wrote:
>>>        
>>      
>>>> problem here is that libvirt start target with -S, and waits to do the
>>>> "cont" as soon as possible.  As of know, only way to do it is to poll
>>>> info migrate on source faster.
>>>>
>>>>          
>>> Why does it do that??
>>>
>>> That sound like a terrible idea.
>>>        
>> Becaues migration is not reliable, and they don't have a way to issue
>> cont only in one of the sides :(
>>
>> We make migration protocol reliable, or management application have to
>> decide when migration suceeded or not.
>>
>> This new events help then a lot.  But they issue the cont really fast
>> (before migration ends).  I don't remember why they did that.
>>      
> The use of '-S / cont' isn't really because of reliability. There
> are several scenarios though. There's a migrate API option to leave
> the guest paused upon completion, hence we need to start it with -S
> to stop it auto-running upon completion.

That's a strange API.  Why would you want to do that?  Why not just stop 
and then migrate?  You're just wasting bandwidth doing a live migration 
and then leaving it stopped.  This is a critical period of time for the 
guest and generally speaking, you don't want to involve many layers of 
management tooling in these decisions because the result is going to be 
that you break the migration downtime contract.

>   With some disk locking
> approaches we need todo a lock transfer before allowing the dest
> to continue running.

QEMU is going to read the disk before the migration completes so the 
lock transfer is not going to work with the current implementation (it 
needs to read the disk to do probing).  I assume this is not something 
that's actually been implemented.

>   It could be optimized to avoid the -S /cont
> in cases where those two scenarios aren't relevant, but only if
> we can get a separate async notification of when migration starts
> and completes on the destination, so we can notify mgmt apps that
> need this lifecycle event.
>    

Migration completes == guest starts running.  You'll get a notification 
of that but you're not getting that now because you're doing -S which 
I'd argue is a functional problem on the part of libvirt (you're 
breaking the downtime contract).

I'm not sure why you would need a notification of when migration starts 
(since you know when you've started migration).

Regards,

Anthony Liguori

> So in summary these lifecycle events on source + dest for start,
> complete, fail, cancel are all focused on allowing libvirt to
> remove its existing hacks in migration support for current QEMU.
>
> Regards,
> Daniel
>
Juan Quintela May 25, 2010, 4:43 p.m. UTC | #12
Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 05/25/2010 11:25 AM, Daniel P. Berrange wrote:
>> On Tue, May 25, 2010 at 06:04:17PM +0200, Juan Quintela wrote:
>>    
>>> Anthony Liguori<anthony@codemonkey.ws>  wrote:

> I'm not sure why you would need a notification of when migration
> starts (since you know when you've started migration).

But you don't know if the other end "knows" that it has also started.

started is needed only in incoming part, because .... we don't have a
monitor to ask if migration has started.

Later, Juan.
Luiz Capitulino May 25, 2010, 6:13 p.m. UTC | #13
On Tue, 25 May 2010 11:10:23 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> >>>> There should be some information about why it failed, no? Preferrably
> >>>> in a QError format.
> >>>>
> >>>>          
> >>> At this point, we have basically -1 :(
> >>>
> >>> I can add a field with an error number, but we are very bad at the
> >>> moment about moving errno's upstack.
> >>>
> >>>        
> >> We need a better solution for reporting errors via notifications.
> >>      
> > Suggestions?
> >
> > Notice that what we need now is a way to know if migration ended with
> > success or in any other way, as soon as possible.
> >    
> 
> Markus/Luiz?

 We need to redesign QError. I could give it a try, but quite frankly, I
don't know how do it good enough..

 Markus has worked more on error handling than me though and I think he's
the best person to do it, but he's busy at other things atm.

 Note that major work is design, not code churn (could turn out into an issue,
but I doubt it).
Luiz Capitulino May 25, 2010, 6:21 p.m. UTC | #14
On Tue, 25 May 2010 17:35:53 +0200
Juan Quintela <quintela@redhat.com> wrote:

> Anthony Liguori <anthony@codemonkey.ws> wrote:
> > On 05/25/2010 09:21 AM, Juan Quintela wrote:
> 
> >> +MIGRATION_CANCELED
> >> +------------------
> >> +
> >> +Emitted when migration is canceled.  This is emitted in the source.
> >> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
> >> +and CANCELED migration for target).
> >>    
> >
> > But the management tool is the one that cancels so surely, it knows
> > why already.
> 
> ok, then that one is ok.

 Isn't this one important for the destination instead?
Luiz Capitulino May 25, 2010, 6:31 p.m. UTC | #15
On Tue, 25 May 2010 16:21:03 +0200
Juan Quintela <quintela@redhat.com> wrote:

> They are emitted when migration starts, ends, has a failure or is canceled.
> 
> Signed-off-by: Juan Quintela <quintela@redhat.com>
> ---
>  QMP/qmp-events.txt |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  monitor.c          |   12 ++++++++++++
>  monitor.h          |    4 ++++
>  3 files changed, 66 insertions(+), 0 deletions(-)
> 
> diff --git a/QMP/qmp-events.txt b/QMP/qmp-events.txt
> index 01ec85f..93caa4d 100644
> --- a/QMP/qmp-events.txt
> +++ b/QMP/qmp-events.txt
> @@ -26,6 +26,56 @@ Example:
>  Note: If action is "stop", a STOP event will eventually follow the
>  BLOCK_IO_ERROR event.
> 
> +MIGRATION_CANCELED
> +------------------
> +
> +Emitted when migration is canceled.  This is emitted in the source.

 Shouldn't this one be emitted in the destination?

> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
> +and CANCELED migration for target).
> +
> +Data: None
> +
> +Example:
> +
> +{ "event": "MIGRATION_CANCELED",
> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> +
> +MIGRATION_ENDED
> +---------------
> +
> +Emitted when migration ends (both in source and target)
> +
> +Data: None
> +
> +Example:
> +
> +{ "event": "MIGRATION_ENDED",
> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> +
> +MIGRATION_FAILED
> +----------------
> +
> +Emitted when migration fails (both is source and target).
> +
> +Data: None
> +
> +Example:
> +
> +{ "event": "MIGRATION_FAILED",
> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }

 What about a MIGRATION_FINISHED event, which contains a 'success'
key which is a bool?

 The only disadvantage of this is if we decide to add more information
to the event (say, stats) then it'd get ugly. Otherwise, one event is enough.

 Anyway, the counterpart of MIGRATION_FAILED is MIGRATION_SUCCEEDED.

> +
> +MIGRATION_STARTED
> +-----------------
> +
> +Emitted when migration starts (both in source and target).

 Don't you need this only on the destination?
Juan Quintela May 25, 2010, 6:38 p.m. UTC | #16
Luiz Capitulino <lcapitulino@redhat.com> wrote:
> On Tue, 25 May 2010 17:35:53 +0200
> Juan Quintela <quintela@redhat.com> wrote:
>
>> Anthony Liguori <anthony@codemonkey.ws> wrote:
>> > On 05/25/2010 09:21 AM, Juan Quintela wrote:
>> 
>> >> +MIGRATION_CANCELED
>> >> +------------------
>> >> +
>> >> +Emitted when migration is canceled.  This is emitted in the source.
>> >> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
>> >> +and CANCELED migration for target).
>> >>    
>> >
>> > But the management tool is the one that cancels so surely, it knows
>> > why already.
>> 
>> ok, then that one is ok.
>
>  Isn't this one important for the destination instead?

Destination don't know what happened, only that conection/data is not
coming anymore.

From management prespective, management application knows that it has
canceled the migration, so no need to be told.

Later, Juan.
Anthony Liguori May 25, 2010, 6:51 p.m. UTC | #17
On 05/25/2010 01:31 PM, Luiz Capitulino wrote:
> On Tue, 25 May 2010 16:21:03 +0200
> Juan Quintela<quintela@redhat.com>  wrote:
>
>    
>> They are emitted when migration starts, ends, has a failure or is canceled.
>>
>> Signed-off-by: Juan Quintela<quintela@redhat.com>
>> ---
>>   QMP/qmp-events.txt |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>   monitor.c          |   12 ++++++++++++
>>   monitor.h          |    4 ++++
>>   3 files changed, 66 insertions(+), 0 deletions(-)
>>
>> diff --git a/QMP/qmp-events.txt b/QMP/qmp-events.txt
>> index 01ec85f..93caa4d 100644
>> --- a/QMP/qmp-events.txt
>> +++ b/QMP/qmp-events.txt
>> @@ -26,6 +26,56 @@ Example:
>>   Note: If action is "stop", a STOP event will eventually follow the
>>   BLOCK_IO_ERROR event.
>>
>> +MIGRATION_CANCELED
>> +------------------
>> +
>> +Emitted when migration is canceled.  This is emitted in the source.
>>      
>   Shouldn't this one be emitted in the destination?
>    

Destination can't distinguish a cancelled from a closed pipe.  But the 
idea is that a third party is talking to both source and destination so 
it knows if it's cancelled the migration.

>> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
>> +and CANCELED migration for target).
>> +
>> +Data: None
>> +
>> +Example:
>> +
>> +{ "event": "MIGRATION_CANCELED",
>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>> +
>> +MIGRATION_ENDED
>> +---------------
>> +
>> +Emitted when migration ends (both in source and target)
>> +
>> +Data: None
>> +
>> +Example:
>> +
>> +{ "event": "MIGRATION_ENDED",
>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>> +
>> +MIGRATION_FAILED
>> +----------------
>> +
>> +Emitted when migration fails (both is source and target).
>> +
>> +Data: None
>> +
>> +Example:
>> +
>> +{ "event": "MIGRATION_FAILED",
>> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
>>      
>   What about a MIGRATION_FINISHED event, which contains a 'success'
> key which is a bool?
>
>   The only disadvantage of this is if we decide to add more information
> to the event (say, stats) then it'd get ugly. Otherwise, one event is enough.
>
>   Anyway, the counterpart of MIGRATION_FAILED is MIGRATION_SUCCEEDED.
>    

I see MIGRATION_FAILED as being very similar to block I/O error events.  
I think we'll need a very similar solution for both.  It boils down to, 
how do we raise asynchronous events when something fails?

Regards,

Anthony Liguori
Daniel P. Berrangé May 26, 2010, 10:16 a.m. UTC | #18
On Tue, May 25, 2010 at 11:33:15AM -0500, Anthony Liguori wrote:
> On 05/25/2010 11:25 AM, Daniel P. Berrange wrote:
> >  With some disk locking
> >approaches we need todo a lock transfer before allowing the dest
> >to continue running.
> 
> QEMU is going to read the disk before the migration completes so the 
> lock transfer is not going to work with the current implementation (it 
> needs to read the disk to do probing).  I assume this is not something 
> that's actually been implemented.

Lock transfer upon migration is a two-phase process. When the QEMU
process for incoming migration starts a shared lock is held. Upon
completion of migration an attempt is made to upgrade this to an
exclusive lock. Only if this upgrade succeeds the dest is allowed 
to start running. 

> >  It could be optimized to avoid the -S /cont
> >in cases where those two scenarios aren't relevant, but only if
> >we can get a separate async notification of when migration starts
> >and completes on the destination, so we can notify mgmt apps that
> >need this lifecycle event.
> >   
> 
> Migration completes == guest starts running.  You'll get a notification 
> of that but you're not getting that now because you're doing -S which 
> I'd argue is a functional problem on the part of libvirt (you're 
> breaking the downtime contract).

If doing non-live migration, or restoring a saved image which was
originally in the paused state, there is no 'guest starts running'
event. You cannot reliably infer that migration completed by looking
for a RESUME event. Hence the need for an explicit notification
for the end of migration events.

> I'm not sure why you would need a notification of when migration starts 
> (since you know when you've started migration).

The destination only knows that it has started a QEMU instance ready
for incoming migration. It doesn't know when / if an incoming migration
has actually been accepted & started processing. Hence the need for an
explicit notification for the start of migration.


Daniel
Daniel P. Berrangé May 26, 2010, 10:33 a.m. UTC | #19
On Tue, May 25, 2010 at 06:43:13PM +0200, Juan Quintela wrote:
> Anthony Liguori <anthony@codemonkey.ws> wrote:
> > On 05/25/2010 11:25 AM, Daniel P. Berrange wrote:
> >> On Tue, May 25, 2010 at 06:04:17PM +0200, Juan Quintela wrote:
> >>    
> >>> Anthony Liguori<anthony@codemonkey.ws>  wrote:
> 
> > I'm not sure why you would need a notification of when migration
> > starts (since you know when you've started migration).
> 
> But you don't know if the other end "knows" that it has also started.
> 
> started is needed only in incoming part, because .... we don't have a
> monitor to ask if migration has started.

If we ever want to get closer to allowing multiple monitors, or allowing
apps to issue QMP commands directly via libvirt, then we still need the
'migration started' event on the source, because something else can 
have issued the 'migrate' command without the mgmt app knowing.

MIGRATED_STARTED+STOPPED really *is* needed if we're to make QMP cope 
with all possible use cases. If we rely on inferring it from STOP+RESUME
events, it is going to exclude a significant set of use cases, and likely
result in this being proposed all over again in 12 months time :-(

Regards,
Daniel
Luiz Capitulino May 26, 2010, 1:14 p.m. UTC | #20
On Tue, 25 May 2010 13:51:01 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> On 05/25/2010 01:31 PM, Luiz Capitulino wrote:
> > On Tue, 25 May 2010 16:21:03 +0200
> > Juan Quintela<quintela@redhat.com>  wrote:
> >
> >    
> >> They are emitted when migration starts, ends, has a failure or is canceled.
> >>
> >> Signed-off-by: Juan Quintela<quintela@redhat.com>
> >> ---
> >>   QMP/qmp-events.txt |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   monitor.c          |   12 ++++++++++++
> >>   monitor.h          |    4 ++++
> >>   3 files changed, 66 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/QMP/qmp-events.txt b/QMP/qmp-events.txt
> >> index 01ec85f..93caa4d 100644
> >> --- a/QMP/qmp-events.txt
> >> +++ b/QMP/qmp-events.txt
> >> @@ -26,6 +26,56 @@ Example:
> >>   Note: If action is "stop", a STOP event will eventually follow the
> >>   BLOCK_IO_ERROR event.
> >>
> >> +MIGRATION_CANCELED
> >> +------------------
> >> +
> >> +Emitted when migration is canceled.  This is emitted in the source.
> >>      
> >   Shouldn't this one be emitted in the destination?
> >    
> 
> Destination can't distinguish a cancelled from a closed pipe.  But the 
> idea is that a third party is talking to both source and destination so 
> it knows if it's cancelled the migration.

 Yeah, got it.

> >> +Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
> >> +and CANCELED migration for target).
> >> +
> >> +Data: None
> >> +
> >> +Example:
> >> +
> >> +{ "event": "MIGRATION_CANCELED",
> >> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >> +
> >> +MIGRATION_ENDED
> >> +---------------
> >> +
> >> +Emitted when migration ends (both in source and target)
> >> +
> >> +Data: None
> >> +
> >> +Example:
> >> +
> >> +{ "event": "MIGRATION_ENDED",
> >> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >> +
> >> +MIGRATION_FAILED
> >> +----------------
> >> +
> >> +Emitted when migration fails (both is source and target).
> >> +
> >> +Data: None
> >> +
> >> +Example:
> >> +
> >> +{ "event": "MIGRATION_FAILED",
> >> +    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
> >>      
> >   What about a MIGRATION_FINISHED event, which contains a 'success'
> > key which is a bool?
> >
> >   The only disadvantage of this is if we decide to add more information
> > to the event (say, stats) then it'd get ugly. Otherwise, one event is enough.
> >
> >   Anyway, the counterpart of MIGRATION_FAILED is MIGRATION_SUCCEEDED.
> >    
> 
> I see MIGRATION_FAILED as being very similar to block I/O error events.  
> I think we'll need a very similar solution for both.  It boils down to, 
> how do we raise asynchronous events when something fails?

 You mean we should have a sort of error specific events?
Anthony Liguori May 26, 2010, 2:54 p.m. UTC | #21
On 05/26/2010 05:33 AM, Daniel P. Berrange wrote:
>>> I'm not sure why you would need a notification of when migration
>>> starts (since you know when you've started migration).
>>>        
>> But you don't know if the other end "knows" that it has also started.
>>
>> started is needed only in incoming part, because .... we don't have a
>> monitor to ask if migration has started.
>>      
> If we ever want to get closer to allowing multiple monitors, or allowing
> apps to issue QMP commands directly via libvirt, then we still need the
> 'migration started' event on the source, because something else can
> have issued the 'migrate' command without the mgmt app knowing.
>    

Migration started doesn't help multiple monitors.  You need locking of 
some sort.

Part of the problem is the QMP migrate command is implemented as a 
synchronous command.  It really ought to be an asynchronous command.  
That tells you when the migration has actually completed without polling.

On the source end, I can't think of any events that would be useful.  
The migrate command can complete with a failure so that gives you 
failure notifications.

On the destination side, we're really limited by the fact that we don't 
do live incoming migrations.  The monitor doesn't get a chance to run at 
all with exec: migration, for instance.

For tcp: and unix:, a CONNECTED event absolutely makes sense (every 
socket server should emit a CONNECTED event).  Unfortunately, after 
CONNECTED you lose the monitor until migration is complete.  If 
something bad happens, you have to exit qemu so once the monitor 
returns, migration has completed successfully.

If we introduce live incoming migration, we'll need to rethink things.  
I would actually suggest that we deprecate the incoming command if we do 
that and make incoming migration a monitor command.  I would think it 
should have the same semantics as migrate (as an asynchronous command).  
A CONNECTED event still makes sense for tcp and unix protocols but I 
don't think events make sense for start stop vs. an asynchronous command 
completion.

Regards,

Anthony Liguori

> MIGRATED_STARTED+STOPPED really *is* needed if we're to make QMP cope
> with all possible use cases. If we rely on inferring it from STOP+RESUME
> events, it is going to exclude a significant set of use cases, and likely
> result in this being proposed all over again in 12 months time :-(
>
> Regards,
> Daniel
>
Daniel P. Berrangé May 26, 2010, 3:15 p.m. UTC | #22
On Wed, May 26, 2010 at 09:54:22AM -0500, Anthony Liguori wrote:
> On 05/26/2010 05:33 AM, Daniel P. Berrange wrote:
> >>>I'm not sure why you would need a notification of when migration
> >>>starts (since you know when you've started migration).
> >>>       
> >>But you don't know if the other end "knows" that it has also started.
> >>
> >>started is needed only in incoming part, because .... we don't have a
> >>monitor to ask if migration has started.
> >>     
> >If we ever want to get closer to allowing multiple monitors, or allowing
> >apps to issue QMP commands directly via libvirt, then we still need the
> >'migration started' event on the source, because something else can
> >have issued the 'migrate' command without the mgmt app knowing.
> >   
> 
> Migration started doesn't help multiple monitors.  You need locking of 
> some sort.
> 
> Part of the problem is the QMP migrate command is implemented as a 
> synchronous command.  It really ought to be an asynchronous command.  
> That tells you when the migration has actually completed without polling.

Handling asynchronous commands is alot more complicated and error 
prone for client apps, than providing a asynchronous event notification
of the lifecycle stages. If you want to also query status while waiting
for the completion, it means you can have to deal with overlapping 
command  execute+return pairs within a single monitor connection.
AFAICT this requires a change to QMP to require a unique ID to be 
sent with the {'execute'..} command and be sent back with the later
corresponding {'return'...} data,  so you can actually correlate 
reliably.

> On the destination side, we're really limited by the fact that we don't 
> do live incoming migrations.  The monitor doesn't get a chance to run at 
> all with exec: migration, for instance.

If QEMU let you issue a monitor command for starting incoming
migration, instead of using -incoming this wouldn't such a bad
problem. eg you can launch QEMU in the desired config, with CPUs
stopped, do the normal QMP handshake + whatever else is required
then issue 'migrate_incoming URI' which blocked the caller for
the duration, to allow completion to be detected.

> For tcp: and unix:, a CONNECTED event absolutely makes sense (every 
> socket server should emit a CONNECTED event).  Unfortunately, after 
> CONNECTED you lose the monitor until migration is complete.  If 
> something bad happens, you have to exit qemu so once the monitor 
> returns, migration has completed successfully.
> 
> If we introduce live incoming migration, we'll need to rethink things.  
> I would actually suggest that we deprecate the incoming command if we do 
> that and make incoming migration a monitor command.  I would think it 
> should have the same semantics as migrate (as an asynchronous command).  
> A CONNECTED event still makes sense for tcp and unix protocols but I 
> don't think events make sense for start stop vs. an asynchronous command 
> completion.

Do you actually mean 'deprecate -incoming arg' here ? 

Daniel
Anthony Liguori May 26, 2010, 4:55 p.m. UTC | #23
On 05/26/2010 10:15 AM, Daniel P. Berrange wrote:
> On Wed, May 26, 2010 at 09:54:22AM -0500, Anthony Liguori wrote:
>    
>> On 05/26/2010 05:33 AM, Daniel P. Berrange wrote:
>>      
>>>>> I'm not sure why you would need a notification of when migration
>>>>> starts (since you know when you've started migration).
>>>>>
>>>>>            
>>>> But you don't know if the other end "knows" that it has also started.
>>>>
>>>> started is needed only in incoming part, because .... we don't have a
>>>> monitor to ask if migration has started.
>>>>
>>>>          
>>> If we ever want to get closer to allowing multiple monitors, or allowing
>>> apps to issue QMP commands directly via libvirt, then we still need the
>>> 'migration started' event on the source, because something else can
>>> have issued the 'migrate' command without the mgmt app knowing.
>>>
>>>        
>> Migration started doesn't help multiple monitors.  You need locking of
>> some sort.
>>
>> Part of the problem is the QMP migrate command is implemented as a
>> synchronous command.  It really ought to be an asynchronous command.
>> That tells you when the migration has actually completed without polling.
>>      
> Handling asynchronous commands is alot more complicated and error
> prone for client apps, than providing a asynchronous event notification
> of the lifecycle stages. If you want to also query status while waiting
> for the completion, it means you can have to deal with overlapping
> command  execute+return pairs within a single monitor connection.
> AFAICT this requires a change to QMP to require a unique ID to be
> sent with the {'execute'..} command and be sent back with the later
> corresponding {'return'...} data,  so you can actually correlate
> reliably.
>    

That's exactly how the protocol is designed.  That was one of the major 
improvements of QMP over the human monior.

This is how the info balloon command works, BTW.

Since there's a clear correlation between the request and the result of 
the request, an asynchronous command is what makes the most sense.  It 
eliminates the problem of how to pass QErrors via an event which is one 
of the problems with the current event proposal.

Events are meant for when there is no clear request that the event is 
associated with.  A VNC CONNECT event is a good example of this.

>    
>> On the destination side, we're really limited by the fact that we don't
>> do live incoming migrations.  The monitor doesn't get a chance to run at
>> all with exec: migration, for instance.
>>      
> If QEMU let you issue a monitor command for starting incoming
> migration, instead of using -incoming this wouldn't such a bad
> problem. eg you can launch QEMU in the desired config, with CPUs
> stopped, do the normal QMP handshake + whatever else is required
> then issue 'migrate_incoming URI' which blocked the caller for
> the duration, to allow completion to be detected.
>
>    
>> For tcp: and unix:, a CONNECTED event absolutely makes sense (every
>> socket server should emit a CONNECTED event).  Unfortunately, after
>> CONNECTED you lose the monitor until migration is complete.  If
>> something bad happens, you have to exit qemu so once the monitor
>> returns, migration has completed successfully.
>>
>> If we introduce live incoming migration, we'll need to rethink things.
>> I would actually suggest that we deprecate the incoming command if we do
>> that and make incoming migration a monitor command.  I would think it
>> should have the same semantics as migrate (as an asynchronous command).
>> A CONNECTED event still makes sense for tcp and unix protocols but I
>> don't think events make sense for start stop vs. an asynchronous command
>> completion.
>>      
> Do you actually mean 'deprecate -incoming arg' here ?
>    

Yes.  And by deprecate, I really mean that -incoming just becomes 
syntactic sugar for executing a monitor command immediately.

Regards,

Anthony Liguori

> Daniel
>
Luiz Capitulino May 27, 2010, 1:48 p.m. UTC | #24
On Wed, 26 May 2010 11:55:31 -0500
Anthony Liguori <anthony@codemonkey.ws> wrote:

> On 05/26/2010 10:15 AM, Daniel P. Berrange wrote:
> > On Wed, May 26, 2010 at 09:54:22AM -0500, Anthony Liguori wrote:
> >    
> >> On 05/26/2010 05:33 AM, Daniel P. Berrange wrote:
> >>      
> >>>>> I'm not sure why you would need a notification of when migration
> >>>>> starts (since you know when you've started migration).
> >>>>>
> >>>>>            
> >>>> But you don't know if the other end "knows" that it has also started.
> >>>>
> >>>> started is needed only in incoming part, because .... we don't have a
> >>>> monitor to ask if migration has started.
> >>>>
> >>>>          
> >>> If we ever want to get closer to allowing multiple monitors, or allowing
> >>> apps to issue QMP commands directly via libvirt, then we still need the
> >>> 'migration started' event on the source, because something else can
> >>> have issued the 'migrate' command without the mgmt app knowing.
> >>>
> >>>        
> >> Migration started doesn't help multiple monitors.  You need locking of
> >> some sort.
> >>
> >> Part of the problem is the QMP migrate command is implemented as a
> >> synchronous command.  It really ought to be an asynchronous command.
> >> That tells you when the migration has actually completed without polling.
> >>      
> > Handling asynchronous commands is alot more complicated and error
> > prone for client apps, than providing a asynchronous event notification
> > of the lifecycle stages. If you want to also query status while waiting
> > for the completion, it means you can have to deal with overlapping
> > command  execute+return pairs within a single monitor connection.
> > AFAICT this requires a change to QMP to require a unique ID to be
> > sent with the {'execute'..} command and be sent back with the later
> > corresponding {'return'...} data,  so you can actually correlate
> > reliably.
> >    
> 
> That's exactly how the protocol is designed.  That was one of the major 
> improvements of QMP over the human monior.

 Yes and it already has 'id' support:

{ "execute": "cont", "id": "luiz" }
{"timestamp": {"seconds": 1274966635, "microseconds": 776813}, "event": "RESUME"}
{"return": {}, "id": "luiz"}

 But it doesn't detect duplicates, this is something I think it's up
to the client to do, do you agree?

> This is how the info balloon command works, BTW.

 I won't remember the details now, but that interface has some issues and it
has to be reviewed.

> Since there's a clear correlation between the request and the result of 
> the request, an asynchronous command is what makes the most sense.  It 
> eliminates the problem of how to pass QErrors via an event which is one 
> of the problems with the current event proposal.

 Not exactly, this is a problem with QError not the event proposal. We'll
have the same issue if we decide to include errno in the migrate errors and
the problem still exists with the BLOCK_IO_ERROR event.

 That said, I do agree that migrate should be asynchronous. This yet another
thing we may want to fix before 0.13.

[...]

> >> For tcp: and unix:, a CONNECTED event absolutely makes sense (every
> >> socket server should emit a CONNECTED event).  Unfortunately, after
> >> CONNECTED you lose the monitor until migration is complete.  If
> >> something bad happens, you have to exit qemu so once the monitor
> >> returns, migration has completed successfully.
> >>
> >> If we introduce live incoming migration, we'll need to rethink things.
> >> I would actually suggest that we deprecate the incoming command if we do
> >> that and make incoming migration a monitor command.  I would think it
> >> should have the same semantics as migrate (as an asynchronous command).
> >> A CONNECTED event still makes sense for tcp and unix protocols but I
> >> don't think events make sense for start stop vs. an asynchronous command
> >> completion.
> >>      
> > Do you actually mean 'deprecate -incoming arg' here ?
> >    
> 
> Yes.  And by deprecate, I really mean that -incoming just becomes 
> syntactic sugar for executing a monitor command immediately.

 But we can't change -incoming itself, since our command-line is supposed
to be stable, right?

 Also, Juan has said that replacing that arg with a monitor command
doesn't work, as qemu would have to be started in paused monitor for this
to work.

 So, what about introducing a -incoming-monitor command, which puts qemu
in the right state for migration, but requires a migrate_incoming command
to actually start migration?
Juan Quintela May 27, 2010, 3:58 p.m. UTC | #25
Luiz Capitulino <lcapitulino@redhat.com> wrote:
> On Wed, 26 May 2010 11:55:31 -0500
> Anthony Liguori <anthony@codemonkey.ws> wrote:

>> That's exactly how the protocol is designed.  That was one of the major 
>> improvements of QMP over the human monior.
>
>  Yes and it already has 'id' support:
>
> { "execute": "cont", "id": "luiz" }
> {"timestamp": {"seconds": 1274966635, "microseconds": 776813}, "event": "RESUME"}
> {"return": {}, "id": "luiz"}
>
>  But it doesn't detect duplicates, this is something I think it's up
> to the client to do, do you agree?
>
>> This is how the info balloon command works, BTW.
>
>  I won't remember the details now, but that interface has some issues and it
> has to be reviewed.
>
>> Since there's a clear correlation between the request and the result of 
>> the request, an asynchronous command is what makes the most sense.  It 
>> eliminates the problem of how to pass QErrors via an event which is one 
>> of the problems with the current event proposal.
>
>  Not exactly, this is a problem with QError not the event proposal. We'll
> have the same issue if we decide to include errno in the migrate errors and
> the problem still exists with the BLOCK_IO_ERROR event.
>
>  That said, I do agree that migrate should be asynchronous. This yet another
> thing we may want to fix before 0.13.

How difficult is that?

> [...]
>
>> >> For tcp: and unix:, a CONNECTED event absolutely makes sense (every
>> >> socket server should emit a CONNECTED event).  Unfortunately, after
>> >> CONNECTED you lose the monitor until migration is complete.  If
>> >> something bad happens, you have to exit qemu so once the monitor
>> >> returns, migration has completed successfully.
>> >>
>> >> If we introduce live incoming migration, we'll need to rethink things.
>> >> I would actually suggest that we deprecate the incoming command if we do
>> >> that and make incoming migration a monitor command.  I would think it
>> >> should have the same semantics as migrate (as an asynchronous command).
>> >> A CONNECTED event still makes sense for tcp and unix protocols but I
>> >> don't think events make sense for start stop vs. an asynchronous command
>> >> completion.
>> >>      
>> > Do you actually mean 'deprecate -incoming arg' here ?
>> >    
>> 
>> Yes.  And by deprecate, I really mean that -incoming just becomes 
>> syntactic sugar for executing a monitor command immediately.
>
>  But we can't change -incoming itself, since our command-line is supposed
> to be stable, right?
>
>  Also, Juan has said that replacing that arg with a monitor command
> doesn't work, as qemu would have to be started in paused monitor for this
> to work.
>
>  So, what about introducing a -incoming-monitor command, which puts qemu
> in the right state for migration, but requires a migrate_incoming command
> to actually start migration?

this -incoming-monitor is called -S, that should have a long name of
-no-autostart

that is what it does, and what we need for incoming migration as monitor
command.  Nothing new to see here.

Later, Juan.
Luiz Capitulino May 27, 2010, 4:07 p.m. UTC | #26
On Thu, 27 May 2010 17:58:03 +0200
Juan Quintela <quintela@redhat.com> wrote:

> Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > On Wed, 26 May 2010 11:55:31 -0500
> > Anthony Liguori <anthony@codemonkey.ws> wrote:
> 
> >> That's exactly how the protocol is designed.  That was one of the major 
> >> improvements of QMP over the human monior.
> >
> >  Yes and it already has 'id' support:
> >
> > { "execute": "cont", "id": "luiz" }
> > {"timestamp": {"seconds": 1274966635, "microseconds": 776813}, "event": "RESUME"}
> > {"return": {}, "id": "luiz"}
> >
> >  But it doesn't detect duplicates, this is something I think it's up
> > to the client to do, do you agree?
> >
> >> This is how the info balloon command works, BTW.
> >
> >  I won't remember the details now, but that interface has some issues and it
> > has to be reviewed.
> >
> >> Since there's a clear correlation between the request and the result of 
> >> the request, an asynchronous command is what makes the most sense.  It 
> >> eliminates the problem of how to pass QErrors via an event which is one 
> >> of the problems with the current event proposal.
> >
> >  Not exactly, this is a problem with QError not the event proposal. We'll
> > have the same issue if we decide to include errno in the migrate errors and
> > the problem still exists with the BLOCK_IO_ERROR event.
> >
> >  That said, I do agree that migrate should be asynchronous. This yet another
> > thing we may want to fix before 0.13.
> 
> How difficult is that?

 Anthony is working on this and should have patches soon.

[...]

> >> Yes.  And by deprecate, I really mean that -incoming just becomes 
> >> syntactic sugar for executing a monitor command immediately.
> >
> >  But we can't change -incoming itself, since our command-line is supposed
> > to be stable, right?
> >
> >  Also, Juan has said that replacing that arg with a monitor command
> > doesn't work, as qemu would have to be started in paused monitor for this
> > to work.
> >
> >  So, what about introducing a -incoming-monitor command, which puts qemu
> > in the right state for migration, but requires a migrate_incoming command
> > to actually start migration?
> 
> this -incoming-monitor is called -S, that should have a long name of
> -no-autostart
> 
> that is what it does, and what we need for incoming migration as monitor
> command.  Nothing new to see here.

 Ok, I thought -S alone wasn't enough but if it's, let's go for it then.
Anthony Liguori May 27, 2010, 4:07 p.m. UTC | #27
On 05/27/2010 10:58 AM, Juan Quintela wrote:
> Luiz Capitulino<lcapitulino@redhat.com>  wrote:
>    
>> On Wed, 26 May 2010 11:55:31 -0500
>> Anthony Liguori<anthony@codemonkey.ws>  wrote:
>>      
>    
>>> That's exactly how the protocol is designed.  That was one of the major
>>> improvements of QMP over the human monior.
>>>        
>>   Yes and it already has 'id' support:
>>
>> { "execute": "cont", "id": "luiz" }
>> {"timestamp": {"seconds": 1274966635, "microseconds": 776813}, "event": "RESUME"}
>> {"return": {}, "id": "luiz"}
>>
>>   But it doesn't detect duplicates, this is something I think it's up
>> to the client to do, do you agree?
>>
>>      
>>> This is how the info balloon command works, BTW.
>>>        
>>   I won't remember the details now, but that interface has some issues and it
>> has to be reviewed.
>>
>>      
>>> Since there's a clear correlation between the request and the result of
>>> the request, an asynchronous command is what makes the most sense.  It
>>> eliminates the problem of how to pass QErrors via an event which is one
>>> of the problems with the current event proposal.
>>>        
>>   Not exactly, this is a problem with QError not the event proposal. We'll
>> have the same issue if we decide to include errno in the migrate errors and
>> the problem still exists with the BLOCK_IO_ERROR event.
>>
>>   That said, I do agree that migrate should be asynchronous. This yet another
>> thing we may want to fix before 0.13.
>>      
> How difficult is that?
>    

Easy.  I've got a patch locally and will submit an RFC this afternoon.  
I'm currently converting the migration error's to QError to make sure 
than when an error occurs, we can report it in a meaningful way.

We'll still need the MIGRATION_CONNECTED event though and I'm not 
currently planning on working on that.

>> [...]
>>
>>      
>>>>> For tcp: and unix:, a CONNECTED event absolutely makes sense (every
>>>>> socket server should emit a CONNECTED event).  Unfortunately, after
>>>>> CONNECTED you lose the monitor until migration is complete.  If
>>>>> something bad happens, you have to exit qemu so once the monitor
>>>>> returns, migration has completed successfully.
>>>>>
>>>>> If we introduce live incoming migration, we'll need to rethink things.
>>>>> I would actually suggest that we deprecate the incoming command if we do
>>>>> that and make incoming migration a monitor command.  I would think it
>>>>> should have the same semantics as migrate (as an asynchronous command).
>>>>> A CONNECTED event still makes sense for tcp and unix protocols but I
>>>>> don't think events make sense for start stop vs. an asynchronous command
>>>>> completion.
>>>>>
>>>>>            
>>>> Do you actually mean 'deprecate -incoming arg' here ?
>>>>
>>>>          
>>> Yes.  And by deprecate, I really mean that -incoming just becomes
>>> syntactic sugar for executing a monitor command immediately.
>>>        
>>   But we can't change -incoming itself, since our command-line is supposed
>> to be stable, right?
>>
>>   Also, Juan has said that replacing that arg with a monitor command
>> doesn't work, as qemu would have to be started in paused monitor for this
>> to work.
>>
>>   So, what about introducing a -incoming-monitor command, which puts qemu
>> in the right state for migration, but requires a migrate_incoming command
>> to actually start migration?
>>      
> this -incoming-monitor is called -S, that should have a long name of
> -no-autostart
>
> that is what it does, and what we need for incoming migration as monitor
> command.  Nothing new to see here.
>    

Agreed.  We could introduce a migrate_incoming command but it wouldn't 
be possible to poll it's results while the command executed.  I'd rather 
not do that though because that's potentially ugly for clients to deal with.

Regards,

Anthony Liguori

> Later, Juan.
>
diff mbox

Patch

diff --git a/QMP/qmp-events.txt b/QMP/qmp-events.txt
index 01ec85f..93caa4d 100644
--- a/QMP/qmp-events.txt
+++ b/QMP/qmp-events.txt
@@ -26,6 +26,56 @@  Example:
 Note: If action is "stop", a STOP event will eventually follow the
 BLOCK_IO_ERROR event.

+MIGRATION_CANCELED
+------------------
+
+Emitted when migration is canceled.  This is emitted in the source.
+Target will emit MIGRATION_FAILED (no way to differentiate a FAILED
+and CANCELED migration for target).
+
+Data: None
+
+Example:
+
+{ "event": "MIGRATION_CANCELED",
+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
+
+MIGRATION_ENDED
+---------------
+
+Emitted when migration ends (both in source and target)
+
+Data: None
+
+Example:
+
+{ "event": "MIGRATION_ENDED",
+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
+
+MIGRATION_FAILED
+----------------
+
+Emitted when migration fails (both is source and target).
+
+Data: None
+
+Example:
+
+{ "event": "MIGRATION_FAILED",
+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
+
+MIGRATION_STARTED
+-----------------
+
+Emitted when migration starts (both in source and target).
+
+Data: None
+
+Example:
+
+{ "event": "MIGRATION_STARTED",
+    "timestamp": {"seconds": 1274687575, "microseconds": 592483} }
+
 RESET
 -----

diff --git a/monitor.c b/monitor.c
index ad50f12..5158780 100644
--- a/monitor.c
+++ b/monitor.c
@@ -444,6 +444,18 @@  void monitor_protocol_event(MonitorEvent event, QObject *data)
         case QEVENT_WATCHDOG:
             event_name = "WATCHDOG";
             break;
+        case QEVENT_MIGRATION_STARTED:
+            event_name = "MIGRATION_STARTED";
+            break;
+        case QEVENT_MIGRATION_ENDED:
+            event_name = "MIGRATION_ENDED";
+            break;
+        case QEVENT_MIGRATION_FAILED:
+            event_name = "MIGRATION_FAILED";
+            break;
+        case QEVENT_MIGRATION_CANCELED:
+            event_name = "MIGRATION_CANCELED";
+            break;
         default:
             abort();
             break;
diff --git a/monitor.h b/monitor.h
index ea15469..34bcd38 100644
--- a/monitor.h
+++ b/monitor.h
@@ -28,6 +28,10 @@  typedef enum MonitorEvent {
     QEVENT_BLOCK_IO_ERROR,
     QEVENT_RTC_CHANGE,
     QEVENT_WATCHDOG,
+    QEVENT_MIGRATION_STARTED,
+    QEVENT_MIGRATION_ENDED,
+    QEVENT_MIGRATION_FAILED,
+    QEVENT_MIGRATION_CANCELED,
     QEVENT_MAX,
 } MonitorEvent;