diff mbox

[v11,14/15] rdma: introduce MIG_STATE_NONE and change MIG_STATE_SETUP state transition

Message ID 1372125485-11795-15-git-send-email-mrhines@linux.vnet.ibm.com
State New
Headers show

Commit Message

mrhines@linux.vnet.ibm.com June 25, 2013, 1:58 a.m. UTC
From: "Michael R. Hines" <mrhines@us.ibm.com>

As described in the previous patch, until now, the MIG_STATE_SETUP
state was not really a 'formal' state. It has been used as a 'zero' state
(what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
into this state when the QMP migration command was called. Instead we want to
introduce MIG_STATE_NONE, which is our starting state in the state machine, and
then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
command is issued.

In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
later in the migration_thread(). This is done to be able to timestamp the amount of
time spent in the SETUP state for proper accounting to the user during
an RDMA migration.

Furthermore, the management software, until now, has never been aware of the
existence of the SETUP state whatsoever. This must change, because, timing of this
state implies that the state actually exists.

These two patches cannot be separated because the 'query_migrate' QMP
switch statement needs to know how to handle this new state transition.

Tested-by: Michael R. Hines <mrhines@us.ibm.com>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration.c |   22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

Comments

Juan Quintela June 25, 2013, 9:49 a.m. UTC | #1
mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> As described in the previous patch, until now, the MIG_STATE_SETUP
> state was not really a 'formal' state. It has been used as a 'zero' state
> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
> into this state when the QMP migration command was called. Instead we want to
> introduce MIG_STATE_NONE, which is our starting state in the state machine, and
> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
> command is issued.
>
> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
> later in the migration_thread(). This is done to be able to timestamp the amount of
> time spent in the SETUP state for proper accounting to the user during
> an RDMA migration.
>
> Furthermore, the management software, until now, has never been aware of the
> existence of the SETUP state whatsoever. This must change, because, timing of this
> state implies that the state actually exists.
>
> These two patches cannot be separated because the 'query_migrate' QMP
> switch statement needs to know how to handle this new state transition.
>
> Tested-by: Michael R. Hines <mrhines@us.ibm.com>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>

> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s)
>  {
>      DPRINTF("cancelling migration\n");
>  
> +    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
>      migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
>  }

This chunk is wrong.

we can call qme_migrate_cancel() at any point,  and it is going to be
called normally from MIG_STATE_ACTIVE.


    migrate_set_satet(s, s->state,  MIG_STATE_CANCELLED)

should do the trick.  Or something like that,  what do you think?

Later,  Juan.
Paolo Bonzini June 25, 2013, 10:13 a.m. UTC | #2
Il 25/06/2013 11:49, Juan Quintela ha scritto:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> As described in the previous patch, until now, the MIG_STATE_SETUP
>> state was not really a 'formal' state. It has been used as a 'zero' state
>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
>> into this state when the QMP migration command was called. Instead we want to
>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and
>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
>> command is issued.
>>
>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
>> later in the migration_thread(). This is done to be able to timestamp the amount of
>> time spent in the SETUP state for proper accounting to the user during
>> an RDMA migration.
>>
>> Furthermore, the management software, until now, has never been aware of the
>> existence of the SETUP state whatsoever. This must change, because, timing of this
>> state implies that the state actually exists.
>>
>> These two patches cannot be separated because the 'query_migrate' QMP
>> switch statement needs to know how to handle this new state transition.
>>
>> Tested-by: Michael R. Hines <mrhines@us.ibm.com>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> 
>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s)
>>  {
>>      DPRINTF("cancelling migration\n");
>>  
>> +    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
>>      migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
>>  }
> 
> This chunk is wrong.
> 
> we can call qme_migrate_cancel() at any point,  and it is going to be
> called normally from MIG_STATE_ACTIVE.
> 
>     migrate_set_satet(s, s->state,  MIG_STATE_CANCELLED)
> 
> should do the trick.  Or something like that,  what do you think?

I don't like the three-arguments migrate_set_state, but I don't have any
better idea.

With Juan's modification, it is fine (but not reviewed-by me :)).  While
you resend, the first 13 patches of v10 can be merged (pull request).
You can then rebase the last three on top.

Michael, did you look at the "debugging/getting the protocol ready" mode
where all pages are unpinned?

Paolo
mrhines@linux.vnet.ibm.com June 25, 2013, 1:40 p.m. UTC | #3
On 06/25/2013 05:49 AM, Juan Quintela wrote:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> As described in the previous patch, until now, the MIG_STATE_SETUP
>> state was not really a 'formal' state. It has been used as a 'zero' state
>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
>> into this state when the QMP migration command was called. Instead we want to
>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and
>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
>> command is issued.
>>
>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
>> later in the migration_thread(). This is done to be able to timestamp the amount of
>> time spent in the SETUP state for proper accounting to the user during
>> an RDMA migration.
>>
>> Furthermore, the management software, until now, has never been aware of the
>> existence of the SETUP state whatsoever. This must change, because, timing of this
>> state implies that the state actually exists.
>>
>> These two patches cannot be separated because the 'query_migrate' QMP
>> switch statement needs to know how to handle this new state transition.
>>
>> Tested-by: Michael R. Hines <mrhines@us.ibm.com>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s)
>>   {
>>       DPRINTF("cancelling migration\n");
>>   
>> +    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
>>       migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
>>   }
> This chunk is wrong.
>
> we can call qme_migrate_cancel() at any point,  and it is going to be
> called normally from MIG_STATE_ACTIVE.
>
>
>      migrate_set_satet(s, s->state,  MIG_STATE_CANCELLED)
>
> should do the trick.  Or something like that,  what do you think?
>
> Later,  Juan.
>
>
>

Great idea, thank you.

- Michael
mrhines@linux.vnet.ibm.com June 25, 2013, 1:44 p.m. UTC | #4
On 06/25/2013 06:13 AM, Paolo Bonzini wrote:
> Il 25/06/2013 11:49, Juan Quintela ha scritto:
>> mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>
>>> As described in the previous patch, until now, the MIG_STATE_SETUP
>>> state was not really a 'formal' state. It has been used as a 'zero' state
>>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
>>> into this state when the QMP migration command was called. Instead we want to
>>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and
>>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
>>> command is issued.
>>>
>>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
>>> later in the migration_thread(). This is done to be able to timestamp the amount of
>>> time spent in the SETUP state for proper accounting to the user during
>>> an RDMA migration.
>>>
>>> Furthermore, the management software, until now, has never been aware of the
>>> existence of the SETUP state whatsoever. This must change, because, timing of this
>>> state implies that the state actually exists.
>>>
>>> These two patches cannot be separated because the 'query_migrate' QMP
>>> switch statement needs to know how to handle this new state transition.
>>>
>>> Tested-by: Michael R. Hines <mrhines@us.ibm.com>
>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s)
>>>   {
>>>       DPRINTF("cancelling migration\n");
>>>   
>>> +    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
>>>       migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
>>>   }
>> This chunk is wrong.
>>
>> we can call qme_migrate_cancel() at any point,  and it is going to be
>> called normally from MIG_STATE_ACTIVE.
>>
>>      migrate_set_satet(s, s->state,  MIG_STATE_CANCELLED)
>>
>> should do the trick.  Or something like that,  what do you think?
> I don't like the three-arguments migrate_set_state, but I don't have any
> better idea.
>
> With Juan's modification, it is fine (but not reviewed-by me :)).  While
> you resend, the first 13 patches of v10 can be merged (pull request).
> You can then rebase the last three on top.

So, I should send a v11 before Juan applies and generates the pull request?

Did I understand correctly?

> Michael, did you look at the "debugging/getting the protocol ready" mode
> where all pages are unpinned?
>
> Paolo
>


We did not come up with a design on how such a mode would be
exposed to the user. Implementing this in migration-rdma.c is
just a few lines of code, but without a clear use case in this patch
series, I've not chosen expose it to the user yet without some kind
of agreement with you guys.

Recommendations?
Paolo Bonzini June 25, 2013, 1:53 p.m. UTC | #5
Il 25/06/2013 15:44, Michael R. Hines ha scritto:
>>>
>> I don't like the three-arguments migrate_set_state, but I don't have any
>> better idea.
>>
>> With Juan's modification, it is fine (but not reviewed-by me :)).  While
>> you resend, the first 13 patches of v10 can be merged (pull request).
>> You can then rebase the last three on top.
> 
> So, I should send a v11 before Juan applies and generates the pull request?
> 
> Did I understand correctly?

Juan found another small nit, so please do send a v12.  He liked (with
the small change in cancellation) your last three patches, so there is
no need to split out patch 13 (as you had it in v10).

>> Michael, did you look at the "debugging/getting the protocol ready" mode
>> where all pages are unpinned?
> 
> We did not come up with a design on how such a mode would be
> exposed to the user. Implementing this in migration-rdma.c is
> just a few lines of code, but without a clear use case in this patch
> series, I've not chosen expose it to the user yet without some kind
> of agreement with you guys.

Do those few lines of code change the protocol?  If yes, I'll go against
all my previous recommendations and suggest a #define.  If no, it is
fine to leave it for later, but I would still suggest posting the patch
on the list just for information.

Paolo
Juan Quintela June 25, 2013, 2:17 p.m. UTC | #6
"Michael R. Hines" <mrhines@linux.vnet.ibm.com> wrote:
> On 06/25/2013 06:13 AM, Paolo Bonzini wrote:
>> Il 25/06/2013 11:49, Juan Quintela ha scritto:
>>> mrhines@linux.vnet.ibm.com wrote:
>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>
>>>> As described in the previous patch, until now, the MIG_STATE_SETUP
>>>> state was not really a 'formal' state. It has been used as a 'zero' state
>>>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
>>>> into this state when the QMP migration command was called. Instead we want to
>>>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and
>>>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
>>>> command is issued.
>>>>
>>>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
>>>> later in the migration_thread(). This is done to be able to timestamp the amount of
>>>> time spent in the SETUP state for proper accounting to the user during
>>>> an RDMA migration.
>>>>
>>>> Furthermore, the management software, until now, has never been aware of the
>>>> existence of the SETUP state whatsoever. This must change, because, timing of this
>>>> state implies that the state actually exists.
>>>>
>>>> These two patches cannot be separated because the 'query_migrate' QMP
>>>> switch statement needs to know how to handle this new state transition.
>>>>
>>>> Tested-by: Michael R. Hines <mrhines@us.ibm.com>
>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s)
>>>>   {
>>>>       DPRINTF("cancelling migration\n");
>>>>   +    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
>>>>       migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
>>>>   }
>>> This chunk is wrong.
>>>
>>> we can call qme_migrate_cancel() at any point,  and it is going to be
>>> called normally from MIG_STATE_ACTIVE.
>>>
>>>      migrate_set_satet(s, s->state,  MIG_STATE_CANCELLED)
>>>
>>> should do the trick.  Or something like that,  what do you think?
>> I don't like the three-arguments migrate_set_state, but I don't have any
>> better idea.
>>
>> With Juan's modification, it is fine (but not reviewed-by me :)).  While
>> you resend, the first 13 patches of v10 can be merged (pull request).
>> You can then rebase the last three on top.
>
> So, I should send a v11 before Juan applies and generates the pull request?
>
> Did I understand correctly?
>
>> Michael, did you look at the "debugging/getting the protocol ready" mode
>> where all pages are unpinned?
>>
>> Paolo
>>
>
>
> We did not come up with a design on how such a mode would be
> exposed to the user. Implementing this in migration-rdma.c is
> just a few lines of code, but without a clear use case in this patch
> series, I've not chosen expose it to the user yet without some kind
> of agreement with you guys.
>
> Recommendations?

Dunno.  I have tried to use the thing:

I am using -incoming x-rdma:deus:4444

(qemu) char device redirected to /dev/pts/2 (label serial0)
librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list
RDMA ERROR: could not create rdma event channel
-incoming x-rdma:deus:4444: RDMA ERROR: could not create rdma event channel


I also used "0" instead of deus (the machine name).

The only thing that I did appart from intsalling librdmacm-devel was:


# cat /etc/udev/rules.d/80-infinibad.rules
KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666"

As per README.  Something else I need to do to use the tcp
implementations?

Later,  Juan.
mrhines@linux.vnet.ibm.com June 25, 2013, 2:54 p.m. UTC | #7
On 06/25/2013 09:53 AM, Paolo Bonzini wrote:
> Il 25/06/2013 15:44, Michael R. Hines ha scritto:
>>> I don't like the three-arguments migrate_set_state, but I don't have any
>>> better idea.
>>>
>>> With Juan's modification, it is fine (but not reviewed-by me :)).  While
>>> you resend, the first 13 patches of v10 can be merged (pull request).
>>> You can then rebase the last three on top.
>> So, I should send a v11 before Juan applies and generates the pull request?
>>
>> Did I understand correctly?
> Juan found another small nit, so please do send a v12.  He liked (with
> the small change in cancellation) your last three patches, so there is
> no need to split out patch 13 (as you had it in v10).

Acknowledged.

>>> Michael, did you look at the "debugging/getting the protocol ready" mode
>>> where all pages are unpinned?
>> We did not come up with a design on how such a mode would be
>> exposed to the user. Implementing this in migration-rdma.c is
>> just a few lines of code, but without a clear use case in this patch
>> series, I've not chosen expose it to the user yet without some kind
>> of agreement with you guys.
> Do those few lines of code change the protocol?  If yes, I'll go against
> all my previous recommendations and suggest a #define.  If no, it is
> fine to leave it for later, but I would still suggest posting the patch
> on the list just for information.
>
> Paolo
>

Ok, you got it - no it does not change the protocol.
I'll use a #define and send it out with a new version for review.

- Michael
Paolo Bonzini June 25, 2013, 2:55 p.m. UTC | #8
Il 25/06/2013 16:54, Michael R. Hines ha scritto:
>>>
>> Do those few lines of code change the protocol?  If yes, I'll go against
>> all my previous recommendations and suggest a #define.  If no, it is
>> fine to leave it for later, but I would still suggest posting the patch
>> on the list just for information.
> 
> Ok, you got it - no it does not change the protocol.
> I'll use a #define and send it out with a new version for review.

Make sure you send an additional patch on top of these 15.

Paolo
mrhines@linux.vnet.ibm.com June 25, 2013, 4:57 p.m. UTC | #9
On 06/25/2013 10:55 AM, Paolo Bonzini wrote:
> Il 25/06/2013 16:54, Michael R. Hines ha scritto:
>>> Do those few lines of code change the protocol?  If yes, I'll go against
>>> all my previous recommendations and suggest a #define.  If no, it is
>>> fine to leave it for later, but I would still suggest posting the patch
>>> on the list just for information.
>> Ok, you got it - no it does not change the protocol.
>> I'll use a #define and send it out with a new version for review.
> Make sure you send an additional patch on top of these 15.
>
> Paolo
>
Understood.

- Michael
mrhines@linux.vnet.ibm.com June 25, 2013, 5:02 p.m. UTC | #10
On 06/25/2013 10:17 AM, Juan Quintela wrote:
> "Michael R. Hines" <mrhines@linux.vnet.ibm.com> wrote:
>> On 06/25/2013 06:13 AM, Paolo Bonzini wrote:
>>> Il 25/06/2013 11:49, Juan Quintela ha scritto:
>>>> mrhines@linux.vnet.ibm.com wrote:
>>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>>
>>>>> As described in the previous patch, until now, the MIG_STATE_SETUP
>>>>> state was not really a 'formal' state. It has been used as a 'zero' state
>>>>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
>>>>> into this state when the QMP migration command was called. Instead we want to
>>>>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and
>>>>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
>>>>> command is issued.
>>>>>
>>>>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
>>>>> later in the migration_thread(). This is done to be able to timestamp the amount of
>>>>> time spent in the SETUP state for proper accounting to the user during
>>>>> an RDMA migration.
>>>>>
>>>>> Furthermore, the management software, until now, has never been aware of the
>>>>> existence of the SETUP state whatsoever. This must change, because, timing of this
>>>>> state implies that the state actually exists.
>>>>>
>>>>> These two patches cannot be separated because the 'query_migrate' QMP
>>>>> switch statement needs to know how to handle this new state transition.
>>>>>
>>>>> Tested-by: Michael R. Hines <mrhines@us.ibm.com>
>>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>>>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s)
>>>>>    {
>>>>>        DPRINTF("cancelling migration\n");
>>>>>    +    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
>>>>>        migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
>>>>>    }
>>>> This chunk is wrong.
>>>>
>>>> we can call qme_migrate_cancel() at any point,  and it is going to be
>>>> called normally from MIG_STATE_ACTIVE.
>>>>
>>>>       migrate_set_satet(s, s->state,  MIG_STATE_CANCELLED)
>>>>
>>>> should do the trick.  Or something like that,  what do you think?
>>> I don't like the three-arguments migrate_set_state, but I don't have any
>>> better idea.
>>>
>>> With Juan's modification, it is fine (but not reviewed-by me :)).  While
>>> you resend, the first 13 patches of v10 can be merged (pull request).
>>> You can then rebase the last three on top.
>> So, I should send a v11 before Juan applies and generates the pull request?
>>
>> Did I understand correctly?
>>
>>> Michael, did you look at the "debugging/getting the protocol ready" mode
>>> where all pages are unpinned?
>>>
>>> Paolo
>>>
>>
>> We did not come up with a design on how such a mode would be
>> exposed to the user. Implementing this in migration-rdma.c is
>> just a few lines of code, but without a clear use case in this patch
>> series, I've not chosen expose it to the user yet without some kind
>> of agreement with you guys.
>>
>> Recommendations?
> Dunno.  I have tried to use the thing:
>
> I am using -incoming x-rdma:deus:4444
>
> (qemu) char device redirected to /dev/pts/2 (label serial0)
> librdmacm: Warning: couldn't read ABI version.
> librdmacm: Warning: assuming: 4
> librdmacm: Fatal: unable to get RDMA device list
> RDMA ERROR: could not create rdma event channel
> -incoming x-rdma:deus:4444: RDMA ERROR: could not create rdma event channel
>

This means that your RDMA environment / IB configuration
(outside of QEMU) is not configured properly. There are several
"sanity checks" you can do to make sure it is working:

Run:

$ ibv_devices     <= You should see a list of configured IB devices
$ ibv_devinfo     <= You should see "PORT_ACTIVE" in the output 
somewhere for each IB device
$ ifconfig ib0     <= if using standard infiniband
$ ifconfig eth0   <= if using RDMA over Converged ethernet

Furthermore, there are several latency/throughput utilities that
come along with the OpenFabrics software stack you can run
to make sure your two hosts are communicating with each other:

There's these commands:

$ ucmatose  [args...]
$ ibv_rc_pingpong [args ..]
$ rdma_server/rdma_client [args...]


> I also used "0" instead of deus (the machine name).
>
> The only thing that I did appart from intsalling librdmacm-devel was:
>
>
> # cat /etc/udev/rules.d/80-infinibad.rules
> KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666"
>
> As per README.  Something else I need to do to use the tcp
> implementations?
>
> Later,  Juan.
>
>
>
>
>
mrhines@linux.vnet.ibm.com June 25, 2013, 6:48 p.m. UTC | #11
On 06/25/2013 10:17 AM, Juan Quintela wrote:
> "Michael R. Hines" <mrhines@linux.vnet.ibm.com> wrote:
>> On 06/25/2013 06:13 AM, Paolo Bonzini wrote:
>>> Il 25/06/2013 11:49, Juan Quintela ha scritto:
>>>> mrhines@linux.vnet.ibm.com wrote:
>>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>>
>>>>> As described in the previous patch, until now, the MIG_STATE_SETUP
>>>>> state was not really a 'formal' state. It has been used as a 'zero' state
>>>>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning
>>>>> into this state when the QMP migration command was called. Instead we want to
>>>>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and
>>>>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate
>>>>> command is issued.
>>>>>
>>>>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until
>>>>> later in the migration_thread(). This is done to be able to timestamp the amount of
>>>>> time spent in the SETUP state for proper accounting to the user during
>>>>> an RDMA migration.
>>>>>
>>>>> Furthermore, the management software, until now, has never been aware of the
>>>>> existence of the SETUP state whatsoever. This must change, because, timing of this
>>>>> state implies that the state actually exists.
>>>>>
>>>>> These two patches cannot be separated because the 'query_migrate' QMP
>>>>> switch statement needs to know how to handle this new state transition.
>>>>>
>>>>> Tested-by: Michael R. Hines <mrhines@us.ibm.com>
>>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>>>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s)
>>>>>    {
>>>>>        DPRINTF("cancelling migration\n");
>>>>>    +    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
>>>>>        migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
>>>>>    }
>>>> This chunk is wrong.
>>>>
>>>> we can call qme_migrate_cancel() at any point,  and it is going to be
>>>> called normally from MIG_STATE_ACTIVE.
>>>>
>>>>       migrate_set_satet(s, s->state,  MIG_STATE_CANCELLED)
>>>>
>>>> should do the trick.  Or something like that,  what do you think?
>>> I don't like the three-arguments migrate_set_state, but I don't have any
>>> better idea.
>>>
>>> With Juan's modification, it is fine (but not reviewed-by me :)).  While
>>> you resend, the first 13 patches of v10 can be merged (pull request).
>>> You can then rebase the last three on top.
>> So, I should send a v11 before Juan applies and generates the pull request?
>>
>> Did I understand correctly?
>>
>>> Michael, did you look at the "debugging/getting the protocol ready" mode
>>> where all pages are unpinned?
>>>
>>> Paolo
>>>
>>
>> We did not come up with a design on how such a mode would be
>> exposed to the user. Implementing this in migration-rdma.c is
>> just a few lines of code, but without a clear use case in this patch
>> series, I've not chosen expose it to the user yet without some kind
>> of agreement with you guys.
>>
>> Recommendations?
> Dunno.  I have tried to use the thing:
>
> I am using -incoming x-rdma:deus:4444
>
> (qemu) char device redirected to /dev/pts/2 (label serial0)
> librdmacm: Warning: couldn't read ABI version.
> librdmacm: Warning: assuming: 4
> librdmacm: Fatal: unable to get RDMA device list
> RDMA ERROR: could not create rdma event channel
> -incoming x-rdma:deus:4444: RDMA ERROR: could not create rdma event channel
>
>
> I also used "0" instead of deus (the machine name).
>
> The only thing that I did appart from intsalling librdmacm-devel was:
>
>
> # cat /etc/udev/rules.d/80-infinibad.rules
> KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666"
>
> As per README.  Something else I need to do to use the tcp
> implementations?
>
> Later,  Juan.
>
>

There's a simple bug found by Visilis affecting localhost migration 
which I believe also
affects using '0' as a hostname in the migration.....

Both of these should be fixed in my next version. Will send out today.
(I never tried migrating with 0.0.0.0 until now).

- Michael


>
>
mrhines@linux.vnet.ibm.com June 25, 2013, 8:56 p.m. UTC | #12
On 06/25/2013 10:55 AM, Paolo Bonzini wrote:
> Il 25/06/2013 16:54, Michael R. Hines ha scritto:
>>> Do those few lines of code change the protocol?  If yes, I'll go against
>>> all my previous recommendations and suggest a #define.  If no, it is
>>> fine to leave it for later, but I would still suggest posting the patch
>>> on the list just for information.
>> Ok, you got it - no it does not change the protocol.
>> I'll use a #define and send it out with a new version for review.
> Make sure you send an additional patch on top of these 15.
>
> Paolo
>
I was wrong - this does require a protocol extension.

This is because the RDMA transfers are asynchronous, and thus
we cannot know in advance that it is safe to unregister the memory
associated with each individual transfer before the transfer actually
completes.

While the destination currently uses the protocol to participate in
*registering* the page, the destination does not participate in the
RDMA transfers themselves, only the source does, and thus would
require a new exchange of messages to block and instruct the
destination to unpin the memory.

- Michael
Paolo Bonzini June 25, 2013, 9:06 p.m. UTC | #13
Il 25/06/2013 22:56, Michael R. Hines ha scritto:
>>
> I was wrong - this does require a protocol extension.
> 
> This is because the RDMA transfers are asynchronous, and thus
> we cannot know in advance that it is safe to unregister the memory
> associated with each individual transfer before the transfer actually
> completes.
> 
> While the destination currently uses the protocol to participate in
> *registering* the page, the destination does not participate in the
> RDMA transfers themselves, only the source does, and thus would
> require a new exchange of messages to block and instruct the
> destination to unpin the memory.

Yes, that's what I recalled too (really what mst told me :)).  Does it
need to be blocking though?  As long as the pinning is blocking, and
messages are processed in order, the source can proceed immediately
after sending an unpin message.  This assumes of course that the chunk
is not being transmitted, and I am not sure how easy the source can
determine that.

Paolo
mrhines@linux.vnet.ibm.com June 26, 2013, 12:31 a.m. UTC | #14
On 06/25/2013 05:06 PM, Paolo Bonzini wrote:
> Il 25/06/2013 22:56, Michael R. Hines ha scritto:
>> I was wrong - this does require a protocol extension.
>>
>> This is because the RDMA transfers are asynchronous, and thus
>> we cannot know in advance that it is safe to unregister the memory
>> associated with each individual transfer before the transfer actually
>> completes.
>>
>> While the destination currently uses the protocol to participate in
>> *registering* the page, the destination does not participate in the
>> RDMA transfers themselves, only the source does, and thus would
>> require a new exchange of messages to block and instruct the
>> destination to unpin the memory.
> Yes, that's what I recalled too (really what mst told me :)).  Does it
> need to be blocking though?  As long as the pinning is blocking, and
> messages are processed in order, the source can proceed immediately
> after sending an unpin message.  This assumes of course that the chunk
> is not being transmitted, and I am not sure how easy the source can
> determine that.
>
> Paolo
>

No, they're not processed in order. In fact, not only does the device
write out of order, but also the PCI bus writes out of order.
This was such a problem in fact, that I fixed several bugs as a result
a few weeks ago (v7 of the patch with an in-depth description).

The destination simply cannot assume whatsoever what the ordering
of the writes are - that's really the whole point of using RDMA in the
first place so that the software can get out of the way of the transfer 
process
to lower the latency of each transfer.

The only option is to send a blocking message to the other side to
request the unpinning (in addition to unpinning on the source first upon
completion of the original transfer).

As you can expect, this would be very expensive and we must ensure
that we have *very* good a-priori information that this memory will
not need to be re-registered anytime in the near future.

- Michael
Paolo Bonzini June 26, 2013, 6:37 a.m. UTC | #15
Il 26/06/2013 02:31, Michael R. Hines ha scritto:
> On 06/25/2013 05:06 PM, Paolo Bonzini wrote:
>> Il 25/06/2013 22:56, Michael R. Hines ha scritto:
>>> I was wrong - this does require a protocol extension.
>>>
>>> This is because the RDMA transfers are asynchronous, and thus
>>> we cannot know in advance that it is safe to unregister the memory
>>> associated with each individual transfer before the transfer actually
>>> completes.
>>>
>>> While the destination currently uses the protocol to participate in
>>> *registering* the page, the destination does not participate in the
>>> RDMA transfers themselves, only the source does, and thus would
>>> require a new exchange of messages to block and instruct the
>>> destination to unpin the memory.
>> Yes, that's what I recalled too (really what mst told me :)).  Does it
>> need to be blocking though?  As long as the pinning is blocking, and
>> messages are processed in order, the source can proceed immediately
>> after sending an unpin message.  This assumes of course that the chunk
>> is not being transmitted, and I am not sure how easy the source can
>> determine that.
> 
> No, they're not processed in order. In fact, not only does the device
> write out of order, but also the PCI bus writes out of order.
> This was such a problem in fact, that I fixed several bugs as a result
> a few weeks ago (v7 of the patch with an in-depth description).
> 
> The destination simply cannot assume whatsoever what the ordering
> of the writes are - that's really the whole point of using RDMA in the
> first place so that the software can get out of the way of the transfer
> process to lower the latency of each transfer.

The memory is processed out of order, but what about the messages?
Those must be in order.

Note that I wrote above "This assumes of course that the chunk is not
being transmitted".  Can the source know when an asynchronous transfer
finished, and delay the unpinning until that time?

Paolo

> 
> The only option is to send a blocking message to the other side to
> request the unpinning (in addition to unpinning on the source first upon
> completion of the original transfer).
> 
> As you can expect, this would be very expensive and we must ensure
> that we have *very* good a-priori information that this memory will
> not need to be re-registered anytime in the near future.
> 
> - Michael
> 
> 
>
mrhines@linux.vnet.ibm.com June 26, 2013, 12:37 p.m. UTC | #16
On 06/26/2013 02:37 AM, Paolo Bonzini wrote:
> Il 26/06/2013 02:31, Michael R. Hines ha scritto:
>> On 06/25/2013 05:06 PM, Paolo Bonzini wrote:
>>> Il 25/06/2013 22:56, Michael R. Hines ha scritto:
>>>> I was wrong - this does require a protocol extension.
>>>>
>>>> This is because the RDMA transfers are asynchronous, and thus
>>>> we cannot know in advance that it is safe to unregister the memory
>>>> associated with each individual transfer before the transfer actually
>>>> completes.
>>>>
>>>> While the destination currently uses the protocol to participate in
>>>> *registering* the page, the destination does not participate in the
>>>> RDMA transfers themselves, only the source does, and thus would
>>>> require a new exchange of messages to block and instruct the
>>>> destination to unpin the memory.
>>> Yes, that's what I recalled too (really what mst told me :)).  Does it
>>> need to be blocking though?  As long as the pinning is blocking, and
>>> messages are processed in order, the source can proceed immediately
>>> after sending an unpin message.  This assumes of course that the chunk
>>> is not being transmitted, and I am not sure how easy the source can
>>> determine that.
>> No, they're not processed in order. In fact, not only does the device
>> write out of order, but also the PCI bus writes out of order.
>> This was such a problem in fact, that I fixed several bugs as a result
>> a few weeks ago (v7 of the patch with an in-depth description).
>>
>> The destination simply cannot assume whatsoever what the ordering
>> of the writes are - that's really the whole point of using RDMA in the
>> first place so that the software can get out of the way of the transfer
>> process to lower the latency of each transfer.
> The memory is processed out of order, but what about the messages?
> Those must be in order.
>
> Note that I wrote above "This assumes of course that the chunk is not
> being transmitted".  Can the source know when an asynchronous transfer
> finished, and delay the unpinning until that time?
>
> Paolo

Yes, the source does know. There's no problem unpinning on the source.

But both sides must do the unpinning, not just the source.

Did I misunderstand you? Are you suggesting *only* unpinning on the source?

- Michael
Paolo Bonzini June 26, 2013, 12:39 p.m. UTC | #17
Il 26/06/2013 14:37, Michael R. Hines ha scritto:
> On 06/26/2013 02:37 AM, Paolo Bonzini wrote:
>> Il 26/06/2013 02:31, Michael R. Hines ha scritto:
>>> On 06/25/2013 05:06 PM, Paolo Bonzini wrote:
>>>> Il 25/06/2013 22:56, Michael R. Hines ha scritto:
>>>>> I was wrong - this does require a protocol extension.
>>>>>
>>>>> This is because the RDMA transfers are asynchronous, and thus
>>>>> we cannot know in advance that it is safe to unregister the memory
>>>>> associated with each individual transfer before the transfer actually
>>>>> completes.
>>>>>
>>>>> While the destination currently uses the protocol to participate in
>>>>> *registering* the page, the destination does not participate in the
>>>>> RDMA transfers themselves, only the source does, and thus would
>>>>> require a new exchange of messages to block and instruct the
>>>>> destination to unpin the memory.
>>>> Yes, that's what I recalled too (really what mst told me :)).  Does it
>>>> need to be blocking though?  As long as the pinning is blocking, and
>>>> messages are processed in order, the source can proceed immediately
>>>> after sending an unpin message.  This assumes of course that the chunk
>>>> is not being transmitted, and I am not sure how easy the source can
>>>> determine that.
>>> No, they're not processed in order. In fact, not only does the device
>>> write out of order, but also the PCI bus writes out of order.
>>> This was such a problem in fact, that I fixed several bugs as a result
>>> a few weeks ago (v7 of the patch with an in-depth description).
>>>
>>> The destination simply cannot assume whatsoever what the ordering
>>> of the writes are - that's really the whole point of using RDMA in the
>>> first place so that the software can get out of the way of the transfer
>>> process to lower the latency of each transfer.
>> The memory is processed out of order, but what about the messages?
>> Those must be in order.
>>
>> Note that I wrote above "This assumes of course that the chunk is not
>> being transmitted".  Can the source know when an asynchronous transfer
>> finished, and delay the unpinning until that time?
> 
> Yes, the source does know. There's no problem unpinning on the source.
> 
> But both sides must do the unpinning, not just the source.
> 
> Did I misunderstand you? Are you suggesting *only* unpinning on the source?

I'm suggesting (if possible) that the source only tells the destination
to unpin once it knows it is safe for the destination to do it.  As long
as unpin and pin messages are not reordered, it should be possible to do
it with an asynchronous message from the source to the destination.

Paolo
mrhines@linux.vnet.ibm.com June 26, 2013, 2:09 p.m. UTC | #18
On 06/26/2013 08:39 AM, Paolo Bonzini wrote:
> Il 26/06/2013 14:37, Michael R. Hines ha scritto:
>> On 06/26/2013 02:37 AM, Paolo Bonzini wrote:
>>> Il 26/06/2013 02:31, Michael R. Hines ha scritto:
>>>> On 06/25/2013 05:06 PM, Paolo Bonzini wrote:
>>>>> Il 25/06/2013 22:56, Michael R. Hines ha scritto:
>>>>>> I was wrong - this does require a protocol extension.
>>>>>>
>>>>>> This is because the RDMA transfers are asynchronous, and thus
>>>>>> we cannot know in advance that it is safe to unregister the memory
>>>>>> associated with each individual transfer before the transfer actually
>>>>>> completes.
>>>>>>
>>>>>> While the destination currently uses the protocol to participate in
>>>>>> *registering* the page, the destination does not participate in the
>>>>>> RDMA transfers themselves, only the source does, and thus would
>>>>>> require a new exchange of messages to block and instruct the
>>>>>> destination to unpin the memory.
>>>>> Yes, that's what I recalled too (really what mst told me :)).  Does it
>>>>> need to be blocking though?  As long as the pinning is blocking, and
>>>>> messages are processed in order, the source can proceed immediately
>>>>> after sending an unpin message.  This assumes of course that the chunk
>>>>> is not being transmitted, and I am not sure how easy the source can
>>>>> determine that.
>>>> No, they're not processed in order. In fact, not only does the device
>>>> write out of order, but also the PCI bus writes out of order.
>>>> This was such a problem in fact, that I fixed several bugs as a result
>>>> a few weeks ago (v7 of the patch with an in-depth description).
>>>>
>>>> The destination simply cannot assume whatsoever what the ordering
>>>> of the writes are - that's really the whole point of using RDMA in the
>>>> first place so that the software can get out of the way of the transfer
>>>> process to lower the latency of each transfer.
>>> The memory is processed out of order, but what about the messages?
>>> Those must be in order.
>>>
>>> Note that I wrote above "This assumes of course that the chunk is not
>>> being transmitted".  Can the source know when an asynchronous transfer
>>> finished, and delay the unpinning until that time?
>> Yes, the source does know. There's no problem unpinning on the source.
>>
>> But both sides must do the unpinning, not just the source.
>>
>> Did I misunderstand you? Are you suggesting *only* unpinning on the source?
> I'm suggesting (if possible) that the source only tells the destination
> to unpin once it knows it is safe for the destination to do it.  As long
> as unpin and pin messages are not reordered, it should be possible to do
> it with an asynchronous message from the source to the destination.
>
> Paolo
>

Oh, certainly. I agree. That's not a trivial patch, though (as we were
originally shooting for).

(I'll list the steps below on the QEMU wiki, for the record).

*This requires some steps:*
1. First, maintain a new data structure: something like
     "These memory ranges are 'being unpinned'" - block all potential writes
     to these addresses until the unpinning completes.
2. Once the source unpin completes, send the asynchronous control 
channel message
     to the other side for unpinning.
2. Mark the data structure and return and allow the migration to continue
     with the next RDMA write.
3. Upon completion of the unpinning on the destination,
     respond to the source that it was finished.
4. Source then clears the data structure for the successfully unpinned 
memory ranges.
5. At this point, one or more writes may (or may not) be blocking on the
     unpinned memory areas and will poll the data structure and find that
     the unpinning has completed.
6. Then issue the new writes and proceed as normal.
7. Repeat step 1.
Paolo Bonzini June 26, 2013, 2:57 p.m. UTC | #19
Il 26/06/2013 16:09, Michael R. Hines ha scritto:
> *This requires some steps:*
> 1. First, maintain a new data structure: something like
>     "These memory ranges are 'being unpinned'" - block all potential writes
>     to these addresses until the unpinning completes.
> 2. Once the source unpin completes, send the asynchronous control
> channel message
>     to the other side for unpinning.
> 2. Mark the data structure and return and allow the migration to continue
>     with the next RDMA write.
> 3. Upon completion of the unpinning on the destination,
>     respond to the source that it was finished.
> 4. Source then clears the data structure for the successfully unpinned
> memory ranges.
> 5. At this point, one or more writes may (or may not) be blocking on the
>     unpinned memory areas and will poll the data structure and find that
>     the unpinning has completed.
> 6. Then issue the new writes and proceed as normal.
> 7. Repeat step 1.

After more discussion on IRC I think I understand this.  It's not
trivial, but not super-complex either...  it would really be nice to add
it to the protocol already in 1.6.

Paolo
mrhines@linux.vnet.ibm.com June 26, 2013, 7:25 p.m. UTC | #20
On 06/26/2013 10:57 AM, Paolo Bonzini wrote:
> Il 26/06/2013 16:09, Michael R. Hines ha scritto:
>> *This requires some steps:*
>> 1. First, maintain a new data structure: something like
>>      "These memory ranges are 'being unpinned'" - block all potential writes
>>      to these addresses until the unpinning completes.
>> 2. Once the source unpin completes, send the asynchronous control
>> channel message
>>      to the other side for unpinning.
>> 2. Mark the data structure and return and allow the migration to continue
>>      with the next RDMA write.
>> 3. Upon completion of the unpinning on the destination,
>>      respond to the source that it was finished.
>> 4. Source then clears the data structure for the successfully unpinned
>> memory ranges.
>> 5. At this point, one or more writes may (or may not) be blocking on the
>>      unpinned memory areas and will poll the data structure and find that
>>      the unpinning has completed.
>> 6. Then issue the new writes and proceed as normal.
>> 7. Repeat step 1.
> After more discussion on IRC I think I understand this.  It's not
> trivial, but not super-complex either...  it would really be nice to add
> it to the protocol already in 1.6.
>
> Paolo
>

I will submit this as a brand new patch series. I'm working on it right now.

- Michael
diff mbox

Patch

diff --git a/migration.c b/migration.c
index dcb99c9..758d99d 100644
--- a/migration.c
+++ b/migration.c
@@ -36,7 +36,8 @@ 
 #endif
 
 enum {
-    MIG_STATE_ERROR,
+    MIG_STATE_ERROR = -1,
+    MIG_STATE_NONE,
     MIG_STATE_SETUP,
     MIG_STATE_CANCELLED,
     MIG_STATE_ACTIVE,
@@ -63,7 +64,7 @@  static NotifierList migration_state_notifiers =
 MigrationState *migrate_get_current(void)
 {
     static MigrationState current_migration = {
-        .state = MIG_STATE_SETUP,
+        .state = MIG_STATE_NONE,
         .bandwidth_limit = MAX_THROTTLE,
         .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
         .mbps = -1,
@@ -184,9 +185,13 @@  MigrationInfo *qmp_query_migrate(Error **errp)
     MigrationState *s = migrate_get_current();
 
     switch (s->state) {
-    case MIG_STATE_SETUP:
+    case MIG_STATE_NONE:
         /* no migration has happened ever */
         break;
+    case MIG_STATE_SETUP:
+        info->has_status = true;
+        info->status = g_strdup("setup");
+        break;
     case MIG_STATE_ACTIVE:
         info->has_status = true;
         info->status = g_strdup("active");
@@ -257,7 +262,7 @@  void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
     MigrationState *s = migrate_get_current();
     MigrationCapabilityStatusList *cap;
 
-    if (s->state == MIG_STATE_ACTIVE) {
+    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
         error_set(errp, QERR_MIGRATION_ACTIVE);
         return;
     }
@@ -316,6 +321,7 @@  static void migrate_fd_cancel(MigrationState *s)
 {
     DPRINTF("cancelling migration\n");
 
+    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED);
     migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED);
 }
 
@@ -393,7 +399,7 @@  void qmp_migrate(const char *uri, bool has_blk, bool blk,
     params.blk = blk;
     params.shared = inc;
 
-    if (s->state == MIG_STATE_ACTIVE) {
+    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
         error_set(errp, QERR_MIGRATION_ACTIVE);
         return;
     }
@@ -525,6 +531,8 @@  static void *migration_thread(void *opaque)
     DPRINTF("beginning savevm\n");
     qemu_savevm_state_begin(s->file, &s->params);
 
+    migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ACTIVE);
+
     while (s->state == MIG_STATE_ACTIVE) {
         int64_t current_time;
         uint64_t pending_size;
@@ -604,8 +612,8 @@  static void *migration_thread(void *opaque)
 
 void migrate_fd_connect(MigrationState *s)
 {
-    s->state = MIG_STATE_ACTIVE;
-    trace_migrate_set_state(MIG_STATE_ACTIVE);
+    s->state = MIG_STATE_SETUP;
+    trace_migrate_set_state(MIG_STATE_SETUP);
 
     /* This is a best 1st approximation. ns to ms */
     s->expected_downtime = max_downtime/1000000;