Message ID | 1372125485-11795-15-git-send-email-mrhines@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
mrhines@linux.vnet.ibm.com wrote: > From: "Michael R. Hines" <mrhines@us.ibm.com> > > As described in the previous patch, until now, the MIG_STATE_SETUP > state was not really a 'formal' state. It has been used as a 'zero' state > (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning > into this state when the QMP migration command was called. Instead we want to > introduce MIG_STATE_NONE, which is our starting state in the state machine, and > then immediately transition into the MIG_STATE_SETUP state when the QMP migrate > command is issued. > > In order to do this, we must delay the transition into MIG_STATE_ACTIVE until > later in the migration_thread(). This is done to be able to timestamp the amount of > time spent in the SETUP state for proper accounting to the user during > an RDMA migration. > > Furthermore, the management software, until now, has never been aware of the > existence of the SETUP state whatsoever. This must change, because, timing of this > state implies that the state actually exists. > > These two patches cannot be separated because the 'query_migrate' QMP > switch statement needs to know how to handle this new state transition. > > Tested-by: Michael R. Hines <mrhines@us.ibm.com> > Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> > @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) > { > DPRINTF("cancelling migration\n"); > > + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); > migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); > } This chunk is wrong. we can call qme_migrate_cancel() at any point, and it is going to be called normally from MIG_STATE_ACTIVE. migrate_set_satet(s, s->state, MIG_STATE_CANCELLED) should do the trick. Or something like that, what do you think? Later, Juan.
Il 25/06/2013 11:49, Juan Quintela ha scritto: > mrhines@linux.vnet.ibm.com wrote: >> From: "Michael R. Hines" <mrhines@us.ibm.com> >> >> As described in the previous patch, until now, the MIG_STATE_SETUP >> state was not really a 'formal' state. It has been used as a 'zero' state >> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning >> into this state when the QMP migration command was called. Instead we want to >> introduce MIG_STATE_NONE, which is our starting state in the state machine, and >> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate >> command is issued. >> >> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until >> later in the migration_thread(). This is done to be able to timestamp the amount of >> time spent in the SETUP state for proper accounting to the user during >> an RDMA migration. >> >> Furthermore, the management software, until now, has never been aware of the >> existence of the SETUP state whatsoever. This must change, because, timing of this >> state implies that the state actually exists. >> >> These two patches cannot be separated because the 'query_migrate' QMP >> switch statement needs to know how to handle this new state transition. >> >> Tested-by: Michael R. Hines <mrhines@us.ibm.com> >> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> > >> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) >> { >> DPRINTF("cancelling migration\n"); >> >> + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); >> migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); >> } > > This chunk is wrong. > > we can call qme_migrate_cancel() at any point, and it is going to be > called normally from MIG_STATE_ACTIVE. > > migrate_set_satet(s, s->state, MIG_STATE_CANCELLED) > > should do the trick. Or something like that, what do you think? I don't like the three-arguments migrate_set_state, but I don't have any better idea. With Juan's modification, it is fine (but not reviewed-by me :)). While you resend, the first 13 patches of v10 can be merged (pull request). You can then rebase the last three on top. Michael, did you look at the "debugging/getting the protocol ready" mode where all pages are unpinned? Paolo
On 06/25/2013 05:49 AM, Juan Quintela wrote: > mrhines@linux.vnet.ibm.com wrote: >> From: "Michael R. Hines" <mrhines@us.ibm.com> >> >> As described in the previous patch, until now, the MIG_STATE_SETUP >> state was not really a 'formal' state. It has been used as a 'zero' state >> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning >> into this state when the QMP migration command was called. Instead we want to >> introduce MIG_STATE_NONE, which is our starting state in the state machine, and >> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate >> command is issued. >> >> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until >> later in the migration_thread(). This is done to be able to timestamp the amount of >> time spent in the SETUP state for proper accounting to the user during >> an RDMA migration. >> >> Furthermore, the management software, until now, has never been aware of the >> existence of the SETUP state whatsoever. This must change, because, timing of this >> state implies that the state actually exists. >> >> These two patches cannot be separated because the 'query_migrate' QMP >> switch statement needs to know how to handle this new state transition. >> >> Tested-by: Michael R. Hines <mrhines@us.ibm.com> >> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) >> { >> DPRINTF("cancelling migration\n"); >> >> + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); >> migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); >> } > This chunk is wrong. > > we can call qme_migrate_cancel() at any point, and it is going to be > called normally from MIG_STATE_ACTIVE. > > > migrate_set_satet(s, s->state, MIG_STATE_CANCELLED) > > should do the trick. Or something like that, what do you think? > > Later, Juan. > > > Great idea, thank you. - Michael
On 06/25/2013 06:13 AM, Paolo Bonzini wrote: > Il 25/06/2013 11:49, Juan Quintela ha scritto: >> mrhines@linux.vnet.ibm.com wrote: >>> From: "Michael R. Hines" <mrhines@us.ibm.com> >>> >>> As described in the previous patch, until now, the MIG_STATE_SETUP >>> state was not really a 'formal' state. It has been used as a 'zero' state >>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning >>> into this state when the QMP migration command was called. Instead we want to >>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and >>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate >>> command is issued. >>> >>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until >>> later in the migration_thread(). This is done to be able to timestamp the amount of >>> time spent in the SETUP state for proper accounting to the user during >>> an RDMA migration. >>> >>> Furthermore, the management software, until now, has never been aware of the >>> existence of the SETUP state whatsoever. This must change, because, timing of this >>> state implies that the state actually exists. >>> >>> These two patches cannot be separated because the 'query_migrate' QMP >>> switch statement needs to know how to handle this new state transition. >>> >>> Tested-by: Michael R. Hines <mrhines@us.ibm.com> >>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) >>> { >>> DPRINTF("cancelling migration\n"); >>> >>> + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); >>> migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); >>> } >> This chunk is wrong. >> >> we can call qme_migrate_cancel() at any point, and it is going to be >> called normally from MIG_STATE_ACTIVE. >> >> migrate_set_satet(s, s->state, MIG_STATE_CANCELLED) >> >> should do the trick. Or something like that, what do you think? > I don't like the three-arguments migrate_set_state, but I don't have any > better idea. > > With Juan's modification, it is fine (but not reviewed-by me :)). While > you resend, the first 13 patches of v10 can be merged (pull request). > You can then rebase the last three on top. So, I should send a v11 before Juan applies and generates the pull request? Did I understand correctly? > Michael, did you look at the "debugging/getting the protocol ready" mode > where all pages are unpinned? > > Paolo > We did not come up with a design on how such a mode would be exposed to the user. Implementing this in migration-rdma.c is just a few lines of code, but without a clear use case in this patch series, I've not chosen expose it to the user yet without some kind of agreement with you guys. Recommendations?
Il 25/06/2013 15:44, Michael R. Hines ha scritto: >>> >> I don't like the three-arguments migrate_set_state, but I don't have any >> better idea. >> >> With Juan's modification, it is fine (but not reviewed-by me :)). While >> you resend, the first 13 patches of v10 can be merged (pull request). >> You can then rebase the last three on top. > > So, I should send a v11 before Juan applies and generates the pull request? > > Did I understand correctly? Juan found another small nit, so please do send a v12. He liked (with the small change in cancellation) your last three patches, so there is no need to split out patch 13 (as you had it in v10). >> Michael, did you look at the "debugging/getting the protocol ready" mode >> where all pages are unpinned? > > We did not come up with a design on how such a mode would be > exposed to the user. Implementing this in migration-rdma.c is > just a few lines of code, but without a clear use case in this patch > series, I've not chosen expose it to the user yet without some kind > of agreement with you guys. Do those few lines of code change the protocol? If yes, I'll go against all my previous recommendations and suggest a #define. If no, it is fine to leave it for later, but I would still suggest posting the patch on the list just for information. Paolo
"Michael R. Hines" <mrhines@linux.vnet.ibm.com> wrote: > On 06/25/2013 06:13 AM, Paolo Bonzini wrote: >> Il 25/06/2013 11:49, Juan Quintela ha scritto: >>> mrhines@linux.vnet.ibm.com wrote: >>>> From: "Michael R. Hines" <mrhines@us.ibm.com> >>>> >>>> As described in the previous patch, until now, the MIG_STATE_SETUP >>>> state was not really a 'formal' state. It has been used as a 'zero' state >>>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning >>>> into this state when the QMP migration command was called. Instead we want to >>>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and >>>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate >>>> command is issued. >>>> >>>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until >>>> later in the migration_thread(). This is done to be able to timestamp the amount of >>>> time spent in the SETUP state for proper accounting to the user during >>>> an RDMA migration. >>>> >>>> Furthermore, the management software, until now, has never been aware of the >>>> existence of the SETUP state whatsoever. This must change, because, timing of this >>>> state implies that the state actually exists. >>>> >>>> These two patches cannot be separated because the 'query_migrate' QMP >>>> switch statement needs to know how to handle this new state transition. >>>> >>>> Tested-by: Michael R. Hines <mrhines@us.ibm.com> >>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >>>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) >>>> { >>>> DPRINTF("cancelling migration\n"); >>>> + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); >>>> migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); >>>> } >>> This chunk is wrong. >>> >>> we can call qme_migrate_cancel() at any point, and it is going to be >>> called normally from MIG_STATE_ACTIVE. >>> >>> migrate_set_satet(s, s->state, MIG_STATE_CANCELLED) >>> >>> should do the trick. Or something like that, what do you think? >> I don't like the three-arguments migrate_set_state, but I don't have any >> better idea. >> >> With Juan's modification, it is fine (but not reviewed-by me :)). While >> you resend, the first 13 patches of v10 can be merged (pull request). >> You can then rebase the last three on top. > > So, I should send a v11 before Juan applies and generates the pull request? > > Did I understand correctly? > >> Michael, did you look at the "debugging/getting the protocol ready" mode >> where all pages are unpinned? >> >> Paolo >> > > > We did not come up with a design on how such a mode would be > exposed to the user. Implementing this in migration-rdma.c is > just a few lines of code, but without a clear use case in this patch > series, I've not chosen expose it to the user yet without some kind > of agreement with you guys. > > Recommendations? Dunno. I have tried to use the thing: I am using -incoming x-rdma:deus:4444 (qemu) char device redirected to /dev/pts/2 (label serial0) librdmacm: Warning: couldn't read ABI version. librdmacm: Warning: assuming: 4 librdmacm: Fatal: unable to get RDMA device list RDMA ERROR: could not create rdma event channel -incoming x-rdma:deus:4444: RDMA ERROR: could not create rdma event channel I also used "0" instead of deus (the machine name). The only thing that I did appart from intsalling librdmacm-devel was: # cat /etc/udev/rules.d/80-infinibad.rules KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666" As per README. Something else I need to do to use the tcp implementations? Later, Juan.
On 06/25/2013 09:53 AM, Paolo Bonzini wrote: > Il 25/06/2013 15:44, Michael R. Hines ha scritto: >>> I don't like the three-arguments migrate_set_state, but I don't have any >>> better idea. >>> >>> With Juan's modification, it is fine (but not reviewed-by me :)). While >>> you resend, the first 13 patches of v10 can be merged (pull request). >>> You can then rebase the last three on top. >> So, I should send a v11 before Juan applies and generates the pull request? >> >> Did I understand correctly? > Juan found another small nit, so please do send a v12. He liked (with > the small change in cancellation) your last three patches, so there is > no need to split out patch 13 (as you had it in v10). Acknowledged. >>> Michael, did you look at the "debugging/getting the protocol ready" mode >>> where all pages are unpinned? >> We did not come up with a design on how such a mode would be >> exposed to the user. Implementing this in migration-rdma.c is >> just a few lines of code, but without a clear use case in this patch >> series, I've not chosen expose it to the user yet without some kind >> of agreement with you guys. > Do those few lines of code change the protocol? If yes, I'll go against > all my previous recommendations and suggest a #define. If no, it is > fine to leave it for later, but I would still suggest posting the patch > on the list just for information. > > Paolo > Ok, you got it - no it does not change the protocol. I'll use a #define and send it out with a new version for review. - Michael
Il 25/06/2013 16:54, Michael R. Hines ha scritto: >>> >> Do those few lines of code change the protocol? If yes, I'll go against >> all my previous recommendations and suggest a #define. If no, it is >> fine to leave it for later, but I would still suggest posting the patch >> on the list just for information. > > Ok, you got it - no it does not change the protocol. > I'll use a #define and send it out with a new version for review. Make sure you send an additional patch on top of these 15. Paolo
On 06/25/2013 10:55 AM, Paolo Bonzini wrote: > Il 25/06/2013 16:54, Michael R. Hines ha scritto: >>> Do those few lines of code change the protocol? If yes, I'll go against >>> all my previous recommendations and suggest a #define. If no, it is >>> fine to leave it for later, but I would still suggest posting the patch >>> on the list just for information. >> Ok, you got it - no it does not change the protocol. >> I'll use a #define and send it out with a new version for review. > Make sure you send an additional patch on top of these 15. > > Paolo > Understood. - Michael
On 06/25/2013 10:17 AM, Juan Quintela wrote: > "Michael R. Hines" <mrhines@linux.vnet.ibm.com> wrote: >> On 06/25/2013 06:13 AM, Paolo Bonzini wrote: >>> Il 25/06/2013 11:49, Juan Quintela ha scritto: >>>> mrhines@linux.vnet.ibm.com wrote: >>>>> From: "Michael R. Hines" <mrhines@us.ibm.com> >>>>> >>>>> As described in the previous patch, until now, the MIG_STATE_SETUP >>>>> state was not really a 'formal' state. It has been used as a 'zero' state >>>>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning >>>>> into this state when the QMP migration command was called. Instead we want to >>>>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and >>>>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate >>>>> command is issued. >>>>> >>>>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until >>>>> later in the migration_thread(). This is done to be able to timestamp the amount of >>>>> time spent in the SETUP state for proper accounting to the user during >>>>> an RDMA migration. >>>>> >>>>> Furthermore, the management software, until now, has never been aware of the >>>>> existence of the SETUP state whatsoever. This must change, because, timing of this >>>>> state implies that the state actually exists. >>>>> >>>>> These two patches cannot be separated because the 'query_migrate' QMP >>>>> switch statement needs to know how to handle this new state transition. >>>>> >>>>> Tested-by: Michael R. Hines <mrhines@us.ibm.com> >>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >>>>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) >>>>> { >>>>> DPRINTF("cancelling migration\n"); >>>>> + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); >>>>> migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); >>>>> } >>>> This chunk is wrong. >>>> >>>> we can call qme_migrate_cancel() at any point, and it is going to be >>>> called normally from MIG_STATE_ACTIVE. >>>> >>>> migrate_set_satet(s, s->state, MIG_STATE_CANCELLED) >>>> >>>> should do the trick. Or something like that, what do you think? >>> I don't like the three-arguments migrate_set_state, but I don't have any >>> better idea. >>> >>> With Juan's modification, it is fine (but not reviewed-by me :)). While >>> you resend, the first 13 patches of v10 can be merged (pull request). >>> You can then rebase the last three on top. >> So, I should send a v11 before Juan applies and generates the pull request? >> >> Did I understand correctly? >> >>> Michael, did you look at the "debugging/getting the protocol ready" mode >>> where all pages are unpinned? >>> >>> Paolo >>> >> >> We did not come up with a design on how such a mode would be >> exposed to the user. Implementing this in migration-rdma.c is >> just a few lines of code, but without a clear use case in this patch >> series, I've not chosen expose it to the user yet without some kind >> of agreement with you guys. >> >> Recommendations? > Dunno. I have tried to use the thing: > > I am using -incoming x-rdma:deus:4444 > > (qemu) char device redirected to /dev/pts/2 (label serial0) > librdmacm: Warning: couldn't read ABI version. > librdmacm: Warning: assuming: 4 > librdmacm: Fatal: unable to get RDMA device list > RDMA ERROR: could not create rdma event channel > -incoming x-rdma:deus:4444: RDMA ERROR: could not create rdma event channel > This means that your RDMA environment / IB configuration (outside of QEMU) is not configured properly. There are several "sanity checks" you can do to make sure it is working: Run: $ ibv_devices <= You should see a list of configured IB devices $ ibv_devinfo <= You should see "PORT_ACTIVE" in the output somewhere for each IB device $ ifconfig ib0 <= if using standard infiniband $ ifconfig eth0 <= if using RDMA over Converged ethernet Furthermore, there are several latency/throughput utilities that come along with the OpenFabrics software stack you can run to make sure your two hosts are communicating with each other: There's these commands: $ ucmatose [args...] $ ibv_rc_pingpong [args ..] $ rdma_server/rdma_client [args...] > I also used "0" instead of deus (the machine name). > > The only thing that I did appart from intsalling librdmacm-devel was: > > > # cat /etc/udev/rules.d/80-infinibad.rules > KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666" > > As per README. Something else I need to do to use the tcp > implementations? > > Later, Juan. > > > > >
On 06/25/2013 10:17 AM, Juan Quintela wrote: > "Michael R. Hines" <mrhines@linux.vnet.ibm.com> wrote: >> On 06/25/2013 06:13 AM, Paolo Bonzini wrote: >>> Il 25/06/2013 11:49, Juan Quintela ha scritto: >>>> mrhines@linux.vnet.ibm.com wrote: >>>>> From: "Michael R. Hines" <mrhines@us.ibm.com> >>>>> >>>>> As described in the previous patch, until now, the MIG_STATE_SETUP >>>>> state was not really a 'formal' state. It has been used as a 'zero' state >>>>> (what we're calling 'NONE' here) and QEMU has been unconditionally transitioning >>>>> into this state when the QMP migration command was called. Instead we want to >>>>> introduce MIG_STATE_NONE, which is our starting state in the state machine, and >>>>> then immediately transition into the MIG_STATE_SETUP state when the QMP migrate >>>>> command is issued. >>>>> >>>>> In order to do this, we must delay the transition into MIG_STATE_ACTIVE until >>>>> later in the migration_thread(). This is done to be able to timestamp the amount of >>>>> time spent in the SETUP state for proper accounting to the user during >>>>> an RDMA migration. >>>>> >>>>> Furthermore, the management software, until now, has never been aware of the >>>>> existence of the SETUP state whatsoever. This must change, because, timing of this >>>>> state implies that the state actually exists. >>>>> >>>>> These two patches cannot be separated because the 'query_migrate' QMP >>>>> switch statement needs to know how to handle this new state transition. >>>>> >>>>> Tested-by: Michael R. Hines <mrhines@us.ibm.com> >>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> >>>>> @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) >>>>> { >>>>> DPRINTF("cancelling migration\n"); >>>>> + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); >>>>> migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); >>>>> } >>>> This chunk is wrong. >>>> >>>> we can call qme_migrate_cancel() at any point, and it is going to be >>>> called normally from MIG_STATE_ACTIVE. >>>> >>>> migrate_set_satet(s, s->state, MIG_STATE_CANCELLED) >>>> >>>> should do the trick. Or something like that, what do you think? >>> I don't like the three-arguments migrate_set_state, but I don't have any >>> better idea. >>> >>> With Juan's modification, it is fine (but not reviewed-by me :)). While >>> you resend, the first 13 patches of v10 can be merged (pull request). >>> You can then rebase the last three on top. >> So, I should send a v11 before Juan applies and generates the pull request? >> >> Did I understand correctly? >> >>> Michael, did you look at the "debugging/getting the protocol ready" mode >>> where all pages are unpinned? >>> >>> Paolo >>> >> >> We did not come up with a design on how such a mode would be >> exposed to the user. Implementing this in migration-rdma.c is >> just a few lines of code, but without a clear use case in this patch >> series, I've not chosen expose it to the user yet without some kind >> of agreement with you guys. >> >> Recommendations? > Dunno. I have tried to use the thing: > > I am using -incoming x-rdma:deus:4444 > > (qemu) char device redirected to /dev/pts/2 (label serial0) > librdmacm: Warning: couldn't read ABI version. > librdmacm: Warning: assuming: 4 > librdmacm: Fatal: unable to get RDMA device list > RDMA ERROR: could not create rdma event channel > -incoming x-rdma:deus:4444: RDMA ERROR: could not create rdma event channel > > > I also used "0" instead of deus (the machine name). > > The only thing that I did appart from intsalling librdmacm-devel was: > > > # cat /etc/udev/rules.d/80-infinibad.rules > KERNEL="rdma_cm", NAME="infiniband/%k", MODE="0666" > > As per README. Something else I need to do to use the tcp > implementations? > > Later, Juan. > > There's a simple bug found by Visilis affecting localhost migration which I believe also affects using '0' as a hostname in the migration..... Both of these should be fixed in my next version. Will send out today. (I never tried migrating with 0.0.0.0 until now). - Michael > >
On 06/25/2013 10:55 AM, Paolo Bonzini wrote: > Il 25/06/2013 16:54, Michael R. Hines ha scritto: >>> Do those few lines of code change the protocol? If yes, I'll go against >>> all my previous recommendations and suggest a #define. If no, it is >>> fine to leave it for later, but I would still suggest posting the patch >>> on the list just for information. >> Ok, you got it - no it does not change the protocol. >> I'll use a #define and send it out with a new version for review. > Make sure you send an additional patch on top of these 15. > > Paolo > I was wrong - this does require a protocol extension. This is because the RDMA transfers are asynchronous, and thus we cannot know in advance that it is safe to unregister the memory associated with each individual transfer before the transfer actually completes. While the destination currently uses the protocol to participate in *registering* the page, the destination does not participate in the RDMA transfers themselves, only the source does, and thus would require a new exchange of messages to block and instruct the destination to unpin the memory. - Michael
Il 25/06/2013 22:56, Michael R. Hines ha scritto: >> > I was wrong - this does require a protocol extension. > > This is because the RDMA transfers are asynchronous, and thus > we cannot know in advance that it is safe to unregister the memory > associated with each individual transfer before the transfer actually > completes. > > While the destination currently uses the protocol to participate in > *registering* the page, the destination does not participate in the > RDMA transfers themselves, only the source does, and thus would > require a new exchange of messages to block and instruct the > destination to unpin the memory. Yes, that's what I recalled too (really what mst told me :)). Does it need to be blocking though? As long as the pinning is blocking, and messages are processed in order, the source can proceed immediately after sending an unpin message. This assumes of course that the chunk is not being transmitted, and I am not sure how easy the source can determine that. Paolo
On 06/25/2013 05:06 PM, Paolo Bonzini wrote: > Il 25/06/2013 22:56, Michael R. Hines ha scritto: >> I was wrong - this does require a protocol extension. >> >> This is because the RDMA transfers are asynchronous, and thus >> we cannot know in advance that it is safe to unregister the memory >> associated with each individual transfer before the transfer actually >> completes. >> >> While the destination currently uses the protocol to participate in >> *registering* the page, the destination does not participate in the >> RDMA transfers themselves, only the source does, and thus would >> require a new exchange of messages to block and instruct the >> destination to unpin the memory. > Yes, that's what I recalled too (really what mst told me :)). Does it > need to be blocking though? As long as the pinning is blocking, and > messages are processed in order, the source can proceed immediately > after sending an unpin message. This assumes of course that the chunk > is not being transmitted, and I am not sure how easy the source can > determine that. > > Paolo > No, they're not processed in order. In fact, not only does the device write out of order, but also the PCI bus writes out of order. This was such a problem in fact, that I fixed several bugs as a result a few weeks ago (v7 of the patch with an in-depth description). The destination simply cannot assume whatsoever what the ordering of the writes are - that's really the whole point of using RDMA in the first place so that the software can get out of the way of the transfer process to lower the latency of each transfer. The only option is to send a blocking message to the other side to request the unpinning (in addition to unpinning on the source first upon completion of the original transfer). As you can expect, this would be very expensive and we must ensure that we have *very* good a-priori information that this memory will not need to be re-registered anytime in the near future. - Michael
Il 26/06/2013 02:31, Michael R. Hines ha scritto: > On 06/25/2013 05:06 PM, Paolo Bonzini wrote: >> Il 25/06/2013 22:56, Michael R. Hines ha scritto: >>> I was wrong - this does require a protocol extension. >>> >>> This is because the RDMA transfers are asynchronous, and thus >>> we cannot know in advance that it is safe to unregister the memory >>> associated with each individual transfer before the transfer actually >>> completes. >>> >>> While the destination currently uses the protocol to participate in >>> *registering* the page, the destination does not participate in the >>> RDMA transfers themselves, only the source does, and thus would >>> require a new exchange of messages to block and instruct the >>> destination to unpin the memory. >> Yes, that's what I recalled too (really what mst told me :)). Does it >> need to be blocking though? As long as the pinning is blocking, and >> messages are processed in order, the source can proceed immediately >> after sending an unpin message. This assumes of course that the chunk >> is not being transmitted, and I am not sure how easy the source can >> determine that. > > No, they're not processed in order. In fact, not only does the device > write out of order, but also the PCI bus writes out of order. > This was such a problem in fact, that I fixed several bugs as a result > a few weeks ago (v7 of the patch with an in-depth description). > > The destination simply cannot assume whatsoever what the ordering > of the writes are - that's really the whole point of using RDMA in the > first place so that the software can get out of the way of the transfer > process to lower the latency of each transfer. The memory is processed out of order, but what about the messages? Those must be in order. Note that I wrote above "This assumes of course that the chunk is not being transmitted". Can the source know when an asynchronous transfer finished, and delay the unpinning until that time? Paolo > > The only option is to send a blocking message to the other side to > request the unpinning (in addition to unpinning on the source first upon > completion of the original transfer). > > As you can expect, this would be very expensive and we must ensure > that we have *very* good a-priori information that this memory will > not need to be re-registered anytime in the near future. > > - Michael > > >
On 06/26/2013 02:37 AM, Paolo Bonzini wrote: > Il 26/06/2013 02:31, Michael R. Hines ha scritto: >> On 06/25/2013 05:06 PM, Paolo Bonzini wrote: >>> Il 25/06/2013 22:56, Michael R. Hines ha scritto: >>>> I was wrong - this does require a protocol extension. >>>> >>>> This is because the RDMA transfers are asynchronous, and thus >>>> we cannot know in advance that it is safe to unregister the memory >>>> associated with each individual transfer before the transfer actually >>>> completes. >>>> >>>> While the destination currently uses the protocol to participate in >>>> *registering* the page, the destination does not participate in the >>>> RDMA transfers themselves, only the source does, and thus would >>>> require a new exchange of messages to block and instruct the >>>> destination to unpin the memory. >>> Yes, that's what I recalled too (really what mst told me :)). Does it >>> need to be blocking though? As long as the pinning is blocking, and >>> messages are processed in order, the source can proceed immediately >>> after sending an unpin message. This assumes of course that the chunk >>> is not being transmitted, and I am not sure how easy the source can >>> determine that. >> No, they're not processed in order. In fact, not only does the device >> write out of order, but also the PCI bus writes out of order. >> This was such a problem in fact, that I fixed several bugs as a result >> a few weeks ago (v7 of the patch with an in-depth description). >> >> The destination simply cannot assume whatsoever what the ordering >> of the writes are - that's really the whole point of using RDMA in the >> first place so that the software can get out of the way of the transfer >> process to lower the latency of each transfer. > The memory is processed out of order, but what about the messages? > Those must be in order. > > Note that I wrote above "This assumes of course that the chunk is not > being transmitted". Can the source know when an asynchronous transfer > finished, and delay the unpinning until that time? > > Paolo Yes, the source does know. There's no problem unpinning on the source. But both sides must do the unpinning, not just the source. Did I misunderstand you? Are you suggesting *only* unpinning on the source? - Michael
Il 26/06/2013 14:37, Michael R. Hines ha scritto: > On 06/26/2013 02:37 AM, Paolo Bonzini wrote: >> Il 26/06/2013 02:31, Michael R. Hines ha scritto: >>> On 06/25/2013 05:06 PM, Paolo Bonzini wrote: >>>> Il 25/06/2013 22:56, Michael R. Hines ha scritto: >>>>> I was wrong - this does require a protocol extension. >>>>> >>>>> This is because the RDMA transfers are asynchronous, and thus >>>>> we cannot know in advance that it is safe to unregister the memory >>>>> associated with each individual transfer before the transfer actually >>>>> completes. >>>>> >>>>> While the destination currently uses the protocol to participate in >>>>> *registering* the page, the destination does not participate in the >>>>> RDMA transfers themselves, only the source does, and thus would >>>>> require a new exchange of messages to block and instruct the >>>>> destination to unpin the memory. >>>> Yes, that's what I recalled too (really what mst told me :)). Does it >>>> need to be blocking though? As long as the pinning is blocking, and >>>> messages are processed in order, the source can proceed immediately >>>> after sending an unpin message. This assumes of course that the chunk >>>> is not being transmitted, and I am not sure how easy the source can >>>> determine that. >>> No, they're not processed in order. In fact, not only does the device >>> write out of order, but also the PCI bus writes out of order. >>> This was such a problem in fact, that I fixed several bugs as a result >>> a few weeks ago (v7 of the patch with an in-depth description). >>> >>> The destination simply cannot assume whatsoever what the ordering >>> of the writes are - that's really the whole point of using RDMA in the >>> first place so that the software can get out of the way of the transfer >>> process to lower the latency of each transfer. >> The memory is processed out of order, but what about the messages? >> Those must be in order. >> >> Note that I wrote above "This assumes of course that the chunk is not >> being transmitted". Can the source know when an asynchronous transfer >> finished, and delay the unpinning until that time? > > Yes, the source does know. There's no problem unpinning on the source. > > But both sides must do the unpinning, not just the source. > > Did I misunderstand you? Are you suggesting *only* unpinning on the source? I'm suggesting (if possible) that the source only tells the destination to unpin once it knows it is safe for the destination to do it. As long as unpin and pin messages are not reordered, it should be possible to do it with an asynchronous message from the source to the destination. Paolo
On 06/26/2013 08:39 AM, Paolo Bonzini wrote: > Il 26/06/2013 14:37, Michael R. Hines ha scritto: >> On 06/26/2013 02:37 AM, Paolo Bonzini wrote: >>> Il 26/06/2013 02:31, Michael R. Hines ha scritto: >>>> On 06/25/2013 05:06 PM, Paolo Bonzini wrote: >>>>> Il 25/06/2013 22:56, Michael R. Hines ha scritto: >>>>>> I was wrong - this does require a protocol extension. >>>>>> >>>>>> This is because the RDMA transfers are asynchronous, and thus >>>>>> we cannot know in advance that it is safe to unregister the memory >>>>>> associated with each individual transfer before the transfer actually >>>>>> completes. >>>>>> >>>>>> While the destination currently uses the protocol to participate in >>>>>> *registering* the page, the destination does not participate in the >>>>>> RDMA transfers themselves, only the source does, and thus would >>>>>> require a new exchange of messages to block and instruct the >>>>>> destination to unpin the memory. >>>>> Yes, that's what I recalled too (really what mst told me :)). Does it >>>>> need to be blocking though? As long as the pinning is blocking, and >>>>> messages are processed in order, the source can proceed immediately >>>>> after sending an unpin message. This assumes of course that the chunk >>>>> is not being transmitted, and I am not sure how easy the source can >>>>> determine that. >>>> No, they're not processed in order. In fact, not only does the device >>>> write out of order, but also the PCI bus writes out of order. >>>> This was such a problem in fact, that I fixed several bugs as a result >>>> a few weeks ago (v7 of the patch with an in-depth description). >>>> >>>> The destination simply cannot assume whatsoever what the ordering >>>> of the writes are - that's really the whole point of using RDMA in the >>>> first place so that the software can get out of the way of the transfer >>>> process to lower the latency of each transfer. >>> The memory is processed out of order, but what about the messages? >>> Those must be in order. >>> >>> Note that I wrote above "This assumes of course that the chunk is not >>> being transmitted". Can the source know when an asynchronous transfer >>> finished, and delay the unpinning until that time? >> Yes, the source does know. There's no problem unpinning on the source. >> >> But both sides must do the unpinning, not just the source. >> >> Did I misunderstand you? Are you suggesting *only* unpinning on the source? > I'm suggesting (if possible) that the source only tells the destination > to unpin once it knows it is safe for the destination to do it. As long > as unpin and pin messages are not reordered, it should be possible to do > it with an asynchronous message from the source to the destination. > > Paolo > Oh, certainly. I agree. That's not a trivial patch, though (as we were originally shooting for). (I'll list the steps below on the QEMU wiki, for the record). *This requires some steps:* 1. First, maintain a new data structure: something like "These memory ranges are 'being unpinned'" - block all potential writes to these addresses until the unpinning completes. 2. Once the source unpin completes, send the asynchronous control channel message to the other side for unpinning. 2. Mark the data structure and return and allow the migration to continue with the next RDMA write. 3. Upon completion of the unpinning on the destination, respond to the source that it was finished. 4. Source then clears the data structure for the successfully unpinned memory ranges. 5. At this point, one or more writes may (or may not) be blocking on the unpinned memory areas and will poll the data structure and find that the unpinning has completed. 6. Then issue the new writes and proceed as normal. 7. Repeat step 1.
Il 26/06/2013 16:09, Michael R. Hines ha scritto: > *This requires some steps:* > 1. First, maintain a new data structure: something like > "These memory ranges are 'being unpinned'" - block all potential writes > to these addresses until the unpinning completes. > 2. Once the source unpin completes, send the asynchronous control > channel message > to the other side for unpinning. > 2. Mark the data structure and return and allow the migration to continue > with the next RDMA write. > 3. Upon completion of the unpinning on the destination, > respond to the source that it was finished. > 4. Source then clears the data structure for the successfully unpinned > memory ranges. > 5. At this point, one or more writes may (or may not) be blocking on the > unpinned memory areas and will poll the data structure and find that > the unpinning has completed. > 6. Then issue the new writes and proceed as normal. > 7. Repeat step 1. After more discussion on IRC I think I understand this. It's not trivial, but not super-complex either... it would really be nice to add it to the protocol already in 1.6. Paolo
On 06/26/2013 10:57 AM, Paolo Bonzini wrote: > Il 26/06/2013 16:09, Michael R. Hines ha scritto: >> *This requires some steps:* >> 1. First, maintain a new data structure: something like >> "These memory ranges are 'being unpinned'" - block all potential writes >> to these addresses until the unpinning completes. >> 2. Once the source unpin completes, send the asynchronous control >> channel message >> to the other side for unpinning. >> 2. Mark the data structure and return and allow the migration to continue >> with the next RDMA write. >> 3. Upon completion of the unpinning on the destination, >> respond to the source that it was finished. >> 4. Source then clears the data structure for the successfully unpinned >> memory ranges. >> 5. At this point, one or more writes may (or may not) be blocking on the >> unpinned memory areas and will poll the data structure and find that >> the unpinning has completed. >> 6. Then issue the new writes and proceed as normal. >> 7. Repeat step 1. > After more discussion on IRC I think I understand this. It's not > trivial, but not super-complex either... it would really be nice to add > it to the protocol already in 1.6. > > Paolo > I will submit this as a brand new patch series. I'm working on it right now. - Michael
diff --git a/migration.c b/migration.c index dcb99c9..758d99d 100644 --- a/migration.c +++ b/migration.c @@ -36,7 +36,8 @@ #endif enum { - MIG_STATE_ERROR, + MIG_STATE_ERROR = -1, + MIG_STATE_NONE, MIG_STATE_SETUP, MIG_STATE_CANCELLED, MIG_STATE_ACTIVE, @@ -63,7 +64,7 @@ static NotifierList migration_state_notifiers = MigrationState *migrate_get_current(void) { static MigrationState current_migration = { - .state = MIG_STATE_SETUP, + .state = MIG_STATE_NONE, .bandwidth_limit = MAX_THROTTLE, .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE, .mbps = -1, @@ -184,9 +185,13 @@ MigrationInfo *qmp_query_migrate(Error **errp) MigrationState *s = migrate_get_current(); switch (s->state) { - case MIG_STATE_SETUP: + case MIG_STATE_NONE: /* no migration has happened ever */ break; + case MIG_STATE_SETUP: + info->has_status = true; + info->status = g_strdup("setup"); + break; case MIG_STATE_ACTIVE: info->has_status = true; info->status = g_strdup("active"); @@ -257,7 +262,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params, MigrationState *s = migrate_get_current(); MigrationCapabilityStatusList *cap; - if (s->state == MIG_STATE_ACTIVE) { + if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) { error_set(errp, QERR_MIGRATION_ACTIVE); return; } @@ -316,6 +321,7 @@ static void migrate_fd_cancel(MigrationState *s) { DPRINTF("cancelling migration\n"); + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_CANCELLED); migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CANCELLED); } @@ -393,7 +399,7 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk, params.blk = blk; params.shared = inc; - if (s->state == MIG_STATE_ACTIVE) { + if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) { error_set(errp, QERR_MIGRATION_ACTIVE); return; } @@ -525,6 +531,8 @@ static void *migration_thread(void *opaque) DPRINTF("beginning savevm\n"); qemu_savevm_state_begin(s->file, &s->params); + migrate_set_state(s, MIG_STATE_SETUP, MIG_STATE_ACTIVE); + while (s->state == MIG_STATE_ACTIVE) { int64_t current_time; uint64_t pending_size; @@ -604,8 +612,8 @@ static void *migration_thread(void *opaque) void migrate_fd_connect(MigrationState *s) { - s->state = MIG_STATE_ACTIVE; - trace_migrate_set_state(MIG_STATE_ACTIVE); + s->state = MIG_STATE_SETUP; + trace_migrate_set_state(MIG_STATE_SETUP); /* This is a best 1st approximation. ns to ms */ s->expected_downtime = max_downtime/1000000;