diff mbox series

migration/postcopy: Recognise the recovery states as 'in_postcopy'

Message ID 20190923174942.12182-1-dgilbert@redhat.com
State New
Headers show
Series migration/postcopy: Recognise the recovery states as 'in_postcopy' | expand

Commit Message

Dr. David Alan Gilbert Sept. 23, 2019, 5:49 p.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Various parts of the migration code do different things when they're
in postcopy mode; prior to this patch this has been 'postcopy-active'.
This patch extends 'in_postcopy' to include 'postcopy-paused' and
'postcopy-recover'.

In particular, when you set the max-postcopy-bandwidth parameter, this
only affects the current migration fd if we're 'in_postcopy';
this leads to a race in the postcopy recovery test where it increases
the speed from 4k/sec to unlimited, but that increase can get ignored
if the change is made between the point at which the reconnection
happens and it transitions back to active.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/migration.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Alex Bennée Sept. 23, 2019, 6:23 p.m. UTC | #1
Dr. David Alan Gilbert (git) <dgilbert@redhat.com> writes:

> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Various parts of the migration code do different things when they're
> in postcopy mode; prior to this patch this has been 'postcopy-active'.
> This patch extends 'in_postcopy' to include 'postcopy-paused' and
> 'postcopy-recover'.
>
> In particular, when you set the max-postcopy-bandwidth parameter, this
> only affects the current migration fd if we're 'in_postcopy';
> this leads to a race in the postcopy recovery test where it increases
> the speed from 4k/sec to unlimited, but that increase can get ignored
> if the change is made between the point at which the reconnection
> happens and it transitions back to active.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

I'm stress testing it now.

> ---
>  migration/migration.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 01863a95f5..5f7e4d15e9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1659,7 +1659,14 @@ bool migration_in_postcopy(void)
>  {
>      MigrationState *s = migrate_get_current();
>
> -    return (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
> +    switch (s->state) {
> +    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_PAUSED:
> +    case MIGRATION_STATUS_POSTCOPY_RECOVER:
> +        return true;
> +    default:
> +        return false;
> +    }
>  }
>
>  bool migration_in_postcopy_after_devices(MigrationState *s)


--
Alex Bennée
Peter Xu Sept. 24, 2019, 12:15 a.m. UTC | #2
On Mon, Sep 23, 2019 at 06:49:42PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Various parts of the migration code do different things when they're
> in postcopy mode; prior to this patch this has been 'postcopy-active'.
> This patch extends 'in_postcopy' to include 'postcopy-paused' and
> 'postcopy-recover'.
> 
> In particular, when you set the max-postcopy-bandwidth parameter, this
> only affects the current migration fd if we're 'in_postcopy';
> this leads to a race in the postcopy recovery test where it increases
> the speed from 4k/sec to unlimited, but that increase can get ignored
> if the change is made between the point at which the reconnection
> happens and it transitions back to active.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Yeh this makes quite a lot of sense to me...

Reviewed-by: Peter Xu <peterx@redhat.com>
Juan Quintela Sept. 24, 2019, 7:29 a.m. UTC | #3
"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Various parts of the migration code do different things when they're
> in postcopy mode; prior to this patch this has been 'postcopy-active'.
> This patch extends 'in_postcopy' to include 'postcopy-paused' and
> 'postcopy-recover'.
>
> In particular, when you set the max-postcopy-bandwidth parameter, this
> only affects the current migration fd if we're 'in_postcopy';
> this leads to a race in the postcopy recovery test where it increases
> the speed from 4k/sec to unlimited, but that increase can get ignored
> if the change is made between the point at which the reconnection
> happens and it transitions back to active.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>
Alex Bennée Sept. 24, 2019, 3:39 p.m. UTC | #4
Dr. David Alan Gilbert (git) <dgilbert@redhat.com> writes:

> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Various parts of the migration code do different things when they're
> in postcopy mode; prior to this patch this has been 'postcopy-active'.
> This patch extends 'in_postcopy' to include 'postcopy-paused' and
> 'postcopy-recover'.
>
> In particular, when you set the max-postcopy-bandwidth parameter, this
> only affects the current migration fd if we're 'in_postcopy';
> this leads to a race in the postcopy recovery test where it increases
> the speed from 4k/sec to unlimited, but that increase can get ignored
> if the change is made between the point at which the reconnection
> happens and it transitions back to active.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

In my xenial stress test I run 100 times and it never triggered the 180s
timeout I set on my retry.py script:

Tested-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  migration/migration.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 01863a95f5..5f7e4d15e9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1659,7 +1659,14 @@ bool migration_in_postcopy(void)
>  {
>      MigrationState *s = migrate_get_current();
>
> -    return (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
> +    switch (s->state) {
> +    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_PAUSED:
> +    case MIGRATION_STATUS_POSTCOPY_RECOVER:
> +        return true;
> +    default:
> +        return false;
> +    }
>  }
>
>  bool migration_in_postcopy_after_devices(MigrationState *s)


--
Alex Bennée
Markus Armbruster Sept. 25, 2019, 9:21 a.m. UTC | #5
"Dr. David Alan Gilbert (git)" <dgilbert@redhat.com> writes:

> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> Various parts of the migration code do different things when they're
> in postcopy mode; prior to this patch this has been 'postcopy-active'.
> This patch extends 'in_postcopy' to include 'postcopy-paused' and
> 'postcopy-recover'.
>
> In particular, when you set the max-postcopy-bandwidth parameter, this
> only affects the current migration fd if we're 'in_postcopy';
> this leads to a race in the postcopy recovery test where it increases
> the speed from 4k/sec to unlimited, but that increase can get ignored
> if the change is made between the point at which the reconnection
> happens and it transitions back to active.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

This seems to fix the intermittent hangs I observed and bisected to
commit 8504ddeca0 "migration: Fix postcopy bw for recovery".

Tested-by: Markus Armbruster <armbru@redhat.com>
Dr. David Alan Gilbert Sept. 25, 2019, 10:37 a.m. UTC | #6
* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Various parts of the migration code do different things when they're
> in postcopy mode; prior to this patch this has been 'postcopy-active'.
> This patch extends 'in_postcopy' to include 'postcopy-paused' and
> 'postcopy-recover'.
> 
> In particular, when you set the max-postcopy-bandwidth parameter, this
> only affects the current migration fd if we're 'in_postcopy';
> this leads to a race in the postcopy recovery test where it increases
> the speed from 4k/sec to unlimited, but that increase can get ignored
> if the change is made between the point at which the reconnection
> happens and it transitions back to active.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Queued

> ---
>  migration/migration.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 01863a95f5..5f7e4d15e9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1659,7 +1659,14 @@ bool migration_in_postcopy(void)
>  {
>      MigrationState *s = migrate_get_current();
>  
> -    return (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
> +    switch (s->state) {
> +    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> +    case MIGRATION_STATUS_POSTCOPY_PAUSED:
> +    case MIGRATION_STATUS_POSTCOPY_RECOVER:
> +        return true;
> +    default:
> +        return false;
> +    }
>  }
>  
>  bool migration_in_postcopy_after_devices(MigrationState *s)
> -- 
> 2.21.0
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 01863a95f5..5f7e4d15e9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1659,7 +1659,14 @@  bool migration_in_postcopy(void)
 {
     MigrationState *s = migrate_get_current();
 
-    return (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
+    switch (s->state) {
+    case MIGRATION_STATUS_POSTCOPY_ACTIVE:
+    case MIGRATION_STATUS_POSTCOPY_PAUSED:
+    case MIGRATION_STATUS_POSTCOPY_RECOVER:
+        return true;
+    default:
+        return false;
+    }
 }
 
 bool migration_in_postcopy_after_devices(MigrationState *s)