diff mbox series

[2/2] tests/qtest: make more migration pre-copy scenarios run non-live

Message ID 20230418133100.48799-3-berrange@redhat.com
State New
Headers show
Series tests/qtest: make migraton-test faster | expand

Commit Message

Daniel P. Berrangé April 18, 2023, 1:31 p.m. UTC
There are 27 pre-copy live migration scenarios being tested. In all of
these we force non-convergance and run for one iteration, then let it
converge and wait for completion during the second (or following)
iterations. At 3 mbps bandwidth limit the first iteration takes a very
long time (~30 seconds).

While it is important to test the migration passes and convergance
logic, it is overkill to do this for all 27 pre-copy scenarios. The
TLS migration scenarios in particular are merely exercising different
code paths during connection establishment.

To optimize time taken, switch most of the test scenarios to run
non-live (ie guest CPUs paused) with no bandwidth limits. This gives
a massive speed up for most of the test scenarios.

For test coverage the following scenarios are unchanged

 * Precopy with UNIX sockets
 * Precopy with UNIX sockets and dirty ring tracking
 * Precopy with XBZRLE
 * Precopy with multifd

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 34 +++++++++++++++++++++++++++-------
 1 file changed, 27 insertions(+), 7 deletions(-)

Comments

Fabiano Rosas April 18, 2023, 7:52 p.m. UTC | #1
Daniel P. Berrangé <berrange@redhat.com> writes:

> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
>  * Precopy with UNIX sockets
>  * Precopy with UNIX sockets and dirty ring tracking
>  * Precopy with XBZRLE
>  * Precopy with multifd
>
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

...

> -        qtest_qmp_eventwait(to, "RESUME");
> +        if (!args->live) {
> +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> +        }
> +        if (!got_resume) {
> +            qtest_qmp_eventwait(to, "RESUME");
> +        }

Hi Daniel,

On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:

../configure --target-list=aarch64-softmmu --enable-gnutls

... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match

(gdb) bt
#0  0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0
#1  0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44
#2  qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73
#3  0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713
#4  qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837
#5  0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
#6  0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393
#7  0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564
#8  0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0
...
#15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0
#16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0
#17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642
Daniel P. Berrangé April 19, 2023, 5:14 p.m. UTC | #2
On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with multifd
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> ...
> 
> > -        qtest_qmp_eventwait(to, "RESUME");
> > +        if (!args->live) {
> > +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > +        }
> > +        if (!got_resume) {
> > +            qtest_qmp_eventwait(to, "RESUME");
> > +        }
> 
> Hi Daniel,
> 
> On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:
> 
> ../configure --target-list=aarch64-softmmu --enable-gnutls
> 
> ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match
> 
> (gdb) bt
> #0  0x0000fffff7b33f8c in recv () from /lib64/libpthread.so.0
> #1  0x0000aaaaaaac8bf4 in recv (__flags=0, __n=1, __buf=0xffffffffe477, __fd=5) at /usr/include/bits/socket2.h:44
> #2  qmp_fd_receive (fd=5) at ../tests/qtest/libqmp.c:73
> #3  0x0000aaaaaaac6dbc in qtest_qmp_receive_dict (s=0xaaaaaaca7d10) at ../tests/qtest/libqtest.c:713
> #4  qtest_qmp_eventwait_ref (s=0xaaaaaaca7d10, event=0xaaaaaab26ce8 "RESUME") at ../tests/qtest/libqtest.c:837
> #5  0x0000aaaaaaac6e34 in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
> #6  0x0000aaaaaaabbd90 in test_precopy_common (args=0xffffffffe590, args@entry=0xffffffffe5a0) at ../tests/qtest/migration-test.c:1393
> #7  0x0000aaaaaaabc804 in test_precopy_tcp_tls_psk_match () at ../tests/qtest/migration-test.c:1564
> #8  0x0000fffff7c89630 in ?? () from //usr/lib64/libglib-2.0.so.0
> ...
> #15 0x0000fffff7c89a70 in g_test_run_suite () from //usr/lib64/libglib-2.0.so.0
> #16 0x0000fffff7c89ae4 in g_test_run () from //usr/lib64/libglib-2.0.so.0
> #17 0x0000aaaaaaab7fdc in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2642

Urgh, ok, there must be an unexpected race condition wrt events in my
change. Thanks for the stack trace, i'll investigate.

With regards,
Daniel
Juan Quintela April 20, 2023, 12:59 p.m. UTC | #3
Daniel P. Berrangé <berrange@redhat.com> wrote:
> There are 27 pre-copy live migration scenarios being tested. In all of
> these we force non-convergance and run for one iteration, then let it
> converge and wait for completion during the second (or following)
> iterations. At 3 mbps bandwidth limit the first iteration takes a very
> long time (~30 seconds).
>
> While it is important to test the migration passes and convergance
> logic, it is overkill to do this for all 27 pre-copy scenarios. The
> TLS migration scenarios in particular are merely exercising different
> code paths during connection establishment.
>
> To optimize time taken, switch most of the test scenarios to run
> non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> a massive speed up for most of the test scenarios.
>
> For test coverage the following scenarios are unchanged
>
>  * Precopy with UNIX sockets
>  * Precopy with UNIX sockets and dirty ring tracking
>  * Precopy with XBZRLE
>  * Precopy with multifd

Just for completeness: the other test that is still slow is
/migration/vcpu_dirty_limit.

> -    migrate_ensure_non_converge(from);
> +    if (args->live) {
> +        migrate_ensure_non_converge(from);
> +    } else {
> +        migrate_ensure_converge(from);
> +    }

Looks ... weird?
But the only way that I can think of improving it is to pass args to
migrate_ensure_*() and that is a different kind of weird.

>      } else {
> -        if (args->iterations) {
> -            while (args->iterations--) {
> +        if (args->live) {
> +            if (args->iterations) {
> +                while (args->iterations--) {
> +                    wait_for_migration_pass(from);
> +                }
> +            } else {
>                  wait_for_migration_pass(from);
>              }
> +
> +            migrate_ensure_converge(from);

I think we should change iterations to be 1 when we create args, but
otherwise, treat 0 as 1 and change it to something in the lines of:

        if (args->live) {
            while (args->iterations-- >= 0) {
                wait_for_migration_pass(from);
            }
            migrate_ensure_converge(from);

What do you think?


> -        qtest_qmp_eventwait(to, "RESUME");
> +        if (!args->live) {
> +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> +        }
> +        if (!got_resume) {
> +            qtest_qmp_eventwait(to, "RESUME");
> +        }
>  
>          wait_for_serial("dest_serial");
>      }

I was looking at the "culprit" of Lukas problem, and it is not directly
obvious.  I see that when we expect one event, we just drop any event
that we are not interested in.  I don't know if that is the proper
behaviour or if that is what affecting this test.

Later, Juan.
Daniel P. Berrangé April 20, 2023, 3:58 p.m. UTC | #4
On Thu, Apr 20, 2023 at 02:59:00PM +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with multifd
> 
> Just for completeness: the other test that is still slow is
> /migration/vcpu_dirty_limit.
> 
> > -    migrate_ensure_non_converge(from);
> > +    if (args->live) {
> > +        migrate_ensure_non_converge(from);
> > +    } else {
> > +        migrate_ensure_converge(from);
> > +    }
> 
> Looks ... weird?
> But the only way that I can think of improving it is to pass args to
> migrate_ensure_*() and that is a different kind of weird.

I'm going to change this a little anyway. Currently for the non-live
case, I start the migration and then  stop the CPUs. I want to reverse
that order, as we should have CPUs paused before even starting the
migration to ensure we don't have any re-dirtied pages at all.

> 
> >      } else {
> > -        if (args->iterations) {
> > -            while (args->iterations--) {
> > +        if (args->live) {
> > +            if (args->iterations) {
> > +                while (args->iterations--) {
> > +                    wait_for_migration_pass(from);
> > +                }
> > +            } else {
> >                  wait_for_migration_pass(from);
> >              }
> > +
> > +            migrate_ensure_converge(from);
> 
> I think we should change iterations to be 1 when we create args, but
> otherwise, treat 0 as 1 and change it to something in the lines of:
> 
>         if (args->live) {
>             while (args->iterations-- >= 0) {
>                 wait_for_migration_pass(from);
>             }
>             migrate_ensure_converge(from);
> 
> What do you think?

I think in retrospect 'iterations' was overkill as we only use the
values 0 (implicitly 1) or 2. IOW we could just just a 'bool multipass'
to distinguish the two cases.

> > -        qtest_qmp_eventwait(to, "RESUME");
> > +        if (!args->live) {
> > +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > +        }
> > +        if (!got_resume) {
> > +            qtest_qmp_eventwait(to, "RESUME");
> > +        }
> >  
> >          wait_for_serial("dest_serial");
> >      }
> 
> I was looking at the "culprit" of Lukas problem, and it is not directly
> obvious.  I see that when we expect one event, we just drop any event
> that we are not interested in.  I don't know if that is the proper
> behaviour or if that is what affecting this test.

I've not successfully reproduced it yet, nor figured out a real
scenario where it could plausibly happen. i'm looking to add more
debug to help us out.

With regards,
Daniel
Daniel P. Berrangé April 21, 2023, 5:20 p.m. UTC | #5
On Tue, Apr 18, 2023 at 04:52:32PM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > There are 27 pre-copy live migration scenarios being tested. In all of
> > these we force non-convergance and run for one iteration, then let it
> > converge and wait for completion during the second (or following)
> > iterations. At 3 mbps bandwidth limit the first iteration takes a very
> > long time (~30 seconds).
> >
> > While it is important to test the migration passes and convergance
> > logic, it is overkill to do this for all 27 pre-copy scenarios. The
> > TLS migration scenarios in particular are merely exercising different
> > code paths during connection establishment.
> >
> > To optimize time taken, switch most of the test scenarios to run
> > non-live (ie guest CPUs paused) with no bandwidth limits. This gives
> > a massive speed up for most of the test scenarios.
> >
> > For test coverage the following scenarios are unchanged
> >
> >  * Precopy with UNIX sockets
> >  * Precopy with UNIX sockets and dirty ring tracking
> >  * Precopy with XBZRLE
> >  * Precopy with multifd
> >
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> 
> ...
> 
> > -        qtest_qmp_eventwait(to, "RESUME");
> > +        if (!args->live) {
> > +            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
> > +        }
> > +        if (!got_resume) {
> > +            qtest_qmp_eventwait(to, "RESUME");
> > +        }
> 
> Hi Daniel,
> 
> On an aarch64 host I'm sometimes (~30%) seeing a hang here on a TLS test:
> 
> ../configure --target-list=aarch64-softmmu --enable-gnutls
> 
> ... ./tests/qtest/migration-test --tap -k -p /aarch64/migration/precopy/tcp/tls/psk/match

I never came to a satisfactory understanding of why this problem hits
you. I've just sent out a new version of this series, which has quite
a few differences, so possibly I've fixed it by luck.

So if you have time, I'd appreciate any testing you can try on

  https://lists.gnu.org/archive/html/qemu-devel/2023-04/msg03688.html


With regards,
Daniel
diff mbox series

Patch

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 3b615b0da9..cdc9635f0b 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -574,6 +574,9 @@  typedef struct {
     /* Optional: set number of migration passes to wait for */
     unsigned int iterations;
 
+    /* Whether the guest CPUs should be running during migration */
+    bool live;
+
     /* Postcopy specific fields */
     void *postcopy_data;
     bool postcopy_preempt;
@@ -1329,7 +1332,11 @@  static void test_precopy_common(MigrateCommon *args)
         return;
     }
 
-    migrate_ensure_non_converge(from);
+    if (args->live) {
+        migrate_ensure_non_converge(from);
+    } else {
+        migrate_ensure_converge(from);
+    }
 
     if (args->start_hook) {
         data_hook = args->start_hook(from, to);
@@ -1357,16 +1364,20 @@  static void test_precopy_common(MigrateCommon *args)
             qtest_set_expected_status(to, EXIT_FAILURE);
         }
     } else {
-        if (args->iterations) {
-            while (args->iterations--) {
+        if (args->live) {
+            if (args->iterations) {
+                while (args->iterations--) {
+                    wait_for_migration_pass(from);
+                }
+            } else {
                 wait_for_migration_pass(from);
             }
+
+            migrate_ensure_converge(from);
         } else {
-            wait_for_migration_pass(from);
+            qtest_qmp_discard_response(from, "{ 'execute' : 'stop'}");
         }
 
-        migrate_ensure_converge(from);
-
         /* We do this first, as it has a timeout to stop us
          * hanging forever if migration didn't converge */
         wait_for_migration_complete(from);
@@ -1375,7 +1386,12 @@  static void test_precopy_common(MigrateCommon *args)
             qtest_qmp_eventwait(from, "STOP");
         }
 
-        qtest_qmp_eventwait(to, "RESUME");
+        if (!args->live) {
+            qtest_qmp_discard_response(to, "{ 'execute' : 'cont'}");
+        }
+        if (!got_resume) {
+            qtest_qmp_eventwait(to, "RESUME");
+        }
 
         wait_for_serial("dest_serial");
     }
@@ -1393,6 +1409,7 @@  static void test_precopy_unix_plain(void)
     MigrateCommon args = {
         .listen_uri = uri,
         .connect_uri = uri,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1408,6 +1425,7 @@  static void test_precopy_unix_dirty_ring(void)
         },
         .listen_uri = uri,
         .connect_uri = uri,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1519,6 +1537,7 @@  static void test_precopy_unix_xbzrle(void)
         .start_hook = test_migrate_xbzrle_start,
 
         .iterations = 2,
+        .live = true,
     };
 
     test_precopy_common(&args);
@@ -1919,6 +1938,7 @@  static void test_multifd_tcp_none(void)
     MigrateCommon args = {
         .listen_uri = "defer",
         .start_hook = test_migrate_precopy_tcp_multifd_start,
+        .live = true,
     };
     test_precopy_common(&args);
 }