diff mbox series

[v3,19/19] test-bdrv-drain: Test draining job source child and parent

Message ID 20180920161958.27453-20-kwolf@redhat.com
State New
Headers show
Series Fix some jobs/drain/aio_poll related hangs | expand

Commit Message

Kevin Wolf Sept. 20, 2018, 4:19 p.m. UTC
For the block job drain test, don't only test draining the source and
the target node, but create a backing chain for the source
(source_backing <- source <- source_overlay) and test draining each of
the nodes in it.

When using iothreads, the source node (and therefore the job) is in a
different AioContext than the drain, which happens from the main
thread. This way, the main thread waits in AIO_WAIT_WHILE() for the
iothread to make process and aio_wait_kick() is required to notify it.
The test validates that calling bdrv_wakeup() for a child or a parent
node will actually notify AIO_WAIT_WHILE() instead of letting it hang.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/test-bdrv-drain.c | 75 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 67 insertions(+), 8 deletions(-)

Comments

Eric Blake Sept. 21, 2018, 5:01 p.m. UTC | #1
On 9/20/18 11:19 AM, Kevin Wolf wrote:
> For the block job drain test, don't only test draining the source and
> the target node, but create a backing chain for the source
> (source_backing <- source <- source_overlay) and test draining each of
> the nodes in it.
> 
> When using iothreads, the source node (and therefore the job) is in a
> different AioContext than the drain, which happens from the main
> thread. This way, the main thread waits in AIO_WAIT_WHILE() for the
> iothread to make process and aio_wait_kick() is required to notify it.
> The test validates that calling bdrv_wakeup() for a child or a parent
> node will actually notify AIO_WAIT_WHILE() instead of letting it hang.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   tests/test-bdrv-drain.c | 75 +++++++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 67 insertions(+), 8 deletions(-)
> 

> @@ -818,12 +819,17 @@ static int coroutine_fn test_job_run(Job *job, Error **errp)
>   {
>       TestBlockJob *s = container_of(job, TestBlockJob, common.job);
>   
> +    /* We are running the actual job code past the pause point in
> +     * job_co_entry(). */
> +    s->running = true;
> +
>       job_transition_to_ready(&s->common.job);
>       while (!s->should_complete) {
>           /* Avoid job_sleep_ns() because it marks the job as !busy. We want to
>            * emulate some actual activity (probably some I/O) here so that drain
>            * has to wait for this acitivity to stop. */
> -        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
> +        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);

The commit message didn't mention why you had to lengthen this sleep, 
but I'm guessing it's because you are now doing enough additional things 
that you have to scale the delay to match?

Reviewed-by: Eric Blake <eblake@redhat.com>
Kevin Wolf Sept. 21, 2018, 5:34 p.m. UTC | #2
Am 21.09.2018 um 19:01 hat Eric Blake geschrieben:
> On 9/20/18 11:19 AM, Kevin Wolf wrote:
> > For the block job drain test, don't only test draining the source and
> > the target node, but create a backing chain for the source
> > (source_backing <- source <- source_overlay) and test draining each of
> > the nodes in it.
> > 
> > When using iothreads, the source node (and therefore the job) is in a
> > different AioContext than the drain, which happens from the main
> > thread. This way, the main thread waits in AIO_WAIT_WHILE() for the
> > iothread to make process and aio_wait_kick() is required to notify it.
> > The test validates that calling bdrv_wakeup() for a child or a parent
> > node will actually notify AIO_WAIT_WHILE() instead of letting it hang.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   tests/test-bdrv-drain.c | 75 +++++++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 67 insertions(+), 8 deletions(-)
> > 
> 
> > @@ -818,12 +819,17 @@ static int coroutine_fn test_job_run(Job *job, Error **errp)
> >   {
> >       TestBlockJob *s = container_of(job, TestBlockJob, common.job);
> > +    /* We are running the actual job code past the pause point in
> > +     * job_co_entry(). */
> > +    s->running = true;
> > +
> >       job_transition_to_ready(&s->common.job);
> >       while (!s->should_complete) {
> >           /* Avoid job_sleep_ns() because it marks the job as !busy. We want to
> >            * emulate some actual activity (probably some I/O) here so that drain
> >            * has to wait for this acitivity to stop. */
> > -        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
> > +        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
> 
> The commit message didn't mention why you had to lengthen this sleep, but
> I'm guessing it's because you are now doing enough additional things that
> you have to scale the delay to match?

Maybe it's actually interesting enought to add a paragraph to the commit
message:

When running under 'rr record -n' (because I wanted to debug some
behaviour), the bug suddently didn't reproduce any more. This was
because bdrv_drain_invoke_entry() (in the main thread) was only called
after the job had already reached the pause point, so we got a
bdrv_dec_in_flight() from the main thread and the additional
aio_wait_kick() when the job becomes idle wasn't necessary any more.

I don't think this would happen outside a debugging environment (no idea
what rr does to multithreading), but I thought increasing the delay
can't hurt because it's still quite short (1 ms).

Of course, if you have an idea how to check the actual condition (only
allow a pause point after the wakeup from bdrv_drain_invoke_entry() has
already happened), I'd consider that instead. But I don't think there's
an easy way to get this information.

Kevin
Max Reitz Sept. 24, 2018, 5:23 p.m. UTC | #3
On 20.09.18 18:19, Kevin Wolf wrote:
> For the block job drain test, don't only test draining the source and
> the target node, but create a backing chain for the source
> (source_backing <- source <- source_overlay) and test draining each of
> the nodes in it.
> 
> When using iothreads, the source node (and therefore the job) is in a
> different AioContext than the drain, which happens from the main
> thread. This way, the main thread waits in AIO_WAIT_WHILE() for the
> iothread to make process and aio_wait_kick() is required to notify it.
> The test validates that calling bdrv_wakeup() for a child or a parent
> node will actually notify AIO_WAIT_WHILE() instead of letting it hang.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  tests/test-bdrv-drain.c | 75 +++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 67 insertions(+), 8 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>
diff mbox series

Patch

diff --git a/tests/test-bdrv-drain.c b/tests/test-bdrv-drain.c
index 526978a8ac..193b88d340 100644
--- a/tests/test-bdrv-drain.c
+++ b/tests/test-bdrv-drain.c
@@ -786,6 +786,7 @@  typedef struct TestBlockJob {
     BlockJob common;
     int run_ret;
     int prepare_ret;
+    bool running;
     bool should_complete;
 } TestBlockJob;
 
@@ -818,12 +819,17 @@  static int coroutine_fn test_job_run(Job *job, Error **errp)
 {
     TestBlockJob *s = container_of(job, TestBlockJob, common.job);
 
+    /* We are running the actual job code past the pause point in
+     * job_co_entry(). */
+    s->running = true;
+
     job_transition_to_ready(&s->common.job);
     while (!s->should_complete) {
         /* Avoid job_sleep_ns() because it marks the job as !busy. We want to
          * emulate some actual activity (probably some I/O) here so that drain
          * has to wait for this acitivity to stop. */
-        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 100000);
+        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
+
         job_pause_point(&s->common.job);
     }
 
@@ -856,11 +862,19 @@  enum test_job_result {
     TEST_JOB_FAIL_PREPARE,
 };
 
-static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
-                                 enum test_job_result result)
+enum test_job_drain_node {
+    TEST_JOB_DRAIN_SRC,
+    TEST_JOB_DRAIN_SRC_CHILD,
+    TEST_JOB_DRAIN_SRC_PARENT,
+};
+
+static void test_blockjob_common_drain_node(enum drain_type drain_type,
+                                            bool use_iothread,
+                                            enum test_job_result result,
+                                            enum test_job_drain_node drain_node)
 {
     BlockBackend *blk_src, *blk_target;
-    BlockDriverState *src, *target;
+    BlockDriverState *src, *src_backing, *src_overlay, *target, *drain_bs;
     BlockJob *job;
     TestBlockJob *tjob;
     IOThread *iothread = NULL;
@@ -869,8 +883,30 @@  static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
 
     src = bdrv_new_open_driver(&bdrv_test, "source", BDRV_O_RDWR,
                                &error_abort);
+    src_backing = bdrv_new_open_driver(&bdrv_test, "source-backing",
+                                       BDRV_O_RDWR, &error_abort);
+    src_overlay = bdrv_new_open_driver(&bdrv_test, "source-overlay",
+                                       BDRV_O_RDWR, &error_abort);
+
+    bdrv_set_backing_hd(src_overlay, src, &error_abort);
+    bdrv_unref(src);
+    bdrv_set_backing_hd(src, src_backing, &error_abort);
+    bdrv_unref(src_backing);
+
     blk_src = blk_new(BLK_PERM_ALL, BLK_PERM_ALL);
-    blk_insert_bs(blk_src, src, &error_abort);
+    blk_insert_bs(blk_src, src_overlay, &error_abort);
+
+    switch (drain_node) {
+    case TEST_JOB_DRAIN_SRC:
+        drain_bs = src;
+        break;
+    case TEST_JOB_DRAIN_SRC_CHILD:
+        drain_bs = src_backing;
+        break;
+    case TEST_JOB_DRAIN_SRC_PARENT:
+        drain_bs = src_overlay;
+        break;
+    }
 
     if (use_iothread) {
         iothread = iothread_new();
@@ -906,11 +942,21 @@  static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
     job_start(&job->job);
     aio_context_release(ctx);
 
+    if (use_iothread) {
+        /* job_co_entry() is run in the I/O thread, wait for the actual job
+         * code to start (we don't want to catch the job in the pause point in
+         * job_co_entry(). */
+        while (!tjob->running) {
+            aio_poll(qemu_get_aio_context(), false);
+        }
+    }
+
     g_assert_cmpint(job->job.pause_count, ==, 0);
     g_assert_false(job->job.paused);
+    g_assert_true(tjob->running);
     g_assert_true(job->job.busy); /* We're in qemu_co_sleep_ns() */
 
-    do_drain_begin_unlocked(drain_type, src);
+    do_drain_begin_unlocked(drain_type, drain_bs);
 
     if (drain_type == BDRV_DRAIN_ALL) {
         /* bdrv_drain_all() drains both src and target */
@@ -921,7 +967,7 @@  static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
     g_assert_true(job->job.paused);
     g_assert_false(job->job.busy); /* The job is paused */
 
-    do_drain_end_unlocked(drain_type, src);
+    do_drain_end_unlocked(drain_type, drain_bs);
 
     if (use_iothread) {
         /* paused is reset in the I/O thread, wait for it */
@@ -969,7 +1015,7 @@  static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
 
     blk_unref(blk_src);
     blk_unref(blk_target);
-    bdrv_unref(src);
+    bdrv_unref(src_overlay);
     bdrv_unref(target);
 
     if (iothread) {
@@ -977,6 +1023,19 @@  static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
     }
 }
 
+static void test_blockjob_common(enum drain_type drain_type, bool use_iothread,
+                                 enum test_job_result result)
+{
+    test_blockjob_common_drain_node(drain_type, use_iothread, result,
+                                    TEST_JOB_DRAIN_SRC);
+    test_blockjob_common_drain_node(drain_type, use_iothread, result,
+                                    TEST_JOB_DRAIN_SRC_CHILD);
+    if (drain_type == BDRV_SUBTREE_DRAIN) {
+        test_blockjob_common_drain_node(drain_type, use_iothread, result,
+                                        TEST_JOB_DRAIN_SRC_PARENT);
+    }
+}
+
 static void test_blockjob_drain_all(void)
 {
     test_blockjob_common(BDRV_DRAIN_ALL, false, TEST_JOB_SUCCESS);