diff mbox

main-loop: Acquire main_context lock around os_host_main_loop_wait.

Message ID 20170331205133.23906-1-rjones@redhat.com
State New
Headers show

Commit Message

Richard W.M. Jones March 31, 2017, 8:51 p.m. UTC
When running virt-rescue the serial console hangs from time to time.
Virt-rescue runs an ordinary Linux kernel "appliance", but there is
only a single idle process running inside, so the qemu main loop is
largely idle.  With virt-rescue >= 1.37 you may be able to observe the
hang by doing:

  $ virt-rescue -e ^] --scratch
  ><rescue> while true; do ls -l /usr/bin; done

The hang in virt-rescue can be resolved by pressing a key on the
serial console.

Possibly with the same root cause, we also observed hangs during very
early boot of regular Linux VMs with a serial console.  Those hangs
are extremely rare, but you may be able to observe them by running
this command on baremetal for a sufficiently long time:

  $ while libguestfs-test-tool -t 60 >& /tmp/log ; do echo -n . ; done

(Check in /tmp/log that the failure was caused by a hang during early
boot, and not some other reason)

During investigation of this bug, Paolo Bonzini wrote:

> glib is expecting QEMU to use g_main_context_acquire around accesses to
> GMainContext.  However QEMU is not doing that, instead it is taking its
> own mutex.  So we should add g_main_context_acquire and
> g_main_context_release in the two implementations of
> os_host_main_loop_wait; these should undo the effect of Frediano's
> glib patch.

This patch exactly implements Paolo's suggestion in that paragraph.

This fixes the serial console hang in my testing, across 3 different
physical machines (AMD, Intel Core i7 and Intel Xeon), over many hours
of automated testing.  I wasn't able to reproduce the early boot hangs
(but as noted above, these are extremely rare in any case).

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1435432
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
---
 util/main-loop.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Richard W.M. Jones April 1, 2017, 12:16 p.m. UTC | #1
On Fri, Mar 31, 2017 at 09:51:33PM +0100, Richard W.M. Jones wrote:
> When running virt-rescue the serial console hangs from time to time.
> Virt-rescue runs an ordinary Linux kernel "appliance", but there is
> only a single idle process running inside, so the qemu main loop is
> largely idle.  With virt-rescue >= 1.37 you may be able to observe the
> hang by doing:
> 
>   $ virt-rescue -e ^] --scratch
>   ><rescue> while true; do ls -l /usr/bin; done

I know that people have had problems reproducing this bug.

It's easily reproducible, *if* you have the newest version of
virt-rescue.  Both the old and new versions of virt-rescue run qemu
with `-serial stdio', but in the old version the qemu chardev was
literally connected to stdio (the tty), and in the new version
virt-rescue has its own poll loop connected to the qemu chardev.  For
some reason this changes the scheduling in some way to make the bug
much more evident.

Unfortunately the new version was only available in Fedora Rawhide.

So what I have done for Fedora users is to backport the new
virt-rescue to the version of libguestfs available in Fedora 25 and
Fedora 26, in these versions:

  Fedora 25:
  libguestfs-tools-c >= 1.36.3-2.fc25
  https://bodhi.fedoraproject.org/updates/FEDORA-2017-0a9af45644
  or https://koji.fedoraproject.org/koji/taskinfo?taskID=18714553

  Fedora 26:
  libguestfs-tools-c >= 1.36.3-2.fc26
  https://bodhi.fedoraproject.org/updates/?packages=libguestfs (coming)
  or https://koji.fedoraproject.org/koji/taskinfo?taskID=18714548

You can tell if you have the old or new version of virt-rescue: The
new version has new options `-e' and `-i' (these were not present in
the old version), and also prints the following message when starting up:

  The virt-rescue escape key is ‘^]’.  Type ‘^] h’ for help.

After ensuring you have the new version of virt-rescue installed,
reproducing the bug should be as simple as:

  $ virt-rescue --scratch
  ><rescue> while true; do ls -l /usr/bin; done

If you want to run virt-rescue against a different version of qemu:

  $ LIBGUESTFS_BACKEND=direct LIBGUESTFS_HV=/path/to/qemu \
        virt-rescue --scratch

HTH,

Rich.
diff mbox

Patch

diff --git a/util/main-loop.c b/util/main-loop.c
index 4534c89..19cad6b 100644
--- a/util/main-loop.c
+++ b/util/main-loop.c
@@ -218,9 +218,12 @@  static void glib_pollfds_poll(void)
 
 static int os_host_main_loop_wait(int64_t timeout)
 {
+    GMainContext *context = g_main_context_default();
     int ret;
     static int spin_counter;
 
+    g_main_context_acquire(context);
+
     glib_pollfds_fill(&timeout);
 
     /* If the I/O thread is very busy or we are incorrectly busy waiting in
@@ -256,6 +259,9 @@  static int os_host_main_loop_wait(int64_t timeout)
     }
 
     glib_pollfds_poll();
+
+    g_main_context_release(context);
+
     return ret;
 }
 #else
@@ -412,12 +418,15 @@  static int os_host_main_loop_wait(int64_t timeout)
     fd_set rfds, wfds, xfds;
     int nfds;
 
+    g_main_context_acquire(context);
+
     /* XXX: need to suppress polling by better using win32 events */
     ret = 0;
     for (pe = first_polling_entry; pe != NULL; pe = pe->next) {
         ret |= pe->func(pe->opaque);
     }
     if (ret != 0) {
+        g_main_context_release(context);
         return ret;
     }
 
@@ -472,6 +481,8 @@  static int os_host_main_loop_wait(int64_t timeout)
         g_main_context_dispatch(context);
     }
 
+    g_main_context_release(context);
+
     return select_ret || g_poll_ret;
 }
 #endif