diff mbox

use broadcast on qemu_pause_cond

Message ID 1453716498-27238-1-git-send-email-dgilbert@redhat.com
State New
Headers show

Commit Message

Dr. David Alan Gilbert Jan. 25, 2016, 10:08 a.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Jiri saw a hang on pause_all_vcpus called from postcopy_start,
where the cpus are all apparently stopped ('stopped' flag set)
but pause_all_vcpus is still stuck on a cond_wait on qemu_paused_cond.
We suspect this is happening if a qmp_stop is called at about the
same time as the postcopy code calls that pause_all_vcpus;
although they both should have the main lock held, Paolo spotted
the cond_wait unlocks the global lock so perhaps they both
could end up waiting at the same time?

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reported-by: Jiri Denemark <jdenemar@redhat.com>
---
 cpus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini Jan. 25, 2016, 1:18 p.m. UTC | #1
On 25/01/2016 11:08, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Jiri saw a hang on pause_all_vcpus called from postcopy_start,
> where the cpus are all apparently stopped ('stopped' flag set)
> but pause_all_vcpus is still stuck on a cond_wait on qemu_paused_cond.
> We suspect this is happening if a qmp_stop is called at about the
> same time as the postcopy code calls that pause_all_vcpus;
> although they both should have the main lock held, Paolo spotted
> the cond_wait unlocks the global lock so perhaps they both
> could end up waiting at the same time?
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reported-by: Jiri Denemark <jdenemar@redhat.com>
> ---
>  cpus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/cpus.c b/cpus.c
> index 3efff6b..1e97cc4 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -986,7 +986,7 @@ static void qemu_wait_io_event_common(CPUState *cpu)
>      if (cpu->stop) {
>          cpu->stop = false;
>          cpu->stopped = true;
> -        qemu_cond_signal(&qemu_pause_cond);
> +        qemu_cond_broadcast(&qemu_pause_cond);
>      }
>      flush_queued_work(cpu);
>      cpu->thread_kicked = false;
> @@ -1396,7 +1396,7 @@ void cpu_stop_current(void)
>          current_cpu->stop = false;
>          current_cpu->stopped = true;
>          cpu_exit(current_cpu);
> -        qemu_cond_signal(&qemu_pause_cond);
> +        qemu_cond_broadcast(&qemu_pause_cond);
>      }
>  }
>  
> 

Thanks, queued.

Paolo
Christian Borntraeger Jan. 26, 2016, 7:41 p.m. UTC | #2
On 01/25/2016 11:08 AM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Jiri saw a hang on pause_all_vcpus called from postcopy_start,
> where the cpus are all apparently stopped ('stopped' flag set)
> but pause_all_vcpus is still stuck on a cond_wait on qemu_paused_cond.
> We suspect this is happening if a qmp_stop is called at about the
> same time as the postcopy code calls that pause_all_vcpus;
> although they both should have the main lock held, Paolo spotted
> the cond_wait unlocks the global lock so perhaps they both
> could end up waiting at the same time?

We have been chasing a similar problem, with many guests with lots of cpus, that
sometimes thread 1 waits like
Thread 1 (Thread 0x3fffa670c00 (LWP 15652)):
#0  0x000003fffcdf21b2 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000008023f8f2 in qemu_cond_wait ()
#2  0x0000000080060332 in pause_all_vcpus ()
#3  0x00000000800603e8 in vm_stop ()
#4  0x00000000800f9b04 in qmp_marshal_input_stop ()
#5  0x0000000080063154 in handle_qmp_command ()
#6  0x000000008023b77e in json_message_process_token ()
#7  0x000000008024ef98 in json_lexer_feed_char ()
---Type <return> to continue, or q <return> to quit---
#8  0x000000008024f056 in json_lexer_feed ()
#9  0x0000000080061756 in monitor_qmp_read ()
#10 0x00000000800e4966 in tcp_chr_read ()
#11 0x000003fffcce3fb6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#12 0x00000000801bd18e in main_loop_wait ()
#13 0x000000008002e244 in main ()
(gdb) 

One thread was still running inside KVM, not being kicked out into userspace.
Now: This might actually be the same problem. I was chasing the still running
CPU (why it does not exit, and I was able to make progress with killall -SIGUSR1 
qemu), but in fact, the problem might have been that thread 1 did not get
notified by the LAST CPUs (notify getting lost), therefore, never kicked this
CPU out.

The problem was never reproducable with qemu 2.3, so maybe the BQL avoided the
issue?
We will test if this fixes our problem as well.

> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reported-by: Jiri Denemark <jdenemar@redhat.com>
> ---
>  cpus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/cpus.c b/cpus.c
> index 3efff6b..1e97cc4 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -986,7 +986,7 @@ static void qemu_wait_io_event_common(CPUState *cpu)
>      if (cpu->stop) {
>          cpu->stop = false;
>          cpu->stopped = true;
> -        qemu_cond_signal(&qemu_pause_cond);
> +        qemu_cond_broadcast(&qemu_pause_cond);
>      }
>      flush_queued_work(cpu);
>      cpu->thread_kicked = false;
> @@ -1396,7 +1396,7 @@ void cpu_stop_current(void)
>          current_cpu->stop = false;
>          current_cpu->stopped = true;
>          cpu_exit(current_cpu);
> -        qemu_cond_signal(&qemu_pause_cond);
> +        qemu_cond_broadcast(&qemu_pause_cond);
>      }
>  }
>
Dr. David Alan Gilbert Jan. 26, 2016, 8:07 p.m. UTC | #3
* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 01/25/2016 11:08 AM, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Jiri saw a hang on pause_all_vcpus called from postcopy_start,
> > where the cpus are all apparently stopped ('stopped' flag set)
> > but pause_all_vcpus is still stuck on a cond_wait on qemu_paused_cond.
> > We suspect this is happening if a qmp_stop is called at about the
> > same time as the postcopy code calls that pause_all_vcpus;
> > although they both should have the main lock held, Paolo spotted
> > the cond_wait unlocks the global lock so perhaps they both
> > could end up waiting at the same time?
> 
> We have been chasing a similar problem, with many guests with lots of cpus, that
> sometimes thread 1 waits like
> Thread 1 (Thread 0x3fffa670c00 (LWP 15652)):
> #0  0x000003fffcdf21b2 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x000000008023f8f2 in qemu_cond_wait ()
> #2  0x0000000080060332 in pause_all_vcpus ()
> #3  0x00000000800603e8 in vm_stop ()
> #4  0x00000000800f9b04 in qmp_marshal_input_stop ()
> #5  0x0000000080063154 in handle_qmp_command ()
> #6  0x000000008023b77e in json_message_process_token ()
> #7  0x000000008024ef98 in json_lexer_feed_char ()
> ---Type <return> to continue, or q <return> to quit---
> #8  0x000000008024f056 in json_lexer_feed ()
> #9  0x0000000080061756 in monitor_qmp_read ()
> #10 0x00000000800e4966 in tcp_chr_read ()
> #11 0x000003fffcce3fb6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #12 0x00000000801bd18e in main_loop_wait ()
> #13 0x000000008002e244 in main ()
> (gdb) 
> 
> One thread was still running inside KVM, not being kicked out into userspace.
> Now: This might actually be the same problem. I was chasing the still running
> CPU (why it does not exit, and I was able to make progress with killall -SIGUSR1 
> qemu), but in fact, the problem might have been that thread 1 did not get
> notified by the LAST CPUs (notify getting lost), therefore, never kicked this
> CPU out.

I think the patch should only have helped if there was something else trying
to do a stop at the same time;  if there was only one thing then the signal
and broadcast should be identical;  in my case it's a race between a 'stop'
issued on the monitor and a 'stop' from migration;  where's the second stop in
your case?

Dave

> The problem was never reproducable with qemu 2.3, so maybe the BQL avoided the
> issue?
> We will test if this fixes our problem as well.
> 
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reported-by: Jiri Denemark <jdenemar@redhat.com>
> > ---
> >  cpus.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/cpus.c b/cpus.c
> > index 3efff6b..1e97cc4 100644
> > --- a/cpus.c
> > +++ b/cpus.c
> > @@ -986,7 +986,7 @@ static void qemu_wait_io_event_common(CPUState *cpu)
> >      if (cpu->stop) {
> >          cpu->stop = false;
> >          cpu->stopped = true;
> > -        qemu_cond_signal(&qemu_pause_cond);
> > +        qemu_cond_broadcast(&qemu_pause_cond);
> >      }
> >      flush_queued_work(cpu);
> >      cpu->thread_kicked = false;
> > @@ -1396,7 +1396,7 @@ void cpu_stop_current(void)
> >          current_cpu->stop = false;
> >          current_cpu->stopped = true;
> >          cpu_exit(current_cpu);
> > -        qemu_cond_signal(&qemu_pause_cond);
> > +        qemu_cond_broadcast(&qemu_pause_cond);
> >      }
> >  }
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/cpus.c b/cpus.c
index 3efff6b..1e97cc4 100644
--- a/cpus.c
+++ b/cpus.c
@@ -986,7 +986,7 @@  static void qemu_wait_io_event_common(CPUState *cpu)
     if (cpu->stop) {
         cpu->stop = false;
         cpu->stopped = true;
-        qemu_cond_signal(&qemu_pause_cond);
+        qemu_cond_broadcast(&qemu_pause_cond);
     }
     flush_queued_work(cpu);
     cpu->thread_kicked = false;
@@ -1396,7 +1396,7 @@  void cpu_stop_current(void)
         current_cpu->stop = false;
         current_cpu->stopped = true;
         cpu_exit(current_cpu);
-        qemu_cond_signal(&qemu_pause_cond);
+        qemu_cond_broadcast(&qemu_pause_cond);
     }
 }