Patchwork [RFC] main-loop: Unconditionally unlock iothread

login
register
mail settings
Submitter Peter Crosthwaite
Date April 3, 2013, 2:17 a.m.
Message ID <CAEgOgz5r5FFr6Wxu08hK+mKr3-ZohEAWSau=u8cPt=W9iZFFRw@mail.gmail.com>
Download mbox | patch
Permalink /patch/233208/
State New
Headers show

Comments

Peter Crosthwaite - April 3, 2013, 2:17 a.m.
Hi Paolo,

Thanks for the clues.

On Tue, Apr 2, 2013 at 9:11 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 02/04/2013 11:04, Peter Crosthwaite ha scritto:
>> Public bug: 1154328
>> Broken Commit: a29753f8aa79a34a324afebe340182a51a5aef11
>>
>> ATM, the timeout from g_pollfds_fill is inhibiting unlocking of the
>> iothread. This is capable of causing a total deadlock when hw/serial
>> is used as a device. The bug manifests when you go -nographic -serial
>> mon:stdio and then paste 40 or more chars into the terminal.
>>
>> My knowledge of this g_foo is vague at best, but my best working
>> theory is this:
>>
>> - First 8 chars are recieved by the serial device no complaints.
>> - The next 32 chars, serial returns false for can_receive() so they
>>       are buffered by the MuxDriver object - mux_chr_read()
>> - Buffer is full, so 41st char causes false return from Muxes own
>>       can_read()
>> - This propagates all the way up to glib_pollfds_fill and manifests
>>       as a timeout
>
> I suppose you mean "manifests as timeout==0".  The question is *which*
> GSource has a timeout of zero?

None of the GSources are to blame. The timeout actually starts as zero
as set by mainloop wait when the nonblocking flag is set. This little
hack also fixes the bug (while leaving the mutex ifery in).


Not the mux's: if mux_chr_can_read()
> returns zero, the prepare function returns FALSE without touching the
> timeout at all...
>
> static gboolean io_watch_poll_prepare(GSource *source, gint *timeout_)
> {
>     IOWatchPoll *iwp = io_watch_poll_from_source(source);
>
>     iwp->max_size = iwp->fd_can_read(iwp->opaque);
>     if (iwp->max_size == 0) {
>         return FALSE;
>     }
>
>     return g_io_watch_funcs.prepare(source, timeout_);
> }
>
>> - Timeout means no unlock of IOthread. Device land never sees any more
>>       cycles so the serial port never progresses - no flushing of
>>       buffer
>
> Still, this is plausible, so the patch looks correct.
>

Ok,

Ill give it some more time and fix commit message. if we don't figure
out a better patch.

Regards,
Peter

> Paolo
>
>> - Deadlock
>>
>> Tested on petalogix_ml605 microblazeel machine model, which was faulty
>> due to 1154328.
>>
>> Fix by removing the conditions on unlocking the iothread. Don't know
>> what else this will break but the timeout is certainly the wrong
>> condition for the unlock. Probably the real solution is to have a more
>> selective unlock policy.
>>
>> I'm happy for someone to take this patch off my hands, or educate me on
>> the correct implementation. For the peeps doing automated testing on
>> nographic platforms this will get your build working again.
>>
>> Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
>> ---
>>  main-loop.c |    8 ++------
>>  1 files changed, 2 insertions(+), 6 deletions(-)
>>
>> diff --git a/main-loop.c b/main-loop.c
>> index eb80ff3..a376898 100644
>> --- a/main-loop.c
>> +++ b/main-loop.c
>> @@ -194,15 +194,11 @@ static int os_host_main_loop_wait(uint32_t timeout)
>>
>>      glib_pollfds_fill(&timeout);
>>
>> -    if (timeout > 0) {
>> -        qemu_mutex_unlock_iothread();
>> -    }
>> +    qemu_mutex_unlock_iothread();
>>
>>      ret = g_poll((GPollFD *)gpollfds->data, gpollfds->len, timeout);
>>
>> -    if (timeout > 0) {
>> -        qemu_mutex_lock_iothread();
>> -    }
>> +    qemu_mutex_lock_iothread();
>>
>>      glib_pollfds_poll();
>>      return ret;
>>
>
>
Paolo Bonzini - April 3, 2013, 6:35 a.m.
> ---
> Is it expected that this non-blocking condition implies lockup of the
> iothread?

No.  The idea was to make the loop cheaper when you had a qemu_notify_event()
or bottom half, basically something that causes main_loop_wait() to wake up
immediately multiple times.  When that happens, it is cheaper to avoid
releasing and taking the mutex.

Can you check why main_loop_wait() is returning a non-zero value without
making any progress?

Paolo

> Diving deeper again, I notice that this non-blocking feature isn't
> even enabled at all for KVM. Which would probably mean that this bug
> is not replicable by anyone testing with KVM. We could just make all
> the CPU backends consistent with KVM by removing the nonblocking
> altogether. Any comments from TCG people :) ?
> 
> --- a/vl.c
> +++ b/vl.c
> @@ -2030,17 +2030,15 @@ static bool main_loop_should_exit(void)
> 
>  static void main_loop(void)
>  {
> -    bool nonblocking;
>      int last_io = 0;
>  #ifdef CONFIG_PROFILER
>      int64_t ti;
>  #endif
>      do {
> -        nonblocking = !kvm_enabled() && last_io > 0;
>  #ifdef CONFIG_PROFILER
>          ti = profile_getclock();
>  #endif
> -        last_io = main_loop_wait(nonblocking);
> +        last_io = main_loop_wait(0);
>  #ifdef CONFIG_PROFILER
>          dev_time += profile_getclock() - ti;
>  #endif
> 
> Not the mux's: if mux_chr_can_read()
> > returns zero, the prepare function returns FALSE without touching the
> > timeout at all...
> >
> > static gboolean io_watch_poll_prepare(GSource *source, gint *timeout_)
> > {
> >     IOWatchPoll *iwp = io_watch_poll_from_source(source);
> >
> >     iwp->max_size = iwp->fd_can_read(iwp->opaque);
> >     if (iwp->max_size == 0) {
> >         return FALSE;
> >     }
> >
> >     return g_io_watch_funcs.prepare(source, timeout_);
> > }
> >
> >> - Timeout means no unlock of IOthread. Device land never sees any more
> >>       cycles so the serial port never progresses - no flushing of
> >>       buffer
> >
> > Still, this is plausible, so the patch looks correct.
> >
> 
> Ok,
> 
> Ill give it some more time and fix commit message. if we don't figure
> out a better patch.
> 
> Regards,
> Peter
> 
> > Paolo
> >
> >> - Deadlock
> >>
> >> Tested on petalogix_ml605 microblazeel machine model, which was faulty
> >> due to 1154328.
> >>
> >> Fix by removing the conditions on unlocking the iothread. Don't know
> >> what else this will break but the timeout is certainly the wrong
> >> condition for the unlock. Probably the real solution is to have a more
> >> selective unlock policy.
> >>
> >> I'm happy for someone to take this patch off my hands, or educate me on
> >> the correct implementation. For the peeps doing automated testing on
> >> nographic platforms this will get your build working again.
> >>
> >> Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
> >> ---
> >>  main-loop.c |    8 ++------
> >>  1 files changed, 2 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/main-loop.c b/main-loop.c
> >> index eb80ff3..a376898 100644
> >> --- a/main-loop.c
> >> +++ b/main-loop.c
> >> @@ -194,15 +194,11 @@ static int os_host_main_loop_wait(uint32_t timeout)
> >>
> >>      glib_pollfds_fill(&timeout);
> >>
> >> -    if (timeout > 0) {
> >> -        qemu_mutex_unlock_iothread();
> >> -    }
> >> +    qemu_mutex_unlock_iothread();
> >>
> >>      ret = g_poll((GPollFD *)gpollfds->data, gpollfds->len, timeout);
> >>
> >> -    if (timeout > 0) {
> >> -        qemu_mutex_lock_iothread();
> >> -    }
> >> +    qemu_mutex_lock_iothread();
> >>
> >>      glib_pollfds_poll();
> >>      return ret;
> >>
> >
> >
>
Peter Crosthwaite - April 3, 2013, 11:58 p.m.
Hi Paolo,

On Wed, Apr 3, 2013 at 4:35 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> ---
>> Is it expected that this non-blocking condition implies lockup of the
>> iothread?
>
> No.  The idea was to make the loop cheaper when you had a qemu_notify_event()
> or bottom half, basically something that causes main_loop_wait() to wake up
> immediately multiple times.  When that happens, it is cheaper to avoid
> releasing and taking the mutex.
>
> Can you check why main_loop_wait() is returning a non-zero value without
> making any progress?
>
> Paolo
>

My best theory is that stdin thinks it is progressing when it is not.

Heres some gdb to back it up. I hung the system with the 40+paste and
dumped the state of the polled fds (from os_main_loop_wait()) at that
time. 4th array element (stdin) has a revent which makes sense, as
stdin is buffering my extra chars so a poll is going to return with
that descriptor as readable. g_poll returns 1 accordingly and this is
returned, propagating back up to vl.c as a non-zero return code giving
the illusion of progress.

I think there may be a flaw in that "any of the descriptors being
pollable" is not a good definition of progress. stdin is blocked by
the fact that the device and mux cannot accept their data anymore so
even though its readable, no meaningful read will happen. That leaves
us with having to devise more elaborate code to define progress, or we
simplify by just removing this nonblocking optimisation altogether
(original patch).

Regards,
Peter

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000721353 in poll ()
(gdb) jhfdg;fdhg;dfgkdfg;dfjhg;dsfjhgds;fjgsd;fjhdjhdfg^C(gdb)
(gdb) bt
#0  0x0000000000721353 in poll ()
#1  0x0000000000492ec3 in os_host_main_loop_wait (timeout=0)
    at /home/peterc/Petalogix/Internal/plgx_src/qemu-arm-microblazeel/main-loop.c:201
#2  0x0000000000492f43 in main_loop_wait (nonblocking=1)
    at /home/peterc/Petalogix/Internal/plgx_src/qemu-arm-microblazeel/main-loop.c:443
#3  0x00000000005032c5 in main_loop () at
/home/peterc/Petalogix/Internal/plgx_src/qemu-arm-microblazeel/vl.c:2043
#4  0x0000000000509783 in main (argc=20, argv=0x7fffffffcd68,
envp=0x7fffffffce10)
    at /home/peterc/Petalogix/Internal/plgx_src/qemu-arm-microblazeel/vl.c:4429
(gdb) up 1
#1  0x0000000000492ec3 in os_host_main_loop_wait (timeout=0)
    at /home/peterc/Petalogix/Internal/plgx_src/qemu-arm-microblazeel/main-loop.c:201
201	    ret = g_poll((GPollFD *)gpollfds->data, gpollfds->len, timeout);
(gdb) step
Single stepping until exit from function poll,
which has no line number information.
os_host_main_loop_wait (timeout=0) at
/home/peterc/Petalogix/Internal/plgx_src/qemu-arm-microblazeel/main-loop.c:203
203	    if (timeout > 0) {
(gdb) p/x gpollfds->len
$1 = 0x6
(gdb) p/x ((GPollFD *)gpollfds->data)[0]
$2 = {fd = 0x9, events = 0x19, revents = 0x0}
(gdb) p/x ((GPollFD *)gpollfds->data)[1]
$3 = {fd = 0x5, events = 0x19, revents = 0x0}
(gdb) p/x ((GPollFD *)gpollfds->data)[2]
$4 = {fd = 0x7, events = 0x1, revents = 0x0}
(gdb) p/x ((GPollFD *)gpollfds->data)[3]
$5 = {fd = 0xa, events = 0x1, revents = 0x0}
(gdb) p/x ((GPollFD *)gpollfds->data)[4]
$6 = {fd = 0x0, events = 0x1, revents = 0x1}
(gdb) p/x ((GPollFD *)gpollfds->data)[5]
$7 = {fd = 0x6, events = 0x1, revents = 0x0}
Paolo Bonzini - April 4, 2013, 5:44 a.m.
Il 04/04/2013 01:58, Peter Crosthwaite ha scritto:
> 
> I think there may be a flaw in that "any of the descriptors being
> pollable" is not a good definition of progress. stdin is blocked by
> the fact that the device and mux cannot accept their data anymore so
> even though its readable, no meaningful read will happen. That leaves
> us with having to devise more elaborate code to define progress, or we
> simplify by just removing this nonblocking optimisation altogether
> (original patch).

If stdin is blocked, it shouldn't be polled at all.  That is the purpose
of the can_read callback.  Unfortunately, return FALSE from the prepare
callback still leaves the poll handler.

So your original patch fixes the symptom, but leaves the busy waiting
unfixed.

The right thing to use would be g_source_add_child_source() and
g_source_remove_child_source(), but that is only present since glib 2.28
and we currently require 2.12 (2.20 on Windows).

Anthony, Amit, can you look at it?

Paolo
Anthony Liguori - April 4, 2013, 1:49 p.m.
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 04/04/2013 01:58, Peter Crosthwaite ha scritto:
>> 
>> I think there may be a flaw in that "any of the descriptors being
>> pollable" is not a good definition of progress. stdin is blocked by
>> the fact that the device and mux cannot accept their data anymore so
>> even though its readable, no meaningful read will happen. That leaves
>> us with having to devise more elaborate code to define progress, or we
>> simplify by just removing this nonblocking optimisation altogether
>> (original patch).
>
> If stdin is blocked, it shouldn't be polled at all.  That is the purpose
> of the can_read callback.  Unfortunately, return FALSE from the prepare
> callback still leaves the poll handler.
>
> So your original patch fixes the symptom, but leaves the busy waiting
> unfixed.
>
> The right thing to use would be g_source_add_child_source() and
> g_source_remove_child_source(), but that is only present since glib 2.28
> and we currently require 2.12 (2.20 on Windows).
>
> Anthony, Amit, can you look at it?

Ack.

May not be until tomorrow but I'll try to dig in here.

Regards,

Anthony Liguori

>
> Paolo
Anthony Liguori - April 4, 2013, 4:59 p.m.
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 04/04/2013 01:58, Peter Crosthwaite ha scritto:
>> 
>> I think there may be a flaw in that "any of the descriptors being
>> pollable" is not a good definition of progress. stdin is blocked by
>> the fact that the device and mux cannot accept their data anymore so
>> even though its readable, no meaningful read will happen. That leaves
>> us with having to devise more elaborate code to define progress, or we
>> simplify by just removing this nonblocking optimisation altogether
>> (original patch).
>
> If stdin is blocked, it shouldn't be polled at all.  That is the purpose
> of the can_read callback.  Unfortunately, return FALSE from the prepare
> callback still leaves the poll handler.
>
> So your original patch fixes the symptom, but leaves the busy waiting
> unfixed.
>
> The right thing to use would be g_source_add_child_source() and
> g_source_remove_child_source(), but that is only present since glib 2.28
> and we currently require 2.12 (2.20 on Windows).

I don't think a child source fixes the problem.  The backend definitely
has work to do.  What we don't know is whether the front end is capable
of processing the work.

The problem here is that we use polling on the front-end.  IOW:

1) Char backend has data and is ready to write.
2) Asks front end if it can write
3) Front end says no
4) Goto (1)

It does this without dropping the BQL which means that TCG never gets to
run.

Ideally, we would do:

1) Char backend has data and is ready to write.
2) Asks front end if it can write
3) Backend stops asking front end until front end says it can receive
   data again
4) Drop BQL because we have no more work to do.

But this requires eliminating all users of can_read() which isn't going
to happen any time soon.

The solution therefore needs to be:

1) Char backend has data and is ready to write.
2) Asks front end if it can write
3) Front end says no
3.5) Drop the BQL and give the VCPU a chance to run
4) Goto (1)

Which is essentially what Peter's patch does.  In fact, you can imagine
other scenarios where a very busy VNC client essentially blocks the
BQL.  We need to regularly give up the BQL in order to prevent
starvation regardless of the polling in the char layer.

So I think this is a long way of saying:

Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>

Regards,

Anthony Liguori

>
> Anthony, Amit, can you look at it?
>
> Paolo
Peter Maydell - April 4, 2013, 5:03 p.m.
On 4 April 2013 17:59, Anthony Liguori <aliguori@us.ibm.com> wrote:
> So I think this is a long way of saying:
>
> Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>

Any chance we could update the commit message to include
this more authoritative analysis?

thanks
-- PMM
Anthony Liguori - April 4, 2013, 6:17 p.m.
Peter Maydell <peter.maydell@linaro.org> writes:

> On 4 April 2013 17:59, Anthony Liguori <aliguori@us.ibm.com> wrote:
>> So I think this is a long way of saying:
>>
>> Reviewed-by: Anthony Liguori <aliguori@us.ibm.com>
>
> Any chance we could update the commit message to include
> this more authoritative analysis?

Yes, please do.  I'm also not sure that just always dropping the lock is
the best strategy either.

I think a simple counter with a nice comment that explains why we need
to periodically drop the lock is a better solution as it gives a way to
experiment with the right value to ensure fairness between the threads
without having excessing lock acquisition/release.

Regards,

Anthony Liguori

>
> thanks
> -- PMM
Paolo Bonzini - April 4, 2013, 6:57 p.m.
Il 04/04/2013 18:59, Anthony Liguori ha scritto:
>> >
>> > The right thing to use would be g_source_add_child_source() and
>> > g_source_remove_child_source(), but that is only present since glib 2.28
>> > and we currently require 2.12 (2.20 on Windows).
> I don't think a child source fixes the problem.  The backend definitely
> has work to do.  What we don't know is whether the front end is capable
> of processing the work.
> 
> The problem here is that we use polling on the front-end.  IOW:
> 
> 1) Char backend has data and is ready to write.
> 2) Asks front end if it can write
> 3) Front end says no
> 4) Goto (1)

What we used to do is this, however:

1) Char backend asks front end if it can write
2) Front end says no
3) poll() goes on without char backend's descriptor
4) Goto (1), then:
5) Front end is now available, calls qemu_notify_event()
6) Char backend asks front end if it can write
7) Front end says yes

No busy waiting here.  Though we don't really do (5) everywhere, because
for example this patch slipped through the cracks:
http://lists.gnu.org/archive/html/qemu-devel/2012-03/msg03208.html (but
for the monitor and every consumer running in iothread context it's ok).

You could achieve the same thing by making the io watch a child source
of the QEMU wrapper.  All the wrapper does is add/remove the io watch
source from its children, depending on can_read.

Perhaps it's possible without child sources though, by setting the
callbacks on the glib source directly.

Paolo
Anthony Liguori - April 4, 2013, 7:54 p.m.
Paolo Bonzini <pbonzini@redhat.com> writes:

> Il 04/04/2013 18:59, Anthony Liguori ha scritto:
>>> >
>>> > The right thing to use would be g_source_add_child_source() and
>>> > g_source_remove_child_source(), but that is only present since glib 2.28
>>> > and we currently require 2.12 (2.20 on Windows).
>> I don't think a child source fixes the problem.  The backend definitely
>> has work to do.  What we don't know is whether the front end is capable
>> of processing the work.
>> 
>> The problem here is that we use polling on the front-end.  IOW:
>> 
>> 1) Char backend has data and is ready to write.
>> 2) Asks front end if it can write
>> 3) Front end says no
>> 4) Goto (1)
>
> What we used to do is this, however:
>
> 1) Char backend asks front end if it can write
> 2) Front end says no

You're missing a step here.  Previously we would drop the data silently.

> 3) poll() goes on without char backend's descriptor

Which is what enabled this.

> 4) Goto (1), then:
> 5) Front end is now available, calls qemu_notify_event()

This still exists FWIW.

> 6) Char backend asks front end if it can write
> 7) Front end says yes
>
> No busy waiting here.  Though we don't really do (5) everywhere, because
> for example this patch slipped through the cracks:

Ack.

> http://lists.gnu.org/archive/html/qemu-devel/2012-03/msg03208.html (but
> for the monitor and every consumer running in iothread context it's ok).
>
> You could achieve the same thing by making the io watch a child source
> of the QEMU wrapper.  All the wrapper does is add/remove the io watch
> source from its children, depending on can_read.

I don't understand how this solves the problem.  There aren't two
sources.  There is only a single source (the GIOChannel source).
We override the function pointers to basically do monkey patching of the
GSource.

If I understand your suggestion, it's to add the source as a child of
itself.

>
> Perhaps it's possible without child sources though, by setting the
> callbacks on the glib source directly.

That's what we're doing...

    src = g_io_create_watch(channel, G_IO_IN | G_IO_ERR | G_IO_HUP);
    g_source_set_funcs(src, &io_watch_poll_funcs);
    g_source_set_callback(src, (GSourceFunc)fd_read, user_data, NULL);

There is only one GSource active here.

Regards,

Anthony Liguori

> Paolo

Patch

--- a/main-loop.c
+++ b/main-loop.c
@@ -428,10 +428,6 @@  int main_loop_wait(int nonblocking)
     int ret;
     uint32_t timeout = UINT32_MAX;

-    if (nonblocking) {
-        timeout = 0;
-    }
-
     /* poll any events */
     g_array_set_size(gpollfds, 0); /* reset for new iteration */
     /* XXX: separate device handlers from system ones */

---
Is it expected that this non-blocking condition implies lockup of the iothread?

Diving deeper again, I notice that this non-blocking feature isn't
even enabled at all for KVM. Which would probably mean that this bug
is not replicable by anyone testing with KVM. We could just make all
the CPU backends consistent with KVM by removing the nonblocking
altogether. Any comments from TCG people :) ?

--- a/vl.c
+++ b/vl.c
@@ -2030,17 +2030,15 @@  static bool main_loop_should_exit(void)

 static void main_loop(void)
 {
-    bool nonblocking;
     int last_io = 0;
 #ifdef CONFIG_PROFILER
     int64_t ti;
 #endif
     do {
-        nonblocking = !kvm_enabled() && last_io > 0;
 #ifdef CONFIG_PROFILER
         ti = profile_getclock();
 #endif
-        last_io = main_loop_wait(nonblocking);
+        last_io = main_loop_wait(0);
 #ifdef CONFIG_PROFILER
         dev_time += profile_getclock() - ti;
 #endif