diff mbox

main-loop: Don't lock starve io-threads when main_loop_tlg has pending events

Message ID 1381259403-7386-1-git-send-email-hdegoede@redhat.com
State New
Headers show

Commit Message

Hans de Goede Oct. 8, 2013, 7:10 p.m. UTC
I noticed today that current qemu master would hang as soon as Xorg starts in
the guest when using qxl + a Linux guest. This message would be printed:
main-loop: WARNING: I/O thread spun for 1000 iterations

And from then on the guest hangs and qemu consumes 100% cpu, bisecting pointed
out commit 7b595f35d89d73bc69c35bf3980a89c420e8a44b:
"aio / timers: Convert mainloop to use timeout"

After looking at that commit I had a hunch the problem might be blocking
main_loop_wait calls being turned into non-blocking ones (and thus never
releasing the io-lock), a debug printf confirmed this was happening at
the moment of the hang, so I wrote this patch which fixes the hang for me
and seems like a good idea in general.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
---
 main-loop.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Paolo Bonzini Oct. 8, 2013, 7:13 p.m. UTC | #1
Il 08/10/2013 21:10, Hans de Goede ha scritto:
> @@ -480,6 +480,11 @@ int main_loop_wait(int nonblocking)
>                                        timerlistgroup_deadline_ns(
>                                            &main_loop_tlg));
>  
> +    /* When not non-blocking always allow io-threads to acquire the lock */
> +    if (timeout != 0 && timeout_ns == 0) {
> +        timeout_ns = 1;
> +    }
> +

This _is_ an I/O thread, so I guess this should be changed to "other
threads".  Also, perhaps timeout_ns can be changed to a higher value
such as 250 us that were used up to the offending commit?

Otherwise looks good!

Paolo
Hans de Goede Oct. 8, 2013, 7:21 p.m. UTC | #2
Hi,

On 10/08/2013 09:13 PM, Paolo Bonzini wrote:
> Il 08/10/2013 21:10, Hans de Goede ha scritto:
>> @@ -480,6 +480,11 @@ int main_loop_wait(int nonblocking)
>>                                         timerlistgroup_deadline_ns(
>>                                             &main_loop_tlg));
>>
>> +    /* When not non-blocking always allow io-threads to acquire the lock */
>> +    if (timeout != 0 && timeout_ns == 0) {
>> +        timeout_ns = 1;
>> +    }
>> +
>
> This _is_ an I/O thread, so I guess this should be changed to "other
> threads".

Ok.

> Also, perhaps timeout_ns can be changed to a higher value
> such as 250 us that were used up to the offending commit?

Wasn't it 1 ms until the offending commit (note 250 us does
sound better to me). More over I wonder how useful is this, if threads
are waiting for the lock at his point, they should all get it and do
work (and then release it) before this thread will be able to re-aquire it.

The only case which I can see where going to sleep will help is when a
thread takes the lock, does some thing, releases it, then does something
else quite quickly (so within 250 us), and then tries to re-aquire the lock
to do more work.

Note either solution (1 ns versus 250 us) is fine with me, I'm by no means
the expert on this, just let me know which one you think is better and I'll
do a v2.

Regards,

Hans
Alex Bligh Oct. 8, 2013, 7:33 p.m. UTC | #3
On 8 Oct 2013, at 20:21, Hans de Goede wrote:

> Wasn't it 1 ms until the offending commit (note 250 us does
> sound better to me).

I believe you've made it 1 nanosecond not 1 millisecond;
can that be right?
Hans de Goede Oct. 8, 2013, 7:41 p.m. UTC | #4
Hi,

On 10/08/2013 09:33 PM, Alex Bligh wrote:
>
> On 8 Oct 2013, at 20:21, Hans de Goede wrote:
>
>> Wasn't it 1 ms until the offending commit (note 250 us does
>> sound better to me).
>
> I believe you've made it 1 nanosecond not 1 millisecond;

Correct, the 1 ms I referred to was before your commit which changed
things from ms to ns.

> can that be right?

The purpose of the 1 ns timeout is to cause os_host_main_loop_wait
to unlock the iothread, as $subject says the problem I'm seeing seems
to be lock starvation not cpu starvation.

Note as I already indicated I'm in no way an expert in this, if you
and or Paolo suspect cpu starvation may happen too, then bumping
the timeout to 250 us is fine with me too.

If we go with 250 us that thus pose the question though if we should
always keep a minimum timeout of 250 us when not non-blocking, or only
bump it to 250 us when main_loop_tlg has already expired events and
thus is causing a timeout of 0.

Regards,

Hans
Alex Bligh Oct. 8, 2013, 7:48 p.m. UTC | #5
On 8 Oct 2013, at 20:10, Hans de Goede wrote:

> I noticed today that current qemu master would hang as soon as Xorg starts in
> the guest when using qxl + a Linux guest. This message would be printed:
> main-loop: WARNING: I/O thread spun for 1000 iterations
> 
> And from then on the guest hangs and qemu consumes 100% cpu, bisecting pointed
> out commit 7b595f35d89d73bc69c35bf3980a89c420e8a44b:
> "aio / timers: Convert mainloop to use timeout"
> 
> After looking at that commit I had a hunch the problem might be blocking
> main_loop_wait calls being turned into non-blocking ones (and thus never
> releasing the io-lock), a debug printf confirmed this was happening at
> the moment of the hang, so I wrote this patch which fixes the hang for me
> and seems like a good idea in general.
> 
> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
> ---
> main-loop.c | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/main-loop.c b/main-loop.c
> index c3c9c28..921c939 100644
> --- a/main-loop.c
> +++ b/main-loop.c
> @@ -480,6 +480,11 @@ int main_loop_wait(int nonblocking)
>                                       timerlistgroup_deadline_ns(
>                                           &main_loop_tlg));
> 
> +    /* When not non-blocking always allow io-threads to acquire the lock */
> +    if (timeout != 0 && timeout_ns == 0) {
> +        timeout_ns = 1;
> +    }
> +
>     ret = os_host_main_loop_wait(timeout_ns);
>     qemu_iohandler_poll(gpollfds, ret);
> #ifdef CONFIG_SLIRP

I /think/ you might mean "if (!blocking && timeout_ns == 0)"
as timeout can be zero on a blocking call at this stage (i.e.
when there is a timer which has already expired.

I'm not entirely sure I understand the problem from your
description - I'll answer this in your subseqent message.
Hans de Goede Oct. 8, 2013, 8:01 p.m. UTC | #6
Hi,

On 10/08/2013 09:48 PM, Alex Bligh wrote:
>
> On 8 Oct 2013, at 20:10, Hans de Goede wrote:
>
>> I noticed today that current qemu master would hang as soon as Xorg starts in
>> the guest when using qxl + a Linux guest. This message would be printed:
>> main-loop: WARNING: I/O thread spun for 1000 iterations
>>
>> And from then on the guest hangs and qemu consumes 100% cpu, bisecting pointed
>> out commit 7b595f35d89d73bc69c35bf3980a89c420e8a44b:
>> "aio / timers: Convert mainloop to use timeout"
>>
>> After looking at that commit I had a hunch the problem might be blocking
>> main_loop_wait calls being turned into non-blocking ones (and thus never
>> releasing the io-lock), a debug printf confirmed this was happening at
>> the moment of the hang, so I wrote this patch which fixes the hang for me
>> and seems like a good idea in general.
>>
>> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
>> ---
>> main-loop.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/main-loop.c b/main-loop.c
>> index c3c9c28..921c939 100644
>> --- a/main-loop.c
>> +++ b/main-loop.c
>> @@ -480,6 +480,11 @@ int main_loop_wait(int nonblocking)
>>                                        timerlistgroup_deadline_ns(
>>                                            &main_loop_tlg));
>>
>> +    /* When not non-blocking always allow io-threads to acquire the lock */
>> +    if (timeout != 0 && timeout_ns == 0) {
>> +        timeout_ns = 1;
>> +    }
>> +
>>      ret = os_host_main_loop_wait(timeout_ns);
>>      qemu_iohandler_poll(gpollfds, ret);
>> #ifdef CONFIG_SLIRP
>
> I /think/ you might mean "if (!blocking && timeout_ns == 0)"
> as timeout can be zero on a blocking call at this stage (i.e.
> when there is a timer which has already expired.

timeout does not get modified, except by the slirp-stuff, which
won't ever set it to 0 I think. But your right writing

    if (!nonblocking && timeout_ns == 0)

would be much clearer, so I'll do that in v2 of the patch.

Regards,

Hans
Alex Bligh Oct. 8, 2013, 8:01 p.m. UTC | #7
On 8 Oct 2013, at 20:41, Hans de Goede wrote:

>>> 
>>> Wasn't it 1 ms until the offending commit (note 250 us does
>>> sound better to me).
>> 
>> I believe you've made it 1 nanosecond not 1 millisecond;
> 
> Correct, the 1 ms I referred to was before your commit which changed
> things from ms to ns.

OK I was looking at the patch as it would apply to master now.

> The purpose of the 1 ns timeout is to cause os_host_main_loop_wait
> to unlock the iothread, as $subject says the problem I'm seeing seems
> to be lock starvation not cpu starvation.
> 
> Note as I already indicated I'm in no way an expert in this, if you
> and or Paolo suspect cpu starvation may happen too, then bumping
> the timeout to 250 us is fine with me too.
> 
> If we go with 250 us that thus pose the question though if we should
> always keep a minimum timeout of 250 us when not non-blocking, or only
> bump it to 250 us when main_loop_tlg has already expired events and
> thus is causing a timeout of 0.

I am by no means an expert in the iothread bit, so let's pool our
ignorance ... :-) 

Somewhere within that patch series (7b595f35 I think) I fixed up
the spin counter bit, which made it slightly less yucky and work
with milliseconds. I hope I didn't break it but there seems
something slightly odd about the use case here.

If you are getting the spin error, this implies something is
pretty much constantly polling os_host_main_loop_wait with a
zero timeout. As you point out this is going to be main_loop_wait
and almost certainly main_loop_wait called with nonblocking
set to 1.

The comment at line 208 suggests that "the I/O thread is very busy
or we are incorrectly busy waiting in the I/O thread". Do we know
which is happening? Perhaps rather than give up the io_thread
mutex on every call (which is in practice what a 1 nanosecond
timeout does) we should give it up if we have not released
it for X nanoseconds (maybe X=250us), or on every Y calls. I think
someone other than me should consider the effect of dropping and
reacquiring a mutex so frequently under heavy I/O load, but I'm not
sure it's a great idea.

So on reflection you might be more right with 1 nanosecond than
250us as a timeout of 250us, but I wonder whether a strategy
of just dropping the lock occasionally (and still using a zero
timeout) might be better.
Alex Bligh Oct. 8, 2013, 8:07 p.m. UTC | #8
On 8 Oct 2013, at 21:01, Alex Bligh wrote:

> Somewhere within that patch series (7b595f35 I think) I fixed up
> the spin counter bit, which made it slightly less yucky and work
> with milliseconds.

"with nanoseconds rather than microseconds" - oops
Hans de Goede Oct. 8, 2013, 8:16 p.m. UTC | #9
Hi,

On 10/08/2013 10:01 PM, Alex Bligh wrote:

<snip>

>> The purpose of the 1 ns timeout is to cause os_host_main_loop_wait
>> to unlock the iothread, as $subject says the problem I'm seeing seems
>> to be lock starvation not cpu starvation.
>>
>> Note as I already indicated I'm in no way an expert in this, if you
>> and or Paolo suspect cpu starvation may happen too, then bumping
>> the timeout to 250 us is fine with me too.
>>
>> If we go with 250 us that thus pose the question though if we should
>> always keep a minimum timeout of 250 us when not non-blocking, or only
>> bump it to 250 us when main_loop_tlg has already expired events and
>> thus is causing a timeout of 0.
>
> I am by no means an expert in the iothread bit, so let's pool our
> ignorance ... :-)
>
> Somewhere within that patch series (7b595f35 I think) I fixed up
> the spin counter bit, which made it slightly less yucky and work
> with milliseconds. I hope I didn't break it but there seems
> something slightly odd about the use case here.
>
> If you are getting the spin error, this implies something is
> pretty much constantly polling os_host_main_loop_wait with a
> zero timeout. As you point out this is going to be main_loop_wait
> and almost certainly main_loop_wait called with nonblocking
> set to 1.

No, it is calling main_loop_wait with nonblocking set to 0, so
normally the lock would get released. But
timerlistgroup_deadline_ns(&main_loop_tlg) is returning 0,
causing timeout_ns to be 0, and this causes the lock to not get
released.

I'm quite sure this is what is happing because once my
bisect pointed to the "aio / timers: Convert mainloop to use timeout"
commit as a culprit, I read that commit very carefully multiple
times and that seemed like the only problem it could cause,
so I added a debug printf to test for that case, and it triggered.

What I believe is happening in my troublesome scenario is that one
thread is calling main_loop_wait(0) repeatedly, waiting for another
thread to do some work (*), but that other thread is not getting a
chance to do that work because the iothread never gets unlocked.

*) likely the spice-server thread which does a lot of work for
the qxl device


>
> The comment at line 208 suggests that "the I/O thread is very busy
> or we are incorrectly busy waiting in the I/O thread". Do we know
> which is happening? Perhaps rather than give up the io_thread
> mutex on every call (which is in practice what a 1 nanosecond
> timeout does) we should give it up if we have not released
> it for X nanoseconds (maybe X=250us), or on every Y calls. I think
> someone other than me should consider the effect of dropping and
> reacquiring a mutex so frequently under heavy I/O load, but I'm not
> sure it's a great idea.

We're only waiting so short if there are timers which want to run
immediately, normally we would wait a lot longer.

> So on reflection you might be more right with 1 nanosecond than
> 250us as a timeout of 250us, but I wonder whether a strategy
> of just dropping the lock occasionally (and still using a zero
> timeout) might be better.

Paolo probably has some better insights on this, but he seems to
have called it a day for today, and I'm going to do the same :)

So lets continue this tomorrow.

Regards,

Hans
Alex Bligh Oct. 8, 2013, 8:32 p.m. UTC | #10
Hans,

On 8 Oct 2013, at 21:16, Hans de Goede wrote:

> No, it is calling main_loop_wait with nonblocking set to 0, so
> normally the lock would get released. But
> timerlistgroup_deadline_ns(&main_loop_tlg) is returning 0,
> causing timeout_ns to be 0, and this causes the lock to not get
> released.

OK so my question is *why* is timerlistgroup_deadline_ns
returning 0 every time? That should only ever happen if there is an
expired timer every time timerlistgroup_deadline_ns is called.
Given the spin warning only prints (or is only meant to print)
if this happens 1,000 times consecutively, this implies to me
that something is setting a timer wrongly, as timers should
not constantly expire.

I would agree that this changeset may have introduced the symptom
(because before we didn't generate a zero timeout this way, we
broke out of the select loop), but I'd love to know WHY there
is an expired timereach time and what that timer is. Perhaps
something is using a timeout value in milliseconds not realising it
is nanoseconds (or similar).

My worry is that setting 1ns may be just hiding a bug here.

We could relatively easily store __FILE__ and __LINE__ when
timers are created to make it easier to track this sort of thing
(perhaps protected with an ifdef).

> I'm quite sure this is what is happing because once my
> bisect pointed to the "aio / timers: Convert mainloop to use timeout"
> commit as a culprit, I read that commit very carefully multiple
> times and that seemed like the only problem it could cause,
> so I added a debug printf to test for that case, and it triggered.
> 
> What I believe is happening in my troublesome scenario is that one
> thread is calling main_loop_wait(0) repeatedly, waiting for another
> thread to do some work (*), but that other thread is not getting a
> chance to do that work because the iothread never gets unlocked.

That's fine, but that doesn't explain why timerlistgroup_deadline_ns
thinks that a timer has always expired.

> *) likely the spice-server thread which does a lot of work for
> the qxl device
> 
> 
>> 
>> The comment at line 208 suggests that "the I/O thread is very busy
>> or we are incorrectly busy waiting in the I/O thread". Do we know
>> which is happening? Perhaps rather than give up the io_thread
>> mutex on every call (which is in practice what a 1 nanosecond
>> timeout does) we should give it up if we have not released
>> it for X nanoseconds (maybe X=250us), or on every Y calls. I think
>> someone other than me should consider the effect of dropping and
>> reacquiring a mutex so frequently under heavy I/O load, but I'm not
>> sure it's a great idea.
> 
> We're only waiting so short if there are timers which want to run
> immediately, normally we would wait a lot longer.
> 
>> So on reflection you might be more right with 1 nanosecond than
>> 250us as a timeout of 250us, but I wonder whether a strategy
>> of just dropping the lock occasionally (and still using a zero
>> timeout) might be better.
> 
> Paolo probably has some better insights on this, but he seems to
> have called it a day for today, and I'm going to do the same :)
> 
> So lets continue this tomorrow.

... the trouble with reading email from top to bottom :-)

> 
> Regards,
> 
> Hans
> 
> 
>
Paolo Bonzini Oct. 8, 2013, 8:50 p.m. UTC | #11
Il 08/10/2013 22:16, Hans de Goede ha scritto:
> No, it is calling main_loop_wait with nonblocking set to 0, so
> normally the lock would get released. But
> timerlistgroup_deadline_ns(&main_loop_tlg) is returning 0,
> causing timeout_ns to be 0, and this causes the lock to not get
> released.

Yes, this was my understanding of the patch as well.  Before Alex's
series, this would be capped to MIN_REARM_TIMER_NS (250 us).  This is
why I mentioned 250 us.

However, I agree with Alex that it looks a bit fishy and I'd like to
know what timer is it that is continuously set to expire in the past.

Paolo
Alex Bligh Oct. 8, 2013, 9:25 p.m. UTC | #12
On 8 Oct 2013, at 21:01, Hans de Goede wrote:

>>> +    /* When not non-blocking always allow io-threads to acquire the lock */
>>> +    if (timeout != 0 && timeout_ns == 0) {
>>> +        timeout_ns = 1;
>>> +    }
>>> +
>>>     ret = os_host_main_loop_wait(timeout_ns);
>>>     qemu_iohandler_poll(gpollfds, ret);
>>> #ifdef CONFIG_SLIRP
>> 
>> I /think/ you might mean "if (!blocking && timeout_ns == 0)"
>> as timeout can be zero on a blocking call at this stage (i.e.
>> when there is a timer which has already expired.
> 
> timeout does not get modified, except by the slirp-stuff, which
> won't ever set it to 0 I think.

Ignoring the slirp stuff (which is hopefully irrelevant):

For your patch to have any effect, timeout_ns must be zero
or your timeout_ns=1 would not execute.

timeout_ns should not be zero frequently for the reason
I described.

If the call to main_loop_wait is blocking (i.e. nonblocking==0),
timeout_ns should start at -1 - this is the case you described.
If the call is non-blocking (as I originally thought)
timeout_ns should start at zero.

Then this line happens:

    timeout_ns = qemu_soonest_timeout(timeout_ns,
                                      timerlistgroup_deadline_ns(
                                          &main_loop_tlg));

If timeout_ns is zero (i.e. non-blocking, caused by timeout=0)
this will do nothing (lack of optimisation and opacity here is
down to slirp stuff).

If timeout_ns is -1 (i.e. blocking, timeout=UINT32_MAX), then
this will use the return from
timerlistgroup_deadline_ns(). You said this was returning
zero, which would cause os_host_main_loop_wait() to
be called with a zero parameter.

This would explain the symptom, and explain why the fix
worked. But the question is why is timerlistgroup_deadline_ns
returning zero.


>  But your right writing
> 
>   if (!nonblocking && timeout_ns == 0)
> 
> would be much clearer, so I'll do that in v2 of the patch.
Hans de Goede Oct. 9, 2013, 12:58 p.m. UTC | #13
Hi,

On 10/08/2013 10:50 PM, Paolo Bonzini wrote:
> Il 08/10/2013 22:16, Hans de Goede ha scritto:
>> No, it is calling main_loop_wait with nonblocking set to 0, so
>> normally the lock would get released. But
>> timerlistgroup_deadline_ns(&main_loop_tlg) is returning 0,
>> causing timeout_ns to be 0, and this causes the lock to not get
>> released.
>
> Yes, this was my understanding of the patch as well.  Before Alex's
> series, this would be capped to MIN_REARM_TIMER_NS (250 us).  This is
> why I mentioned 250 us.
>
> However, I agree with Alex that it looks a bit fishy and I'd like to
> know what timer is it that is continuously set to expire in the past.

I got an off-list email this morning from a kind soul informing me that
he was seeing the same issue as me, but only when using audio over
spice. I've just tried things with "-device hda -device hda-duplex"
removed from my qemu cmdline and I can confirm this is the same for me.

There seem to be more cases which trigger the timer set to expire in the
past case, I added a printf("bingo\n"); to my code for catching this
case and still unlocking the lock there, and it triggers a couple of
times before the hda device becomes active too.

As soon as pulseaudio starts in the guest the screen really starts
filling with bingos though, so it definitely seems a spice + audio problem,

Normally the bingo's stop scrolling by after a couple of seconds (as pulse
audio stops the audio device when there is no sound), but when I try to play
music they keep scrolling.

Also playing music in git master (with my patch) makes qemu take 100% host cpu,
while doing the same using qemu-1.5.3 takes 50% host cpu, so there definetily
is something wrong here.

Looking at the spice code in question, we have the spice-server code
setting a timer to go of every MM_TIMER_GRANULARITY_MS ms, which is every
33 ms, so not often at all.

Spice server does this by calling ui/spice-core.c timer_start() from its
timer function, which does:

qemu_mod_timer(timer->timer, qemu_get_clock_ms(rt_clock) + ms);

Note how it goes from the current time, not from some saved time, so even
if we've missed a couple of timer ticks it should catch up with the
current time after running once.

So long story short I've no idea what is going on here, and Alex is right
that there is a bigger underlying problem.

Regards,

Hans


p.s.

I still think we should add my patch in some form, since the lock starvation
caused by timers set to expire in the past could still happen in other cases,
esp for timer users who take a time stamp once and then add incremental
values to trigger the next run, these can get behind quite a bit if there
are latency spikes, and we don't wont to run without ever releasing the
lock while these are catching up.
Alex Bligh Oct. 9, 2013, 1:18 p.m. UTC | #14
On 9 Oct 2013, at 13:58, Hans de Goede wrote:

> I got an off-list email this morning from a kind soul informing me that
> he was seeing the same issue as me, but only when using audio over
> spice. I've just tried things with "-device hda -device hda-duplex"
> removed from my qemu cmdline and I can confirm this is the same for me.
> 
> There seem to be more cases which trigger the timer set to expire in the
> past case, I added a printf("bingo\n"); to my code for catching this
> case and still unlocking the lock there, and it triggers a couple of
> times before the hda device becomes active too.

OK, that's useful.

> As soon as pulseaudio starts in the guest the screen really starts
> filling with bingos though, so it definitely seems a spice + audio problem,
> 
> Normally the bingo's stop scrolling by after a couple of seconds (as pulse
> audio stops the audio device when there is no sound), but when I try to play
> music they keep scrolling.
> 
> Also playing music in git master (with my patch) makes qemu take 100% host cpu,
> while doing the same using qemu-1.5.3 takes 50% host cpu, so there definetily
> is something wrong here.

Possibly in both cases!

> Looking at the spice code in question, we have the spice-server code
> setting a timer to go of every MM_TIMER_GRANULARITY_MS ms, which is every
> 33 ms, so not often at all.
> 
> Spice server does this by calling ui/spice-core.c timer_start() from its
> timer function, which does:
> 
> qemu_mod_timer(timer->timer, qemu_get_clock_ms(rt_clock) + ms);

qemu_mod_timer does not exist in master.

This line is now:

    timer_mod(timer->timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + ms);

This suggests to me either timer->scale is not set to 1000000 (i.e.
it isn't in milliseconds), OR ms is zero/negative.

Or of course there might be a bug in the timer code.

> Note how it goes from the current time, not from some saved time, so even
> if we've missed a couple of timer ticks it should catch up with the
> current time after running once.

Sure. I would guess what's happening is that the timer is modified such
that it is already expired.

Alex

> 
> So long story short I've no idea what is going on here, and Alex is right
> that there is a bigger underlying problem.
Paolo Bonzini Oct. 9, 2013, 2:37 p.m. UTC | #15
Il 09/10/2013 14:58, Hans de Goede ha scritto:
> 
> 
> I still think we should add my patch in some form, since the lock
> starvation
> caused by timers set to expire in the past could still happen in other
> cases,
> esp for timer users who take a time stamp once and then add incremental
> values to trigger the next run, these can get behind quite a bit if there
> are latency spikes, and we don't wont to run without ever releasing the
> lock while these are catching up.

I agree.  Do you also agree that the equivalent workaround, before
Alex's patch, was MIN_REARM_TIMER_NS (and thus 250 microseconds)?

Paolo
Alex Bligh Oct. 9, 2013, 4:19 p.m. UTC | #16
On 9 Oct 2013, at 15:37, Paolo Bonzini wrote:

> 
> I agree.  Do you also agree that the equivalent workaround, before
> Alex's patch, was MIN_REARM_TIMER_NS (and thus 250 microseconds)?

I don't think this was the case, as if it's a timer constantly
expiring we'd have seen select() exit as soon as it was entered
by the fd poked by the signal. That might be far more frequent.
I think the equivalent would be something like: if the 'zero'
timeout comes from the deadline calculation (and not
nonblocking=true) then release the lock anyway. I think
that would be a reasonable approach.

I would however like to get to the bottom of what's causing
this as even pre my changes playing sound was apparently taking
50% CPU, which is not good. I am completely packed until the
weekend but I propose producing a timer debug patch which
will instrument what is expiring constantly (unless the
problem with spice is obvious to someone).
Paolo Bonzini Oct. 9, 2013, 4:26 p.m. UTC | #17
Il 09/10/2013 18:19, Alex Bligh ha scritto:
>> Do you also agree that the equivalent workaround, before
>> > Alex's patch, was MIN_REARM_TIMER_NS (and thus 250 microseconds)?
> I don't think this was the case, as if it's a timer constantly
> expiring we'd have seen select() exit as soon as it was entered
> by the fd poked by the signal.

The signal itself was clamped to be at least 250 microseconds...

> That might be far more frequent.

... it's true though that it could have been less than 250 microseconds
(more precisely, 250 microseconds minus the time from qemu_mod_timer_ns
to select).

Since the CPU usage with Hans's patch is 100% and used to be 50%, it was
also more than 1 ns that Hans's patch is using.

> I think the equivalent would be something like: if the 'zero'
> timeout comes from the deadline calculation (and not
> nonblocking=true) then release the lock anyway. I think
> that would be a reasonable approach.
> 
> I would however like to get to the bottom of what's causing
> this as even pre my changes playing sound was apparently taking
> 50% CPU, which is not good. I am completely packed until the
> weekend but I propose producing a timer debug patch which
> will instrument what is expiring constantly (unless the
> problem with spice is obvious to someone).

I think Hans already debugged it to the (supposedly) 33 Hz timer that
spice audio uses.

If it turns out the bug is in the QEMU part of spice, I think it makes
sense _not_ to include this patch at all.

If it turns out to be in spice itself, then we can include it as a
workaround, but still it would be nice to tune a bit the limit and not
burn even more CPU time than before.

Paolo
Alex Bligh Oct. 9, 2013, 4:33 p.m. UTC | #18
On 9 Oct 2013, at 17:26, Paolo Bonzini wrote:

> I think Hans already debugged it to the (supposedly) 33 Hz timer that
> spice audio uses.
> 
> If it turns out the bug is in the QEMU part of spice, I think it makes
> sense _not_ to include this patch at all.
> 
> If it turns out to be in spice itself, then we can include it as a
> workaround, but still it would be nice to tune a bit the limit and not
> burn even more CPU time than before.

Or we could simply clamp the value passed to timer_mod in spice.c
which should have the same effect.
Hans de Goede Oct. 9, 2013, 5:53 p.m. UTC | #19
Hi,

On 10/09/2013 06:26 PM, Paolo Bonzini wrote:
> Il 09/10/2013 18:19, Alex Bligh ha scritto:
>>> Do you also agree that the equivalent workaround, before
>>>> Alex's patch, was MIN_REARM_TIMER_NS (and thus 250 microseconds)?
>> I don't think this was the case, as if it's a timer constantly
>> expiring we'd have seen select() exit as soon as it was entered
>> by the fd poked by the signal.
>
> The signal itself was clamped to be at least 250 microseconds...
>
>> That might be far more frequent.
>
> ... it's true though that it could have been less than 250 microseconds
> (more precisely, 250 microseconds minus the time from qemu_mod_timer_ns
> to select).
>
> Since the CPU usage with Hans's patch is 100% and used to be 50%, it was
> also more than 1 ns that Hans's patch is using.
>
>> I think the equivalent would be something like: if the 'zero'
>> timeout comes from the deadline calculation (and not
>> nonblocking=true) then release the lock anyway. I think
>> that would be a reasonable approach.
>>
>> I would however like to get to the bottom of what's causing
>> this as even pre my changes playing sound was apparently taking
>> 50% CPU, which is not good. I am completely packed until the
>> weekend but I propose producing a timer debug patch which
>> will instrument what is expiring constantly (unless the
>> problem with spice is obvious to someone).
>
> I think Hans already debugged it to the (supposedly) 33 Hz timer that
> spice audio uses.

Correction, I've been looking at timers in the spice audio path which could
potentially cause this, and this one stood out. The real problem could be
a completely different timer!

> If it turns out the bug is in the QEMU part of spice, I think it makes
> sense _not_ to include this patch at all.

As you said yourself before in a previous mail, at a minimum we should
ensure we always unlock the iolock when called in blocking mode, to give
other threads a chance to aquire it, so we need some form of my patch, or
some other patch which achieves the unlocking,

I welcome other proposals which have the same end result :)

Regards,

Hans
Hans de Goede Oct. 9, 2013, 6:03 p.m. UTC | #20
Hi,

On 10/09/2013 03:18 PM, Alex Bligh wrote:

<snip>

>> qemu_mod_timer(timer->timer, qemu_get_clock_ms(rt_clock) + ms);
>
> qemu_mod_timer does not exist in master.
>
> This line is now:
>
>      timer_mod(timer->timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + ms);

Ah yes my bad, I was accidentally looking at a stable-1.5 checkout.

> This suggests to me either timer->scale is not set to 1000000 (i.e.
> it isn't in milliseconds),

The timer always gets created like this:

     timer->timer = timer_new_ms(QEMU_CLOCK_REALTIME, func, opaque);

> OR ms is zero/negative.

Nope I've added a debug printf to the spice timer code and it all works
as expected.

So I started looking for suspecious timers under audio/*.c and immediately
found this in audio/audio.c :

static void audio_reset_timer (AudioState *s)
{
     if (audio_is_timer_needed ()) {
         timer_mod (s->ts, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 1);
     }
     else {
         timer_del (s->ts);
     }
}

static void audio_timer (void *opaque)
{
     audio_run ("timer");
     audio_reset_timer (opaque);
}

Note how that is using a timer which expires every freaking nano second,
I think it is very likely that is the culprit.

Regards,

Hans
Hans de Goede Oct. 9, 2013, 6:09 p.m. UTC | #21
Hi,

On 10/09/2013 04:37 PM, Paolo Bonzini wrote:
> Il 09/10/2013 14:58, Hans de Goede ha scritto:
>>
>>
>> I still think we should add my patch in some form, since the lock
>> starvation
>> caused by timers set to expire in the past could still happen in other
>> cases,
>> esp for timer users who take a time stamp once and then add incremental
>> values to trigger the next run, these can get behind quite a bit if there
>> are latency spikes, and we don't wont to run without ever releasing the
>> lock while these are catching up.
>
> I agree.  Do you also agree that the equivalent workaround, before
> Alex's patch, was MIN_REARM_TIMER_NS (and thus 250 microseconds)?

I think I agree :)

Regards,

Hans
Hans de Goede Oct. 9, 2013, 6:15 p.m. UTC | #22
Hi,

On 10/09/2013 08:03 PM, Hans de Goede wrote:

<snip>

> So I started looking for suspecious timers under audio/*.c and immediately
> found this in audio/audio.c :
>
> static void audio_reset_timer (AudioState *s)
> {
>      if (audio_is_timer_needed ()) {
>          timer_mod (s->ts, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 1);
>      }
>      else {
>          timer_del (s->ts);
>      }
> }
>
> static void audio_timer (void *opaque)
> {
>      audio_run ("timer");
>      audio_reset_timer (opaque);
> }
>
> Note how that is using a timer which expires every freaking nano second,
> I think it is very likely that is the culprit.

Yep, this is the culprit, patch coming up.

Regards,

Hans
Alex Bligh Oct. 9, 2013, 6:28 p.m. UTC | #23
On 9 Oct 2013, at 19:03, Hans de Goede wrote:

> static void audio_reset_timer (AudioState *s)
> {
>    if (audio_is_timer_needed ()) {
>        timer_mod (s->ts, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 1);
>    }
>    else {
>        timer_del (s->ts);
>    }
> }
> 
> static void audio_timer (void *opaque)
> {
>    audio_run ("timer");
>    audio_reset_timer (opaque);
> }
> 
> Note how that is using a timer which expires every freaking nano second,
> I think it is very likely that is the culprit.

Indeed. I am hoping that it is not my automated perl conversion code that
did that, because if it is, it may have broken other stuff :-/

Thanks for finding this - let me see whether the bug existed before
my automated changes commit.
Alex Bligh Oct. 9, 2013, 6:36 p.m. UTC | #24
On 9 Oct 2013, at 19:28, Alex Bligh wrote:

>> 
>> static void audio_reset_timer (AudioState *s)
>> {
>>   if (audio_is_timer_needed ()) {
>>       timer_mod (s->ts, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 1);
>>   }
>>   else {
>>       timer_del (s->ts);
>>   }
>> }
>> 
>> static void audio_timer (void *opaque)
>> {
>>   audio_run ("timer");
>>   audio_reset_timer (opaque);
>> }
>> 
>> Note how that is using a timer which expires every freaking nano second,
>> I think it is very likely that is the culprit.
> 
> Indeed. I am hoping that it is not my automated perl conversion code that
> did that, because if it is, it may have broken other stuff :-/
> 
> Thanks for finding this - let me see whether the bug existed before
> my automated changes commit.

I think this was broken prior to my changes. Here's audio/audio.c before
my changes:

static void audio_reset_timer (AudioState *s)
{
    if (audio_is_timer_needed ()) {
        qemu_mod_timer (s->ts, qemu_get_clock_ns (vm_clock) + 1);
    }
    else {
        qemu_del_timer (s->ts);
    }
}


Now qemu_get_clock_ns can only return something in nanoseconds, so it's
adding 1 nanosecond. This is thus either broken because:
  1. ts->scale is in nanoseconds, in which case that timer expires in
     one nano-second's time; or
  2. ts->scale is not in nanoseconds, in which case nanosecond VM clock
     is going to be a huge time in the future, and it's never going
     to expire.

I wonder whether it's meant to be 1 millisecond or 1 microsecond?

Whatever, it looks like this is suitable for an individual patch to
audio/audio.c
Hans de Goede Oct. 9, 2013, 6:49 p.m. UTC | #25
Hi,

On 10/09/2013 08:36 PM, Alex Bligh wrote:
>
> On 9 Oct 2013, at 19:28, Alex Bligh wrote:
>
>>>
>>> static void audio_reset_timer (AudioState *s)
>>> {
>>>    if (audio_is_timer_needed ()) {
>>>        timer_mod (s->ts, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 1);
>>>    }
>>>    else {
>>>        timer_del (s->ts);
>>>    }
>>> }
>>>
>>> static void audio_timer (void *opaque)
>>> {
>>>    audio_run ("timer");
>>>    audio_reset_timer (opaque);
>>> }
>>>
>>> Note how that is using a timer which expires every freaking nano second,
>>> I think it is very likely that is the culprit.
>>
>> Indeed. I am hoping that it is not my automated perl conversion code that
>> did that, because if it is, it may have broken other stuff :-/
>>
>> Thanks for finding this - let me see whether the bug existed before
>> my automated changes commit.
>
> I think this was broken prior to my changes. Here's audio/audio.c before
> my changes:
>
> static void audio_reset_timer (AudioState *s)
> {
>      if (audio_is_timer_needed ()) {
>          qemu_mod_timer (s->ts, qemu_get_clock_ns (vm_clock) + 1);
>      }
>      else {
>          qemu_del_timer (s->ts);
>      }
> }
>
>
> Now qemu_get_clock_ns can only return something in nanoseconds, so it's
> adding 1 nanosecond. This is thus either broken because:
>    1. ts->scale is in nanoseconds, in which case that timer expires in
>       one nano-second's time; or
>    2. ts->scale is not in nanoseconds, in which case nanosecond VM clock
>       is going to be a huge time in the future, and it's never going
>       to expire.
>
> I wonder whether it's meant to be 1 millisecond or 1 microsecond?

Maybe once it was 1 ms, this code just exists to keep the buffers
of a soundcard filled / emptied in time. 100 times / second is more
then plenty for that, so that is what I'm going to use in the patch I'm
about to submit.

Regards,

Hans

p.s.

I still think we should unlock the io-thread in the main_loop_wait when
called with nonblocking == 0, even if there are expired timers. Any
suggestions on how to best do that ?
Paolo Bonzini Oct. 9, 2013, 7:03 p.m. UTC | #26
Il 09/10/2013 20:49, Hans de Goede ha scritto:
>>
>> I wonder whether it's meant to be 1 millisecond or 1 microsecond?
> 
> Maybe once it was 1 ms, this code just exists to keep the buffers
> of a soundcard filled / emptied in time. 100 times / second is more
> then plenty for that, so that is what I'm going to use in the patch I'm
> about to submit.

It was 1 ms when that was the resolution of the "alarm tick" (which was
based on /dev/rtc or /dev/hpet), then it became 250 us with "dynamic
ticks", then we know how it became 0 (1 ns after timer_mod is
effectively 0 ns after select).

Paolo
Hans de Goede Oct. 9, 2013, 7:15 p.m. UTC | #27
Hi,

On 10/09/2013 09:03 PM, Paolo Bonzini wrote:
> Il 09/10/2013 20:49, Hans de Goede ha scritto:
>>>
>>> I wonder whether it's meant to be 1 millisecond or 1 microsecond?
>>
>> Maybe once it was 1 ms, this code just exists to keep the buffers
>> of a soundcard filled / emptied in time. 100 times / second is more
>> then plenty for that, so that is what I'm going to use in the patch I'm
>> about to submit.
>
> It was 1 ms when that was the resolution of the "alarm tick" (which was
> based on /dev/rtc or /dev/hpet), then it became 250 us with "dynamic
> ticks", then we know how it became 0 (1 ns after timer_mod is
> effectively 0 ns after select).

Yep, and the weird thing is, there is a cmdline option to set the
wakeup frequency in hz, and parsing code for it, and code to
convert it to ticks, it is just not used. So I'm throwing away my initial
patch for this and writing a new one.

Regards,

Hans
diff mbox

Patch

diff --git a/main-loop.c b/main-loop.c
index c3c9c28..921c939 100644
--- a/main-loop.c
+++ b/main-loop.c
@@ -480,6 +480,11 @@  int main_loop_wait(int nonblocking)
                                       timerlistgroup_deadline_ns(
                                           &main_loop_tlg));
 
+    /* When not non-blocking always allow io-threads to acquire the lock */
+    if (timeout != 0 && timeout_ns == 0) {
+        timeout_ns = 1;
+    }
+
     ret = os_host_main_loop_wait(timeout_ns);
     qemu_iohandler_poll(gpollfds, ret);
 #ifdef CONFIG_SLIRP