diff mbox

Re: segmentation fault in qemu-kvm-0.14.0

Message ID AANLkTikAOH2nR3T7iqrKj226MrH+8O9DXC=J+7XK-Av_@mail.gmail.com
State New
Headers show

Commit Message

Corentin Chary March 9, 2011, 8:50 a.m. UTC
On Wed, Mar 9, 2011 at 7:37 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
> On 2011-03-08 23:53, Peter Lieven wrote:
>> Hi,
>>
>> during testing of qemu-kvm-0.14.0 i can reproduce the following segfault. i have seen similar crash already in 0.13.0, but had no time to debug.
>> my guess is that this segfault is related to the threaded vnc server which was introduced in qemu 0.13.0. the bug is only triggerable if a vnc
>> client is attached. it might also be connected to a resolution change in the guest. i have a backtrace attached. the debugger is still running if someone
>> needs more output
>>
>
> ...
>
>> Thread 1 (Thread 0x7ffff7ff0700 (LWP 29038)):
>> #0  0x0000000000000000 in ?? ()
>> No symbol table info available.
>> #1  0x000000000041d669 in main_loop_wait (nonblocking=0)
>>     at /usr/src/qemu-kvm-0.14.0/vl.c:1388
>
> So we are calling a IOHandlerRecord::fd_write handler that is NULL.
> Looking at qemu_set_fd_handler2, this may happen if that function is
> called for an existing io-handler entry with non-NULL write handler,
> passing a NULL write and a non-NULL read handler. And all this without
> the global mutex held.
>
> And there are actually calls in vnc_client_write_plain and
> vnc_client_write_locked (in contrast to vnc_write) that may generate
> this pattern. It's probably worth validating that the iothread lock is
> always held when qemu_set_fd_handler2 is invoked to confirm this race
> theory, adding something like
>
> assert(pthread_mutex_trylock(&qemu_mutex) != 0);
> (that's for qemu-kvm only)
>
> BTW, qemu with just --enable-vnc-thread, ie. without io-thread support,
> should always run into this race as it then definitely lacks a global mutex.

I'm not sure what mutex should be locked here (qemu_global_mutex,
qemu_fair_mutex, lock_iothread).
But here is where is should be locked (other vnc_write calls in this
thread should never trigger qemu_set_fd_handler):

Comments

Jan Kiszka March 9, 2011, 9:04 a.m. UTC | #1
On 2011-03-09 09:50, Corentin Chary wrote:
> On Wed, Mar 9, 2011 at 7:37 AM, Jan Kiszka <jan.kiszka@web.de> wrote:
>> On 2011-03-08 23:53, Peter Lieven wrote:
>>> Hi,
>>>
>>> during testing of qemu-kvm-0.14.0 i can reproduce the following segfault. i have seen similar crash already in 0.13.0, but had no time to debug.
>>> my guess is that this segfault is related to the threaded vnc server which was introduced in qemu 0.13.0. the bug is only triggerable if a vnc
>>> client is attached. it might also be connected to a resolution change in the guest. i have a backtrace attached. the debugger is still running if someone
>>> needs more output
>>>
>>
>> ...
>>
>>> Thread 1 (Thread 0x7ffff7ff0700 (LWP 29038)):
>>> #0  0x0000000000000000 in ?? ()
>>> No symbol table info available.
>>> #1  0x000000000041d669 in main_loop_wait (nonblocking=0)
>>>     at /usr/src/qemu-kvm-0.14.0/vl.c:1388
>>
>> So we are calling a IOHandlerRecord::fd_write handler that is NULL.
>> Looking at qemu_set_fd_handler2, this may happen if that function is
>> called for an existing io-handler entry with non-NULL write handler,
>> passing a NULL write and a non-NULL read handler. And all this without
>> the global mutex held.
>>
>> And there are actually calls in vnc_client_write_plain and
>> vnc_client_write_locked (in contrast to vnc_write) that may generate
>> this pattern. It's probably worth validating that the iothread lock is
>> always held when qemu_set_fd_handler2 is invoked to confirm this race
>> theory, adding something like
>>
>> assert(pthread_mutex_trylock(&qemu_mutex) != 0);
>> (that's for qemu-kvm only)
>>
>> BTW, qemu with just --enable-vnc-thread, ie. without io-thread support,
>> should always run into this race as it then definitely lacks a global mutex.
> 
> I'm not sure what mutex should be locked here (qemu_global_mutex,
> qemu_fair_mutex, lock_iothread).

In upstream qemu, the latter - if it exists (which is not the case in
non-io-thread mode).

In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c
implements the global lock.

> But here is where is should be locked (other vnc_write calls in this
> thread should never trigger qemu_set_fd_handler):
> 
> diff --git a/ui/vnc-jobs-async.c b/ui/vnc-jobs-async.c
> index 1d4c5e7..e02d891 100644
> --- a/ui/vnc-jobs-async.c
> +++ b/ui/vnc-jobs-async.c
> @@ -258,7 +258,9 @@ static int vnc_worker_thread_loop(VncJobQueue *queue)
>          goto disconnected;
>      }
> 
> +    /* lock */
>      vnc_write(job->vs, vs.output.buffer, vs.output.offset);
> +    /* unlock */
> 
>  disconnected:
>      /* Copy persistent encoding data */
> @@ -267,7 +269,9 @@ disconnected:
>      vnc_unlock_output(job->vs);
> 
>      if (flush) {
> +        /* lock */
>          vnc_flush(job->vs);
> +       /* unlock */
>      }

Particularly this call is critical as it will trigger
vnc_client_write_locked which may NULL'ify the write handler.

But I'm also not sure about vnc_send_framebuffer_update. Someone has to
go through the threaded vnc code again, very thoroughly. It looks fragile.

Jan
Corentin Chary March 9, 2011, 9:54 a.m. UTC | #2
Re-reading:

>> So we are calling a IOHandlerRecord::fd_write handler that is NULL.
>> Looking at qemu_set_fd_handler2, this may happen if that function is
>> called for an existing io-handler entry with non-NULL write handler,
>> passing a NULL write and a non-NULL read handler. And all this without
>> the global mutex held.

When using the vnc thread, nothing in the vnc thread will never be
called directly by an IO-handler. So I don't really how in would
trigger this.
And since there is a lock for accessing the output buffer (and the
network actually), there should not be any race condition either.

So the real issue, is that qemu_set_fd_handler2() is called outside of
the main thread by those two vnc_write() and vnc_flush() calls,
without any kind of locking.

> In upstream qemu, the latter - if it exists (which is not the case in
> non-io-thread mode).
> In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c
> implements the global lock.

So there is currently no lock for that when io-thread is disabled :/.
Spice also seems to project this kind of thing with
qemu_mutex_lock_iothread().

Maybe qemu_mutex_lock_iothread() should also be defined when
CONFIG_VNC_THREAD=y ?

> But I'm also not sure about vnc_send_framebuffer_update. Someone has to
> go through the threaded vnc code again, very thoroughly. It looks fragile.

while vnc-thread is enabled vnc_send_framebuffer_update() will always
call vnc_write() with csock = -1 in a temporary buffer. Check
vnc_async_encoding_start() and vnc_async_encoding_end(), they provide
a kind of "sandbox" that prevent the thread to write anything the
main-thread will use. You can also see that as a "transaction": the
thread compute the update in a temporary buffer, and only send it to
the network (real vnc_write calls with csock correctly set) once it's
successfully finished.

The is only two functions calls that break this isolation are the two
that I pointed out earlier.
Jan Kiszka March 9, 2011, 9:58 a.m. UTC | #3
On 2011-03-09 10:54, Corentin Chary wrote:
> Re-reading:
> 
>>> So we are calling a IOHandlerRecord::fd_write handler that is NULL.
>>> Looking at qemu_set_fd_handler2, this may happen if that function is
>>> called for an existing io-handler entry with non-NULL write handler,
>>> passing a NULL write and a non-NULL read handler. And all this without
>>> the global mutex held.
> 
> When using the vnc thread, nothing in the vnc thread will never be
> called directly by an IO-handler. So I don't really how in would
> trigger this.
> And since there is a lock for accessing the output buffer (and the
> network actually), there should not be any race condition either.
> 
> So the real issue, is that qemu_set_fd_handler2() is called outside of
> the main thread by those two vnc_write() and vnc_flush() calls,
> without any kind of locking.

Yes, that's what I was referring to.

> 
>> In upstream qemu, the latter - if it exists (which is not the case in
>> non-io-thread mode).
>> In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c
>> implements the global lock.
> 
> So there is currently no lock for that when io-thread is disabled :/.
> Spice also seems to project this kind of thing with
> qemu_mutex_lock_iothread().
> 
> Maybe qemu_mutex_lock_iothread() should also be defined when
> CONFIG_VNC_THREAD=y ?
> 
>> But I'm also not sure about vnc_send_framebuffer_update. Someone has to
>> go through the threaded vnc code again, very thoroughly. It looks fragile.
> 
> while vnc-thread is enabled vnc_send_framebuffer_update() will always
> call vnc_write() with csock = -1 in a temporary buffer. Check
> vnc_async_encoding_start() and vnc_async_encoding_end(), they provide
> a kind of "sandbox" that prevent the thread to write anything the
> main-thread will use. You can also see that as a "transaction": the
> thread compute the update in a temporary buffer, and only send it to
> the network (real vnc_write calls with csock correctly set) once it's
> successfully finished.
> 
> The is only two functions calls that break this isolation are the two
> that I pointed out earlier.

Probably the best way is to make vnc stop fiddling with
qemu_set_fd_handler2, specifically in threaded mode. Why does it need to
set/reset the write handler all the time?

Jan
Jan Kiszka March 9, 2011, 10:02 a.m. UTC | #4
On 2011-03-09 10:58, Jan Kiszka wrote:
> On 2011-03-09 10:54, Corentin Chary wrote:
>> Re-reading:
>>
>>>> So we are calling a IOHandlerRecord::fd_write handler that is NULL.
>>>> Looking at qemu_set_fd_handler2, this may happen if that function is
>>>> called for an existing io-handler entry with non-NULL write handler,
>>>> passing a NULL write and a non-NULL read handler. And all this without
>>>> the global mutex held.
>>
>> When using the vnc thread, nothing in the vnc thread will never be
>> called directly by an IO-handler. So I don't really how in would
>> trigger this.
>> And since there is a lock for accessing the output buffer (and the
>> network actually), there should not be any race condition either.
>>
>> So the real issue, is that qemu_set_fd_handler2() is called outside of
>> the main thread by those two vnc_write() and vnc_flush() calls,
>> without any kind of locking.
> 
> Yes, that's what I was referring to.
> 
>>
>>> In upstream qemu, the latter - if it exists (which is not the case in
>>> non-io-thread mode).
>>> In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c
>>> implements the global lock.
>>
>> So there is currently no lock for that when io-thread is disabled :/.
>> Spice also seems to project this kind of thing with
>> qemu_mutex_lock_iothread().
>>
>> Maybe qemu_mutex_lock_iothread() should also be defined when
>> CONFIG_VNC_THREAD=y ?
>>
>>> But I'm also not sure about vnc_send_framebuffer_update. Someone has to
>>> go through the threaded vnc code again, very thoroughly. It looks fragile.
>>
>> while vnc-thread is enabled vnc_send_framebuffer_update() will always
>> call vnc_write() with csock = -1 in a temporary buffer. Check
>> vnc_async_encoding_start() and vnc_async_encoding_end(), they provide
>> a kind of "sandbox" that prevent the thread to write anything the
>> main-thread will use. You can also see that as a "transaction": the
>> thread compute the update in a temporary buffer, and only send it to
>> the network (real vnc_write calls with csock correctly set) once it's
>> successfully finished.
>>
>> The is only two functions calls that break this isolation are the two
>> that I pointed out earlier.
> 
> Probably the best way is to make vnc stop fiddling with
> qemu_set_fd_handler2, specifically in threaded mode. Why does it need to
> set/reset the write handler all the time?

The other question is: Who's responsible for writing to the client
socket in threaded mode? Only the vnc thread(s), or also other qemu
threads? In the former case, just avoid setting a write handler at all.

Jan
Corentin Chary March 9, 2011, 10:06 a.m. UTC | #5
> Probably the best way is to make vnc stop fiddling with
> qemu_set_fd_handler2, specifically in threaded mode.
> Why does it need to set/reset the write handler all the time?

I didn't write the original code, but it's probably to avoid calling a
write handler when there is no data to write. That would make select()
busy loop when there are actually no data to write.
There is also the vnc_disconnect_start() case when the socket seems to
be broken and we remove all handlers.

> The other question is: Who's responsible for writing to the client
> socket in threaded mode? Only the vnc thread(s), or also other qemu
> threads? In the former case, just avoid setting a write handler at all.

Cheap stuff is done by the main thread (cursor, etc...). The thread
only do framebuffer updates.
Jan Kiszka March 9, 2011, 10:12 a.m. UTC | #6
On 2011-03-09 11:06, Corentin Chary wrote:
>> Probably the best way is to make vnc stop fiddling with
>> qemu_set_fd_handler2, specifically in threaded mode.
>> Why does it need to set/reset the write handler all the time?
> 
> I didn't write the original code, but it's probably to avoid calling a
> write handler when there is no data to write. That would make select()
> busy loop when there are actually no data to write.
> There is also the vnc_disconnect_start() case when the socket seems to
> be broken and we remove all handlers.
> 
>> The other question is: Who's responsible for writing to the client
>> socket in threaded mode? Only the vnc thread(s), or also other qemu
>> threads? In the former case, just avoid setting a write handler at all.
> 
> Cheap stuff is done by the main thread (cursor, etc...). The thread
> only do framebuffer updates.

And both are synchronized with a vnc-private lock only?

The problem with this model is the non-threaded qemu execution model.
Even if we acquire the global mutex to protect handler updates against
the main select loop, this won't help here. So vnc-threading likely
needs to be made dependent on --enable-io-thread.

Jan
Corentin Chary March 9, 2011, 10:14 a.m. UTC | #7
>> Cheap stuff is done by the main thread (cursor, etc...). The thread
>> only do framebuffer updates.
>
> And both are synchronized with a vnc-private lock only?

Yes

> The problem with this model is the non-threaded qemu execution model.
> Even if we acquire the global mutex to protect handler updates against
> the main select loop, this won't help here. So vnc-threading likely
> needs to be made dependent on --enable-io-thread.

Plus proper lock_iothread() calls right ?

If yes I'll send a patch for that (or you can do it, as you want).

Thanks,
Jan Kiszka March 9, 2011, 10:17 a.m. UTC | #8
On 2011-03-09 11:14, Corentin Chary wrote:
>>> Cheap stuff is done by the main thread (cursor, etc...). The thread
>>> only do framebuffer updates.
>>
>> And both are synchronized with a vnc-private lock only?
> 
> Yes
> 
>> The problem with this model is the non-threaded qemu execution model.
>> Even if we acquire the global mutex to protect handler updates against
>> the main select loop, this won't help here. So vnc-threading likely
>> needs to be made dependent on --enable-io-thread.
> 
> Plus proper lock_iothread() calls right ?

Yep.

> 
> If yes I'll send a patch for that (or you can do it, as you want).

Don't wait for me. :)

Jan
diff mbox

Patch

diff --git a/ui/vnc-jobs-async.c b/ui/vnc-jobs-async.c
index 1d4c5e7..e02d891 100644
--- a/ui/vnc-jobs-async.c
+++ b/ui/vnc-jobs-async.c
@@ -258,7 +258,9 @@  static int vnc_worker_thread_loop(VncJobQueue *queue)
         goto disconnected;
     }

+    /* lock */
     vnc_write(job->vs, vs.output.buffer, vs.output.offset);
+    /* unlock */

 disconnected:
     /* Copy persistent encoding data */
@@ -267,7 +269,9 @@  disconnected:
     vnc_unlock_output(job->vs);

     if (flush) {
+        /* lock */
         vnc_flush(job->vs);
+       /* unlock */
     }

     vnc_lock_queue(queue)