Message ID | AANLkTikAOH2nR3T7iqrKj226MrH+8O9DXC=J+7XK-Av_@mail.gmail.com |
---|---|
State | New |
Headers | show |
On 2011-03-09 09:50, Corentin Chary wrote: > On Wed, Mar 9, 2011 at 7:37 AM, Jan Kiszka <jan.kiszka@web.de> wrote: >> On 2011-03-08 23:53, Peter Lieven wrote: >>> Hi, >>> >>> during testing of qemu-kvm-0.14.0 i can reproduce the following segfault. i have seen similar crash already in 0.13.0, but had no time to debug. >>> my guess is that this segfault is related to the threaded vnc server which was introduced in qemu 0.13.0. the bug is only triggerable if a vnc >>> client is attached. it might also be connected to a resolution change in the guest. i have a backtrace attached. the debugger is still running if someone >>> needs more output >>> >> >> ... >> >>> Thread 1 (Thread 0x7ffff7ff0700 (LWP 29038)): >>> #0 0x0000000000000000 in ?? () >>> No symbol table info available. >>> #1 0x000000000041d669 in main_loop_wait (nonblocking=0) >>> at /usr/src/qemu-kvm-0.14.0/vl.c:1388 >> >> So we are calling a IOHandlerRecord::fd_write handler that is NULL. >> Looking at qemu_set_fd_handler2, this may happen if that function is >> called for an existing io-handler entry with non-NULL write handler, >> passing a NULL write and a non-NULL read handler. And all this without >> the global mutex held. >> >> And there are actually calls in vnc_client_write_plain and >> vnc_client_write_locked (in contrast to vnc_write) that may generate >> this pattern. It's probably worth validating that the iothread lock is >> always held when qemu_set_fd_handler2 is invoked to confirm this race >> theory, adding something like >> >> assert(pthread_mutex_trylock(&qemu_mutex) != 0); >> (that's for qemu-kvm only) >> >> BTW, qemu with just --enable-vnc-thread, ie. without io-thread support, >> should always run into this race as it then definitely lacks a global mutex. > > I'm not sure what mutex should be locked here (qemu_global_mutex, > qemu_fair_mutex, lock_iothread). In upstream qemu, the latter - if it exists (which is not the case in non-io-thread mode). In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c implements the global lock. > But here is where is should be locked (other vnc_write calls in this > thread should never trigger qemu_set_fd_handler): > > diff --git a/ui/vnc-jobs-async.c b/ui/vnc-jobs-async.c > index 1d4c5e7..e02d891 100644 > --- a/ui/vnc-jobs-async.c > +++ b/ui/vnc-jobs-async.c > @@ -258,7 +258,9 @@ static int vnc_worker_thread_loop(VncJobQueue *queue) > goto disconnected; > } > > + /* lock */ > vnc_write(job->vs, vs.output.buffer, vs.output.offset); > + /* unlock */ > > disconnected: > /* Copy persistent encoding data */ > @@ -267,7 +269,9 @@ disconnected: > vnc_unlock_output(job->vs); > > if (flush) { > + /* lock */ > vnc_flush(job->vs); > + /* unlock */ > } Particularly this call is critical as it will trigger vnc_client_write_locked which may NULL'ify the write handler. But I'm also not sure about vnc_send_framebuffer_update. Someone has to go through the threaded vnc code again, very thoroughly. It looks fragile. Jan
Re-reading: >> So we are calling a IOHandlerRecord::fd_write handler that is NULL. >> Looking at qemu_set_fd_handler2, this may happen if that function is >> called for an existing io-handler entry with non-NULL write handler, >> passing a NULL write and a non-NULL read handler. And all this without >> the global mutex held. When using the vnc thread, nothing in the vnc thread will never be called directly by an IO-handler. So I don't really how in would trigger this. And since there is a lock for accessing the output buffer (and the network actually), there should not be any race condition either. So the real issue, is that qemu_set_fd_handler2() is called outside of the main thread by those two vnc_write() and vnc_flush() calls, without any kind of locking. > In upstream qemu, the latter - if it exists (which is not the case in > non-io-thread mode). > In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c > implements the global lock. So there is currently no lock for that when io-thread is disabled :/. Spice also seems to project this kind of thing with qemu_mutex_lock_iothread(). Maybe qemu_mutex_lock_iothread() should also be defined when CONFIG_VNC_THREAD=y ? > But I'm also not sure about vnc_send_framebuffer_update. Someone has to > go through the threaded vnc code again, very thoroughly. It looks fragile. while vnc-thread is enabled vnc_send_framebuffer_update() will always call vnc_write() with csock = -1 in a temporary buffer. Check vnc_async_encoding_start() and vnc_async_encoding_end(), they provide a kind of "sandbox" that prevent the thread to write anything the main-thread will use. You can also see that as a "transaction": the thread compute the update in a temporary buffer, and only send it to the network (real vnc_write calls with csock correctly set) once it's successfully finished. The is only two functions calls that break this isolation are the two that I pointed out earlier.
On 2011-03-09 10:54, Corentin Chary wrote: > Re-reading: > >>> So we are calling a IOHandlerRecord::fd_write handler that is NULL. >>> Looking at qemu_set_fd_handler2, this may happen if that function is >>> called for an existing io-handler entry with non-NULL write handler, >>> passing a NULL write and a non-NULL read handler. And all this without >>> the global mutex held. > > When using the vnc thread, nothing in the vnc thread will never be > called directly by an IO-handler. So I don't really how in would > trigger this. > And since there is a lock for accessing the output buffer (and the > network actually), there should not be any race condition either. > > So the real issue, is that qemu_set_fd_handler2() is called outside of > the main thread by those two vnc_write() and vnc_flush() calls, > without any kind of locking. Yes, that's what I was referring to. > >> In upstream qemu, the latter - if it exists (which is not the case in >> non-io-thread mode). >> In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c >> implements the global lock. > > So there is currently no lock for that when io-thread is disabled :/. > Spice also seems to project this kind of thing with > qemu_mutex_lock_iothread(). > > Maybe qemu_mutex_lock_iothread() should also be defined when > CONFIG_VNC_THREAD=y ? > >> But I'm also not sure about vnc_send_framebuffer_update. Someone has to >> go through the threaded vnc code again, very thoroughly. It looks fragile. > > while vnc-thread is enabled vnc_send_framebuffer_update() will always > call vnc_write() with csock = -1 in a temporary buffer. Check > vnc_async_encoding_start() and vnc_async_encoding_end(), they provide > a kind of "sandbox" that prevent the thread to write anything the > main-thread will use. You can also see that as a "transaction": the > thread compute the update in a temporary buffer, and only send it to > the network (real vnc_write calls with csock correctly set) once it's > successfully finished. > > The is only two functions calls that break this isolation are the two > that I pointed out earlier. Probably the best way is to make vnc stop fiddling with qemu_set_fd_handler2, specifically in threaded mode. Why does it need to set/reset the write handler all the time? Jan
On 2011-03-09 10:58, Jan Kiszka wrote: > On 2011-03-09 10:54, Corentin Chary wrote: >> Re-reading: >> >>>> So we are calling a IOHandlerRecord::fd_write handler that is NULL. >>>> Looking at qemu_set_fd_handler2, this may happen if that function is >>>> called for an existing io-handler entry with non-NULL write handler, >>>> passing a NULL write and a non-NULL read handler. And all this without >>>> the global mutex held. >> >> When using the vnc thread, nothing in the vnc thread will never be >> called directly by an IO-handler. So I don't really how in would >> trigger this. >> And since there is a lock for accessing the output buffer (and the >> network actually), there should not be any race condition either. >> >> So the real issue, is that qemu_set_fd_handler2() is called outside of >> the main thread by those two vnc_write() and vnc_flush() calls, >> without any kind of locking. > > Yes, that's what I was referring to. > >> >>> In upstream qemu, the latter - if it exists (which is not the case in >>> non-io-thread mode). >>> In qemu-kvm, those locks are yet unused. Rather the code in qemu-kvm.c >>> implements the global lock. >> >> So there is currently no lock for that when io-thread is disabled :/. >> Spice also seems to project this kind of thing with >> qemu_mutex_lock_iothread(). >> >> Maybe qemu_mutex_lock_iothread() should also be defined when >> CONFIG_VNC_THREAD=y ? >> >>> But I'm also not sure about vnc_send_framebuffer_update. Someone has to >>> go through the threaded vnc code again, very thoroughly. It looks fragile. >> >> while vnc-thread is enabled vnc_send_framebuffer_update() will always >> call vnc_write() with csock = -1 in a temporary buffer. Check >> vnc_async_encoding_start() and vnc_async_encoding_end(), they provide >> a kind of "sandbox" that prevent the thread to write anything the >> main-thread will use. You can also see that as a "transaction": the >> thread compute the update in a temporary buffer, and only send it to >> the network (real vnc_write calls with csock correctly set) once it's >> successfully finished. >> >> The is only two functions calls that break this isolation are the two >> that I pointed out earlier. > > Probably the best way is to make vnc stop fiddling with > qemu_set_fd_handler2, specifically in threaded mode. Why does it need to > set/reset the write handler all the time? The other question is: Who's responsible for writing to the client socket in threaded mode? Only the vnc thread(s), or also other qemu threads? In the former case, just avoid setting a write handler at all. Jan
> Probably the best way is to make vnc stop fiddling with > qemu_set_fd_handler2, specifically in threaded mode. > Why does it need to set/reset the write handler all the time? I didn't write the original code, but it's probably to avoid calling a write handler when there is no data to write. That would make select() busy loop when there are actually no data to write. There is also the vnc_disconnect_start() case when the socket seems to be broken and we remove all handlers. > The other question is: Who's responsible for writing to the client > socket in threaded mode? Only the vnc thread(s), or also other qemu > threads? In the former case, just avoid setting a write handler at all. Cheap stuff is done by the main thread (cursor, etc...). The thread only do framebuffer updates.
On 2011-03-09 11:06, Corentin Chary wrote: >> Probably the best way is to make vnc stop fiddling with >> qemu_set_fd_handler2, specifically in threaded mode. >> Why does it need to set/reset the write handler all the time? > > I didn't write the original code, but it's probably to avoid calling a > write handler when there is no data to write. That would make select() > busy loop when there are actually no data to write. > There is also the vnc_disconnect_start() case when the socket seems to > be broken and we remove all handlers. > >> The other question is: Who's responsible for writing to the client >> socket in threaded mode? Only the vnc thread(s), or also other qemu >> threads? In the former case, just avoid setting a write handler at all. > > Cheap stuff is done by the main thread (cursor, etc...). The thread > only do framebuffer updates. And both are synchronized with a vnc-private lock only? The problem with this model is the non-threaded qemu execution model. Even if we acquire the global mutex to protect handler updates against the main select loop, this won't help here. So vnc-threading likely needs to be made dependent on --enable-io-thread. Jan
>> Cheap stuff is done by the main thread (cursor, etc...). The thread >> only do framebuffer updates. > > And both are synchronized with a vnc-private lock only? Yes > The problem with this model is the non-threaded qemu execution model. > Even if we acquire the global mutex to protect handler updates against > the main select loop, this won't help here. So vnc-threading likely > needs to be made dependent on --enable-io-thread. Plus proper lock_iothread() calls right ? If yes I'll send a patch for that (or you can do it, as you want). Thanks,
On 2011-03-09 11:14, Corentin Chary wrote: >>> Cheap stuff is done by the main thread (cursor, etc...). The thread >>> only do framebuffer updates. >> >> And both are synchronized with a vnc-private lock only? > > Yes > >> The problem with this model is the non-threaded qemu execution model. >> Even if we acquire the global mutex to protect handler updates against >> the main select loop, this won't help here. So vnc-threading likely >> needs to be made dependent on --enable-io-thread. > > Plus proper lock_iothread() calls right ? Yep. > > If yes I'll send a patch for that (or you can do it, as you want). Don't wait for me. :) Jan
diff --git a/ui/vnc-jobs-async.c b/ui/vnc-jobs-async.c index 1d4c5e7..e02d891 100644 --- a/ui/vnc-jobs-async.c +++ b/ui/vnc-jobs-async.c @@ -258,7 +258,9 @@ static int vnc_worker_thread_loop(VncJobQueue *queue) goto disconnected; } + /* lock */ vnc_write(job->vs, vs.output.buffer, vs.output.offset); + /* unlock */ disconnected: /* Copy persistent encoding data */ @@ -267,7 +269,9 @@ disconnected: vnc_unlock_output(job->vs); if (flush) { + /* lock */ vnc_flush(job->vs); + /* unlock */ } vnc_lock_queue(queue)