From patchwork Wed Apr 3 02:17:41 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Crosthwaite X-Patchwork-Id: 233208 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id D58592C0128 for ; Wed, 3 Apr 2013 13:18:03 +1100 (EST) Received: from localhost ([::1]:36753 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNDGz-0000by-Om for incoming@patchwork.ozlabs.org; Tue, 02 Apr 2013 22:18:01 -0400 Received: from eggs.gnu.org ([208.118.235.92]:37737) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNDGj-0000bs-5Y for qemu-devel@nongnu.org; Tue, 02 Apr 2013 22:17:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UNDGh-0000Cx-7r for qemu-devel@nongnu.org; Tue, 02 Apr 2013 22:17:45 -0400 Received: from mail-bk0-x22e.google.com ([2a00:1450:4008:c01::22e]:37833) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UNDGg-0000Cf-Sb for qemu-devel@nongnu.org; Tue, 02 Apr 2013 22:17:43 -0400 Received: by mail-bk0-f46.google.com with SMTP id je9so516512bkc.5 for ; Tue, 02 Apr 2013 19:17:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :x-gm-message-state; bh=5Vg7njyqC5bzRJkrYj8vuBXDSesjVXnLGpsIHaT7o4U=; b=VyVp4EYUz6RDa7xdGQYDK7nnFzDHsMKZINOEXRtUSEZaoiCSnGYrcPuLeI//RsXS9m 1R2OCAqkVY262Xn+rPW+8bJwdFbsw0Fjv8+DKPnhazXitSlvRAsw4uJRVx4QliahmL1w gK1+LDG2mbvQCnPZya1TShghR4GO/2kOstwTU5LAgoAaB8apDqJuMHR8OXKc3p9UAaZe bAuq7LInP1g95L3RLozKCYy3PQ30HR4z4KTuVTLMJtfMfqn0hFftDgikf1AVvGmtZI2B KMEV/OTWO1pIMz3O93yv6PHwctrqrv4rD4Bw+epNz+Gz3WDYXUR8MkNyuJqorZyEyBWO LjiQ== MIME-Version: 1.0 X-Received: by 10.205.25.136 with SMTP id ri8mr7651354bkb.62.1364955461598; Tue, 02 Apr 2013 19:17:41 -0700 (PDT) Received: by 10.205.107.5 with HTTP; Tue, 2 Apr 2013 19:17:41 -0700 (PDT) In-Reply-To: <515ABCD1.2070008@redhat.com> References: <1364893452-10604-1-git-send-email-peter.crosthwaite@xilinx.com> <515ABCD1.2070008@redhat.com> Date: Wed, 3 Apr 2013 12:17:41 +1000 X-Google-Sender-Auth: fXJXMg0BcGCVRirxx1okSJnaRD8 Message-ID: From: Peter Crosthwaite To: Paolo Bonzini X-Gm-Message-State: ALoCoQkTKhDAdKdReuIaD9lyD1IXnow6xiktsQ30Z9bKUgjqNS8t5BphMxxqjeCnO/c6NZGPfRxD X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2a00:1450:4008:c01::22e Cc: aliguori@us.ibm.com, qemu-devel@nongnu.org, aurelien@aurel32.net, gson@gson.org Subject: Re: [Qemu-devel] [RFC PATCH] main-loop: Unconditionally unlock iothread X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Hi Paolo, Thanks for the clues. On Tue, Apr 2, 2013 at 9:11 PM, Paolo Bonzini wrote: > Il 02/04/2013 11:04, Peter Crosthwaite ha scritto: >> Public bug: 1154328 >> Broken Commit: a29753f8aa79a34a324afebe340182a51a5aef11 >> >> ATM, the timeout from g_pollfds_fill is inhibiting unlocking of the >> iothread. This is capable of causing a total deadlock when hw/serial >> is used as a device. The bug manifests when you go -nographic -serial >> mon:stdio and then paste 40 or more chars into the terminal. >> >> My knowledge of this g_foo is vague at best, but my best working >> theory is this: >> >> - First 8 chars are recieved by the serial device no complaints. >> - The next 32 chars, serial returns false for can_receive() so they >> are buffered by the MuxDriver object - mux_chr_read() >> - Buffer is full, so 41st char causes false return from Muxes own >> can_read() >> - This propagates all the way up to glib_pollfds_fill and manifests >> as a timeout > > I suppose you mean "manifests as timeout==0". The question is *which* > GSource has a timeout of zero? None of the GSources are to blame. The timeout actually starts as zero as set by mainloop wait when the nonblocking flag is set. This little hack also fixes the bug (while leaving the mutex ifery in). Not the mux's: if mux_chr_can_read() > returns zero, the prepare function returns FALSE without touching the > timeout at all... > > static gboolean io_watch_poll_prepare(GSource *source, gint *timeout_) > { > IOWatchPoll *iwp = io_watch_poll_from_source(source); > > iwp->max_size = iwp->fd_can_read(iwp->opaque); > if (iwp->max_size == 0) { > return FALSE; > } > > return g_io_watch_funcs.prepare(source, timeout_); > } > >> - Timeout means no unlock of IOthread. Device land never sees any more >> cycles so the serial port never progresses - no flushing of >> buffer > > Still, this is plausible, so the patch looks correct. > Ok, Ill give it some more time and fix commit message. if we don't figure out a better patch. Regards, Peter > Paolo > >> - Deadlock >> >> Tested on petalogix_ml605 microblazeel machine model, which was faulty >> due to 1154328. >> >> Fix by removing the conditions on unlocking the iothread. Don't know >> what else this will break but the timeout is certainly the wrong >> condition for the unlock. Probably the real solution is to have a more >> selective unlock policy. >> >> I'm happy for someone to take this patch off my hands, or educate me on >> the correct implementation. For the peeps doing automated testing on >> nographic platforms this will get your build working again. >> >> Signed-off-by: Peter Crosthwaite >> --- >> main-loop.c | 8 ++------ >> 1 files changed, 2 insertions(+), 6 deletions(-) >> >> diff --git a/main-loop.c b/main-loop.c >> index eb80ff3..a376898 100644 >> --- a/main-loop.c >> +++ b/main-loop.c >> @@ -194,15 +194,11 @@ static int os_host_main_loop_wait(uint32_t timeout) >> >> glib_pollfds_fill(&timeout); >> >> - if (timeout > 0) { >> - qemu_mutex_unlock_iothread(); >> - } >> + qemu_mutex_unlock_iothread(); >> >> ret = g_poll((GPollFD *)gpollfds->data, gpollfds->len, timeout); >> >> - if (timeout > 0) { >> - qemu_mutex_lock_iothread(); >> - } >> + qemu_mutex_lock_iothread(); >> >> glib_pollfds_poll(); >> return ret; >> > > Reviewed-by: Anthony Liguori --- a/main-loop.c +++ b/main-loop.c @@ -428,10 +428,6 @@ int main_loop_wait(int nonblocking) int ret; uint32_t timeout = UINT32_MAX; - if (nonblocking) { - timeout = 0; - } - /* poll any events */ g_array_set_size(gpollfds, 0); /* reset for new iteration */ /* XXX: separate device handlers from system ones */ --- Is it expected that this non-blocking condition implies lockup of the iothread? Diving deeper again, I notice that this non-blocking feature isn't even enabled at all for KVM. Which would probably mean that this bug is not replicable by anyone testing with KVM. We could just make all the CPU backends consistent with KVM by removing the nonblocking altogether. Any comments from TCG people :) ? --- a/vl.c +++ b/vl.c @@ -2030,17 +2030,15 @@ static bool main_loop_should_exit(void) static void main_loop(void) { - bool nonblocking; int last_io = 0; #ifdef CONFIG_PROFILER int64_t ti; #endif do { - nonblocking = !kvm_enabled() && last_io > 0; #ifdef CONFIG_PROFILER ti = profile_getclock(); #endif - last_io = main_loop_wait(nonblocking); + last_io = main_loop_wait(0); #ifdef CONFIG_PROFILER dev_time += profile_getclock() - ti; #endif