From patchwork Wed Sep 11 19:09:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Rafael David Tinoco X-Patchwork-Id: 1161269 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=nongnu.org (client-ip=209.51.188.17; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=kernelpath.com Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 46TBZy4GMYz9s7T for ; Thu, 12 Sep 2019 05:21:14 +1000 (AEST) Received: from localhost ([::1]:55714 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1i88Ay-0006Yc-IQ for incoming@patchwork.ozlabs.org; Wed, 11 Sep 2019 15:21:12 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:57811) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1i88Ac-0006YL-TP for qemu-devel@nongnu.org; Wed, 11 Sep 2019 15:20:52 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1i88Ab-0005hc-3h for qemu-devel@nongnu.org; Wed, 11 Sep 2019 15:20:50 -0400 Received: from indium.canonical.com ([91.189.90.7]:35194) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1i88Aa-0005hG-Rd for qemu-devel@nongnu.org; Wed, 11 Sep 2019 15:20:49 -0400 Received: from loganberry.canonical.com ([91.189.90.37]) by indium.canonical.com with esmtp (Exim 4.86_2 #2 (Debian)) id 1i88AZ-0001dy-DJ for ; Wed, 11 Sep 2019 19:20:47 +0000 Received: from loganberry.canonical.com (localhost [127.0.0.1]) by loganberry.canonical.com (Postfix) with ESMTP id 5F9A12E80CD for ; Wed, 11 Sep 2019 19:20:47 +0000 (UTC) MIME-Version: 1.0 Date: Wed, 11 Sep 2019 19:09:25 -0000 From: Rafael David Tinoco To: qemu-devel@nongnu.org X-Launchpad-Notification-Type: bug X-Launchpad-Bug: product=qemu; status=In Progress; importance=Undecided; assignee=rafaeldtinoco@kernelpath.com; X-Launchpad-Bug: distribution=ubuntu; sourcepackage=qemu; component=main; status=In Progress; importance=Medium; assignee=rafaeldtinoco@kernelpath.com; X-Launchpad-Bug-Tags: qemu-img X-Launchpad-Bug-Information-Type: Public X-Launchpad-Bug-Private: no X-Launchpad-Bug-Security-Vulnerability: no X-Launchpad-Bug-Commenters: dannf jnsnow lizhengui rafaeldtinoco X-Launchpad-Bug-Reporter: dann frazier (dannf) X-Launchpad-Bug-Modifier: Rafael David Tinoco (rafaeldtinoco) References: <154327283728.15443.11625169757714443608.malonedeb@soybean.canonical.com> Message-Id: <1864070a-2f84-1d98-341e-f01ddf74ec4b@ubuntu.com> X-Launchpad-Message-Rationale: Subscriber (QEMU) @qemu-devel-ml X-Launchpad-Message-For: qemu-devel-ml Precedence: bulk X-Generated-By: Launchpad (canonical.com); Revision="19044"; Instance="production-secrets-lazr.conf" X-Launchpad-Hash: bdce987cc9861213a13481a92478b9e26e071c9a X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 91.189.90.7 Subject: [Qemu-devel] [Bug 1805256] Re: qemu_futex_wait() lockups in ARM64: 2 possible issues X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Bug 1805256 <1805256@bugs.launchpad.net> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" > Zhengui's theory that notify_me doesn't work properly on ARM is more > promising, but he couldn't provide a clear explanation of why he thought > notify_me is involved. In particular, I would have expected notify_me to > be wrong if the qemu_poll_ns call came from aio_ctx_dispatch, for example: > > > glib_pollfds_fill > g_main_context_prepare > aio_ctx_prepare > atomic_or(&ctx->notify_me, 1) > qemu_poll_ns > glib_pollfds_poll > g_main_context_check > aio_ctx_check > atomic_and(&ctx->notify_me, ~1) > g_main_context_dispatch > aio_ctx_dispatch > /* do something for event */ > qemu_poll_ns > Paolo, I tried confining execution in a single NUMA domain (cpu & mem) and still faced the issue, then, I added a mutex "ctx->notify_me_lcktest" into context to protect "ctx->notify_me", like showed bellow, and it seems to have either fixed or mitigated it. I was able to cause the hung once every 3 or 4 runs. I have already ran qemu-img convert more than 30 times now and couldn't reproduce it again. Next step is to play with the barriers and check why existing ones aren't enough for ordering access to ctx->notify_me ... or should I try/do something else in your opinion ? This arch/machine (Huawei D06): $ lscpu Architecture: aarch64 Byte Order: Little Endian CPU(s): 96 On-line CPU(s) list: 0-95 Thread(s) per core: 1 Core(s) per socket: 48 Socket(s): 2 NUMA node(s): 4 Vendor ID: 0x48 Model: 0 Stepping: 0x0 CPU max MHz: 2000.0000 CPU min MHz: 200.0000 BogoMIPS: 200.00 L1d cache: 64K L1i cache: 64K L2 cache: 512K L3 cache: 32768K NUMA node0 CPU(s): 0-23 NUMA node1 CPU(s): 24-47 NUMA node2 CPU(s): 48-71 NUMA node3 CPU(s): 72-95 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm dcpop ---- diff --git a/include/block/aio.h b/include/block/aio.h index 0ca25dfec6..0724086d91 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -84,6 +84,7 @@ struct AioContext { * dispatch phase, hence a simple counter is enough for them. */ uint32_t notify_me; + QemuMutex notify_me_lcktest; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and diff --git a/util/aio-posix.c b/util/aio-posix.c index 51c41ed3c9..031d6e2997 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -529,7 +529,9 @@ static bool run_poll_handlers(AioContext *ctx, int64_t max_ns, int64_t *timeout) bool progress; int64_t start_time, elapsed_time; + qemu_mutex_lock(&ctx->notify_me_lcktest); assert(ctx->notify_me); + qemu_mutex_unlock(&ctx->notify_me_lcktest); assert(qemu_lockcnt_count(&ctx->list_lock) > 0); trace_run_poll_handlers_begin(ctx, max_ns, *timeout); @@ -601,8 +603,10 @@ bool aio_poll(AioContext *ctx, bool blocking) * so disable the optimization now. */ if (blocking) { + qemu_mutex_lock(&ctx->notify_me_lcktest); assert(in_aio_context_home_thread(ctx)); atomic_add(&ctx->notify_me, 2); + qemu_mutex_unlock(&ctx->notify_me_lcktest); } qemu_lockcnt_inc(&ctx->list_lock); @@ -647,8 +651,10 @@ bool aio_poll(AioContext *ctx, bool blocking) } if (blocking) { + qemu_mutex_lock(&ctx->notify_me_lcktest); atomic_sub(&ctx->notify_me, 2); aio_notify_accept(ctx); + qemu_mutex_unlock(&ctx->notify_me_lcktest); } /* Adjust polling time */ diff --git a/util/async.c b/util/async.c index c10642a385..140e1e86f5 100644 --- a/util/async.c +++ b/util/async.c @@ -221,7 +221,9 @@ aio_ctx_prepare(GSource *source, gint *timeout) { AioContext *ctx = (AioContext *) source; + qemu_mutex_lock(&ctx->notify_me_lcktest); atomic_or(&ctx->notify_me, 1); + qemu_mutex_unlock(&ctx->notify_me_lcktest); /* We assume there is no timeout already supplied */ *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)); @@ -239,8 +241,10 @@ aio_ctx_check(GSource *source) AioContext *ctx = (AioContext *) source; QEMUBH *bh; + qemu_mutex_lock(&ctx->notify_me_lcktest); atomic_and(&ctx->notify_me, ~1); aio_notify_accept(ctx); + qemu_mutex_unlock(&ctx->notify_me_lcktest); for (bh = ctx->first_bh; bh; bh = bh->next) { if (bh->scheduled) { @@ -346,11 +350,13 @@ void aio_notify(AioContext *ctx) /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. */ - smp_mb(); + //smp_mb(); + qemu_mutex_lock(&ctx->notify_me_lcktest); if (ctx->notify_me) { event_notifier_set(&ctx->notifier); atomic_mb_set(&ctx->notified, true); } + qemu_mutex_unlock(&ctx->notify_me_lcktest); } void aio_notify_accept(AioContext *ctx) @@ -424,6 +430,8 @@ AioContext *aio_context_new(Error **errp) ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx); QSLIST_INIT(&ctx->scheduled_coroutines); + qemu_rec_mutex_init(&ctx->notify_me_lcktest); + aio_set_event_notifier(ctx, &ctx->notifier, false, (EventNotifierHandler *)