From patchwork Fri Sep 16 17:46:52 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Henderson X-Patchwork-Id: 671030 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3sbP255mJRz9s3T for ; Sat, 17 Sep 2016 04:29:17 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b=Lm0E4N5Y; dkim-atps=neutral Received: from localhost ([::1]:43153 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bkxsx-0005xj-BK for incoming@patchwork.ozlabs.org; Fri, 16 Sep 2016 14:29:15 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41377) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bkxFY-0006Hy-1w for qemu-devel@nongnu.org; Fri, 16 Sep 2016 13:48:34 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bkxFT-0007g2-RU for qemu-devel@nongnu.org; Fri, 16 Sep 2016 13:48:31 -0400 Received: from mail-pa0-f51.google.com ([209.85.220.51]:35224) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bkxFT-0007ff-Ix for qemu-devel@nongnu.org; Fri, 16 Sep 2016 13:48:27 -0400 Received: by mail-pa0-f51.google.com with SMTP id oz2so23466343pac.2 for ; Fri, 16 Sep 2016 10:48:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references; bh=hEuPPJPYWaViKJtXXFewHlEi4J6Toll4BfgwuDpX+TA=; b=Lm0E4N5YXB8R0jEiC6v7QNA3VbIj4DxC0a+OpZ/9+NyvqEnCP/s2h4lmZZyCj5oSVy ZV94rNdWi5mU2xw1OsFu+DoL2vUXj96zIrfT241PzYYYKAnLvWU+K7C7/v4AN0+1+slv bb6CyIFvHMSN+qa03XmJYj5ByryEO2CKRFbbXIzAq+q6WsqlYRRlmAXaXWZ+g2+4HDi5 TnvzqRLdEymDATWoty8R09mcH7MWFWxKByMZ9AEXbfEUOaHqi1+y5ylEWxI57gfSckD2 SMHommbUNIK0iOVapnHDEhxNcUvM4JUw+TJGhMYXsYkVrGvHqWoGGPja/vhwv+o0WCAA oOpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references; bh=hEuPPJPYWaViKJtXXFewHlEi4J6Toll4BfgwuDpX+TA=; b=EHN7YQ2ssjXB4JVMiAfwhJdK3hD8ov26XMDGxfSjZWdV4fJhK90kJiuJ2jzvFRTRLW foB2x5s69DGACYJybEkAX7L588PyWovcxXqoIczGzNu/I6QuuH69/l6E1VGI2JGY4eA0 1j4A0zoHPv9syH+V6Zr5sYMVcyhNYC3OOV7u1K6E9uJQsVVIG+se/N7q1NhaxGjcmOmh mO9XHlbKFAKt13ECYWA0nR4MtZ+poH/njTuMv6G834f5hHIY1XZ5kPSfpAbLBShUZ2ii iIFYfRJhqLcjUKM2Bu+PcVEdqQPACsoew/c6iUX/Kd8OMTtjxtCl8P6SQTZmYa0mv2eV Q6ng== X-Gm-Message-State: AE9vXwMdwZoQ5vjlmjVb4uKPYFMMc9IkYh8/NcYXDOh1w73ZFk9haqyV6YdUGPPUQsj2kA== X-Received: by 10.66.129.174 with SMTP id nx14mr11210752pab.167.1474048046507; Fri, 16 Sep 2016 10:47:26 -0700 (PDT) Received: from anchor.twiddle.net (174-24-157-40.tukw.qwest.net. [174.24.157.40]) by smtp.gmail.com with ESMTPSA id e71sm27910078pfc.75.2016.09.16.10.47.25 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Sep 2016 10:47:25 -0700 (PDT) From: Richard Henderson To: qemu-devel@nongnu.org Date: Fri, 16 Sep 2016 10:46:52 -0700 Message-Id: <1474048017-26696-31-git-send-email-rth@twiddle.net> X-Mailer: git-send-email 2.5.5 In-Reply-To: <1474048017-26696-1-git-send-email-rth@twiddle.net> References: <1474048017-26696-1-git-send-email-rth@twiddle.net> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.85.220.51 Subject: [Qemu-devel] [PATCH v4 30/35] target-arm: emulate aarch64's LL/SC using cmpxchg helpers X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Emilio G. Cota" Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" From: "Emilio G. Cota" Emulating LL/SC with cmpxchg is not correct, since it can suffer from the ABA problem. Portable parallel code, however, is written assuming only cmpxchg--and not LL/SC--is available. This means that in practice emulating LL/SC with cmpxchg is a viable alternative. The appended emulates LL/SC pairs in aarch64 with cmpxchg helpers. This works in both user and system mode. In usermode, it avoids pausing all other CPUs to perform the LL/SC pair. The subsequent performance and scalability improvement is significant, as the plots below show. They plot the throughput of atomic_add-bench compiled for ARM and executed on a 64-core x86 machine. Hi-res plots: http://imgur.com/a/JVc8Y atomic_add-bench: 1000000 ops/thread, [0,1] range 18 ++---------+----------+---------+----------+----------+----------+---++ +cmpxchg +-E--+ + + + + + | 16 ++master +-H--+ ++ || | 14 ++ ++ | | | 12 ++| ++ | | | 10 ++++ ++ 8 ++E ++ |+++ | 6 ++ | ++ | | | 4 ++ | ++ | | | 2 +H++E+--- ++ + | +E++----+E+---+--+E+----++E+------+E+------+E++----+E+---+--+E| 0 ++H-H----H-+-----H----+---------+----------+----------+----------+---++ 0 10 20 30 40 50 60 Number of threads atomic_add-bench: 1000000 ops/thread, [0,2] range 18 ++---------+----------+---------+----------+----------+----------+---++ +cmpxchg +-E--+ + + + + + | 16 ++master +-H--+ ++ | | | 14 ++E ++ | | | 12 ++| ++ |+++ | 10 ++ | ++ 8 ++ | ++ | | | 6 ++ | ++ | | | 4 ++ | ++ | +E+--- | 2 +H+ +E+-----+++ +++ +++ ---+E+-----+E+------+++ +++ + +E+---+--+E+----++E+------+E+--- ++++ +++ + +E| 0 ++H-H----H-+-----H----+---------+----------+----------+----------+---++ 0 10 20 30 40 50 60 Number of threads atomic_add-bench: 1000000 ops/thread, [0,128] range 70 ++---------+----------+---------+----------+----------+----------+---++ +cmpxchg +-E--+ + + + + + | 60 ++master +-H--+ +++ ---+E+-----+E+------+E+ | +E+------E-------+E+--- | | --- +++ | 50 ++ +++--- ++ | -+E+ | 40 ++ +++---- ++ | E- | | --| | 30 ++ -- +++ ++ | +E+ | 20 ++E+ ++ |E+ | | | 10 ++ ++ + + + + + + + | 0 +HH-H----H-+-----H----+---------+----------+----------+----------+---++ 0 10 20 30 40 50 60 Number of threads atomic_add-bench: 1000000 ops/thread, [0,1024] range 160 ++---------+---------+----------+---------+----------+----------+---++ +cmpxchg +-E--+ + + + + + | 140 ++master +-H--+ +++ +++ | -+E+-----+E+-------E| 120 ++ +++ ---- +++ | +++ ----E-- | 100 ++ --E--- +++ ++ | +++ ---- +++ | 80 ++ --E-- ++ | ---- +++ | | -+E+ | 60 ++ ---- +++ ++ | +E+- | 40 ++ -- ++ | +E+ | 20 +EE+ ++ +++ + + + + + + | 0 +HH-H---H--+-----H---+----------+---------+----------+----------+---++ 0 10 20 30 40 50 60 Number of threads [rth: Rearrange 128-bit cmpxchg helper. Enforce alignment on LL.] Signed-off-by: Emilio G. Cota Message-Id: <1467054136-10430-28-git-send-email-cota@braap.org> Signed-off-by: Richard Henderson --- target-arm/helper-a64.c | 113 +++++++++++++++++++++++++++++++++++++++++++++ target-arm/helper-a64.h | 2 + target-arm/translate-a64.c | 106 +++++++++++++++++++----------------------- 3 files changed, 163 insertions(+), 58 deletions(-) diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c index 41e48a4..35c82b4 100644 --- a/target-arm/helper-a64.c +++ b/target-arm/helper-a64.c @@ -27,6 +27,10 @@ #include "qemu/bitops.h" #include "internals.h" #include "qemu/crc32c.h" +#include "exec/exec-all.h" +#include "exec/cpu_ldst.h" +#include "qemu/int128.h" +#include "tcg.h" #include /* For crc32 */ /* C2.4.7 Multiply and divide */ @@ -444,3 +448,112 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, uint32_t bytes) /* Linux crc32c converts the output to one's complement. */ return crc32c(acc, buf, bytes) ^ 0xffffffff; } + +/* Returns 0 on success; 1 otherwise. */ +uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr, + uint64_t new_lo, uint64_t new_hi) +{ + uintptr_t ra = GETPC(); + Int128 oldv, cmpv, newv; + bool success; + + cmpv = int128_make128(env->exclusive_val, env->exclusive_high); + newv = int128_make128(new_lo, new_hi); + + if (parallel_cpus) { +#ifndef CONFIG_ATOMIC128 + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); +#else + int mem_idx = cpu_mmu_index(env, false); + TCGMemOpIdx oi = make_memop_idx(MO_LEQ | MO_ALIGN_16, mem_idx); + oldv = helper_atomic_cmpxchgo_le_mmu(env, addr, cmpv, newv, oi, ra); + success = int128_eq(oldv, cmpv); +#endif + } else { + uint64_t o0, o1; + +#ifdef CONFIG_USER_ONLY + /* ??? Enforce alignment. */ + uint64_t *haddr = g2h(addr); + o0 = ldq_le_p(haddr + 0); + o1 = ldq_le_p(haddr + 1); + oldv = int128_make128(o0, o1); + + success = int128_eq(oldv, cmpv); + if (success) { + stq_le_p(haddr + 0, int128_getlo(newv)); + stq_le_p(haddr + 8, int128_gethi(newv)); + } +#else + int mem_idx = cpu_mmu_index(env, false); + TCGMemOpIdx oi0 = make_memop_idx(MO_LEQ | MO_ALIGN_16, mem_idx); + TCGMemOpIdx oi1 = make_memop_idx(MO_LEQ, mem_idx); + + o0 = helper_le_ldq_mmu(env, addr + 0, oi0, ra); + o1 = helper_le_ldq_mmu(env, addr + 8, oi1, ra); + oldv = int128_make128(o0, o1); + + success = int128_eq(oldv, cmpv); + if (success) { + helper_le_stq_mmu(env, addr + 0, int128_getlo(newv), oi1, ra); + helper_le_stq_mmu(env, addr + 8, int128_gethi(newv), oi1, ra); + } +#endif + } + + return !success; +} + +uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr, + uint64_t new_lo, uint64_t new_hi) +{ + uintptr_t ra = GETPC(); + Int128 oldv, cmpv, newv; + bool success; + + cmpv = int128_make128(env->exclusive_val, env->exclusive_high); + newv = int128_make128(new_lo, new_hi); + + if (parallel_cpus) { +#ifndef CONFIG_ATOMIC128 + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); +#else + int mem_idx = cpu_mmu_index(env, false); + TCGMemOpIdx oi = make_memop_idx(MO_BEQ | MO_ALIGN_16, mem_idx); + oldv = helper_atomic_cmpxchgo_be_mmu(env, addr, cmpv, newv, oi, ra); + success = int128_eq(oldv, cmpv); +#endif + } else { + uint64_t o0, o1; + +#ifdef CONFIG_USER_ONLY + /* ??? Enforce alignment. */ + uint64_t *haddr = g2h(addr); + o1 = ldq_be_p(haddr + 0); + o0 = ldq_be_p(haddr + 1); + oldv = int128_make128(o0, o1); + + success = int128_eq(oldv, cmpv); + if (success) { + stq_be_p(haddr + 0, int128_gethi(newv)); + stq_be_p(haddr + 8, int128_getlo(newv)); + } +#else + int mem_idx = cpu_mmu_index(env, false); + TCGMemOpIdx oi0 = make_memop_idx(MO_BEQ | MO_ALIGN_16, mem_idx); + TCGMemOpIdx oi1 = make_memop_idx(MO_BEQ, mem_idx); + + o1 = helper_be_ldq_mmu(env, addr + 0, oi0, ra); + o0 = helper_be_ldq_mmu(env, addr + 8, oi1, ra); + oldv = int128_make128(o0, o1); + + success = int128_eq(oldv, cmpv); + if (success) { + helper_be_stq_mmu(env, addr + 0, int128_gethi(newv), oi1, ra); + helper_be_stq_mmu(env, addr + 8, int128_getlo(newv), oi1, ra); + } +#endif + } + + return !success; +} diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h index 1d3d10f..dd32000 100644 --- a/target-arm/helper-a64.h +++ b/target-arm/helper-a64.h @@ -46,3 +46,5 @@ DEF_HELPER_FLAGS_2(frecpx_f32, TCG_CALL_NO_RWG, f32, f32, ptr) DEF_HELPER_FLAGS_2(fcvtx_f64_to_f32, TCG_CALL_NO_RWG, f32, f64, env) DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) +DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64) +DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64) diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c index ddf52f5..bf7388d 100644 --- a/target-arm/translate-a64.c +++ b/target-arm/translate-a64.c @@ -1767,37 +1767,41 @@ static void disas_b_exc_sys(DisasContext *s, uint32_t insn) } } -/* - * Load/Store exclusive instructions are implemented by remembering - * the value/address loaded, and seeing if these are the same - * when the store is performed. This is not actually the architecturally - * mandated semantics, but it works for typical guest code sequences - * and avoids having to monitor regular stores. - * - * In system emulation mode only one CPU will be running at once, so - * this sequence is effectively atomic. In user emulation mode we - * throw an exception and handle the atomic operation elsewhere. - */ static void gen_load_exclusive(DisasContext *s, int rt, int rt2, TCGv_i64 addr, int size, bool is_pair) { TCGv_i64 tmp = tcg_temp_new_i64(); - TCGMemOp memop = s->be_data + size; + TCGMemOp be = s->be_data; g_assert(size <= 3); - tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), memop); - if (is_pair) { - TCGv_i64 addr2 = tcg_temp_new_i64(); TCGv_i64 hitmp = tcg_temp_new_i64(); - g_assert(size >= 2); - tcg_gen_addi_i64(addr2, addr, 1 << size); - tcg_gen_qemu_ld_i64(hitmp, addr2, get_mem_index(s), memop); - tcg_temp_free_i64(addr2); + if (size == 3) { + TCGv_i64 addr2 = tcg_temp_new_i64(); + + tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), + MO_64 | MO_ALIGN_16 | be); + tcg_gen_addi_i64(addr2, addr, 8); + tcg_gen_qemu_ld_i64(hitmp, addr2, get_mem_index(s), + MO_64 | MO_ALIGN | be); + tcg_temp_free_i64(addr2); + } else { + g_assert(size == 2); + tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), + MO_64 | MO_ALIGN | be); + if (be == MO_LE) { + tcg_gen_extr32_i64(tmp, hitmp, tmp); + } else { + tcg_gen_extr32_i64(hitmp, tmp, tmp); + } + } + tcg_gen_mov_i64(cpu_exclusive_high, hitmp); tcg_gen_mov_i64(cpu_reg(s, rt2), hitmp); tcg_temp_free_i64(hitmp); + } else { + tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), size | MO_ALIGN | be); } tcg_gen_mov_i64(cpu_exclusive_val, tmp); @@ -1807,16 +1811,6 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2, tcg_gen_mov_i64(cpu_exclusive_addr, addr); } -#ifdef CONFIG_USER_ONLY -static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, - TCGv_i64 addr, int size, int is_pair) -{ - tcg_gen_mov_i64(cpu_exclusive_test, addr); - tcg_gen_movi_i32(cpu_exclusive_info, - size | is_pair << 2 | (rd << 4) | (rt << 9) | (rt2 << 14)); - gen_exception_internal_insn(s, 4, EXCP_STREX); -} -#else static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, TCGv_i64 inaddr, int size, int is_pair) { @@ -1844,46 +1838,42 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label); tmp = tcg_temp_new_i64(); - tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), s->be_data + size); - tcg_gen_brcond_i64(TCG_COND_NE, tmp, cpu_exclusive_val, fail_label); - tcg_temp_free_i64(tmp); - - if (is_pair) { - TCGv_i64 addrhi = tcg_temp_new_i64(); - TCGv_i64 tmphi = tcg_temp_new_i64(); - - tcg_gen_addi_i64(addrhi, addr, 1 << size); - tcg_gen_qemu_ld_i64(tmphi, addrhi, get_mem_index(s), - s->be_data + size); - tcg_gen_brcond_i64(TCG_COND_NE, tmphi, cpu_exclusive_high, fail_label); - - tcg_temp_free_i64(tmphi); - tcg_temp_free_i64(addrhi); - } - - /* We seem to still have the exclusive monitor, so do the store */ - tcg_gen_qemu_st_i64(cpu_reg(s, rt), addr, get_mem_index(s), - s->be_data + size); if (is_pair) { - TCGv_i64 addrhi = tcg_temp_new_i64(); - - tcg_gen_addi_i64(addrhi, addr, 1 << size); - tcg_gen_qemu_st_i64(cpu_reg(s, rt2), addrhi, - get_mem_index(s), s->be_data + size); - tcg_temp_free_i64(addrhi); + if (size == 2) { + TCGv_i64 val = tcg_temp_new_i64(); + tcg_gen_concat32_i64(tmp, cpu_reg(s, rt), cpu_reg(s, rt2)); + tcg_gen_concat32_i64(val, cpu_exclusive_val, cpu_exclusive_high); + tcg_gen_atomic_cmpxchg_i64(tmp, addr, val, tmp, + get_mem_index(s), + size | MO_ALIGN | s->be_data); + tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, val); + tcg_temp_free_i64(val); + } else if (s->be_data == MO_LE) { + gen_helper_paired_cmpxchg64_le(tmp, cpu_env, addr, cpu_reg(s, rt), + cpu_reg(s, rt2)); + } else { + gen_helper_paired_cmpxchg64_be(tmp, cpu_env, addr, cpu_reg(s, rt), + cpu_reg(s, rt2)); + } + } else { + TCGv_i64 val = cpu_reg(s, rt); + tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val, + get_mem_index(s), + size | MO_ALIGN | s->be_data); + tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val); } tcg_temp_free_i64(addr); - tcg_gen_movi_i64(cpu_reg(s, rd), 0); + tcg_gen_mov_i64(cpu_reg(s, rd), tmp); + tcg_temp_free_i64(tmp); tcg_gen_br(done_label); + gen_set_label(fail_label); tcg_gen_movi_i64(cpu_reg(s, rd), 1); gen_set_label(done_label); tcg_gen_movi_i64(cpu_exclusive_addr, -1); - } -#endif /* Update the Sixty-Four bit (SF) registersize. This logic is derived * from the ARMv8 specs for LDR (Shared decode for all encodings).