diff mbox

MTTCG status updates, benchmark results and KVM forum plans

Message ID 20160815154626.GA8768@flamenco
State New
Headers show

Commit Message

Emilio Cota Aug. 15, 2016, 3:46 p.m. UTC
On Mon, Aug 15, 2016 at 11:46:32 +0100, Alex Bennée wrote:
> As far as I'm aware the following work is still ongoing:
> 
> Emilo: cmpxchg atomics
> Alvise: LL/SC modelling

I've been tinkering with an experimental patch to do proper LL/SC. The idea
is to rely on hardware transactional memory, so that stores don't have
to be tracked. The trickiest thing is the fallback path, for which I'm
trying to (ab)use EXCP_ATOMIC to execute exclusively from the ldrex
all the way to the strex.

To test it, I'm using aarch64-linux-user running qht-bench compiled on
an aarch64 machine. I'm running on an Intel Skylake host (Skylake has
no known TSX bugs)

However, I'm finding issues that might not have to do with the
patch itself.

- On the latest MTTCG+cmpxchg tree (45c11751ed7 a.k.a.
  bennee/mttcg/base-patches-v4-with-cmpxchg-atomics-v2), QEMU loops
  forever without making progress in the instruction stream, even
  with taskset -c 0.
- On the cmpxchg tree (rth's atomic-2 branch [1]), it works more
  reliably, although tb_lock is held around tb_find_fast so parallelism isn't
  very high. Still, it sometimes triggers the assert below.
  - Applying the "remove tb_lock around hot path" patch makes it
    easier to trigger this assert in cpu-exec.c:650 (approx.):
            /* Assert that the compiler does not smash local variables. */
            g_assert(cpu == current_cpu)
    I've also seen triggered the assert immediately after that one, as well
    as the rcu_read_unlock depth assert.
  The asserts are usually triggered when all threads exit (by returning
  NULL) at roughly the same time.
  However, they cannot be triggered with taskset -c 0, which makes me
  suspect that somehow start_exclusive isn't working as intended.

Any tips would be appreciated! I'll reply with a patch that uses RTM,
the one below is fallback path all the way, and the best to reproduce
the above.

Thanks,

		Emilio

[1] https://github.com/rth7680/qemu/commits/atomic-2

From ed6af6eb364e5a36e81d7cc8143c0e9783c50587 Mon Sep 17 00:00:00 2001
From: "Emilio G. Cota" <cota@braap.org>
Date: Mon, 15 Aug 2016 00:27:42 +0200
Subject: [PATCH] aarch64: use TSX for ldrex/strex (fallback path only)

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/main.c          |  5 +++--
 target-arm/helper-a64.c    | 23 +++++++++++++++++++++++
 target-arm/helper-a64.h    |  4 ++++
 target-arm/translate-a64.c | 15 +++++++++------
 4 files changed, 39 insertions(+), 8 deletions(-)

Comments

Alex Bennée Aug. 16, 2016, 11:16 a.m. UTC | #1
Emilio G. Cota <cota@braap.org> writes:

> On Mon, Aug 15, 2016 at 11:46:32 +0100, Alex Bennée wrote:
>> As far as I'm aware the following work is still ongoing:
>>
>> Emilo: cmpxchg atomics
>> Alvise: LL/SC modelling
>
> I've been tinkering with an experimental patch to do proper LL/SC. The idea
> is to rely on hardware transactional memory, so that stores don't have
> to be tracked. The trickiest thing is the fallback path, for which I'm
> trying to (ab)use EXCP_ATOMIC to execute exclusively from the ldrex
> all the way to the strex.
>
> To test it, I'm using aarch64-linux-user running qht-bench compiled on
> an aarch64 machine. I'm running on an Intel Skylake host (Skylake has
> no known TSX bugs)
>
> However, I'm finding issues that might not have to do with the
> patch itself.
>
> - On the latest MTTCG+cmpxchg tree (45c11751ed7 a.k.a.
>   bennee/mttcg/base-patches-v4-with-cmpxchg-atomics-v2), QEMU loops
>   forever without making progress in the instruction stream, even
>   with taskset -c 0.

Could this be a store-after-load barrier issue? I have a branch that
adds Pranith's work:

  https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2-and-barriers-v4

This seems to have eliminated some of the failure modes (usually kernel
complaining about stalled tasks) but I'm still seeing my test case fail
from time to time starting the benchmark task. Currently I'm not seeing
much information about why its failing to start though.

> - On the cmpxchg tree (rth's atomic-2 branch [1]), it works more
>   reliably, although tb_lock is held around tb_find_fast so parallelism isn't
>   very high. Still, it sometimes triggers the assert below.
>   - Applying the "remove tb_lock around hot path" patch makes it
>     easier to trigger this assert in cpu-exec.c:650 (approx.):
>             /* Assert that the compiler does not smash local variables. */
>             g_assert(cpu == current_cpu)
>     I've also seen triggered the assert immediately after that one, as well
>     as the rcu_read_unlock depth assert.

Odd - these are remnants of a dodgy compiler.

>   The asserts are usually triggered when all threads exit (by returning
>   NULL) at roughly the same time.
>   However, they cannot be triggered with taskset -c 0, which makes me
>   suspect that somehow start_exclusive isn't working as intended.
>
> Any tips would be appreciated! I'll reply with a patch that uses RTM,
> the one below is fallback path all the way, and the best to reproduce
> the above.

I'll see if I can reproduce the errors your seeing on my setup.

>
> Thanks,
>
> 		Emilio
>
> [1] https://github.com/rth7680/qemu/commits/atomic-2
>
> From ed6af6eb364e5a36e81d7cc8143c0e9783c50587 Mon Sep 17 00:00:00 2001
> From: "Emilio G. Cota" <cota@braap.org>
> Date: Mon, 15 Aug 2016 00:27:42 +0200
> Subject: [PATCH] aarch64: use TSX for ldrex/strex (fallback path only)
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  linux-user/main.c          |  5 +++--
>  target-arm/helper-a64.c    | 23 +++++++++++++++++++++++
>  target-arm/helper-a64.h    |  4 ++++
>  target-arm/translate-a64.c | 15 +++++++++------
>  4 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/linux-user/main.c b/linux-user/main.c
> index 9880505..6922faa 100644
> --- a/linux-user/main.c
> +++ b/linux-user/main.c
> @@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu)
>
>      /* Since we got here, we know that parallel_cpus must be true.  */
>      parallel_cpus = false;
> -    cpu_exec_step(cpu);
> -    parallel_cpus = true;
> +    while (!parallel_cpus) {
> +        cpu_exec_step(cpu);
> +    }
>
>      end_exclusive();
>  }
> diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
> index 8ce518b..a97b631 100644
> --- a/target-arm/helper-a64.c
> +++ b/target-arm/helper-a64.c
> @@ -579,3 +579,26 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
>
>      return !success;
>  }
> +
> +void HELPER(xbegin)(CPUARMState *env)
> +{
> +    uintptr_t ra = GETPC();
> +
> +    if (parallel_cpus) {
> +        cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> +    }
> +}
> +
> +void HELPER(xend)(void)
> +{
> +    assert(!parallel_cpus);
> +    parallel_cpus = true;
> +}
> +
> +uint64_t HELPER(x_ok)(void)
> +{
> +    if (!parallel_cpus) {
> +        return 1;
> +    }
> +    return 0;
> +}
> diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
> index dd32000..e7ede43 100644
> --- a/target-arm/helper-a64.h
> +++ b/target-arm/helper-a64.h
> @@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
>  DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
>  DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
>  DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
> +
> +DEF_HELPER_1(xbegin, void, env)
> +DEF_HELPER_0(x_ok, i64)
> +DEF_HELPER_0(xend, void)
> diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
> index 450c359..cfcf440 100644
> --- a/target-arm/translate-a64.c
> +++ b/target-arm/translate-a64.c
> @@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>      TCGv_i64 tmp = tcg_temp_new_i64();
>      TCGMemOp be = s->be_data;
>
> +    gen_helper_xbegin(cpu_env);
> +
>      g_assert(size <= 3);
>      if (is_pair) {
>          TCGv_i64 hitmp = tcg_temp_new_i64();
> @@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>      tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
>
>      tmp = tcg_temp_new_i64();
> +    gen_helper_x_ok(tmp);
> +    tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label);
> +
>      if (is_pair) {
>          if (size == 2) {
>              TCGv_i64 val = tcg_temp_new_i64();
> @@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>          }
>      } else {
>          TCGv_i64 val = cpu_reg(s, rt);
> -        tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val,
> -                                   get_mem_index(s),
> -                                   size | MO_ALIGN | s->be_data);
> -        tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val);
> +        tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size);
>      }
>
>      tcg_temp_free_i64(addr);
> -
> -    tcg_gen_mov_i64(cpu_reg(s, rd), tmp);
>      tcg_temp_free_i64(tmp);
> +
> +    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
> +    gen_helper_xend();
>      tcg_gen_br(done_label);
>
>      gen_set_label(fail_label);


--
Alex Bennée
Emilio Cota Aug. 16, 2016, 9:51 p.m. UTC | #2
On Tue, Aug 16, 2016 at 12:16:26 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> > However, I'm finding issues that might not have to do with the
> > patch itself.

I had some time today to dig deeper -- turns out the issues *have*
to do with my patch, see below. (And sorry for hijacking this thread.)

> >   - Applying the "remove tb_lock around hot path" patch makes it
> >     easier to trigger this assert in cpu-exec.c:650 (approx.):
> >             /* Assert that the compiler does not smash local variables. */
> >             g_assert(cpu == current_cpu)
> >     I've also seen triggered the assert immediately after that one, as well
> >     as the rcu_read_unlock depth assert.
> 
> Odd - these are remnants of a dodgy compiler.

The problem is that by calling cpu_exec_step() in a loop, we don't
know what instructions we might execute. Thus, when one of those instructions
(sandwiched between an ldrex and strex) causes an exception (e.g. SVC in A64)
we take the longjmp that lands into cpu_exec_loop, from which we did *not*
come from. That explains those odd asserts being triggered.

The reason why this is only triggered when pthreads are joined, is because
the code there is particularly tricky, with branches and SVC between
ldrex/strex pairs.

The good news is that this still allows me to benchmark the TSX code vs
cmpxchg (I just print out the results before joining); for 4 cores
(8 HW threads), qht-bench performs just as well with TSX and cmpxchg (but
with TSX we get full correctness). For 1 thread, atomic_add is faster
with cmpxchg, but the gap is greatly reduced as contention increases.
This gap is due to the fixed cost of calling _xstart/_xend, which
is quite a few more instructions than just emitting an atomic.

		Emilio
diff mbox

Patch

diff --git a/linux-user/main.c b/linux-user/main.c
index 9880505..6922faa 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -192,8 +192,9 @@  static void step_atomic(CPUState *cpu)
 
     /* Since we got here, we know that parallel_cpus must be true.  */
     parallel_cpus = false;
-    cpu_exec_step(cpu);
-    parallel_cpus = true;
+    while (!parallel_cpus) {
+        cpu_exec_step(cpu);
+    }
 
     end_exclusive();
 }
diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 8ce518b..a97b631 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -579,3 +579,26 @@  uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
 
     return !success;
 }
+
+void HELPER(xbegin)(CPUARMState *env)
+{
+    uintptr_t ra = GETPC();
+
+    if (parallel_cpus) {
+        cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
+    }
+}
+
+void HELPER(xend)(void)
+{
+    assert(!parallel_cpus);
+    parallel_cpus = true;
+}
+
+uint64_t HELPER(x_ok)(void)
+{
+    if (!parallel_cpus) {
+        return 1;
+    }
+    return 0;
+}
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index dd32000..e7ede43 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -48,3 +48,7 @@  DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
+
+DEF_HELPER_1(xbegin, void, env)
+DEF_HELPER_0(x_ok, i64)
+DEF_HELPER_0(xend, void)
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 450c359..cfcf440 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -1760,6 +1760,8 @@  static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     TCGv_i64 tmp = tcg_temp_new_i64();
     TCGMemOp be = s->be_data;
 
+    gen_helper_xbegin(cpu_env);
+
     g_assert(size <= 3);
     if (is_pair) {
         TCGv_i64 hitmp = tcg_temp_new_i64();
@@ -1825,6 +1827,9 @@  static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
     tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
 
     tmp = tcg_temp_new_i64();
+    gen_helper_x_ok(tmp);
+    tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label);
+
     if (is_pair) {
         if (size == 2) {
             TCGv_i64 val = tcg_temp_new_i64();
@@ -1844,16 +1849,14 @@  static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
         }
     } else {
         TCGv_i64 val = cpu_reg(s, rt);
-        tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val,
-                                   get_mem_index(s),
-                                   size | MO_ALIGN | s->be_data);
-        tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val);
+        tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size);
     }
 
     tcg_temp_free_i64(addr);
-
-    tcg_gen_mov_i64(cpu_reg(s, rd), tmp);
     tcg_temp_free_i64(tmp);
+
+    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
+    gen_helper_xend();
     tcg_gen_br(done_label);
 
     gen_set_label(fail_label);