Message ID | 20160815154940.GA11939@flamenco |
---|---|
State | New |
Headers | show |
On 08/15/2016 08:49 AM, Emilio G. Cota wrote: > +void HELPER(xbegin)(CPUARMState *env) > +{ > + uintptr_t ra = GETPC(); > + int status; > + int retries = 100; > + > + retry: > + status = _xbegin(); > + if (status != _XBEGIN_STARTED) { > + if (status && retries) { > + retries--; > + goto retry; > + } > + if (parallel_cpus) { > + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); > + } > + } > +} > + > +void HELPER(xend)(void) > +{ > + if (_xtest()) { > + _xend(); > + } else { > + assert(!parallel_cpus); > + parallel_cpus = true; > + } > +} > + Interesting idea. FWIW, there are two other extant HTM implementations: ppc64 and s390x. As I recall, the s390 (but not the ppc64) transactions do not roll back the fp registers. Which suggests that we need special support within the TCG proglogue. Perhaps folding these operations into special TCG opcodes. I believe that power8 has HTM, and there's one of those in the gcc compile farm, so this should be relatively easy to try out. We increase the chances of success of the transaction if we minimize the amount of non-target code that's executed while the transaction is running. That suggests two things: (1) that it would be doubly helpful to incorporate the transaction start directly into TCG code generation rather than as a helper and (2) that we should start a new TB upon encountering a load-exclusive, so that we maximize the chance of the store-exclusive being a part of the same TB and thus have *nothing* extra between the beginning and commit of the transaction. r~
On Wed, Aug 17, 2016 at 10:22:05 -0700, Richard Henderson wrote: > On 08/15/2016 08:49 AM, Emilio G. Cota wrote: > >+void HELPER(xbegin)(CPUARMState *env) > >+{ > >+ uintptr_t ra = GETPC(); > >+ int status; > >+ int retries = 100; > >+ > >+ retry: > >+ status = _xbegin(); > >+ if (status != _XBEGIN_STARTED) { > >+ if (status && retries) { > >+ retries--; > >+ goto retry; > >+ } > >+ if (parallel_cpus) { > >+ cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); > >+ } > >+ } > >+} > >+ > >+void HELPER(xend)(void) > >+{ > >+ if (_xtest()) { > >+ _xend(); > >+ } else { > >+ assert(!parallel_cpus); > >+ parallel_cpus = true; > >+ } > >+} > >+ > > Interesting idea. > > FWIW, there are two other extant HTM implementations: ppc64 and s390x. As I > recall, the s390 (but not the ppc64) transactions do not roll back the fp > registers. Which suggests that we need special support within the TCG > proglogue. Perhaps folding these operations into special TCG opcodes. I'm not familiar with s390, but as long as the hardware implements 'strong atomicity' ["strong atomicity guarantees atomicity between transactions and non-transactional code", see http://acg.cis.upenn.edu/papers/cal06_atomic_semantics.pdf ] then this approach would work, in the sense that stores wouldn't have to be instrumented. Of course architecture issues like saving the fp registers as you mention for s390 would have to be taken into account. > I believe that power8 has HTM, and there's one of those in the gcc compile > farm, so this should be relatively easy to try out. Good point! I had forgotten about power8. So far my tests have been on a 4-core Skylake. I have an account on the gcc compile farm so I will make use of it. The power8 machine in the farm has a lot of cores, so this is pretty exciting. > We increase the chances of success of the transaction if we minimize the > amount of non-target code that's executed while the transaction is running. > That suggests two things: > > (1) that it would be doubly helpful to incorporate the transaction start > directly into TCG code generation rather than as a helper and This (and leaving the fallback path in a helper) is simple enough that even I could do it :-) > (2) that we should start a new TB upon encountering a load-exclusive, so > that we maximize the chance of the store-exclusive being a part of the same > TB and thus have *nothing* extra between the beginning and commit of the > transaction. I don't know how to do this. If it's easy to do, please let me know how (for aarch64 at least, since that's the target I'm using). I've run some more tests on the Intel machine, and noticed that failed transactions are very common (up to 50% abort rate for some SPEC workloads, and I count these aborts as "retrying doesn't help" kind of aborts), so bringing that down should definitely help. Another thing I found out is that abusing tcg_exec_step (as is right now) for the fallback path is a bad idea: when there are many failed transactions, performance drops dramatically (up to 5x overall slowdown). Turns out that all this overhead comes from re-translating the code between ldrex/strex. Would it be possible to cache this step-by-step code? If not, then an alternative would be to have a way to stop the world *without* leaving the CPU loop for the calling thread. I'm more comfortable doing the latter due to my glaring lack of TCG competence. Thanks, Emilio
On Wed, Aug 17, 2016 at 13:58:00 -0400, Emilio G. Cota wrote:
> due to my glaring lack of TCG competence.
A related note that might be of interest.
I benchmarked an alternative implementation that *does* instrument
stores. I wrapped every tcg_gen_qemu_st_i64 (those are enough, right?
tcg_gen_st_i64 are stores for the host memory, which I presume are
not "explicit" guest stores and therefore would not go through
the soft TLB) with a pre/post pair of helpers.
These helpers first check a bitmap given a masked subset of the physical
address of the access, and if the bit is set, then check a QHT with the full
physaddr. If an entry exists, they lock/unlock the entry's spinlock around
the store, so that no race is possible with an ongoing atomic (atomics always
take their corresponding lock). Overhead is not too bad over cmpxchg, but
most of it comes from the helpers--see these numbers for SPEC:
(NB. the "QEMU" baseline does *not* include QHT for tb_htable and therefore
takes tb_lock around tb_find_fast, that's why it's so slow)
http://imgur.com/a/SoSHQ
"QHT only" means a QHT lookup is performed on every guest store. The win of
having the bitmap before hitting the QHT is quite large. I wonder
if things could be sped up further by performing the bitmap check in
TCG code. Would that be worth exploring? If so, any help on that would
be appreciated (i386 host at least)--I tried, but I'm way out of my element.
E.
On 08/17/2016 10:58 AM, Emilio G. Cota wrote: >> (2) that we should start a new TB upon encountering a load-exclusive, so >> that we maximize the chance of the store-exclusive being a part of the same >> TB and thus have *nothing* extra between the beginning and commit of the >> transaction. > > I don't know how to do this. If it's easy to do, please let me know how > (for aarch64 at least, since that's the target I'm using). It's a simple matter of peeking at the next instruction. One way is to partially decode the insn before advancing the PC. static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns) { uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b); + + if (num_insns > 1 && (insn & xxx) == yyy) { + /* Start load-exclusive in a new TB. */ + s->is_jmp = DISAS_UPDATE; + return; + } s->insn = insn; s->pc += 4; ... Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl. r~
On 08/17/2016 11:41 AM, Richard Henderson wrote: > On 08/17/2016 10:58 AM, Emilio G. Cota wrote: >>> (2) that we should start a new TB upon encountering a load-exclusive, so >>> that we maximize the chance of the store-exclusive being a part of the same >>> TB and thus have *nothing* extra between the beginning and commit of the >>> transaction. >> >> I don't know how to do this. If it's easy to do, please let me know how >> (for aarch64 at least, since that's the target I'm using). > > It's a simple matter of peeking at the next instruction. > > One way is to partially decode the insn before advancing the PC. > > static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns) > { > uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b); > + > + if (num_insns > 1 && (insn & xxx) == yyy) { > + /* Start load-exclusive in a new TB. */ > + s->is_jmp = DISAS_UPDATE; > + return; > + } > s->insn = insn; > s->pc += 4; > ... > > > Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl. Actually, the mask check is the only really viable solution, and it needs to happen before we do the tcg_gen_insn_start thing. A couple of other notes, as I've thought about this some more. If the start and end of the transaction are not in the same TB, the likelihood of transaction failure should be very near 100%. Consider: * TB with ldrex ends before the strex. * Since the next TB hasn't been built yet, we'll definitely go through tb_find_physical, through the translator, and through the tcg compiler. (a) Which I think we can definitely assume will exhaust any resources associated with the transaction. (b) Which will abort the transaction, (c) Which, with the current code, will retry N times, with identical results, failing within the compiler each time, (d) Which, with the current code, will single-step through to the strex, as you saw. * Since we proceed to (d) the first time, we'll never succeed to create the next TB, so we'll always iterate compilation N times, resulting in the single-step. This is probably the real slow-down that you see. Therefore, we must abort any transaction when we exit tcg-generated code. Both through cpu_exit_loop or through the tcg epilogue. We should be able to use the software controlled bits associated with the abort to tell what kind of event lead to the abort. However, we must bear in mind that (for both x86 and ppc at least) we only have an 8-bit abort code. So we can't pass back a pointer, for instance. We should think about what kinds of limitations we should accept for handling ll/sc via transactions. * How do we handle unpaired ldrexd / ldxp? This is used by the compiler, as it's the only way to perform a double-word atomic load. This implies that we need some sort of counter, beyond which we stop trying to succeed via transaction. * In order to make normal cmpxchg patterns work, we have to be able to handle a branch within a ll/sc sequence. Options: * Less complex way is to build a TB, including branches, with a max of N insns along the branch-not-taken path, searching for the strex. But of course this fails to handle legitimate patterns for arm (and other ll/sc guests). However, gcc code generation will generally annotate the cmpxchg failure branch as not-taken, so perhaps this will work well enough in practice. * More complex way is to build a TB, including branches, with a max of N insns along *all* paths, searching for the strex. This runs into problems with, among other things, branches crossing pages. * Most complex way is to somehow get all of the TBs built, and linked together, preferably before we even try executing (and failing the transaction in) the first TB. r~
On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote:
> A couple of other notes, as I've thought about this some more.
Thanks for spending time on this.
I have a new patchset (will send as a reply to this e-mail in a few
minutes) that has good performance. Its main ideas:
- Use transactions that start on ldrex and finish on strex. On
an exception, end (instead of abort) the ongoing transaction,
if any. There's little point in aborting, since the subsequent
retries will end up in the same exception anyway. This means
the translation of the corresponding blocks might happen via
the fallback path. That's OK, given that subsequent executions
of the TBs will (likely) complete via HTM.
- For the fallback path, add a stop-the-world primitive that stops
all other CPUs, without requiring the calling CPU to exit the CPU loop.
Not breaking from the loop keeps the code simple--we can just
keep translating/executing normally, with the guarantee that
no other CPU can run until we're done.
- The fallback path of the transaction stops the world and then
continues execution (from ldrex) as the only running CPU.
- Only retry when the hardware hints that we may do so. This
ends up being rare (I can only get dozens of retries under
heavy contention, for instance with 'atomic_add-bench -r 1')
Limitations: for now user-mode only, and I have paid no attention
to paired atomics. Also, I'm making no checks for unusual (undefined?)
guest code, such as stray ldrex/strex thrown in there.
Performance optimizations like you suggest (e.g. starting a TB
on ldrex, or using TCG ops for beginning/ending the transaction)
could be implemented, but at least on Intel TSX (the only one I've
tried so far[*]), the transaction buffer seems big enough to not
make these optimizations a necessity.
[*] I tried running HTM primitives on the gcc compile farm's Power8,
but I get an illegal instruction fault on tbegin. I've filed
an issue here to report it: https://gna.org/support/?3369 ]
Some observations:
- The peak number of retries I see is for atomic_add-bench -r 1 -n 16
(on an 8-thread machine) at about ~90 retries. So I set the limit
to 100.
- The lowest success rate I've seen is ~98%, again for atomic_add-bench
under high contention.
Some numbers:
- atomic_add's performance is lower for HTM vs cmpxchg, although under
contention performance gets very similar. The reason for the perf
gap is that xbegin/xend takes more cycles than cmpxchg, especially
under little or no contention; this explains the large difference
for threads=1.
http://imgur.com/5kiT027
As a side note, contended transactions seem to scale worse than contended
cmpxchg when exploiting SMT. But anyway I wouldn't read much into
that.
- For more realistic workloads that gap goes away, as the relative impact
of cmpxchg or transaction delays is lower. For QHT, 1000 keys:
http://imgur.com/l6vcowu
And for SPEC (note that despite being single-threaded, SPEC executes
a lot of atomics, e.g. from mutexes and from forking):
http://imgur.com/W49YMhJ
Performance is essentially identical to that of cmpxchg, but of course
with HTM we get correct emulation.
Thanks for reading this far!
Emilio
diff --git a/linux-user/main.c b/linux-user/main.c index 9880505..6922faa 100644 --- a/linux-user/main.c +++ b/linux-user/main.c @@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu) /* Since we got here, we know that parallel_cpus must be true. */ parallel_cpus = false; - cpu_exec_step(cpu); - parallel_cpus = true; + while (!parallel_cpus) { + cpu_exec_step(cpu); + } end_exclusive(); } diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c index 8ce518b..af45694 100644 --- a/target-arm/helper-a64.c +++ b/target-arm/helper-a64.c @@ -33,6 +33,8 @@ #include "tcg.h" #include <zlib.h> /* For crc32 */ +#include <immintrin.h> + /* C2.4.7 Multiply and divide */ /* special cases for 0 and LLONG_MIN are mandated by the standard */ uint64_t HELPER(udiv64)(uint64_t num, uint64_t den) @@ -579,3 +581,43 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr, return !success; } + +void HELPER(xbegin)(CPUARMState *env) +{ + uintptr_t ra = GETPC(); + int status; + int retries = 100; + + retry: + status = _xbegin(); + if (status != _XBEGIN_STARTED) { + if (status && retries) { + retries--; + goto retry; + } + if (parallel_cpus) { + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); + } + } +} + +void HELPER(xend)(void) +{ + if (_xtest()) { + _xend(); + } else { + assert(!parallel_cpus); + parallel_cpus = true; + } +} + +uint64_t HELPER(x_ok)(void) +{ + if (_xtest()) { + return 1; + } + if (!parallel_cpus) { + return 1; + } + return 0; +} diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h index dd32000..e7ede43 100644 --- a/target-arm/helper-a64.h +++ b/target-arm/helper-a64.h @@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64) DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64) + +DEF_HELPER_1(xbegin, void, env) +DEF_HELPER_0(x_ok, i64) +DEF_HELPER_0(xend, void) diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c index 450c359..cfcf440 100644 --- a/target-arm/translate-a64.c +++ b/target-arm/translate-a64.c @@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2, TCGv_i64 tmp = tcg_temp_new_i64(); TCGMemOp be = s->be_data; + gen_helper_xbegin(cpu_env); + g_assert(size <= 3); if (is_pair) { TCGv_i64 hitmp = tcg_temp_new_i64(); @@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label); tmp = tcg_temp_new_i64(); + gen_helper_x_ok(tmp); + tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label); + if (is_pair) { if (size == 2) { TCGv_i64 val = tcg_temp_new_i64(); @@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, } } else { TCGv_i64 val = cpu_reg(s, rt); - tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val, - get_mem_index(s), - size | MO_ALIGN | s->be_data); - tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val); + tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size); } tcg_temp_free_i64(addr); - - tcg_gen_mov_i64(cpu_reg(s, rd), tmp); tcg_temp_free_i64(tmp); + + tcg_gen_movi_i64(cpu_reg(s, rd), 0); + gen_helper_xend(); tcg_gen_br(done_label); gen_set_label(fail_label);
Configure with --extra-cflags="-mrtm" Signed-off-by: Emilio G. Cota <cota@braap.org> --- linux-user/main.c | 5 +++-- target-arm/helper-a64.c | 42 ++++++++++++++++++++++++++++++++++++++++++ target-arm/helper-a64.h | 4 ++++ target-arm/translate-a64.c | 15 +++++++++------ 4 files changed, 58 insertions(+), 8 deletions(-)