aarch64: use TSX for ldrex/strex

Message ID	20160815154940.GA11939@flamenco
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Date: Mon, 15 Aug 2016 11:49:40 -0400 From: "Emilio G. Cota" <cota@braap.org> To: Alex =?iso-8859-1?Q?Benn=E9e?= <alex.bennee@linaro.org> Message-ID: <20160815154940.GA11939@flamenco> References: <87mvkeqph3.fsf@linaro.org> <20160815154626.GA8768@flamenco> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160815154626.GA8768@flamenco> User-Agent: Mutt/1.5.23 (2014-03-12) Subject: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Precedence: list Cc: mttcg@listserver.greensocs.com, peter.maydell@linaro.org, claudio.fontana@huawei.com, nikunj@linux.vnet.ibm.com, Peter Crosthwaite <crosthwaite.peter@gmail.com>, jan.kiszka@siemens.com, mark.burton@greensocs.com, a.rigo@virtualopensystems.com, qemu-devel@nongnu.org, serge.fdrv@gmail.com, pbonzini@redhat.com, bobby.prani@gmail.com, rth@twiddle.net, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, fred.konrad@greensocs.com Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Emilio Cota Aug. 15, 2016, 3:49 p.m. UTC

Configure with --extra-cflags="-mrtm"

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/main.c          |  5 +++--
 target-arm/helper-a64.c    | 42 ++++++++++++++++++++++++++++++++++++++++++
 target-arm/helper-a64.h    |  4 ++++
 target-arm/translate-a64.c | 15 +++++++++------
 4 files changed, 58 insertions(+), 8 deletions(-)

Richard Henderson Aug. 17, 2016, 5:22 p.m. UTC | #1

On 08/15/2016 08:49 AM, Emilio G. Cota wrote:
> +void HELPER(xbegin)(CPUARMState *env)
> +{
> +    uintptr_t ra = GETPC();
> +    int status;
> +    int retries = 100;
> +
> + retry:
> +    status = _xbegin();
> +    if (status != _XBEGIN_STARTED) {
> +        if (status && retries) {
> +            retries--;
> +            goto retry;
> +        }
> +        if (parallel_cpus) {
> +            cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> +        }
> +    }
> +}
> +
> +void HELPER(xend)(void)
> +{
> +    if (_xtest()) {
> +        _xend();
> +    } else {
> +        assert(!parallel_cpus);
> +        parallel_cpus = true;
> +    }
> +}
> +

Interesting idea.

FWIW, there are two other extant HTM implementations: ppc64 and s390x.  As I 
recall, the s390 (but not the ppc64) transactions do not roll back the fp 
registers.  Which suggests that we need special support within the TCG 
proglogue.  Perhaps folding these operations into special TCG opcodes.

I believe that power8 has HTM, and there's one of those in the gcc compile 
farm, so this should be relatively easy to try out.

We increase the chances of success of the transaction if we minimize the amount 
of non-target code that's executed while the transaction is running.  That 
suggests two things:

(1) that it would be doubly helpful to incorporate the transaction start 
directly into TCG code generation rather than as a helper and

(2) that we should start a new TB upon encountering a load-exclusive, so that 
we maximize the chance of the store-exclusive being a part of the same TB and 
thus have *nothing* extra between the beginning and commit of the transaction.

r~

Emilio Cota Aug. 17, 2016, 5:58 p.m. UTC | #2

On Wed, Aug 17, 2016 at 10:22:05 -0700, Richard Henderson wrote:
> On 08/15/2016 08:49 AM, Emilio G. Cota wrote:
> >+void HELPER(xbegin)(CPUARMState *env)
> >+{
> >+    uintptr_t ra = GETPC();
> >+    int status;
> >+    int retries = 100;
> >+
> >+ retry:
> >+    status = _xbegin();
> >+    if (status != _XBEGIN_STARTED) {
> >+        if (status && retries) {
> >+            retries--;
> >+            goto retry;
> >+        }
> >+        if (parallel_cpus) {
> >+            cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
> >+        }
> >+    }
> >+}
> >+
> >+void HELPER(xend)(void)
> >+{
> >+    if (_xtest()) {
> >+        _xend();
> >+    } else {
> >+        assert(!parallel_cpus);
> >+        parallel_cpus = true;
> >+    }
> >+}
> >+
> 
> Interesting idea.
> 
> FWIW, there are two other extant HTM implementations: ppc64 and s390x.  As I
> recall, the s390 (but not the ppc64) transactions do not roll back the fp
> registers.  Which suggests that we need special support within the TCG
> proglogue.  Perhaps folding these operations into special TCG opcodes.

I'm not familiar with s390, but as long as the hardware implements 'strong atomicity'
["strong atomicity guarantees atomicity between transactions and non-transactional
code", see http://acg.cis.upenn.edu/papers/cal06_atomic_semantics.pdf ] then
this approach would work, in the sense that stores wouldn't have to
be instrumented.

Of course architecture issues like saving the fp registers as you mention for
s390 would have to be taken into account.

> I believe that power8 has HTM, and there's one of those in the gcc compile
> farm, so this should be relatively easy to try out.

Good point! I had forgotten about power8. So far my tests have been on a
4-core Skylake. I have an account on the gcc compile farm so I will make use
of it. The power8 machine in the farm has a lot of cores, so this is
pretty exciting.

> We increase the chances of success of the transaction if we minimize the
> amount of non-target code that's executed while the transaction is running.
> That suggests two things:
> 
> (1) that it would be doubly helpful to incorporate the transaction start
> directly into TCG code generation rather than as a helper and

This (and leaving the fallback path in a helper) is simple enough that even
I could do it :-)

> (2) that we should start a new TB upon encountering a load-exclusive, so
> that we maximize the chance of the store-exclusive being a part of the same
> TB and thus have *nothing* extra between the beginning and commit of the
> transaction.

I don't know how to do this. If it's easy to do, please let me know how
(for aarch64 at least, since that's the target I'm using).

I've run some more tests on the Intel machine, and noticed that failed
transactions are very common (up to 50% abort rate for some SPEC workloads,
and I count these aborts as "retrying doesn't help" kind of aborts), so
bringing that down should definitely help.

Another thing I found out is that abusing tcg_exec_step (as is right now)
for the fallback path is a bad idea: when there are many failed transactions,
performance drops dramatically (up to 5x overall slowdown). Turns out that
all this overhead comes from re-translating the code between ldrex/strex.
Would it be possible to cache this step-by-step code? If not, then an
alternative would be to have a way to stop the world *without* leaving
the CPU loop for the calling thread. I'm more comfortable doing the latter
due to my glaring lack of TCG competence.

Thanks,

		Emilio

Emilio Cota Aug. 17, 2016, 6:18 p.m. UTC | #3

On Wed, Aug 17, 2016 at 13:58:00 -0400, Emilio G. Cota wrote:
> due to my glaring lack of TCG competence.

A related note that might be of interest.

I benchmarked an alternative implementation that *does* instrument
stores. I wrapped every tcg_gen_qemu_st_i64 (those are enough, right?
tcg_gen_st_i64 are stores for the host memory, which I presume are
not "explicit" guest stores and therefore would not go through
the soft TLB) with a pre/post pair of helpers.

These helpers first check a bitmap given a masked subset of the physical
address of the access, and if the bit is set, then check a QHT with the full
physaddr. If an entry exists, they lock/unlock the entry's spinlock around
the store, so that no race is possible with an ongoing atomic (atomics always
take their corresponding lock). Overhead is not too bad over cmpxchg, but
most of it comes from the helpers--see these numbers for SPEC:
(NB. the "QEMU" baseline does *not* include QHT for tb_htable and therefore
takes tb_lock around tb_find_fast, that's why it's so slow)
  http://imgur.com/a/SoSHQ

"QHT only" means a QHT lookup is performed on every guest store. The win of
having the bitmap before hitting the QHT is quite large. I wonder
if things could be sped up further by performing the bitmap check in
TCG code. Would that be worth exploring? If so, any help on that would
be appreciated (i386 host at least)--I tried, but I'm way out of my element.

		E.

Richard Henderson Aug. 17, 2016, 6:41 p.m. UTC | #4

On 08/17/2016 10:58 AM, Emilio G. Cota wrote:
>> (2) that we should start a new TB upon encountering a load-exclusive, so
>> that we maximize the chance of the store-exclusive being a part of the same
>> TB and thus have *nothing* extra between the beginning and commit of the
>> transaction.
>
> I don't know how to do this. If it's easy to do, please let me know how
> (for aarch64 at least, since that's the target I'm using).

It's a simple matter of peeking at the next instruction.

One way is to partially decode the insn before advancing the PC.

  static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns)
  {
     uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b);
+
+   if (num_insns > 1 && (insn & xxx) == yyy) {
+       /* Start load-exclusive in a new TB.  */
+       s->is_jmp = DISAS_UPDATE;
+       return;
+   }
     s->insn = insn;
     s->pc += 4;
...


Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl.


r~

Richard Henderson Aug. 18, 2016, 3:38 p.m. UTC | #5

On 08/17/2016 11:41 AM, Richard Henderson wrote:
> On 08/17/2016 10:58 AM, Emilio G. Cota wrote:
>>> (2) that we should start a new TB upon encountering a load-exclusive, so
>>> that we maximize the chance of the store-exclusive being a part of the same
>>> TB and thus have *nothing* extra between the beginning and commit of the
>>> transaction.
>>
>> I don't know how to do this. If it's easy to do, please let me know how
>> (for aarch64 at least, since that's the target I'm using).
>
> It's a simple matter of peeking at the next instruction.
>
> One way is to partially decode the insn before advancing the PC.
>
>  static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns)
>  {
>     uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b);
> +
> +   if (num_insns > 1 && (insn & xxx) == yyy) {
> +       /* Start load-exclusive in a new TB.  */
> +       s->is_jmp = DISAS_UPDATE;
> +       return;
> +   }
>     s->insn = insn;
>     s->pc += 4;
> ...
>
>
> Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl.

Actually, the mask check is the only really viable solution, and it needs to 
happen before we do the tcg_gen_insn_start thing.

A couple of other notes, as I've thought about this some more.

If the start and end of the transaction are not in the same TB, the likelihood 
of transaction failure should be very near 100%.  Consider:

   * TB with ldrex ends before the strex.

   * Since the next TB hasn't been built yet, we'll definitely go
     through tb_find_physical, through the translator, and through
     the tcg compiler.

     (a) Which I think we can definitely assume will exhaust any
         resources associated with the transaction.
     (b) Which will abort the transaction,
     (c) Which, with the current code, will retry N times, with
         identical results, failing within the compiler each time,
     (d) Which, with the current code, will single-step through
         to the strex, as you saw.

   * Since we proceed to (d) the first time, we'll never succeed
     to create the next TB, so we'll always iterate compilation N
     times, resulting in the single-step.

This is probably the real slow-down that you see.

Therefore, we must abort any transaction when we exit tcg-generated code.  Both 
through cpu_exit_loop or through the tcg epilogue.  We should be able to use 
the software controlled bits associated with the abort to tell what kind of 
event lead to the abort.  However, we must bear in mind that (for both x86 and 
ppc at least) we only have an 8-bit abort code.  So we can't pass back a 
pointer, for instance.

We should think about what kinds of limitations we should accept for handling 
ll/sc via transactions.

   * How do we handle unpaired ldrexd / ldxp?  This is used by the compiler,
     as it's the only way to perform a double-word atomic load.

     This implies that we need some sort of counter, beyond which we stop
     trying to succeed via transaction.

   * In order to make normal cmpxchg patterns work, we have to be able to
     handle a branch within a ll/sc sequence.  Options:

     * Less complex way is to build a TB, including branches, with a max
       of N insns along the branch-not-taken path, searching for the strex.
       But of course this fails to handle legitimate patterns for arm
       (and other ll/sc guests).

       However, gcc code generation will generally annotate the cmpxchg
       failure branch as not-taken, so perhaps this will work well enough
       in practice.

     * More complex way is to build a TB, including branches, with a max
       of N insns along *all* paths, searching for the strex.  This runs
       into problems with, among other things, branches crossing pages.

     * Most complex way is to somehow get all of the TBs built, and
       linked together, preferably before we even try executing
       (and failing the transaction in) the first TB.


r~

Emilio Cota Aug. 24, 2016, 9:12 p.m. UTC | #6

On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote:
> A couple of other notes, as I've thought about this some more.

Thanks for spending time on this.

I have a new patchset (will send as a reply to this e-mail in a few
minutes) that has good performance. Its main ideas:

- Use transactions that start on ldrex and finish on strex. On
  an exception, end (instead of abort) the ongoing transaction,
  if any. There's little point in aborting, since the subsequent
  retries will end up in the same exception anyway. This means
  the translation of the corresponding blocks might happen via
  the fallback path. That's OK, given that subsequent executions
  of the TBs will (likely) complete via HTM.

- For the fallback path, add a stop-the-world primitive that stops
  all other CPUs, without requiring the calling CPU to exit the CPU loop.
  Not breaking from the loop keeps the code simple--we can just
  keep translating/executing normally, with the guarantee that
  no other CPU can run until we're done.

- The fallback path of the transaction stops the world and then
  continues execution (from ldrex) as the only running CPU.

- Only retry when the hardware hints that we may do so. This
  ends up being rare (I can only get dozens of retries under
  heavy contention, for instance with 'atomic_add-bench -r 1')

Limitations: for now user-mode only, and I have paid no attention
to paired atomics. Also, I'm making no checks for unusual (undefined?)
guest code, such as stray ldrex/strex thrown in there.

Performance optimizations like you suggest (e.g. starting a TB
on ldrex, or using TCG ops for beginning/ending the transaction)
could be implemented, but at least on Intel TSX (the only one I've
tried so far[*]), the transaction buffer seems big enough to not
make these optimizations a necessity.

[*] I tried running HTM primitives on the gcc compile farm's Power8,
  but I get an illegal instruction fault on tbegin. I've filed
  an issue here to report it: https://gna.org/support/?3369 ]

Some observations:

- The peak number of retries I see is for atomic_add-bench -r 1 -n 16
  (on an 8-thread machine) at about ~90 retries. So I set the limit
  to 100.

- The lowest success rate I've seen is ~98%, again for atomic_add-bench
  under high contention.

Some numbers:

- atomic_add's performance is lower for HTM vs cmpxchg, although under
  contention performance gets very similar. The reason for the perf
  gap is that xbegin/xend takes more cycles than cmpxchg, especially
  under little or no contention; this explains the large difference
  for threads=1.
  http://imgur.com/5kiT027
  As a side note, contended transactions seem to scale worse than contended
  cmpxchg when exploiting SMT. But anyway I wouldn't read much into
  that.

- For more realistic workloads that gap goes away, as the relative impact
  of cmpxchg or transaction delays is lower. For QHT, 1000 keys:
  http://imgur.com/l6vcowu
  And for SPEC (note that despite being single-threaded, SPEC executes
  a lot of atomics, e.g. from mutexes and from forking):
  http://imgur.com/W49YMhJ
  Performance is essentially identical to that of cmpxchg, but of course
  with HTM we get correct emulation.

Thanks for reading this far!

		Emilio

aarch64: use TSX for ldrex/strex

Commit Message

Comments

Patch