diff mbox series

[RFC,38/48] translator: implement 2-pass translation

Message ID 20181025172057.20414-39-cota@braap.org
State New
Headers show
Series Plugin support | expand

Commit Message

Emilio Cota Oct. 25, 2018, 5:20 p.m. UTC
The second pass only occurs when a plugin has subscribed to
TB translation events.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h              |  8 ++++
 accel/tcg/translator.c | 91 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 97 insertions(+), 2 deletions(-)

Comments

Alex Bennée Nov. 26, 2018, 3:16 p.m. UTC | #1
Emilio G. Cota <cota@braap.org> writes:

> The second pass only occurs when a plugin has subscribed to
> TB translation events.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg.h              |  8 ++++
>  accel/tcg/translator.c | 91 +++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 97 insertions(+), 2 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index d5afe25c97..479b57d65f 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -720,6 +720,14 @@ struct TCGContext {
>
>      TCGLabel *exitreq_label;
>
<snip>
>      }
>
> +    if (tb_trans_cb && first_pass) {
> +        qemu_plugin_tb_trans_cb(cpu, plugin_tb);
> +        first_pass = false;
> +        goto translate;
> +    }
> +

So the only reason we are doing this two pass tango is to ensure the
plugin can insert TCG ops before the actual translation has occurred?

I think we can do better, especially as the internal structures of
TCGops are implemented as a list so ops and be inserted before and after
other ops. This is currently only done by the optimiser at the moment,
see:

  TCGOp *tcg_op_insert_before(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
  TCGOp *tcg_op_insert_after(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);

and all the base tcg ops end up going to tcg_emit_op which just appends
to the tail. But if we can come up with a neater way to track the op
used before the current translated expression we could do away with two
phases translation completely.

>      /* Emit code to exit the TB, as indicated by db->is_jmp.  */
>      ops->tb_stop(db, cpu);
>      gen_tb_end(db->tb, db->num_insns - bp_insn);


--
Alex Bennée
Emilio Cota Nov. 27, 2018, 2:16 a.m. UTC | #2
On Mon, Nov 26, 2018 at 15:16:00 +0000, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> > +    if (tb_trans_cb && first_pass) {
> > +        qemu_plugin_tb_trans_cb(cpu, plugin_tb);
> > +        first_pass = false;
> > +        goto translate;
> > +    }
> 
> So the only reason we are doing this two pass tango is to ensure the
> plugin can insert TCG ops before the actual translation has occurred?

Not only. The idea is to provide plugins with well-defined TBs,
i.e. the instruction sizes and contents can be queried by the plugin
before the plugin decides how/where to instrument the TB.

Since in the targets we generate TCG code and also generate
host code in a single shot (TranslatorOps.translate_insn),
the 2-pass approach is a workaround to first get the
well-defined TB, and in the second pass inject the instrumentation
in the appropriate places.

This is a bit of a waste but given that it only happens at
translate time, it can have negligible performance impact --
I measured a 0.13% gmean slowdown for SPEC06int.

> I think we can do better, especially as the internal structures of
> TCGops are implemented as a list so ops and be inserted before and after
> other ops. This is currently only done by the optimiser at the moment,
> see:
> 
>   TCGOp *tcg_op_insert_before(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
>   TCGOp *tcg_op_insert_after(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
> 
> and all the base tcg ops end up going to tcg_emit_op which just appends
> to the tail. But if we can come up with a neater way to track the op
> used before the current translated expression we could do away with two
> phases translation completely.

This list of ops is generated via TranslatorOps.translate_insn.
Unfortunately, this function also defines the guest insns that form the TB.
Decoupling the two actions ("define the TB" and "translate to TCG ops")
would be ideal, but I didn't want to rewrite all the target translators
in QEMU, and opted instead for the 2-pass approach as a compromise.

Thanks,

		Emilio
Alex Bennée Nov. 27, 2018, 2:48 p.m. UTC | #3
Emilio G. Cota <cota@braap.org> writes:

> On Mon, Nov 26, 2018 at 15:16:00 +0000, Alex Bennée wrote:
>> Emilio G. Cota <cota@braap.org> writes:
> (snip)
>> > +    if (tb_trans_cb && first_pass) {
>> > +        qemu_plugin_tb_trans_cb(cpu, plugin_tb);
>> > +        first_pass = false;
>> > +        goto translate;
>> > +    }
>>
>> So the only reason we are doing this two pass tango is to ensure the
>> plugin can insert TCG ops before the actual translation has occurred?
>
> Not only. The idea is to provide plugins with well-defined TBs,
> i.e. the instruction sizes and contents can be queried by the plugin
> before the plugin decides how/where to instrument the TB.

Hmmm, this seems a little to close to internal knowledge of the TCG. Is
the idea that a plugin might make a different decision based on the
number of a particular type of instruction in the translation block?

This seems like it would get broken if we wanted to implement other
types of TranslationBlock (e.g. hot-blocks with multiple exits for the
non-hot case).

That said looking at the examples using it so far it doesn't seem to be
doing more than looking at the total number of instructions for a given
translation block.

> Since in the targets we generate TCG code and also generate
> host code in a single shot (TranslatorOps.translate_insn),
> the 2-pass approach is a workaround to first get the
> well-defined TB, and in the second pass inject the instrumentation
> in the appropriate places.

Richard's suggestion of providing a central translator_ldl_code could
keep the book keeping of each instruction location and contents in the
core translator. With a little tweaking to the TCG we could then insert
our instrumentation at the end of the pass with all the knowledge we
want to export to the plugin.

Inserting instrumentation after instructions have executed will be
trickier though due to reasons Peter mentioned on IRC.

> This is a bit of a waste but given that it only happens at
> translate time, it can have negligible performance impact --
> I measured a 0.13% gmean slowdown for SPEC06int.

I'm less concerned about efficiency as complicating the code, especially
if we are baking in concepts that restrict our freedom to change code
generation around going forward.

>
>> I think we can do better, especially as the internal structures of
>> TCGops are implemented as a list so ops and be inserted before and after
>> other ops. This is currently only done by the optimiser at the moment,
>> see:
>>
>>   TCGOp *tcg_op_insert_before(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
>>   TCGOp *tcg_op_insert_after(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
>>
>> and all the base tcg ops end up going to tcg_emit_op which just appends
>> to the tail. But if we can come up with a neater way to track the op
>> used before the current translated expression we could do away with two
>> phases translation completely.
>
> This list of ops is generated via TranslatorOps.translate_insn.
> Unfortunately, this function also defines the guest insns that form the TB.
> Decoupling the two actions ("define the TB" and "translate to TCG ops")
> would be ideal, but I didn't want to rewrite all the target translators
> in QEMU, and opted instead for the 2-pass approach as a compromise.

I don't quite follow. When we've done all our translation into TCG ops
haven't we by definition defined the TB?

Maybe the interface shouldn't be per-insn and per-TB but just an
arbitrary chunk of instructions. We could call the plugin with a list of
instructions with some opaque data that can be passed back to the plugin
APIs to allow insertion of instrumentation at the appropriate points.
The initial implementation would be a single-pass and called after the
TCG op generation. An instruction counter plugin would then be free to
insert counter instrumentation as frequently or infrequently as it
wants. These chunks wouldn't have to be tied to the internals of TCG and
in the worst case we could just inform the plugin in 1 insn chunks
without having to change the API?

What do you think?


--
Alex Bennée
Emilio Cota Nov. 27, 2018, 7:06 p.m. UTC | #4
On Tue, Nov 27, 2018 at 14:48:11 +0000, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> > On Mon, Nov 26, 2018 at 15:16:00 +0000, Alex Bennée wrote:
> >> Emilio G. Cota <cota@braap.org> writes:
> > (snip)
> >> > +    if (tb_trans_cb && first_pass) {
> >> > +        qemu_plugin_tb_trans_cb(cpu, plugin_tb);
> >> > +        first_pass = false;
> >> > +        goto translate;
> >> > +    }
> >>
> >> So the only reason we are doing this two pass tango is to ensure the
> >> plugin can insert TCG ops before the actual translation has occurred?
> >
> > Not only. The idea is to provide plugins with well-defined TBs,
> > i.e. the instruction sizes and contents can be queried by the plugin
> > before the plugin decides how/where to instrument the TB.
> 
> Hmmm, this seems a little to close to internal knowledge of the TCG.

As far as plugins are concerned, a "TB" is a sequence of instructions
that will (unless there are exceptions) execute in sequence. That is,
single-entry and single-exit.
QEMU is free to cut those in any way it wants, and there's no need
for a 1:1 mapping between the "TBs" exported to plugins and
the "TranslationBlock"s we manage internally in QEMU.

I thought about calling them "basic blocks", but then that could
confuse users because not all TB's meet the definition of basic blocks,
that is TBs might end in a non-branch instruction, whereas basic
blocks don't.

So I kept the TB name, but note that all that plugins can assume
about TBs, is that they are single-entry and single-exit, that's it.
Different QEMU releases will cut TB's differently, and plugins
will cope with that perfectly fine. IOW, this imposes no
restrictions on TCG's implementation.

> Is the idea that a plugin might make a different decision based on the
> number of a particular type of instruction in the translation block?

Plugins will make their decisions based on the TB's contents.
For that, they need to know what instructions form the TB,
and be able to disassemble them.

> This seems like it would get broken if we wanted to implement other
> types of TranslationBlock (e.g. hot-blocks with multiple exits for the
> non-hot case).

Again, let's dissociate struct TranslationBlock vs. what we export;
let's call the latter "plugin TB's" for discussion's sake.

If we implemented single-entry, multiple-exit traces, we could implement
that in any way we wanted (e.g. expanding TranslationBlock, or grouping
them into TranslationTraces or whatever we called them). Plugins
would them be exposed to an interface similar to what Pin/DynamoRIO
offer, that is, plugins can subscribe to "Trace" translation events,
where Traces are lists of "plugin TB's".

Besides, I'm OK with having an API that we can break in the future.
(Pin/DynamoRIO do it all the time.)

> That said looking at the examples using it so far it doesn't seem to be
> doing more than looking at the total number of instructions for a given
> translation block.

OK, so I'm appending a more complicated example, where we use capstone
to look at the instructions in a TB at translation time. (Just
for illustration purposes, we then register an empty callback)

> > Since in the targets we generate TCG code and also generate
> > host code in a single shot (TranslatorOps.translate_insn),
> > the 2-pass approach is a workaround to first get the
> > well-defined TB, and in the second pass inject the instrumentation
> > in the appropriate places.
> 
> Richard's suggestion of providing a central translator_ldl_code could
> keep the book keeping of each instruction location and contents in the
> core translator.

Yes, at least for most targets I think that will work.

> With a little tweaking to the TCG we could then insert
> our instrumentation at the end of the pass with all the knowledge we
> want to export to the plugin.

After .translate_insn has returned for the last instruction, how
do we insert the instrumentation that the plugin wants--say, a TB
callback at the beginning of the TB, memory callbacks for the
2nd instruction, and an insn callback before the 3rd instruction
executes?

I don't see how we could achieve that with "a little tweaking"
instead of a 2nd pass, but I'd love to be wrong =)

> Inserting instrumentation after instructions have executed will be
> trickier though due to reasons Peter mentioned on IRC.

Particularly the last instruction in a TB; by the time we return
from .translate_insn, all code we insert will most likely be dead.

> > This is a bit of a waste but given that it only happens at
> > translate time, it can have negligible performance impact --
> > I measured a 0.13% gmean slowdown for SPEC06int.
> 
> I'm less concerned about efficiency as complicating the code, especially
> if we are baking in concepts that restrict our freedom to change code
> generation around going forward.

Agreed. I don't think exposing for now "plugin TB"s, i.e. single-entry,
single-exit blocks of insns restricts our future designs. And if all
else fails, we should reserve the right to break the API (e.g. via
new version numbers).

> >> I think we can do better, especially as the internal structures of
> >> TCGops are implemented as a list so ops and be inserted before and after
> >> other ops. This is currently only done by the optimiser at the moment,
> >> see:
> >>
> >>   TCGOp *tcg_op_insert_before(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
> >>   TCGOp *tcg_op_insert_after(TCGContext *s, TCGOp *op, TCGOpcode opc, int narg);
> >>
> >> and all the base tcg ops end up going to tcg_emit_op which just appends
> >> to the tail. But if we can come up with a neater way to track the op
> >> used before the current translated expression we could do away with two
> >> phases translation completely.
> >
> > This list of ops is generated via TranslatorOps.translate_insn.
> > Unfortunately, this function also defines the guest insns that form the TB.
> > Decoupling the two actions ("define the TB" and "translate to TCG ops")
> > would be ideal, but I didn't want to rewrite all the target translators
> > in QEMU, and opted instead for the 2-pass approach as a compromise.
> 
> I don't quite follow. When we've done all our translation into TCG ops
> haven't we by definition defined the TB?

Yes, that's precisely my point.

The part that's missing is that once the TB is defined, we want to
insert instrumentation. Unfortunately, the "TCG ops" we get after
the 1st pass (no instrumentation), are very different from the list
of "TCG ops" that we get after the 2nd pass (after having injected
instrumentation). Could we get the same output of the 2nd pass,
just by using the output of the 1st and the list of injection points?
It's probably possible, but it seems very hard to do. (Think for
instance of memory callbacks, and further the complication of when
they use helpers).

The only reasonable way to do this I think would be to leave behind
"placeholder" TCG ops, that then we could scan to add further TCG ops.
But you'll agree with me that the 2nd pass is simpler :P

> Maybe the interface shouldn't be per-insn and per-TB but just an
> arbitrary chunk of instructions. We could call the plugin with a list of
> instructions with some opaque data that can be passed back to the plugin
> APIs to allow insertion of instrumentation at the appropriate points.

This is what this series implements. It just happens that these
chunks match our internal translation blocks, but there's no need for
that (and for now, no good reason for them not to match.)

> The initial implementation would be a single-pass and called after the
> TCG op generation. An instruction counter plugin would then be free to
> insert counter instrumentation as frequently or infrequently as it
> wants. These chunks wouldn't have to be tied to the internals of TCG and
> in the worst case we could just inform the plugin in 1 insn chunks
> without having to change the API?
> 
> What do you think?

With a single pass, all you can do is to add a callback with a descriptor
of what just executed. So the "instruction counter" example would work OK.

But what about a plugin that needed only memory accesses performed
by, say, xchg instructions? It'd have to subscribe to *all* memory
accesses, because after a TB is generated we cannot go back to
instrument something (due to the single pass), and then somehow figure
out a way to discard non-xchg-generated memory accesses at run-time.

So having instruction-grained injection is very valuable, and it's
not surprising that Pin/DynamoRIO provide that in their API.
With this series I'm trying to provide something similar.

Thanks,

		Emilio

---
#include <inttypes.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

#include <capstone/capstone.h>
#include <qemu-plugin.h>

struct tb {
	size_t n_insns;
};

static csh cap_handle;
static cs_insn *cap_insn;

static void vcpu_tb_exec(unsigned int cpu_index, void *udata)
{ }

static void vcpu_tb_trans(qemu_plugin_id_t id, unsigned int cpu_index, struct qemu_plugin_tb *tb)
{
	struct tb *desc;
	size_t n = qemu_plugin_tb_n_insns(tb);
	size_t i;

	for (i = 0; i < n; i++) {
		struct qemu_plugin_insn *insn = qemu_plugin_tb_get_insn(tb, i);
		size_t size = qemu_plugin_insn_size(insn);
		const uint8_t *code = qemu_plugin_insn_data(insn);
		uint64_t offset = 0;
		bool success;

		success = cs_disasm_iter(cap_handle, &code, &size, &offset, cap_insn);
		assert(success);
	}
	desc = malloc(sizeof(*desc));
	assert(desc);
	desc->n_insns = n;

	qemu_plugin_register_vcpu_tb_exec_cb(tb, vcpu_tb_exec,
					     QEMU_PLUGIN_CB_NO_REGS, desc);
}

QEMU_PLUGIN_EXPORT int qemu_plugin_install(qemu_plugin_id_t id, int argc, char **argv)
{
	if (cs_open(CS_ARCH_X86, CS_MODE_64, &cap_handle) != CS_ERR_OK) {
		return -1;
	}
	cap_insn = cs_malloc(cap_handle);
	if (cap_insn == NULL) {
		return -1;
	}
	qemu_plugin_register_vcpu_tb_trans_cb(id, vcpu_tb_trans);
	return 0;
}
Emilio Cota Nov. 28, 2018, 2:30 a.m. UTC | #5
On Tue, Nov 27, 2018 at 14:06:57 -0500, Emilio G. Cota wrote:
> On Tue, Nov 27, 2018 at 14:48:11 +0000, Alex Bennée wrote:
> > With a little tweaking to the TCG we could then insert
> > our instrumentation at the end of the pass with all the knowledge we
> > want to export to the plugin.
> 
> After .translate_insn has returned for the last instruction, how
> do we insert the instrumentation that the plugin wants--say, a TB
> callback at the beginning of the TB, memory callbacks for the
> 2nd instruction, and an insn callback before the 3rd instruction
> executes?
> 
> I don't see how we could achieve that with "a little tweaking"
> instead of a 2nd pass, but I'd love to be wrong =)

(snip)
> > I don't quite follow. When we've done all our translation into TCG ops
> > haven't we by definition defined the TB?
> 
> Yes, that's precisely my point.
> 
> The part that's missing is that once the TB is defined, we want to
> insert instrumentation. Unfortunately, the "TCG ops" we get after
> the 1st pass (no instrumentation), are very different from the list
> of "TCG ops" that we get after the 2nd pass (after having injected
> instrumentation). Could we get the same output of the 2nd pass,
> just by using the output of the 1st and the list of injection points?
> It's probably possible, but it seems very hard to do. (Think for
> instance of memory callbacks, and further the complication of when
> they use helpers).
> 
> The only reasonable way to do this I think would be to leave behind
> "placeholder" TCG ops, that then we could scan to add further TCG ops.
> But you'll agree with me that the 2nd pass is simpler :P

It might not be that much simpler after all!

I am exploring the approach you suggested, that is IIUC to do a
single pass and then walk the list of Ops, adding (and
reordering) Ops based on what the plugins have requested.

Contrary to what I wrote earlier today (quoted above), this
approach seems quite promising, and certainly less ugly
than doing the 2 passes.

I just wrote some code to go over the list and add TB callbacks,
which go right before the first insn_start Op. The code is hack-ish
in that we first generate the TCG ops we need, which get added to
the end of the ops list, and then we go over those and move them
to where we want them to be (before insn_start, in this case).
But it works and it's less of a hack than doing the whole 2nd pass.

Insn callbacks will be trivial to implement this way; memory
callbacks should be harder because there are several qemu_ld/st
opcodes, but it should be doable; last, memory instrumentation
of helpers might actually be easier than with the 2 passes, because here
we just have to look for a Call TCG op to know whether a guest
instruction uses helpers, and if it does we can wrap the call
with the helpers to generate the setting/resetting of
CPUState.plugin_mem_cbs.

I'll try to do what's in the previous paragraph tomorrow -- I'm
appending what I did so far.

Thanks,

		Emilio
---
diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
index ee9e179e14..232f645cd4 100644
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -18,6 +18,7 @@
 #include "exec/gen-icount.h"
 #include "exec/log.h"
 #include "exec/translator.h"
+#include "exec/plugin-gen.h"
 
 /* Pairs with tcg_clear_temp_count.
    To be called by #TranslatorOps.{translate_insn,tb_stop} if
@@ -142,6 +143,11 @@ void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
     gen_tb_end(db->tb, db->num_insns - bp_insn);
 
     if (plugin_enabled) {
+        /* collect the plugins' instrumentation */
+        qemu_plugin_tb_trans_cb(cpu, &tcg_ctx->plugin_tb);
+        /* inject instrumentation */
+        qemu_plugin_gen_inject();
+        /* done */
         tcg_ctx->plugin_insn = NULL;
     }
 
diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c
index 75f182be37..cb5dbadc3c 100644
--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -1,4 +1,5 @@
 #include "qemu/osdep.h"
+#include "qemu/queue.h"
 #include "cpu.h"
 #include "tcg/tcg.h"
 #include "tcg/tcg-op.h"
@@ -169,8 +170,61 @@ static void gen_vcpu_udata_cb(struct qemu_plugin_dyn_cb *cb)
     tcg_temp_free_i32(cpu_index);
 }
 
-void qemu_plugin_gen_vcpu_udata_callbacks(struct qemu_plugin_dyn_cb_arr *arr)
+/* check that @a comes before @b */
+static inline void ops_check(const TCGOp *a, const TCGOp *b)
 {
+    const TCGOp *op;
+    bool seen_a = false;
+    bool seen_b = false;
+
+    tcg_debug_assert(a != b);
+    QTAILQ_FOREACH(op, &tcg_ctx->ops, link) {
+        if (op == a) {
+            tcg_debug_assert(!seen_b);
+            seen_a = true;
+        } else if (op == b) {
+            tcg_debug_assert(seen_a);
+            seen_b = true;
+            break;
+        }
+    }
+}
+
+/*
+ * Move ops from @from to @dest.
+ * @from must come after @dest in the list.
+ */
+static void ops_move(TCGOp *dest, TCGOp *from)
+{
+    TCGOp *curr;
+
+#ifdef CONFIG_DEBUG_TCG
+    ops_check(dest, from);
+#endif
+
+   if (QTAILQ_NEXT(dest, link) == from) {
+        /* nothing to do */
+        return;
+    }
+    curr = from;
+    do {
+        TCGOp *next = QTAILQ_NEXT(curr, link);
+
+        /*
+         * This could be done more efficiently since (@from,end) will
+         * remain linked together, but most likely we just have a few
+         * elements, so this is simple enough.
+         */
+        QTAILQ_REMOVE(&tcg_ctx->ops, curr, link);
+        QTAILQ_INSERT_AFTER(&tcg_ctx->ops, dest, curr, link);
+        dest = curr;
+        curr = next;
+    } while (curr);
+}
+
+static void inject_vcpu_udata_callbacks(struct qemu_plugin_dyn_cb_arr *arr, TCGOp *dest)
+{
+    TCGOp *last_op = tcg_last_op();
     size_t i;
 
     for (i = 0; i < arr->n; i++) {
@@ -187,6 +241,10 @@ void qemu_plugin_gen_vcpu_udata_callbacks(struct qemu_plugin_dyn_cb_arr *arr)
             g_assert_not_reached();
         }
     }
+    g_assert(tcg_last_op() != last_op);
+    last_op = QTAILQ_NEXT(last_op, link);
+    g_assert(last_op);
+    ops_move(dest, last_op);
 }
 
 /*
@@ -228,3 +286,26 @@ void qemu_plugin_gen_disable_mem_helpers(void)
                                                         plugin_mem_cbs));
     tcg_temp_free_ptr(ptr);
 }
+
+void qemu_plugin_gen_inject(void)
+{
+    struct qemu_plugin_tb *plugin_tb = &tcg_ctx->plugin_tb;
+
+    /* TB exec callbacks */
+    if (plugin_tb->cbs.n) {
+        TCGOp *op;
+
+        /* Insert the callbacks right before the first insn_start */
+        QTAILQ_FOREACH(op, &tcg_ctx->ops, link) {
+            if (op->opc == INDEX_op_insn_start) {
+                break;
+            }
+        }
+        /* a TB without insn_start? Something went wrong */
+        g_assert(op);
+        op = QTAILQ_PREV(op, TCGOpHead, link);
+        /* we should have called gen_tb_start before the 1st insn */
+        g_assert(op);
+        inject_vcpu_udata_callbacks(&plugin_tb->cbs, op);
+    }
+}
Alex Bennée Nov. 28, 2018, 12:50 p.m. UTC | #6
Emilio G. Cota <cota@braap.org> writes:

> On Tue, Nov 27, 2018 at 14:06:57 -0500, Emilio G. Cota wrote:
>> On Tue, Nov 27, 2018 at 14:48:11 +0000, Alex Bennée wrote:
>> > With a little tweaking to the TCG we could then insert
>> > our instrumentation at the end of the pass with all the knowledge we
>> > want to export to the plugin.
>>
>> After .translate_insn has returned for the last instruction, how
>> do we insert the instrumentation that the plugin wants--say, a TB
>> callback at the beginning of the TB, memory callbacks for the
>> 2nd instruction, and an insn callback before the 3rd instruction
>> executes?
>>
>> I don't see how we could achieve that with "a little tweaking"
>> instead of a 2nd pass, but I'd love to be wrong =)
>
> (snip)
>> > I don't quite follow. When we've done all our translation into TCG ops
>> > haven't we by definition defined the TB?
>>
>> Yes, that's precisely my point.
>>
>> The part that's missing is that once the TB is defined, we want to
>> insert instrumentation. Unfortunately, the "TCG ops" we get after
>> the 1st pass (no instrumentation), are very different from the list
>> of "TCG ops" that we get after the 2nd pass (after having injected
>> instrumentation). Could we get the same output of the 2nd pass,
>> just by using the output of the 1st and the list of injection points?
>> It's probably possible, but it seems very hard to do. (Think for
>> instance of memory callbacks, and further the complication of when
>> they use helpers).
>>
>> The only reasonable way to do this I think would be to leave behind
>> "placeholder" TCG ops, that then we could scan to add further TCG ops.
>> But you'll agree with me that the 2nd pass is simpler :P
>
> It might not be that much simpler after all!
>
> I am exploring the approach you suggested, that is IIUC to do a
> single pass and then walk the list of Ops, adding (and
> reordering) Ops based on what the plugins have requested.
>
> Contrary to what I wrote earlier today (quoted above), this
> approach seems quite promising, and certainly less ugly
> than doing the 2 passes.
>
> I just wrote some code to go over the list and add TB callbacks,
> which go right before the first insn_start Op. The code is hack-ish
> in that we first generate the TCG ops we need, which get added to
> the end of the ops list, and then we go over those and move them
> to where we want them to be (before insn_start, in this case).
> But it works and it's less of a hack than doing the whole 2nd pass.

But we should be able to insert the ops directly in the right place.
That is the whole point of being a list right ;-)

> Insn callbacks will be trivial to implement this way; memory
> callbacks should be harder because there are several qemu_ld/st
> opcodes, but it should be doable;

I was thinking about this last night. I wonder if we need to tag the
memory tcg ops so we can find them afterwards during the insertion
phase - but maybe the opcode itself provides enough information.

> last, memory instrumentation
> of helpers might actually be easier than with the 2 passes, because here
> we just have to look for a Call TCG op to know whether a guest
> instruction uses helpers, and if it does we can wrap the call
> with the helpers to generate the setting/resetting of
> CPUState.plugin_mem_cbs.

So merging the two helper calls into one from the target code?

>
> I'll try to do what's in the previous paragraph tomorrow -- I'm
> appending what I did so far.
>
> Thanks,
>
> 		Emilio
> ---
> diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
> index ee9e179e14..232f645cd4 100644
> --- a/accel/tcg/translator.c
> +++ b/accel/tcg/translator.c
> @@ -18,6 +18,7 @@
>  #include "exec/gen-icount.h"
>  #include "exec/log.h"
>  #include "exec/translator.h"
> +#include "exec/plugin-gen.h"
>
>  /* Pairs with tcg_clear_temp_count.
>     To be called by #TranslatorOps.{translate_insn,tb_stop} if
> @@ -142,6 +143,11 @@ void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
>      gen_tb_end(db->tb, db->num_insns - bp_insn);
>
>      if (plugin_enabled) {
> +        /* collect the plugins' instrumentation */
> +        qemu_plugin_tb_trans_cb(cpu, &tcg_ctx->plugin_tb);
> +        /* inject instrumentation */
> +        qemu_plugin_gen_inject();
> +        /* done */
>          tcg_ctx->plugin_insn = NULL;
>      }
>
> diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c
> index 75f182be37..cb5dbadc3c 100644
> --- a/accel/tcg/plugin-gen.c
> +++ b/accel/tcg/plugin-gen.c
> @@ -1,4 +1,5 @@
>  #include "qemu/osdep.h"
> +#include "qemu/queue.h"
>  #include "cpu.h"
>  #include "tcg/tcg.h"
>  #include "tcg/tcg-op.h"
> @@ -169,8 +170,61 @@ static void gen_vcpu_udata_cb(struct qemu_plugin_dyn_cb *cb)
>      tcg_temp_free_i32(cpu_index);
>  }
>
> -void qemu_plugin_gen_vcpu_udata_callbacks(struct qemu_plugin_dyn_cb_arr *arr)
> +/* check that @a comes before @b */
> +static inline void ops_check(const TCGOp *a, const TCGOp *b)
>  {
> +    const TCGOp *op;
> +    bool seen_a = false;
> +    bool seen_b = false;
> +
> +    tcg_debug_assert(a != b);
> +    QTAILQ_FOREACH(op, &tcg_ctx->ops, link) {
> +        if (op == a) {
> +            tcg_debug_assert(!seen_b);
> +            seen_a = true;
> +        } else if (op == b) {
> +            tcg_debug_assert(seen_a);
> +            seen_b = true;
> +            break;
> +        }
> +    }
> +}
> +
> +/*
> + * Move ops from @from to @dest.
> + * @from must come after @dest in the list.
> + */
> +static void ops_move(TCGOp *dest, TCGOp *from)
> +{
> +    TCGOp *curr;
> +
> +#ifdef CONFIG_DEBUG_TCG
> +    ops_check(dest, from);
> +#endif
> +
> +   if (QTAILQ_NEXT(dest, link) == from) {
> +        /* nothing to do */
> +        return;
> +    }
> +    curr = from;
> +    do {
> +        TCGOp *next = QTAILQ_NEXT(curr, link);
> +
> +        /*
> +         * This could be done more efficiently since (@from,end) will
> +         * remain linked together, but most likely we just have a few
> +         * elements, so this is simple enough.
> +         */
> +        QTAILQ_REMOVE(&tcg_ctx->ops, curr, link);
> +        QTAILQ_INSERT_AFTER(&tcg_ctx->ops, dest, curr, link);
> +        dest = curr;
> +        curr = next;
> +    } while (curr);
> +}
> +
> +static void inject_vcpu_udata_callbacks(struct qemu_plugin_dyn_cb_arr *arr, TCGOp *dest)
> +{
> +    TCGOp *last_op = tcg_last_op();
>      size_t i;
>
>      for (i = 0; i < arr->n; i++) {
> @@ -187,6 +241,10 @@ void qemu_plugin_gen_vcpu_udata_callbacks(struct qemu_plugin_dyn_cb_arr *arr)
>              g_assert_not_reached();
>          }
>      }
> +    g_assert(tcg_last_op() != last_op);
> +    last_op = QTAILQ_NEXT(last_op, link);
> +    g_assert(last_op);
> +    ops_move(dest, last_op);
>  }
>
>  /*
> @@ -228,3 +286,26 @@ void qemu_plugin_gen_disable_mem_helpers(void)
>                                                          plugin_mem_cbs));
>      tcg_temp_free_ptr(ptr);
>  }
> +
> +void qemu_plugin_gen_inject(void)
> +{
> +    struct qemu_plugin_tb *plugin_tb = &tcg_ctx->plugin_tb;
> +
> +    /* TB exec callbacks */
> +    if (plugin_tb->cbs.n) {
> +        TCGOp *op;
> +
> +        /* Insert the callbacks right before the first insn_start */
> +        QTAILQ_FOREACH(op, &tcg_ctx->ops, link) {
> +            if (op->opc == INDEX_op_insn_start) {
> +                break;
> +            }
> +        }
> +        /* a TB without insn_start? Something went wrong */
> +        g_assert(op);
> +        op = QTAILQ_PREV(op, TCGOpHead, link);
> +        /* we should have called gen_tb_start before the 1st insn */
> +        g_assert(op);
> +        inject_vcpu_udata_callbacks(&plugin_tb->cbs, op);
> +    }
> +}


--
Alex Bennée
Emilio Cota Nov. 28, 2018, 3:03 p.m. UTC | #7
On Wed, Nov 28, 2018 at 12:50:23 +0000, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> > I just wrote some code to go over the list and add TB callbacks,
> > which go right before the first insn_start Op. The code is hack-ish
> > in that we first generate the TCG ops we need, which get added to
> > the end of the ops list, and then we go over those and move them
> > to where we want them to be (before insn_start, in this case).
> > But it works and it's less of a hack than doing the whole 2nd pass.
> 
> But we should be able to insert the ops directly in the right place.
> That is the whole point of being a list right ;-)

Right, it's just hard sometimes to know exactly where to insert.

> > Insn callbacks will be trivial to implement this way; memory
> > callbacks should be harder because there are several qemu_ld/st
> > opcodes, but it should be doable;
> 
> I was thinking about this last night. I wonder if we need to tag the
> memory tcg ops so we can find them afterwards during the insertion
> phase - but maybe the opcode itself provides enough information.

We should be able to extract the info from the memop argument,
I think.

> > last, memory instrumentation
> > of helpers might actually be easier than with the 2 passes, because here
> > we just have to look for a Call TCG op to know whether a guest
> > instruction uses helpers, and if it does we can wrap the call
> > with the helpers to generate the setting/resetting of
> > CPUState.plugin_mem_cbs.
> 
> So merging the two helper calls into one from the target code?

Actually we don't need helpers to set/reset CPUState.plugin_mem_cbs;
we do that in TCG directly. So here we could just add the "set"
code right after "insn_start", and the "reset" code at the very
end of the translation (right before tb_exit/goto_tb etc).
The "reset" might still be dead code, but that is unavoidable
because the helper might do a longjmp. But we can fix that by
resetting the variable when returning from the jump.

Thanks,

		E.
diff mbox series

Patch

diff --git a/tcg/tcg.h b/tcg/tcg.h
index d5afe25c97..479b57d65f 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -720,6 +720,14 @@  struct TCGContext {
 
     TCGLabel *exitreq_label;
 
+    /*
+     * We keep one plugin_tb struct per TCGContext. Note that on every TB
+     * translation we clear but do not free its contents; this way we
+     * avoid a lot of malloc/free churn, since after a few TB's it's
+     * unlikely that we'll need to allocate either more instructions or more
+     * space for instructions (for variable-instruction-length ISAs).
+     */
+    struct qemu_plugin_tb plugin_tb;
     struct qemu_plugin_dyn_cb_arr *plugin_mem_cb;
     struct qemu_plugin_insn *plugin_insn;
 
diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
index 8591e4b72a..88f9ac62a3 100644
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -17,6 +17,7 @@ 
 #include "exec/gen-icount.h"
 #include "exec/log.h"
 #include "exec/translator.h"
+#include "exec/plugin-gen.h"
 
 /* Pairs with tcg_clear_temp_count.
    To be called by #TranslatorOps.{translate_insn,tb_stop} if
@@ -35,6 +36,21 @@  void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
                      CPUState *cpu, TranslationBlock *tb)
 {
     int bp_insn = 0;
+    int insn_idx = 0;
+    bool tb_trans_cb = false;
+    bool first_pass = true; /* second pass otherwise */
+    void *saved_dc = g_alloca(ops->ctx_size);
+    /* tb->plugin_mask is a u32 */
+    unsigned long plugin_mask = tb->plugin_mask;
+    struct qemu_plugin_tb *plugin_tb = &tcg_ctx->plugin_tb;
+
+    if (test_bit(QEMU_PLUGIN_EV_VCPU_TB_TRANS, &plugin_mask)) {
+        tb_trans_cb = true;
+        plugin_tb->cbs.n = 0;
+        plugin_tb->n = 0;
+        plugin_tb->vaddr = tb->pc;
+        tcg_ctx->plugin_mem_cb = NULL;
+    }
 
     /* Initialize DisasContext */
     db->tb = tb;
@@ -56,6 +72,21 @@  void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
         db->max_insns = 1;
     }
 
+ translate:
+    tcg_func_start(tcg_ctx);
+
+    /* See the "2-pass translation" comment below */
+    if (tb_trans_cb) {
+        void *dc = db;
+
+        dc -= ops->ctx_base_offset;
+        if (first_pass) {
+            memcpy(saved_dc, dc, ops->ctx_size);
+        } else {
+            memcpy(dc, saved_dc, ops->ctx_size);
+        }
+    }
+
     ops->init_disas_context(db, cpu);
     tcg_debug_assert(db->is_jmp == DISAS_NEXT);  /* no early exit */
 
@@ -67,7 +98,53 @@  void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
     ops->tb_start(db, cpu);
     tcg_debug_assert(db->is_jmp == DISAS_NEXT);  /* no early exit */
 
+    if (!first_pass && plugin_tb->cbs.n) {
+        qemu_plugin_gen_vcpu_udata_callbacks(&plugin_tb->cbs);
+    }
+
     while (true) {
+        struct qemu_plugin_insn *plugin_insn = NULL;
+        bool mem_helpers = false;
+
+        /*
+         * 2-pass translation.
+         *
+         * In the first pass we fully determine the TB.
+         * If no plugins have subscribed to TB translation events, we're done.
+         *
+         * If they have, we first share with plugins a TB descriptor so
+         * that plugins can subscribe to instruction-related events, e.g.
+         * memory accesses of particular instructions, or TB execution.
+         * With this info, which is kept in plugin_tb, we then do a second pass,
+         * inserting the appropriate instrumentation into the translated TB.
+         *
+         * Since all translation state is kept in DisasContext, we copy it
+         * before the first pass, and restore it before the second.
+         */
+        if (tb_trans_cb) {
+            if (first_pass) {
+                plugin_insn = qemu_plugin_tb_insn_get(plugin_tb);
+                tcg_ctx->plugin_insn = plugin_insn;
+                plugin_insn->vaddr = db->pc_next;
+                g_assert(tcg_ctx->plugin_mem_cb == NULL);
+            } else {
+                struct qemu_plugin_insn *insn = &plugin_tb->insns[insn_idx++];
+
+                tcg_ctx->plugin_insn = NULL;
+                if (unlikely(insn->exec_cbs.n)) {
+                    qemu_plugin_gen_vcpu_udata_callbacks(&insn->exec_cbs);
+                }
+                if (insn->mem_cbs.n) {
+                    tcg_ctx->plugin_mem_cb = &insn->mem_cbs;
+                    if (insn->calls_helpers) {
+                        qemu_plugin_gen_enable_mem_helpers(&insn->mem_cbs);
+                        mem_helpers = true;
+                    }
+                } else {
+                    tcg_ctx->plugin_mem_cb = NULL;
+                }
+            }
+        }
         db->num_insns++;
         ops->insn_start(db, cpu);
         tcg_debug_assert(db->is_jmp == DISAS_NEXT);  /* no early exit */
@@ -101,10 +178,14 @@  void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
             && (tb_cflags(db->tb) & CF_LAST_IO)) {
             /* Accept I/O on the last instruction.  */
             gen_io_start();
-            ops->translate_insn(db, cpu, NULL);
+            ops->translate_insn(db, cpu, plugin_insn);
             gen_io_end();
         } else {
-            ops->translate_insn(db, cpu, NULL);
+            ops->translate_insn(db, cpu, plugin_insn);
+        }
+
+        if (unlikely(mem_helpers)) {
+            qemu_plugin_gen_disable_mem_helpers();
         }
 
         /* Stop translation if translate_insn so indicated.  */
@@ -120,6 +201,12 @@  void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
         }
     }
 
+    if (tb_trans_cb && first_pass) {
+        qemu_plugin_tb_trans_cb(cpu, plugin_tb);
+        first_pass = false;
+        goto translate;
+    }
+
     /* Emit code to exit the TB, as indicated by db->is_jmp.  */
     ops->tb_stop(db, cpu);
     gen_tb_end(db->tb, db->num_insns - bp_insn);