[v6,01/22] instrument: Add documentation

Message ID	150529666493.10902.14830445134051381968.stgit@frigg.lan
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: =?utf-8?b?TGx1w61z?= Vilanova <vilanova@ac.upc.edu> To: qemu-devel@nongnu.org Date: Wed, 13 Sep 2017 12:57:45 +0300 Message-Id: <150529666493.10902.14830445134051381968.stgit@frigg.lan> In-Reply-To: <150529642278.10902.18234057937634437857.stgit@frigg.lan> References: <150529642278.10902.18234057937634437857.stgit@frigg.lan> User-Agent: StGit/0.18 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] [PATCH v6 01/22] instrument: Add documentation Precedence: list Cc: "Emilio G. Cota" <cota@braap.org>, Markus Armbruster <armbru@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, =?utf-8?q?Llu=C3=ADs_Vilanova?= <vilanova@ac.upc.edu> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	instrument: Add basic event instrumentation \| expand [v6,00/22] instrument: Add basic event instrumentation [v6,01/22] instrument: Add documentation [v6,02/22] instrument: Add configure-time flag [v6,03/22] instrument: Add generic library loader [v6,04/22] instrument: [linux-user] Add command line library loader [v6,05/22] instrument: [bsd-user] Add command line library loader [v6,06/22] instrument: [softmmu] Add command line library loader [v6,07/22] instrument: [qapi] Add library loader [v6,08/22] instrument: [hmp] Add library loader [v6,09/22] instrument: Add basic control interface [v6,10/22] instrument: Add support for tracing events [v6,11/22] instrument: Track vCPUs [v6,12/22] instrument: Add event 'guest_cpu_enter' [v6,13/22] instrument: Support synchronous modification of vCPU state [v6,14/22] exec: Add function to synchronously flush TB on a stopped vCPU [v6,15/22] instrument: Add event 'guest_cpu_exit' [v6,16/22] instrument: Add event 'guest_cpu_reset' [v6,17/22] trace: Introduce a proper structure to describe memory accesses [v6,18/22] instrument: Add event 'guest_mem_before_trans' [v6,19/22] instrument: Add event 'guest_mem_before_exec' [v6,20/22] instrument: Add event 'guest_user_syscall' [v6,21/22] instrument: Add event 'guest_user_syscall_ret' [v6,22/22] instrument: Add API to manipulate guest memory

Lluís Vilanova Sept. 13, 2017, 9:57 a.m. UTC

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---
 MAINTAINERS         |    6 ++
 docs/instrument.txt |  173 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 179 insertions(+)
 create mode 100644 docs/instrument.txt

Peter Maydell Sept. 14, 2017, 2:41 p.m. UTC | #1

On 13 September 2017 at 10:57, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
> ---
>  MAINTAINERS         |    6 ++
>  docs/instrument.txt |  173 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 179 insertions(+)
>  create mode 100644 docs/instrument.txt
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 36eeb42d19..fb0eaee06a 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1486,6 +1486,12 @@ F: scripts/tracetool/
>  F: docs/tracing.txt
>  T: git git://github.com/stefanha/qemu.git tracing
>
> +Event instrumentation
> +M: Lluís Vilanova <vilanova@ac.upc.edu>
> +M: Stefan Hajnoczi <stefanha@redhat.com>
> +S: Maintained
> +F: docs/instrument.txt
> +
>  TPM
>  S: Orphan
>  F: tpm.c
> diff --git a/docs/instrument.txt b/docs/instrument.txt
> new file mode 100644
> index 0000000000..24a0d21fc7
> --- /dev/null
> +++ b/docs/instrument.txt
> @@ -0,0 +1,173 @@
> += Event instrumentation =
> +
> +== Introduction ==
> +
> +Event instrumentation allows users to execute their own host-native code on a
> +set of pre-defined events provided by QEMU. QEMU also exposes other
> +functionality to peek/poke at the guest state (e.g., memory or registers), as
> +well as interacting with tracing events. For those familiar with the term, this
> +provides dynamic binary instrumentation, works on all QEMU-supported
> +architectures, as well as works in both 'user' (standalone application) and
> +'system' (full-system emulation) modes.
> +
> +Look at the headers installed by QEMU on the "qemu-instr" directory for further
> +information beyond this document.
> +
> +
> +== Loading an instrumentation library ==
> +
> +Instrumentation code can be bundled into a dynamic library, which can be later
> +loaded into QEMU:
> +
> +* Using the command-line "-instr" argument.
> +
> +* Using the "instr-load" and "instr-unload" commands in the HMP and QMP
> +  interfaces.
> +
> +
> +== Example ==
> +
> +1. Configure QEMU with event instrumentation:
> +
> +    # instrument guest_cpu_enter and guest_mem_before
> +    mkdir -p /path/to/qemu-build
> +    cd /path/to/qemu-build
> +    /path/to/qemu-source/configure \
> +      --enable-instrument \
> +      --prefix=/path/to/qemu-install

Ideally instrumentation should be cost-free in the case where
we're not using it, so we can default it to enabled.

> +
> +2. Build and install QEMU:
> +
> +    make install
> +
> +3. Create the "Makefile" to build the instrumentation library:
> +
> +    mkdir -p /tmp/my-instrument
> +
> +    cat > /tmp/my-instrument/Makefile <<EOF
> +    QEMU_PATH=/tmp/qemu-install/
> +
> +    CFLAGS += -g
> +    CFLAGS += -O3
> +    CFLAGS += -Werror -Wall
> +    CFLAGS += -I$(QEMU_PATH)/include

Plugins shouldn't have or need access to all of the QEMU source
tree or its include files. We want to be able to provide them
with one header file which defines all they need (and all they
get), under a suitably non-restrictive license like 2-clause-BSD.

> +
> +    all: libtrace-instrument.la
> +
> +    libtrace-instrument.la: instrument.lo
> +            libtool --mode=link --tag=CC $(CC) -module -rpath /usr/local/lib -o $@ $^

-rpath ?

> +
> +    %.lo: %.c
> +            libtool --mode=compile --tag=CC $(CC) $(CFLAGS) -c $^
> +
> +    clean:
> +            $(RM) -f *.o *.so *.lo
> +            $(RM) -Rf .libs
> +    EOF
> +
> +4. Write your instrumentation library:
> +
> +    cat > /tmp/my-instrument/instrument.c <<EOF
> +    #include <stdio.h>
> +    #include <assert.h>
> +
> +    #include <qemu-instr/control.h>         /* manipulate events */
> +    #include <qemu-instr/trace.h>           /* manipulate tracing */
> +
> +    /* the address for the memory access is not known at translation time */
> +    void guest_mem_before_trans(QICPU vcpu_trans, QITCGv_cpu vcpu_exec,
> +                                QITCGv vaddr, QIMemInfo info)
> +    {
> +        printf("%s: %p %p %p %d %d %d %d\n", __func__, vcpu_trans, vcpu_exec, vaddr,
> +               1 << info.size_shift, info.sign_extend, info.endianness, info.store);
> +        if (info.store) {
> +            /* generate at execution time only for memory writes */
> +            qi_event_gen_guest_mem_before_exec(vcpu_exec, vaddr, info);
> +        }
> +    }
> +
> +    /* called when QEMU executes a memory access */
> +    void guest_mem_before_exec(QICPU vcpu, uint64_t vaddr, QIMemInfo info)
> +    {
> +        if (info.store) {
> +            /* if called by TCG code, we'll only get writes (see above) */
> +            printf("%s: %p %lx %d %d %d %d\n", __func__, vcpu, vaddr,
> +                   1 << info.size_shift, info.sign_extend, info.endianness, info.store);
> +        }
> +    }

This looks like it's exposing too much implementation detail.
We should just provide an API for "hook to be called for
memory writes" which gets all the information when it
is called. I don't think we should expose any kind of
"this hook is called at translation time" at all.

I guess the API docs are in doc comments in a header somewhere,
so I'll go look in the other patches.

thanks
-- PMM

Lluís Vilanova Sept. 15, 2017, 1:39 p.m. UTC | #2

Peter Maydell writes:

> On 13 September 2017 at 10:57, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>> Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
>> ---
>> MAINTAINERS         |    6 ++
>> docs/instrument.txt |  173 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 179 insertions(+)
>> create mode 100644 docs/instrument.txt
>> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 36eeb42d19..fb0eaee06a 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1486,6 +1486,12 @@ F: scripts/tracetool/
>> F: docs/tracing.txt
>> T: git git://github.com/stefanha/qemu.git tracing
>> 
>> +Event instrumentation
>> +M: Lluís Vilanova <vilanova@ac.upc.edu>
>> +M: Stefan Hajnoczi <stefanha@redhat.com>
>> +S: Maintained
>> +F: docs/instrument.txt
>> +
>> TPM
>> S: Orphan
>> F: tpm.c
>> diff --git a/docs/instrument.txt b/docs/instrument.txt
>> new file mode 100644
>> index 0000000000..24a0d21fc7
>> --- /dev/null
>> +++ b/docs/instrument.txt
>> @@ -0,0 +1,173 @@
>> += Event instrumentation =
>> +
>> +== Introduction ==
>> +
>> +Event instrumentation allows users to execute their own host-native code on a
>> +set of pre-defined events provided by QEMU. QEMU also exposes other
>> +functionality to peek/poke at the guest state (e.g., memory or registers), as
>> +well as interacting with tracing events. For those familiar with the term, this
>> +provides dynamic binary instrumentation, works on all QEMU-supported
>> +architectures, as well as works in both 'user' (standalone application) and
>> +'system' (full-system emulation) modes.
>> +
>> +Look at the headers installed by QEMU on the "qemu-instr" directory for further
>> +information beyond this document.
>> +
>> +
>> +== Loading an instrumentation library ==
>> +
>> +Instrumentation code can be bundled into a dynamic library, which can be later
>> +loaded into QEMU:
>> +
>> +* Using the command-line "-instr" argument.
>> +
>> +* Using the "instr-load" and "instr-unload" commands in the HMP and QMP
>> +  interfaces.
>> +
>> +
>> +== Example ==
>> +
>> +1. Configure QEMU with event instrumentation:
>> +
>> +    # instrument guest_cpu_enter and guest_mem_before
>> +    mkdir -p /path/to/qemu-build
>> +    cd /path/to/qemu-build
>> +    /path/to/qemu-source/configure \
>> +      --enable-instrument \
>> +      --prefix=/path/to/qemu-install

> Ideally instrumentation should be cost-free in the case where
> we're not using it, so we can default it to enabled.

I wasn't sure if everyone would want it enabled by default on the first release,
but I can easily change that.


>> +
>> +2. Build and install QEMU:
>> +
>> +    make install
>> +
>> +3. Create the "Makefile" to build the instrumentation library:
>> +
>> +    mkdir -p /tmp/my-instrument
>> +
>> +    cat > /tmp/my-instrument/Makefile <<EOF
>> +    QEMU_PATH=/tmp/qemu-install/
>> +
>> +    CFLAGS += -g
>> +    CFLAGS += -O3
>> +    CFLAGS += -Werror -Wall
>> +    CFLAGS += -I$(QEMU_PATH)/include

> Plugins shouldn't have or need access to all of the QEMU source
> tree or its include files. We want to be able to provide them
> with one header file which defines all they need (and all they
> get), under a suitably non-restrictive license like 2-clause-BSD.

Variable QEMU_PATH refers to the *installation* path of QEMU.

I can change the API headers to use some other license.


>> +
>> +    all: libtrace-instrument.la
>> +
>> +    libtrace-instrument.la: instrument.lo
>> +            libtool --mode=link --tag=CC $(CC) -module -rpath /usr/local/lib -o $@ $^

> -rpath ?

I couldn't make libtool to generate a .so file without it. I can change the
example to directly use gcc instead of libtool.


>> +
>> +    %.lo: %.c
>> +            libtool --mode=compile --tag=CC $(CC) $(CFLAGS) -c $^
>> +
>> +    clean:
>> +            $(RM) -f *.o *.so *.lo
>> +            $(RM) -Rf .libs
>> +    EOF
>> +
>> +4. Write your instrumentation library:
>> +
>> +    cat > /tmp/my-instrument/instrument.c <<EOF
>> +    #include <stdio.h>
>> +    #include <assert.h>
>> +
>> +    #include <qemu-instr/control.h>         /* manipulate events */
>> +    #include <qemu-instr/trace.h>           /* manipulate tracing */
>> +
>> +    /* the address for the memory access is not known at translation time */
>> +    void guest_mem_before_trans(QICPU vcpu_trans, QITCGv_cpu vcpu_exec,
>> +                                QITCGv vaddr, QIMemInfo info)
>> +    {
>> +        printf("%s: %p %p %p %d %d %d %d\n", __func__, vcpu_trans, vcpu_exec, vaddr,
>> +               1 << info.size_shift, info.sign_extend, info.endianness, info.store);
>> +        if (info.store) {
>> +            /* generate at execution time only for memory writes */
>> +            qi_event_gen_guest_mem_before_exec(vcpu_exec, vaddr, info);
>> +        }
>> +    }
>> +
>> +    /* called when QEMU executes a memory access */
>> +    void guest_mem_before_exec(QICPU vcpu, uint64_t vaddr, QIMemInfo info)
>> +    {
>> +        if (info.store) {
>> +            /* if called by TCG code, we'll only get writes (see above) */
>> +            printf("%s: %p %lx %d %d %d %d\n", __func__, vcpu, vaddr,
>> +                   1 << info.size_shift, info.sign_extend, info.endianness, info.store);
>> +        }
>> +    }

> This looks like it's exposing too much implementation detail.
> We should just provide an API for "hook to be called for
> memory writes" which gets all the information when it
> is called. I don't think we should expose any kind of
> "this hook is called at translation time" at all.

The differentiation between translation-time and execution-time is key to
perform certain analysis efficiently.

The simplest example is probably a BBL trace (e.g., as used by SimPoint
[1]). For each BBL (or, rather, TB) that you translate, you collect what
instructions are in it (addresses, length, opcodes, etc). Then, at execution
time you only need to collect the starting address of each executed BBL/TB,
since its information was already collected before at translation time
(therefore making it more efficient).

[1] https://cseweb.ucsd.edu/~calder/simpoint/


> I guess the API docs are in doc comments in a header somewhere,
> so I'll go look in the other patches.

Yes, you can look at headers in instrument/qemu-instr/.


Thanks,
  Lluis

Stefan Hajnoczi Sept. 18, 2017, 2:33 p.m. UTC | #3

On Wed, Sep 13, 2017 at 12:57:45PM +0300, Lluís Vilanova wrote:
> +    /* mandatory initialization function */
> +    int main(int argc, const char **argv)

Most shared library plugin interfaces I have seen do not use "main()" as
the entry point.  Instead they use a unique name that allows the host
application to differentiate between share library files that are valid
plugins (e.g. "qemu_instr_init") and non-plugin shared libraries.

Stable plugin APIs usually have a versioning or feature detection
scheme.  Versioning is simple: the host application passes a major/minor
version number to the init function.

Of course the dynamic linker already enforces compatibility somewhat: if
a plugin uses a newer API than available in the host application then
there will be a missing symbol/linker error.

So what versioning strategy should we follow?  The simplest would be to
depend 100% on the dynamic linker with no explicit checks inside the
plugin or QEMU.  In that case the API/ABI need to follow some rules like
this (not sure how oudated this information is):
http://plan99.net/~mike/writing-shared-libraries.html

> +    {
> +        int i;
> +        printf("init!\n");
> +        printf("    argc :: %d\n", argc);
> +        for (i = 0; i < argc; i++) {
> +            printf("            -> %s\n", argv[i]);
> +        }
> +    
> +        qi_set_fini(fini, NULL);
> +    
> +        /* instrument and trace events */
> +        QITraceEvent *ev;
> +    
> +        qi_event_set_guest_cpu_enter(guest_cpu_enter);
> +        ev = qi_trace_event_name("guest_cpu_enter");
> +        assert(ev);
> +        qi_trace_event_set_state_dynamic(ev, true);
> +    
> +        qi_event_set_guest_mem_before_trans(guest_mem_before_trans);
> +        ev = qi_trace_event_name("guest_mem_before_trans");
> +        assert(ev);
> +        qi_trace_event_set_state_dynamic(ev, true);
> +    
> +        qi_event_set_guest_mem_before_exec(guest_mem_before_exec);
> +        ev = qi_trace_event_name("guest_mem_before_exec");
> +        assert(ev);
> +        qi_trace_event_set_state_dynamic(ev, true);

Why are trace events being enabled in this example?

I would expect qi_event_set_guest_cpu_enter(guest_cpu_enter) to
immediately enable the callback.  The user shouldn't need to use tracing
to receive callbacks.

qi_event_set_guest_cpu_enter(NULL) should disable the callback.

Stefan

Stefan Hajnoczi Sept. 18, 2017, 2:40 p.m. UTC | #4

On Wed, Sep 13, 2017 at 12:57:45PM +0300, Lluís Vilanova wrote:
> +Event instrumentation
> +M: Lluís Vilanova <vilanova@ac.upc.edu>
> +M: Stefan Hajnoczi <stefanha@redhat.com>
> +S: Maintained
> +F: docs/instrument.txt

Thanks for including me but I'm afraid I don't have time to co-maintain
the instrumentation API.

You could maintain it yourself.  Maintainers send signed pull requests
to Peter Maydell to get commits merged into qemu.git/master.

Maybe someone else would like to co-maintain it with you?

Stefan

Peter Maydell Sept. 18, 2017, 2:41 p.m. UTC | #5

On 15 September 2017 at 14:39, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Peter Maydell writes:
>> This looks like it's exposing too much implementation detail.
>> We should just provide an API for "hook to be called for
>> memory writes" which gets all the information when it
>> is called. I don't think we should expose any kind of
>> "this hook is called at translation time" at all.
>
> The differentiation between translation-time and execution-time is key to
> perform certain analysis efficiently.

It's also exposing internal QEMU implementation detail.
What if in future we decide to switch from our current
setup to always interpreting guest instructions as a
first pass with JITting done only in the background for
hot code?

Sticking to instrumentation events that correspond exactly to guest
execution events means they won't break or expose internals.

thanks
-- PMM

Lluís Vilanova Sept. 18, 2017, 5:09 p.m. UTC | #6

Peter Maydell writes:

> On 15 September 2017 at 14:39, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>> Peter Maydell writes:
>>> This looks like it's exposing too much implementation detail.
>>> We should just provide an API for "hook to be called for
>>> memory writes" which gets all the information when it
>>> is called. I don't think we should expose any kind of
>>> "this hook is called at translation time" at all.
>> 
>> The differentiation between translation-time and execution-time is key to
>> perform certain analysis efficiently.

> It's also exposing internal QEMU implementation detail.
> What if in future we decide to switch from our current
> setup to always interpreting guest instructions as a
> first pass with JITting done only in the background for
> hot code?

TCI still has a separation of translation-time (translate.c) and execution-time
(interpreting the TCG opcodes), and I don't think that's gonna go away anytime
soon.

Even if it did, I think there still will be a translation/execution separation
easy enough to hook into (even if it's a "fake" one for the cold-path
interpreted instructions).

> Sticking to instrumentation events that correspond exactly to guest
> execution events means they won't break or expose internals.

It also means we won't be able to "conditionally" instrument instructions (e.g.,
based on their opcode, address range, etc.).

Of course we can add the translation/execution differentiation later if we find
it necessary for performance, but I would rather avoid leaving "historical"
instrumentation points behind on the API.

What are the use-cases you're aiming for?

Cheers!
  Lluis

Peter Maydell Sept. 18, 2017, 5:42 p.m. UTC | #7

On 18 September 2017 at 18:09, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Peter Maydell writes:
>> It's also exposing internal QEMU implementation detail.
>> What if in future we decide to switch from our current
>> setup to always interpreting guest instructions as a
>> first pass with JITting done only in the background for
>> hot code?
>
> TCI still has a separation of translation-time (translate.c) and execution-time
> (interpreting the TCG opcodes), and I don't think that's gonna go away anytime
> soon.

I didn't mean TCI, which is nothing like what you'd use for
this if you did it (TCI is slower than just JITting.)

> Even if it did, I think there still will be a translation/execution separation
> easy enough to hook into (even if it's a "fake" one for the cold-path
> interpreted instructions).

But what would it mean? You don't have basic blocks any more.

>> Sticking to instrumentation events that correspond exactly to guest
>> execution events means they won't break or expose internals.
>
> It also means we won't be able to "conditionally" instrument instructions (e.g.,
> based on their opcode, address range, etc.).

You can still do that, it's just less efficient (your
condition-check happens in the callout to the instrumentation
plugin). We can add "filter" options later if we need them
(which I would rather do than have translate-time callbacks).

> Of course we can add the translation/execution differentiation later if we find
> it necessary for performance, but I would rather avoid leaving "historical"
> instrumentation points behind on the API.
>
> What are the use-cases you're aiming for?

* I want to be able to point the small stream of people who come
into qemu-devel asking "how do I trace all my guest's memory
accesses" at a clean API for it.

* I want to be able to have less ugly and confusing tracing
than our current -d output (and perhaps emit tracing in formats
that other analysis tools want as input)

* I want to keep this initial tracing API simple enough that
we can agree on it and get a first working useful version.

thanks
-- PMM

Peter Maydell Sept. 19, 2017, 1:09 p.m. UTC | #8

On 18 September 2017 at 18:09, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> TCI still has a separation of translation-time (translate.c) and execution-time
> (interpreting the TCG opcodes), and I don't think that's gonna go away anytime
> soon.
>
> Even if it did, I think there still will be a translation/execution separation
> easy enough to hook into (even if it's a "fake" one for the cold-path
> interpreted instructions).

As a slightly more immediate and practical example, I'm currently
implementing the v8M "SG" instruction. This is a somewhat weird
corner-case of an instruction (it's the only instruction you can
execute in non-secure state from a code region that's secure).
I'm implementing it in the exception-handling code path: if we
detect "NS execute from S memory" we throw a QEMU internal exception,
and in the cpu_do_interrupt code we either (a) identify that this
is the SG instruction and execute it or (b) generate the right guest
CPU exception.

That's definitely executing an instruction, and there's no translation
time for it...

thanks
-- PMM

Emilio Cota Sept. 19, 2017, 1:50 p.m. UTC | #9

On Mon, Sep 18, 2017 at 18:42:55 +0100, Peter Maydell wrote:
> On 18 September 2017 at 18:09, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> > It also means we won't be able to "conditionally" instrument instructions (e.g.,
> > based on their opcode, address range, etc.).
> 
> You can still do that, it's just less efficient (your
> condition-check happens in the callout to the instrumentation
> plugin). We can add "filter" options later if we need them
> (which I would rather do than have translate-time callbacks).

My initial reaction to not having translation-time hooks was that it'd
be too slow. However, thinking about it a bit more I suspect it might
not matter much. Other tools such as Pin/DynamoRIO have these hooks,
and in their case it makes sense because one can choose to, instead of
having one callback per executed instruction, get a callback per
executed "trace" (a trace is a chain of TBs). Crucially, these tools do
not go through an intermediate IR, and as a result are about 10X faster
than QEMU.

Therefore, given that QEMU is comparatively slow, we might find that
per-executed-instruction callbacks do not end up slowing things down
significantly.

I like the idea of discussing an API alone, but there is value in having
an implementation to go with it, so that we can explore the perf
trade-offs involved. I was busy with other things the last 10 days or so,
but by the end of this week I hope I'll be able to share some
perf numbers on the per-TB vs. per-instruction callback issue.

Thanks,

		Emilio

Lluís Vilanova Sept. 25, 2017, 6:03 p.m. UTC | #10

First, sorry for the late response; I was away for a few days.

Peter Maydell writes:

> On 18 September 2017 at 18:09, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>> Peter Maydell writes:
>>> It's also exposing internal QEMU implementation detail.
>>> What if in future we decide to switch from our current
>>> setup to always interpreting guest instructions as a
>>> first pass with JITting done only in the background for
>>> hot code?
>> 
>> TCI still has a separation of translation-time (translate.c) and execution-time
>> (interpreting the TCG opcodes), and I don't think that's gonna go away anytime
>> soon.

> I didn't mean TCI, which is nothing like what you'd use for
> this if you did it (TCI is slower than just JITting.)

My point is that even on the cold path you need to decode a guest instruction
(equivalent to translating) and emulate it on the spot (equivalent to
executing).

>> Even if it did, I think there still will be a translation/execution separation
>> easy enough to hook into (even if it's a "fake" one for the cold-path
>> interpreted instructions).

> But what would it mean? You don't have basic blocks any more.

Every instruction emulated on the spot can be seen as a newly translated block
(of one instruction only), which is executed immediately after.

>>> Sticking to instrumentation events that correspond exactly to guest
>>> execution events means they won't break or expose internals.
>> 
>> It also means we won't be able to "conditionally" instrument instructions (e.g.,
>> based on their opcode, address range, etc.).

> You can still do that, it's just less efficient (your
> condition-check happens in the callout to the instrumentation
> plugin). We can add "filter" options later if we need them
> (which I would rather do than have translate-time callbacks).

Before answering, a short summary of when knowing about translate/execute makes
a difference:

* Record some information only once when an instruction is translated, instead
  of recording it on every executed instruction (e.g., a study of opcode
  distribution, which you can get from a file of per-TB opcodes - generated at
  translation time - and a list of executed TBs - generated at execution time
  -). The translate/execute separation makes this run faster *and* produces much
  smaller files with the recorded info.

  Other typical examples that benefit from this are writing a simulator that
  feeds off a stream of instruction information (a common reason why people want
  to trace memory accesses and information of executed instructions).

* Conditionally instrumenting instructions.

Adding filtering to the instrumentation API would only solve the second point,
but not the first one.

Now, do we need/want to support the first point?

>> Of course we can add the translation/execution differentiation later if we find
>> it necessary for performance, but I would rather avoid leaving "historical"
>> instrumentation points behind on the API.
>> 
>> What are the use-cases you're aiming for?

> * I want to be able to point the small stream of people who come
> into qemu-devel asking "how do I trace all my guest's memory
> accesses" at a clean API for it.

> * I want to be able to have less ugly and confusing tracing
> than our current -d output (and perhaps emit tracing in formats
> that other analysis tools want as input)

> * I want to keep this initial tracing API simple enough that
> we can agree on it and get a first working useful version.

Fair enough.

I know it's not exactly the same we're discussing, but the plot in [1] compares
a few different ways to trace memory accesses on SPEC benchmarks:

* First bar is using a Intel's tool called PIN [2].
* Second is calling into an instrumentation function on every executed memory
  access in QEMU.
* Third is embedding the hot path of writing the memory access info to an array
  into the TCG opcode stream (more or less equivalent to supporting filtering;
  when the array is full, a user's callback is called - cold path -)
* Fourth bar can be ignored.

This was working on a much older version of instrumentation for QEMU, but I can
implement something that does the first use-case point above and some filtering
example (second use-case point) to see what's the performance difference.

[1] https://filetea.me/n3wy9WwyCCZR72E9OWXHArHDw
[2] https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

Thanks!
  Lluis

Emilio Cota Sept. 25, 2017, 7:42 p.m. UTC | #11

On Mon, Sep 25, 2017 at 21:03:39 +0300, Lluís Vilanova wrote:
> I know it's not exactly the same we're discussing, but the plot in [1] compares
> a few different ways to trace memory accesses on SPEC benchmarks:
> 
> * First bar is using a Intel's tool called PIN [2].
> * Second is calling into an instrumentation function on every executed memory
>   access in QEMU.
> * Third is embedding the hot path of writing the memory access info to an array
>   into the TCG opcode stream (more or less equivalent to supporting filtering;
>   when the array is full, a user's callback is called - cold path -)
> * Fourth bar can be ignored.
> 
> This was working on a much older version of instrumentation for QEMU, but I can
> implement something that does the first use-case point above and some filtering
> example (second use-case point) to see what's the performance difference.
> 
> [1] https://filetea.me/n3wy9WwyCCZR72E9OWXHArHDw

Interesting! Unfortunately, this URL gives me a 404.

		E.

Lluís Vilanova Sept. 26, 2017, 4:49 p.m. UTC | #12

Emilio G Cota writes:

> On Mon, Sep 25, 2017 at 21:03:39 +0300, Lluís Vilanova wrote:
>> I know it's not exactly the same we're discussing, but the plot in [1] compares
>> a few different ways to trace memory accesses on SPEC benchmarks:
>> 
>> * First bar is using a Intel's tool called PIN [2].
>> * Second is calling into an instrumentation function on every executed memory
>> access in QEMU.
>> * Third is embedding the hot path of writing the memory access info to an array
>> into the TCG opcode stream (more or less equivalent to supporting filtering;
>> when the array is full, a user's callback is called - cold path -)
>> * Fourth bar can be ignored.
>> 
>> This was working on a much older version of instrumentation for QEMU, but I can
>> implement something that does the first use-case point above and some filtering
>> example (second use-case point) to see what's the performance difference.
>> 
>> [1] https://filetea.me/n3wy9WwyCCZR72E9OWXHArHDw

> Interesting! Unfortunately, this URL gives me a 404.

Ok, I've uploade it somewhere else:

  https://people.gso.ac.upc.edu/vilanova/mtrace.pdf

There's also another one that simply counts the number of memory accesses, using
the same three approaches:

  https://people.gso.ac.upc.edu/vilanova/mcount.pdf

Cheers,
  Lluis

Lluís Vilanova Oct. 4, 2017, 11:28 p.m. UTC | #13

Emilio G Cota writes:

> On Sat, Sep 30, 2017 at 00:46:33 +0300, Lluís Vilanova wrote:
>> Emilio G Cota writes:
>> > I'm not sure I understand this concept of filtering. Are you saying that in
>> > the first case, all memory accesses are instrumented, and then in the
>> > "access helper" we only call the user's callback if it's a memory write?
>> > And in the second case, we simply just generate a "write helper" instead
>> > of an "access helper". Am I understanding this correctly?
>> 
>> In the previous case (no filtering), the user callback is always called when a
>> memory access is *executed*, and the user then checks if the access mode is a
>> write to decide whether to increment a counter.
>> 
>> In this case (with filtering), a user callback is called when a memory access is
>> *translated*, and if the access mode is a write, the user generates a call to a
>> second callback that is executed every time a memory access is executed (only
>> that it is only generated for memory writes, the ones we care about).
>> 
>> Is this clearer?

> I get it now, thanks!

>> > FWIW my experiments so far show similar numbers for instrumenting each
>> > instruction (haven't done the per-tb yet). The difference is that I'm
>> > exposing to instrumenters a copy of the guest instructions (const void *data,
>> > size_t size). These copies are kept around until TB's are flushed.
>> > Luckily there seems to be very little overhead in keeping these around,
>> > apart from the memory overhead -- but in terms of performance, the
>> > necessary allocations do not induce significant overhead.
>> 
>> To keep this use-case simpler, I added the memory access API I posted in this
>> series, where instrumenters can read guest memory (more general than passing a
>> copy of the current instruction).

> I see some potential problems with this:
> 1. Instrumenters' accesses could generate exceptions. I presume we'd want to avoid
>    this, or leave it as a debug-only kind of option.

The API takes care of telling you if the access could be performed
successfully. If you access the instruction's memory representation at
translation time, it should be able to perform the access, since QEMU's
translation loop just had to do so in order to access that instruction (I should
check what happens in the corner case where another guest CPU changes the page
table, since I'm not sure if the address translation functions I'm using in QEMU
will use the per-vCPU TLB cache or always traverse the page table).


> 2. Instrumenters won't know where the end of an instruction (for variable-length
>   ISAs) or of a TB is (TB != basic block). For instructions one could have a loop
>   where we read byte-by-byte and pass it to the decoder, something similar to
>   what we have in the capstone code recently posted to the list (v4). For TBs,
>   we really should have a way to delimit the length of the TB. This is further
>   complicated if we want instrumentation to be inserted *before* a TB is
>   translated.

> Some thoughts on the latter problem: if we want a tb_trans_pre callback, like
> Pin/DynamoRIO provide, instead of doing two passes (one to delimit the TB and
> call the tb_trans_pre callback, to then generate the translated TB), we could:
>   - have a tb_trans_pre callback. This callback inserts an exec-time callback
>     with a user-defined pointer (let's call it **tb_info). The callback has
>     no arguments, perhaps just the pc.
>   - have a tb_trans_post callback. This one passes a copy of the guest
>     instructions. The instrumenter then can allocate whatever data structure
>     to represent the TB (*tb_info), and copies this pointer to **tb_info, so
>     that at execution time, we can obtain tb_info _before_ the TB is executed.
>     After the callback returns, the copy of the guest instructions can be freed.
>   This has two disadvantages:
>   - We have an extra dereference to find tb_info
>   - If it turns out that the TB should not be instrumented, we have generated
>     a callback for nothing.

That's precisely one of the reasons why I proposed adding instrumentation points
before and after events happen (e.g., instrument right after translating an
instruction, where you know its size).

What you propose is actually a broader issue, how to allow instrumentors to pass
their own data to execution-time functions "after the fact". For this, I
implemented "promises", a kind of generalization of what gen_icount() does (you
pass a value to the execution-time callback that is computed later during
translation-time).


Cheers,
  Lluis

Emilio Cota Oct. 5, 2017, 12:50 a.m. UTC | #14

On Thu, Oct 05, 2017 at 02:28:12 +0300, Lluís Vilanova wrote:
> Emilio G Cota writes:
> > I see some potential problems with this:
> > 1. Instrumenters' accesses could generate exceptions. I presume we'd want to avoid
> >    this, or leave it as a debug-only kind of option.
> 
> The API takes care of telling you if the access could be performed
> successfully. If you access the instruction's memory representation at
> translation time, it should be able to perform the access, since QEMU's
> translation loop just had to do so in order to access that instruction (I should
> check what happens in the corner case where another guest CPU changes the page
> table, since I'm not sure if the address translation functions I'm using in QEMU
> will use the per-vCPU TLB cache or always traverse the page table).

That was my concern, I'd rather just perform the read once, that is, the read(s)
done by ops->insn_translate.

> > 2. Instrumenters won't know where the end of an instruction (for variable-length
> >   ISAs) or of a TB is (TB != basic block). For instructions one could have a loop
> >   where we read byte-by-byte and pass it to the decoder, something similar to
> >   what we have in the capstone code recently posted to the list (v4). For TBs,
> >   we really should have a way to delimit the length of the TB. This is further
> >   complicated if we want instrumentation to be inserted *before* a TB is
> >   translated.
> 
> > Some thoughts on the latter problem: if we want a tb_trans_pre callback, like
> > Pin/DynamoRIO provide, instead of doing two passes (one to delimit the TB and
> > call the tb_trans_pre callback, to then generate the translated TB), we could:
> >   - have a tb_trans_pre callback. This callback inserts an exec-time callback
> >     with a user-defined pointer (let's call it **tb_info). The callback has
> >     no arguments, perhaps just the pc.
> >   - have a tb_trans_post callback. This one passes a copy of the guest
> >     instructions. The instrumenter then can allocate whatever data structure
> >     to represent the TB (*tb_info), and copies this pointer to **tb_info, so
> >     that at execution time, we can obtain tb_info _before_ the TB is executed.
> >     After the callback returns, the copy of the guest instructions can be freed.
> >   This has two disadvantages:
> >   - We have an extra dereference to find tb_info
> >   - If it turns out that the TB should not be instrumented, we have generated
> >     a callback for nothing.
> 
> That's precisely one of the reasons why I proposed adding instrumentation points
> before and after events happen (e.g., instrument right after translating an
> instruction, where you know its size).
> 
> What you propose is actually a broader issue, how to allow instrumentors to pass
> their own data to execution-time functions "after the fact". For this, I
> implemented "promises", a kind of generalization of what gen_icount() does (you
> pass a value to the execution-time callback that is computed later during
> translation-time).

I see. I implemented what I suggested above, i.e. tb_trans_cb
(i.e. post_trans) passes an opaque descriptor of the TB (which can
be iterated over insn by insn) and the return value (void *) of this
cb will be passed by tb_exec_cb (i.e. pre_exec).  Perf-wise this
is pretty OK, turns out even if we don't end up caring about the
TB, the additional per-TB helper (which might not end up calling
a callback) does not introduce significant overhead at execution time.

The major hurdle I found is what to do when removing a plugin,
so that we avoid flushing potentially all translated code. What I ended up
doing is adding a per-TB list of "plugin_tb" descriptors, which
track these user pointers so that (1) each plugin gets the right
pointer, and (2) if n_plugins > 1, we still have a single helper
that dispatches the callbacks instead of n_plugin helpers.

If I understand correctly, with promises we directly generate
a callback, which has the promise(s) as one (or more) of its
arguments. This is neat and very flexible. However, it forces us to
retranslate the TB when the plugin is removed (if we're lazy we could
flush all TBs), and if we have several plugins, we end up with one
helper per callback, instead of a single one.
None of this is a huge deal though, I just think is worth considering.

Also, I'm not sure Peter and others would be happy with allowing
plugin code to generate arbitrary callbacks (IIRC arbitrary code
has already been ruled out). So perhaps a more restrictive option
like what I suggested above would be more palatable.

Cheers,

		Emilio

Lluís Vilanova Oct. 6, 2017, 3:07 p.m. UTC | #15

Emilio G Cota writes:

> On Thu, Oct 05, 2017 at 02:28:12 +0300, Lluís Vilanova wrote:
>> Emilio G Cota writes:
>> > I see some potential problems with this:
>> > 1. Instrumenters' accesses could generate exceptions. I presume we'd want to avoid
>> >    this, or leave it as a debug-only kind of option.
>> 
>> The API takes care of telling you if the access could be performed
>> successfully. If you access the instruction's memory representation at
>> translation time, it should be able to perform the access, since QEMU's
>> translation loop just had to do so in order to access that instruction (I should
>> check what happens in the corner case where another guest CPU changes the page
>> table, since I'm not sure if the address translation functions I'm using in QEMU
>> will use the per-vCPU TLB cache or always traverse the page table).

> That was my concern, I'd rather just perform the read once, that is, the read(s)
> done by ops->insn_translate.

If your concern is on performance, that should not be an issue, since you'd be
using the memory peek functions at translation-time. Furthermore, since others
suggested having memory peek anyway, that's a nicer way (to me) to compose APIs
(and is less complex to implement).


>> > 2. Instrumenters won't know where the end of an instruction (for variable-length
>> >   ISAs) or of a TB is (TB != basic block). For instructions one could have a loop
>> >   where we read byte-by-byte and pass it to the decoder, something similar to
>> >   what we have in the capstone code recently posted to the list (v4). For TBs,
>> >   we really should have a way to delimit the length of the TB. This is further
>> >   complicated if we want instrumentation to be inserted *before* a TB is
>> >   translated.
>> 
>> > Some thoughts on the latter problem: if we want a tb_trans_pre callback, like
>> > Pin/DynamoRIO provide, instead of doing two passes (one to delimit the TB and
>> > call the tb_trans_pre callback, to then generate the translated TB), we could:
>> >   - have a tb_trans_pre callback. This callback inserts an exec-time callback
>> >     with a user-defined pointer (let's call it **tb_info). The callback has
>> >     no arguments, perhaps just the pc.
>> >   - have a tb_trans_post callback. This one passes a copy of the guest
>> >     instructions. The instrumenter then can allocate whatever data structure
>> >     to represent the TB (*tb_info), and copies this pointer to **tb_info, so
>> >     that at execution time, we can obtain tb_info _before_ the TB is executed.
>> >     After the callback returns, the copy of the guest instructions can be freed.
>> >   This has two disadvantages:
>> >   - We have an extra dereference to find tb_info
>> >   - If it turns out that the TB should not be instrumented, we have generated
>> >     a callback for nothing.
>> 
>> That's precisely one of the reasons why I proposed adding instrumentation points
>> before and after events happen (e.g., instrument right after translating an
>> instruction, where you know its size).
>> 
>> What you propose is actually a broader issue, how to allow instrumentors to pass
>> their own data to execution-time functions "after the fact". For this, I
>> implemented "promises", a kind of generalization of what gen_icount() does (you
>> pass a value to the execution-time callback that is computed later during
>> translation-time).

> I see. I implemented what I suggested above, i.e. tb_trans_cb
> (i.e. post_trans) passes an opaque descriptor of the TB (which can
> be iterated over insn by insn) and the return value (void *) of this
> cb will be passed by tb_exec_cb (i.e. pre_exec).  Perf-wise this
> is pretty OK, turns out even if we don't end up caring about the
> TB, the additional per-TB helper (which might not end up calling
> a callback) does not introduce significant overhead at execution time.

So you build this structure after translating the whole TB, and the user can
iterate it to check the translated instructions. This is closer to other
existing tools: you iterate the structure and then decide which/how to
instrument instructions, memory accesses, etc within it.

My only concern is that this is much more complex than the simpler API I propose
(you must build the informational structures, generate calls to every possible
instrumentation call, which will be optimized-out by TCG if the user decides not
to use them, and overall pay in performance for any unused functionality),
whereas your approach can be implemented on top of it.


> The major hurdle I found is what to do when removing a plugin,
> so that we avoid flushing potentially all translated code. What I ended up
> doing is adding a per-TB list of "plugin_tb" descriptors, which
> track these user pointers so that (1) each plugin gets the right
> pointer, and (2) if n_plugins > 1, we still have a single helper
> that dispatches the callbacks instead of n_plugin helpers.

I'm not sure it's worth optimising for the multiple plugin case.


> If I understand correctly, with promises we directly generate
> a callback, which has the promise(s) as one (or more) of its
> arguments. This is neat and very flexible. However, it forces us to
> retranslate the TB when the plugin is removed (if we're lazy we could
> flush all TBs), and if we have several plugins, we end up with one
> helper per callback, instead of a single one.
> None of this is a huge deal though, I just think is worth considering.

Yes, that happens seldomly enough that the flush cost is negligible.

> Also, I'm not sure Peter and others would be happy with allowing
> plugin code to generate arbitrary callbacks (IIRC arbitrary code
> has already been ruled out). So perhaps a more restrictive option
> like what I suggested above would be more palatable.

AFAIU, arbitrary access to QEMU's structures was ruled out, but not generating
calls to arbitrary user functions.


Cheers,
  Lluis

Emilio Cota Oct. 6, 2017, 5:59 p.m. UTC | #16

On Fri, Oct 06, 2017 at 18:07:16 +0300, Lluís Vilanova wrote:
> Emilio G Cota writes:
> > On Thu, Oct 05, 2017 at 02:28:12 +0300, Lluís Vilanova wrote:
> >> The API takes care of telling you if the access could be performed
> >> successfully. If you access the instruction's memory representation at
> >> translation time, it should be able to perform the access, since QEMU's
> >> translation loop just had to do so in order to access that instruction (I should
> >> check what happens in the corner case where another guest CPU changes the page
> >> table, since I'm not sure if the address translation functions I'm using in QEMU
> >> will use the per-vCPU TLB cache or always traverse the page table).
> 
> > That was my concern, I'd rather just perform the read once, that is, the read(s)
> > done by ops->insn_translate.
> 
> If your concern is on performance, that should not be an issue, since you'd be
> using the memory peek functions at translation-time. Furthermore, since others
> suggested having memory peek anyway, that's a nicer way (to me) to compose APIs
> (and is less complex to implement).

My concern was the same as yours, correctness -- what happens if something
changes between the two reads? Because the two reads should always return
the same thing.

> > I see. I implemented what I suggested above, i.e. tb_trans_cb
> > (i.e. post_trans) passes an opaque descriptor of the TB (which can
> > be iterated over insn by insn) and the return value (void *) of this
> > cb will be passed by tb_exec_cb (i.e. pre_exec).  Perf-wise this
> > is pretty OK, turns out even if we don't end up caring about the
> > TB, the additional per-TB helper (which might not end up calling
> > a callback) does not introduce significant overhead at execution time.
> 
> So you build this structure after translating the whole TB, and the user can
> iterate it to check the translated instructions. This is closer to other
> existing tools: you iterate the structure and then decide which/how to
> instrument instructions, memory accesses, etc within it.

Correct. I suspect they went with this design because it makes sense to
do this preprocessing once, instead of having each plugin do it
themselves. I'm not sure how much we should care about supporting multiple
plugins, but the impression I get from DynamoRIO is that it seems important
to users.

> My only concern is that this is much more complex than the simpler API I propose
> (you must build the informational structures, generate calls to every possible
> instrumentation call, which will be optimized-out by TCG if the user decides not
> to use them, and overall pay in performance for any unused functionality),
> whereas your approach can be implemented on top of it.

It's pretty simple; tr_ops->translate_insn has to copy each insn.
For instance, on aarch64 (disas_a64 is called from tr_translate_insn):

-static void disas_a64_insn(CPUARMState *env, DisasContext *s)
+static void disas_a64_insn(CPUARMState *env, DisasContext *s, struct qemu_plugin_insn *q_insn)
 {
     uint32_t insn;

     insn = arm_ldl_code(env, s->pc, s->sctlr_b);
+    if (q_insn) {
+        qemu_plugin_insn_append(q_insn, &insn, sizeof(insn));
+    }

It takes some memory though (we duplicate the guest code), but perf-wise this
isn't a big deal (an empty callback on every TB execution incurs only a 10-15%
perf slowdown).

I don't understand the part where you say that the instrumentation call can
be optimized out. Is there a special value of a "TCG promise" (at tb_trans_post
time) that removes the previously generated callback (at tb_trans_pre time)?
Otherwise I don't see how selective TB instrumentation can work at tb_trans_pre
time.

Thanks,

		E.

Lluís Vilanova Oct. 15, 2017, 4:30 p.m. UTC | #17

Emilio G Cota writes:

> On Fri, Oct 06, 2017 at 18:07:16 +0300, Lluís Vilanova wrote:
>> Emilio G Cota writes:
>> > On Thu, Oct 05, 2017 at 02:28:12 +0300, Lluís Vilanova wrote:
>> >> The API takes care of telling you if the access could be performed
>> >> successfully. If you access the instruction's memory representation at
>> >> translation time, it should be able to perform the access, since QEMU's
>> >> translation loop just had to do so in order to access that instruction (I should
>> >> check what happens in the corner case where another guest CPU changes the page
>> >> table, since I'm not sure if the address translation functions I'm using in QEMU
>> >> will use the per-vCPU TLB cache or always traverse the page table).
>> 
>> > That was my concern, I'd rather just perform the read once, that is, the read(s)
>> > done by ops->insn_translate.
>> 
>> If your concern is on performance, that should not be an issue, since you'd be
>> using the memory peek functions at translation-time. Furthermore, since others
>> suggested having memory peek anyway, that's a nicer way (to me) to compose APIs
>> (and is less complex to implement).

> My concern was the same as yours, correctness -- what happens if something
> changes between the two reads? Because the two reads should always return
> the same thing.

Thinking about it, shouldn't this always be the same given QEMU's TLB/page table
consistency assurances? Otherwise, QEMU could read bytes from different physical
pages while translating an instruction from the same virtual page.

Therefore, this leads me to believe it is safe to use the memory read operations
during translation to let instrumentation libraries know what exactly they are
dealing with.



>> > I see. I implemented what I suggested above, i.e. tb_trans_cb
>> > (i.e. post_trans) passes an opaque descriptor of the TB (which can
>> > be iterated over insn by insn) and the return value (void *) of this
>> > cb will be passed by tb_exec_cb (i.e. pre_exec).  Perf-wise this
>> > is pretty OK, turns out even if we don't end up caring about the
>> > TB, the additional per-TB helper (which might not end up calling
>> > a callback) does not introduce significant overhead at execution time.
>> 
>> So you build this structure after translating the whole TB, and the user can
>> iterate it to check the translated instructions. This is closer to other
>> existing tools: you iterate the structure and then decide which/how to
>> instrument instructions, memory accesses, etc within it.

> Correct. I suspect they went with this design because it makes sense to
> do this preprocessing once, instead of having each plugin do it
> themselves. I'm not sure how much we should care about supporting multiple
> plugins, but the impression I get from DynamoRIO is that it seems important
> to users.

If that can be built into a "helper" instrumentation library / header that
others can use, I would rather keep this functionality outside QEMU.


>> My only concern is that this is much more complex than the simpler API I propose
>> (you must build the informational structures, generate calls to every possible
>> instrumentation call, which will be optimized-out by TCG if the user decides not
>> to use them, and overall pay in performance for any unused functionality),
>> whereas your approach can be implemented on top of it.

> It's pretty simple; tr_ops->translate_insn has to copy each insn.
> For instance, on aarch64 (disas_a64 is called from tr_translate_insn):

> -static void disas_a64_insn(CPUARMState *env, DisasContext *s)
> +static void disas_a64_insn(CPUARMState *env, DisasContext *s, struct qemu_plugin_insn *q_insn)
>  {
>      uint32_t insn;

>      insn = arm_ldl_code(env, s->pc, s->sctlr_b);
> +    if (q_insn) {
> +        qemu_plugin_insn_append(q_insn, &insn, sizeof(insn));
> +    }

> It takes some memory though (we duplicate the guest code), but perf-wise this
> isn't a big deal (an empty callback on every TB execution incurs only a 10-15%
> perf slowdown).

> I don't understand the part where you say that the instrumentation call can
> be optimized out. Is there a special value of a "TCG promise" (at tb_trans_post
> time) that removes the previously generated callback (at tb_trans_pre time)?
> Otherwise I don't see how selective TB instrumentation can work at tb_trans_pre
> time.

With the approach you propose, the instrumentation library is only called at
translation-time after the whole TB has been generated. If the instrumentor now
decides to instrument each instruction, this means QEMU must inject an
instrumentation call to every instruction while translating the TB *just in
case* the instrumentor will need it later. In case the instrumentor decides some
instructions don't need to be instrumented, now QEMU needs to eliminate them
from the TB before generating the host code (in order to eliminate unnecessary
overheads).

Hope it's clearer now.


Thanks!
  Lluis

Peter Maydell Oct. 15, 2017, 4:47 p.m. UTC | #18

On 15 October 2017 at 17:30, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Thinking about it, shouldn't this always be the same given QEMU's TLB/page table
> consistency assurances?

What TLB/page table consistency assurances? For ARM at least
we will only update (ie flush) the TLB when the guest next
executes a relevant TLB maintenance instruction. So a
misbehaving guest can set things up so the page table
is completely different from what's in QEMU's TLB if it
wants. This all falls in the realms of architecturally
unpredictable behaviour for the guest -- whether you
want the instrumentation to be confused as well is a
different question...

thanks
-- PMM

Lluís Vilanova Oct. 21, 2017, 2:05 p.m. UTC | #19

Peter Maydell writes:

> On 15 October 2017 at 17:30, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>> Thinking about it, shouldn't this always be the same given QEMU's TLB/page table
>> consistency assurances?

> What TLB/page table consistency assurances? For ARM at least
> we will only update (ie flush) the TLB when the guest next
> executes a relevant TLB maintenance instruction. So a
> misbehaving guest can set things up so the page table
> is completely different from what's in QEMU's TLB if it
> wants. This all falls in the realms of architecturally
> unpredictable behaviour for the guest -- whether you
> want the instrumentation to be confused as well is a
> different question...

I meant that if the contents of a virtual memory page change while QEMU is
translating an instruction, it must be able to detect that and act accordingly
for correctness.

Having that in mind, the same should hold true when an instrumentor reads a
page's contents during translation (e.g., to gather information on opcodes).

Cheers,
  Lluis

Peter Maydell Oct. 21, 2017, 4:56 p.m. UTC | #20

On 21 October 2017 at 15:05, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Peter Maydell writes:
>
>> On 15 October 2017 at 17:30, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>>> Thinking about it, shouldn't this always be the same given QEMU's TLB/page table
>>> consistency assurances?
>
>> What TLB/page table consistency assurances? For ARM at least
>> we will only update (ie flush) the TLB when the guest next
>> executes a relevant TLB maintenance instruction. So a
>> misbehaving guest can set things up so the page table
>> is completely different from what's in QEMU's TLB if it
>> wants. This all falls in the realms of architecturally
>> unpredictable behaviour for the guest -- whether you
>> want the instrumentation to be confused as well is a
>> different question...
>
> I meant that if the contents of a virtual memory page change while QEMU is
> translating an instruction, it must be able to detect that and act accordingly
> for correctness.

That's an interesting corner case, actually. Traditionally
it simply couldn't happen because we were strictly single
threaded and so if we were translating then we weren't
running guest code. We did need to handle "writes mean we
must invalidate an already produced translation", but not
"invalidate one we're halfway through and haven't put in
our data structures yet". Did we get that right in the MTTCG
design? How does it work?

(Did we produce a summary of the MTTCG design anywhere?
I didn't follow the development in detail as it was going
on, but it would be useful to understand the final result.)

In any case, the only assurance we provide over QEMU as a
whole is that if the guest writes to a physical address then
we don't keep hold of a now-duff translation for that physaddr.
We don't guarantee the same thing for guest changes of
the vaddr-to-physaddr mapping -- instead we let the target
specific code deal with this by invalidating QEMU's TLB
when the guest code does TLB invalidate ops.

> Having that in mind, the same should hold true when an instrumentor reads a
> page's contents during translation (e.g., to gather information on opcodes).

Basically I don't think we actually have very strong
guarantees here, and that's another reason for not
providing instrumentation callbacks at translate time.

thanks
-- PMM

Alex Bennée Oct. 21, 2017, 5:12 p.m. UTC | #21

Peter Maydell <peter.maydell@linaro.org> writes:

> On 21 October 2017 at 15:05, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>> Peter Maydell writes:
>>
>>> On 15 October 2017 at 17:30, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>>>> Thinking about it, shouldn't this always be the same given QEMU's TLB/page table
>>>> consistency assurances?
>>
>>> What TLB/page table consistency assurances? For ARM at least
>>> we will only update (ie flush) the TLB when the guest next
>>> executes a relevant TLB maintenance instruction. So a
>>> misbehaving guest can set things up so the page table
>>> is completely different from what's in QEMU's TLB if it
>>> wants. This all falls in the realms of architecturally
>>> unpredictable behaviour for the guest -- whether you
>>> want the instrumentation to be confused as well is a
>>> different question...
>>
>> I meant that if the contents of a virtual memory page change while QEMU is
>> translating an instruction, it must be able to detect that and act accordingly
>> for correctness.
>
> That's an interesting corner case, actually. Traditionally
> it simply couldn't happen because we were strictly single
> threaded and so if we were translating then we weren't
> running guest code. We did need to handle "writes mean we
> must invalidate an already produced translation", but not
> "invalidate one we're halfway through and haven't put in
> our data structures yet". Did we get that right in the MTTCG
> design? How does it work?

It's currently protected by locks, as you need to grab tb_lock/mmap_lock
to call:

  void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                                   int is_cpu_write_access)

So no new blocks can be created until you've complete your invalidation
- or you are serialised until the block currently being translated is
completed. At which point the block will be immediately marked as
invalid and not be called again.

>
> (Did we produce a summary of the MTTCG design anywhere?
> I didn't follow the development in detail as it was going
> on, but it would be useful to understand the final result.)

Sure, it's in:

  docs/devel/multi-thread-tcg.txt

>
> In any case, the only assurance we provide over QEMU as a
> whole is that if the guest writes to a physical address then
> we don't keep hold of a now-duff translation for that physaddr.
> We don't guarantee the same thing for guest changes of
> the vaddr-to-physaddr mapping -- instead we let the target
> specific code deal with this by invalidating QEMU's TLB
> when the guest code does TLB invalidate ops.
>
>> Having that in mind, the same should hold true when an instrumentor reads a
>> page's contents during translation (e.g., to gather information on opcodes).
>
> Basically I don't think we actually have very strong
> guarantees here, and that's another reason for not
> providing instrumentation callbacks at translate time.
>
> thanks
> -- PMM


--
Alex Bennée

[v6,01/22] instrument: Add documentation

Commit Message

Comments

Patch