diff mbox

qemu log function to print out the registers of the guest

Message ID CAMo8BfJZ8=28LP-m_uR1uTL2--yowOxgvdJDx3JzE29VPn+nJg@mail.gmail.com
State New
Headers show

Commit Message

Max Filippov Aug. 17, 2012, 11:57 a.m. UTC
On Fri, Aug 17, 2012 at 3:14 PM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
>> > On Thu, Aug 16, 2012 at 7:49 PM, Steven <wangwangkang@gmail.com> wrote:
>> > [...]
>> >> I want to get the guest memory address in the instruction mov
>> >> 0x4(%ebx)  %eax, whic is 0x4(%ebx).
>> >> Since %ebx is not resolved until the execution time, the code in
>> >> softmmu_header.h does not generate any hit or miss information.
>> >> Do you know any place that I could resolve the memory access address? Thanks.
>> >
>> > You'll have to generate code.  Look at how helpers work.
>> Hi, Laurent,
>> do you mean the target-i386/op_helper.c/helper.c or the tcg helper? Thanks.
>
>   What do you mean by "resolve the memory access address"? Do you want
> to get guest virtual address for each guest memory access, right? As Max
> mentioned before (you can also read [1]), there are fast and slow path
> in QEMU softmmu, tlb hit and tlb miss respectively. Max provided patch
> for slow path. As for fast path, take a look on tcg_out_tlb_load (tcg
> /i386/tcg-target.c). tcg_out_tlb_load will generate native code in the
> code cache to do tlb lookup, I think you cannot use the trick Max used
> since tcg_out_tlb_load will not be called when the fast path executed,

That's why I've posted the following hunk that should have made all
accesses go via slow path:



> it "generates" code instead. Therefore, you might have to insert your
> instrument code in the code cache, perhaps modifying tcg_out_tlb_load
> to log value of "addrlo" (see comments above tcg_out_tlb_load).

Comments

陳韋任 Aug. 19, 2012, 8:33 a.m. UTC | #1
On Fri, Aug 17, 2012 at 03:57:55PM +0400, Max Filippov wrote:
> On Fri, Aug 17, 2012 at 3:14 PM, 陳韋任 (Wei-Ren Chen)
> <chenwj@iis.sinica.edu.tw> wrote:
> >> > On Thu, Aug 16, 2012 at 7:49 PM, Steven <wangwangkang@gmail.com> wrote:
> >> > [...]
> >> >> I want to get the guest memory address in the instruction mov
> >> >> 0x4(%ebx)  %eax, whic is 0x4(%ebx).
> >> >> Since %ebx is not resolved until the execution time, the code in
> >> >> softmmu_header.h does not generate any hit or miss information.
> >> >> Do you know any place that I could resolve the memory access address? Thanks.
> >> >
> >> > You'll have to generate code.  Look at how helpers work.
> >> Hi, Laurent,
> >> do you mean the target-i386/op_helper.c/helper.c or the tcg helper? Thanks.
> >
> >   What do you mean by "resolve the memory access address"? Do you want
> > to get guest virtual address for each guest memory access, right? As Max
> > mentioned before (you can also read [1]), there are fast and slow path
> > in QEMU softmmu, tlb hit and tlb miss respectively. Max provided patch
> > for slow path. As for fast path, take a look on tcg_out_tlb_load (tcg
> > /i386/tcg-target.c). tcg_out_tlb_load will generate native code in the
> > code cache to do tlb lookup, I think you cannot use the trick Max used
> > since tcg_out_tlb_load will not be called when the fast path executed,
> 
> That's why I've posted the following hunk that should have made all
> accesses go via slow path:

  Ya, I know. :) Just try to explain what Laurent want to say.

Regards,
chenwj
Steven Aug. 21, 2012, 5:40 a.m. UTC | #2
Hi, Max,
I wrote a small program to verify your patch could catch all the load
instructions from the guest. However, I found some problem from the
results.

The guest OS and the emulated machine are both 32bit x86. My simple
program in the guest declares an 1048576-element integer array,
initialize the elements, and load them in a loop. It looks like this
          int array[1048576];
          initialize the array;

          /*  region of interests */
          int temp;
          for (i=0; i < 1048576; i++) {
              temp = array[i];
          }
So ideally, the path should catch the guest virtual address of in the
loop, right?
          In addition, the virtual address for the beginning and end
of the array is 0xbf68b6e0 and 0xbfa8b6e0.
          What i got is as follows

          __ldl_mmu, vaddr=bf68b6e0
          __ldl_mmu, vaddr=bf68b6e4
          __ldl_mmu, vaddr=bf68b6e8
          .....
          These should be the virtual address of the above loop. The
results look good because the gap between each vaddr is 4 bypte, which
is the length of each element.
          However, after certain address, I got

          __ldl_mmu, vaddr=bf68bffc
          __ldl_mmu, vaddr=bf68c000
          __ldl_mmu, vaddr=bf68d000
          __ldl_mmu, vaddr=bf68e000
          __ldl_mmu, vaddr=bf68f000
          __ldl_mmu, vaddr=bf690000
          __ldl_mmu, vaddr=bf691000
          __ldl_mmu, vaddr=bf692000
          __ldl_mmu, vaddr=bf693000
          __ldl_mmu, vaddr=bf694000
          ...
          __ldl_mmu, vaddr=bf727000
          __ldl_mmu, vaddr=bf728000
          __ldl_mmu, vaddr=bfa89000
          __ldl_mmu, vaddr=bfa8a000
So the rest of the vaddr I got has a different of 4096 bytes, instead
of 4. I repeated the experiment for several times and got the same
results. Is there anything wrong? or could you explain this? Thanks.

steven



On Fri, Aug 17, 2012 at 7:57 AM, Max Filippov <jcmvbkbc@gmail.com> wrote:
> On Fri, Aug 17, 2012 at 3:14 PM, 陳韋任 (Wei-Ren Chen)
> <chenwj@iis.sinica.edu.tw> wrote:
>>> > On Thu, Aug 16, 2012 at 7:49 PM, Steven <wangwangkang@gmail.com> wrote:
>>> > [...]
>>> >> I want to get the guest memory address in the instruction mov
>>> >> 0x4(%ebx)  %eax, whic is 0x4(%ebx).
>>> >> Since %ebx is not resolved until the execution time, the code in
>>> >> softmmu_header.h does not generate any hit or miss information.
>>> >> Do you know any place that I could resolve the memory access address? Thanks.
>>> >
>>> > You'll have to generate code.  Look at how helpers work.
>>> Hi, Laurent,
>>> do you mean the target-i386/op_helper.c/helper.c or the tcg helper? Thanks.
>>
>>   What do you mean by "resolve the memory access address"? Do you want
>> to get guest virtual address for each guest memory access, right? As Max
>> mentioned before (you can also read [1]), there are fast and slow path
>> in QEMU softmmu, tlb hit and tlb miss respectively. Max provided patch
>> for slow path. As for fast path, take a look on tcg_out_tlb_load (tcg
>> /i386/tcg-target.c). tcg_out_tlb_load will generate native code in the
>> code cache to do tlb lookup, I think you cannot use the trick Max used
>> since tcg_out_tlb_load will not be called when the fast path executed,
>
> That's why I've posted the following hunk that should have made all
> accesses go via slow path:
>
> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
> index da17bba..ec68c19 100644
> --- a/tcg/i386/tcg-target.c
> +++ b/tcg/i386/tcg-target.c
> @@ -1062,7 +1062,7 @@ static inline void tcg_out_tlb_load(TCGContext
> *s, int addrlo_idx,
>      tcg_out_mov(s, type, r0, addrlo);
>
>      /* jne label1 */
> -    tcg_out8(s, OPC_JCC_short + JCC_JNE);
> +    tcg_out8(s, OPC_JMP_short);
>      label_ptr[0] = s->code_ptr;
>      s->code_ptr++;
>
>
>> it "generates" code instead. Therefore, you might have to insert your
>> instrument code in the code cache, perhaps modifying tcg_out_tlb_load
>> to log value of "addrlo" (see comments above tcg_out_tlb_load).
>
> --
> Thanks.
> -- Max
Max Filippov Aug. 21, 2012, 7:18 a.m. UTC | #3
On Tue, Aug 21, 2012 at 9:40 AM, Steven <wangwangkang@gmail.com> wrote:
> Hi, Max,
> I wrote a small program to verify your patch could catch all the load
> instructions from the guest. However, I found some problem from the
> results.
>
> The guest OS and the emulated machine are both 32bit x86. My simple
> program in the guest declares an 1048576-element integer array,
> initialize the elements, and load them in a loop. It looks like this
>           int array[1048576];
>           initialize the array;
>
>           /*  region of interests */
>           int temp;
>           for (i=0; i < 1048576; i++) {
>               temp = array[i];
>           }
> So ideally, the path should catch the guest virtual address of in the
> loop, right?
>           In addition, the virtual address for the beginning and end
> of the array is 0xbf68b6e0 and 0xbfa8b6e0.
>           What i got is as follows
>
>           __ldl_mmu, vaddr=bf68b6e0
>           __ldl_mmu, vaddr=bf68b6e4
>           __ldl_mmu, vaddr=bf68b6e8
>           .....
>           These should be the virtual address of the above loop. The
> results look good because the gap between each vaddr is 4 bypte, which
> is the length of each element.
>           However, after certain address, I got
>
>           __ldl_mmu, vaddr=bf68bffc
>           __ldl_mmu, vaddr=bf68c000
>           __ldl_mmu, vaddr=bf68d000
>           __ldl_mmu, vaddr=bf68e000
>           __ldl_mmu, vaddr=bf68f000
>           __ldl_mmu, vaddr=bf690000
>           __ldl_mmu, vaddr=bf691000
>           __ldl_mmu, vaddr=bf692000
>           __ldl_mmu, vaddr=bf693000
>           __ldl_mmu, vaddr=bf694000
>           ...
>           __ldl_mmu, vaddr=bf727000
>           __ldl_mmu, vaddr=bf728000
>           __ldl_mmu, vaddr=bfa89000
>           __ldl_mmu, vaddr=bfa8a000
> So the rest of the vaddr I got has a different of 4096 bytes, instead
> of 4. I repeated the experiment for several times and got the same
> results. Is there anything wrong? or could you explain this? Thanks.

I see two possibilities here:
- maybe there are more fast path shortcuts in the QEMU code?
  in that case output of qemu -d op,out_asm would help.
- maybe your compiler had optimized that sample code?
  could you try to declare array in your sample as 'volatile int'?
Max Filippov Aug. 25, 2012, 8:41 p.m. UTC | #4
On Sat, Aug 25, 2012 at 9:20 PM, Steven <wangwangkang@gmail.com> wrote:
> On Tue, Aug 21, 2012 at 3:18 AM, Max Filippov <jcmvbkbc@gmail.com> wrote:
>> On Tue, Aug 21, 2012 at 9:40 AM, Steven <wangwangkang@gmail.com> wrote:
>>> Hi, Max,
>>> I wrote a small program to verify your patch could catch all the load
>>> instructions from the guest. However, I found some problem from the
>>> results.
>>>
>>> The guest OS and the emulated machine are both 32bit x86. My simple
>>> program in the guest declares an 1048576-element integer array,
>>> initialize the elements, and load them in a loop. It looks like this
>>>           int array[1048576];
>>>           initialize the array;
>>>
>>>           /*  region of interests */
>>>           int temp;
>>>           for (i=0; i < 1048576; i++) {
>>>               temp = array[i];
>>>           }
>>> So ideally, the path should catch the guest virtual address of in the
>>> loop, right?
>>>           In addition, the virtual address for the beginning and end
>>> of the array is 0xbf68b6e0 and 0xbfa8b6e0.
>>>           What i got is as follows
>>>
>>>           __ldl_mmu, vaddr=bf68b6e0
>>>           __ldl_mmu, vaddr=bf68b6e4
>>>           __ldl_mmu, vaddr=bf68b6e8
>>>           .....
>>>           These should be the virtual address of the above loop. The
>>> results look good because the gap between each vaddr is 4 bypte, which
>>> is the length of each element.
>>>           However, after certain address, I got
>>>
>>>           __ldl_mmu, vaddr=bf68bffc
>>>           __ldl_mmu, vaddr=bf68c000
>>>           __ldl_mmu, vaddr=bf68d000
>>>           __ldl_mmu, vaddr=bf68e000
>>>           __ldl_mmu, vaddr=bf68f000
>>>           __ldl_mmu, vaddr=bf690000
>>>           __ldl_mmu, vaddr=bf691000
>>>           __ldl_mmu, vaddr=bf692000
>>>           __ldl_mmu, vaddr=bf693000
>>>           __ldl_mmu, vaddr=bf694000
>>>           ...
>>>           __ldl_mmu, vaddr=bf727000
>>>           __ldl_mmu, vaddr=bf728000
>>>           __ldl_mmu, vaddr=bfa89000
>>>           __ldl_mmu, vaddr=bfa8a000
>>> So the rest of the vaddr I got has a different of 4096 bytes, instead
>>> of 4. I repeated the experiment for several times and got the same
>>> results. Is there anything wrong? or could you explain this? Thanks.
>>
>> I see two possibilities here:
>> - maybe there are more fast path shortcuts in the QEMU code?
>>   in that case output of qemu -d op,out_asm would help.
>> - maybe your compiler had optimized that sample code?
>>   could you try to declare array in your sample as 'volatile int'?
> After adding the "volatile" qualifier, the results are correct now.
> So your patch can trap all the guest memory data load access, no
> matter slow path or fast path.
>
> However, I found some problem when I try understanding the instruction
> access. So I run the VM with "-d in_asm" to see program counter of
> each guest code. I got
>
> __ldl_cmmu,ffffffff8102ff91
> __ldl_cmmu,ffffffff8102ff9a
> ----------------
> IN:
> 0xffffffff8102ff8a:  mov    0x8(%rbx),%rax
> 0xffffffff8102ff8e:  add    0x790(%rbx),%rax
> 0xffffffff8102ff95:  xor    %edx,%edx
> 0xffffffff8102ff97:  mov    0x858(%rbx),%rcx
> 0xffffffff8102ff9e:  cmp    %rcx,%rax
> 0xffffffff8102ffa1:  je     0xffffffff8102ffb0
> .....
>
> __ldl_cmmu,00000000004005a1
> __ldl_cmmu,00000000004005a6
> ----------------
> IN:
> 0x0000000000400594:  push   %rbp
> 0x0000000000400595:  mov    %rsp,%rbp
> 0x0000000000400598:  sub    $0x20,%rsp
> 0x000000000040059c:  mov    %rdi,-0x18(%rbp)
> 0x00000000004005a0:  mov    $0x1,%edi
> 0x00000000004005a5:  callq  0x4004a0
>
> From the results, I see that the guest virtual address of the pc is
> slightly different between the __ldl_cmmu and the tb's pc(below IN:).
> Could you help to understand this? Which one is the true pc memory
> access? Thanks.

Guest code is accessed at the translation time by C functions and
I guess there are other layers of address translation caching. I wouldn't
try to interpret these _cmmu printouts and would instead instrument
[cpu_]ld{{u,s}{b,w},l,q}_code macros.
Steven Aug. 27, 2012, 4:15 p.m. UTC | #5
On Sat, Aug 25, 2012 at 4:41 PM, Max Filippov <jcmvbkbc@gmail.com> wrote:
> On Sat, Aug 25, 2012 at 9:20 PM, Steven <wangwangkang@gmail.com> wrote:
>> On Tue, Aug 21, 2012 at 3:18 AM, Max Filippov <jcmvbkbc@gmail.com> wrote:
>>> On Tue, Aug 21, 2012 at 9:40 AM, Steven <wangwangkang@gmail.com> wrote:
>>>> Hi, Max,
>>>> I wrote a small program to verify your patch could catch all the load
>>>> instructions from the guest. However, I found some problem from the
>>>> results.
>>>>
>>>> The guest OS and the emulated machine are both 32bit x86. My simple
>>>> program in the guest declares an 1048576-element integer array,
>>>> initialize the elements, and load them in a loop. It looks like this
>>>>           int array[1048576];
>>>>           initialize the array;
>>>>
>>>>           /*  region of interests */
>>>>           int temp;
>>>>           for (i=0; i < 1048576; i++) {
>>>>               temp = array[i];
>>>>           }
>>>> So ideally, the path should catch the guest virtual address of in the
>>>> loop, right?
>>>>           In addition, the virtual address for the beginning and end
>>>> of the array is 0xbf68b6e0 and 0xbfa8b6e0.
>>>>           What i got is as follows
>>>>
>>>>           __ldl_mmu, vaddr=bf68b6e0
>>>>           __ldl_mmu, vaddr=bf68b6e4
>>>>           __ldl_mmu, vaddr=bf68b6e8
>>>>           .....
>>>>           These should be the virtual address of the above loop. The
>>>> results look good because the gap between each vaddr is 4 bypte, which
>>>> is the length of each element.
>>>>           However, after certain address, I got
>>>>
>>>>           __ldl_mmu, vaddr=bf68bffc
>>>>           __ldl_mmu, vaddr=bf68c000
>>>>           __ldl_mmu, vaddr=bf68d000
>>>>           __ldl_mmu, vaddr=bf68e000
>>>>           __ldl_mmu, vaddr=bf68f000
>>>>           __ldl_mmu, vaddr=bf690000
>>>>           __ldl_mmu, vaddr=bf691000
>>>>           __ldl_mmu, vaddr=bf692000
>>>>           __ldl_mmu, vaddr=bf693000
>>>>           __ldl_mmu, vaddr=bf694000
>>>>           ...
>>>>           __ldl_mmu, vaddr=bf727000
>>>>           __ldl_mmu, vaddr=bf728000
>>>>           __ldl_mmu, vaddr=bfa89000
>>>>           __ldl_mmu, vaddr=bfa8a000
>>>> So the rest of the vaddr I got has a different of 4096 bytes, instead
>>>> of 4. I repeated the experiment for several times and got the same
>>>> results. Is there anything wrong? or could you explain this? Thanks.
>>>
>>> I see two possibilities here:
>>> - maybe there are more fast path shortcuts in the QEMU code?
>>>   in that case output of qemu -d op,out_asm would help.
>>> - maybe your compiler had optimized that sample code?
>>>   could you try to declare array in your sample as 'volatile int'?
>> After adding the "volatile" qualifier, the results are correct now.
>> So your patch can trap all the guest memory data load access, no
>> matter slow path or fast path.
>>
>> However, I found some problem when I try understanding the instruction
>> access. So I run the VM with "-d in_asm" to see program counter of
>> each guest code. I got
>>
>> __ldl_cmmu,ffffffff8102ff91
>> __ldl_cmmu,ffffffff8102ff9a
>> ----------------
>> IN:
>> 0xffffffff8102ff8a:  mov    0x8(%rbx),%rax
>> 0xffffffff8102ff8e:  add    0x790(%rbx),%rax
>> 0xffffffff8102ff95:  xor    %edx,%edx
>> 0xffffffff8102ff97:  mov    0x858(%rbx),%rcx
>> 0xffffffff8102ff9e:  cmp    %rcx,%rax
>> 0xffffffff8102ffa1:  je     0xffffffff8102ffb0
>> .....
>>
>> __ldl_cmmu,00000000004005a1
>> __ldl_cmmu,00000000004005a6
>> ----------------
>> IN:
>> 0x0000000000400594:  push   %rbp
>> 0x0000000000400595:  mov    %rsp,%rbp
>> 0x0000000000400598:  sub    $0x20,%rsp
>> 0x000000000040059c:  mov    %rdi,-0x18(%rbp)
>> 0x00000000004005a0:  mov    $0x1,%edi
>> 0x00000000004005a5:  callq  0x4004a0
>>
>> From the results, I see that the guest virtual address of the pc is
>> slightly different between the __ldl_cmmu and the tb's pc(below IN:).
>> Could you help to understand this? Which one is the true pc memory
>> access? Thanks.
>
> Guest code is accessed at the translation time by C functions and
> I guess there are other layers of address translation caching. I wouldn't
> try to interpret these _cmmu printouts and would instead instrument
> [cpu_]ld{{u,s}{b,w},l,q}_code macros.
yes, you are right.
Some ldub_code in x86 guest does not call __ldq_cmmu when the tlb hits.
By the way, when I use your patch, I saw too many log event for the
kernel data _mmu, ie., the addrs is
0x7fff ffff ffff. There are too many such mmu event that the user mode
data can not be executed. So I have to  setup a condition like
     if (addr < 0x8000 0000 0000)
            fprintf(stderr, "%s: %08x\n", __func__, addr);
Then my simple array access program can be finished.
I am wondering whether you have met the similar problem or you have
any suggestion on this.
My final  goal is to obtain the memory access trace for a particular
process in the guest, so your patch really helps, except for too many
kernel _mmu events.

steven
>
> --
> Thanks.
> -- Max
陳韋任 Aug. 28, 2012, 3:14 a.m. UTC | #6
> My final  goal is to obtain the memory access trace for a particular
> process in the guest, so your patch really helps, except for too many
> kernel _mmu events.

  How do you know guest is running which process, and log it's memory
access trace?

Regards,
chenwj
Steven Aug. 28, 2012, 3:44 a.m. UTC | #7
I added a special opcode, which is not used by existing x86. When the
process in the guest issues this opcode, the qemu starts to log its
mmu access.



On Mon, Aug 27, 2012 at 11:14 PM, 陳韋任 (Wei-Ren Chen)
<chenwj@iis.sinica.edu.tw> wrote:
>> My final  goal is to obtain the memory access trace for a particular
>> process in the guest, so your patch really helps, except for too many
>> kernel _mmu events.
>
>   How do you know guest is running which process, and log it's memory
> access trace?
>
> Regards,
> chenwj
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj
Max Filippov Aug. 28, 2012, 10:48 a.m. UTC | #8
On Mon, Aug 27, 2012 at 8:15 PM, Steven <wangwangkang@gmail.com> wrote:
>> Guest code is accessed at the translation time by C functions and
>> I guess there are other layers of address translation caching. I wouldn't
>> try to interpret these _cmmu printouts and would instead instrument
>> [cpu_]ld{{u,s}{b,w},l,q}_code macros.
> yes, you are right.
> Some ldub_code in x86 guest does not call __ldq_cmmu when the tlb hits.
> By the way, when I use your patch, I saw too many log event for the
> kernel data _mmu, ie., the addrs is
> 0x7fff ffff ffff. There are too many such mmu event that the user mode
> data can not be executed. So I have to  setup a condition like
>      if (addr < 0x8000 0000 0000)
>             fprintf(stderr, "%s: %08x\n", __func__, addr);
> Then my simple array access program can be finished.

You can also try to differentiate kernel/userspace by mmu_idx passed to
helpers.

> I am wondering whether you have met the similar problem or you have
> any suggestion on this.

I used simple samples (tests/tcg/xtensa testsuite), their memory access
pattern didn't deviate from what I expected.

> My final  goal is to obtain the memory access trace for a particular
> process in the guest, so your patch really helps, except for too many
> kernel _mmu events.

Wouldn't it be easier to use qemu-user for that?
diff mbox

Patch

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index da17bba..ec68c19 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1062,7 +1062,7 @@  static inline void tcg_out_tlb_load(TCGContext
*s, int addrlo_idx,
     tcg_out_mov(s, type, r0, addrlo);

     /* jne label1 */
-    tcg_out8(s, OPC_JCC_short + JCC_JNE);
+    tcg_out8(s, OPC_JMP_short);
     label_ptr[0] = s->code_ptr;
     s->code_ptr++;