Message ID | CAMo8BfJZ8=28LP-m_uR1uTL2--yowOxgvdJDx3JzE29VPn+nJg@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Fri, Aug 17, 2012 at 03:57:55PM +0400, Max Filippov wrote: > On Fri, Aug 17, 2012 at 3:14 PM, 陳韋任 (Wei-Ren Chen) > <chenwj@iis.sinica.edu.tw> wrote: > >> > On Thu, Aug 16, 2012 at 7:49 PM, Steven <wangwangkang@gmail.com> wrote: > >> > [...] > >> >> I want to get the guest memory address in the instruction mov > >> >> 0x4(%ebx) %eax, whic is 0x4(%ebx). > >> >> Since %ebx is not resolved until the execution time, the code in > >> >> softmmu_header.h does not generate any hit or miss information. > >> >> Do you know any place that I could resolve the memory access address? Thanks. > >> > > >> > You'll have to generate code. Look at how helpers work. > >> Hi, Laurent, > >> do you mean the target-i386/op_helper.c/helper.c or the tcg helper? Thanks. > > > > What do you mean by "resolve the memory access address"? Do you want > > to get guest virtual address for each guest memory access, right? As Max > > mentioned before (you can also read [1]), there are fast and slow path > > in QEMU softmmu, tlb hit and tlb miss respectively. Max provided patch > > for slow path. As for fast path, take a look on tcg_out_tlb_load (tcg > > /i386/tcg-target.c). tcg_out_tlb_load will generate native code in the > > code cache to do tlb lookup, I think you cannot use the trick Max used > > since tcg_out_tlb_load will not be called when the fast path executed, > > That's why I've posted the following hunk that should have made all > accesses go via slow path: Ya, I know. :) Just try to explain what Laurent want to say. Regards, chenwj
Hi, Max, I wrote a small program to verify your patch could catch all the load instructions from the guest. However, I found some problem from the results. The guest OS and the emulated machine are both 32bit x86. My simple program in the guest declares an 1048576-element integer array, initialize the elements, and load them in a loop. It looks like this int array[1048576]; initialize the array; /* region of interests */ int temp; for (i=0; i < 1048576; i++) { temp = array[i]; } So ideally, the path should catch the guest virtual address of in the loop, right? In addition, the virtual address for the beginning and end of the array is 0xbf68b6e0 and 0xbfa8b6e0. What i got is as follows __ldl_mmu, vaddr=bf68b6e0 __ldl_mmu, vaddr=bf68b6e4 __ldl_mmu, vaddr=bf68b6e8 ..... These should be the virtual address of the above loop. The results look good because the gap between each vaddr is 4 bypte, which is the length of each element. However, after certain address, I got __ldl_mmu, vaddr=bf68bffc __ldl_mmu, vaddr=bf68c000 __ldl_mmu, vaddr=bf68d000 __ldl_mmu, vaddr=bf68e000 __ldl_mmu, vaddr=bf68f000 __ldl_mmu, vaddr=bf690000 __ldl_mmu, vaddr=bf691000 __ldl_mmu, vaddr=bf692000 __ldl_mmu, vaddr=bf693000 __ldl_mmu, vaddr=bf694000 ... __ldl_mmu, vaddr=bf727000 __ldl_mmu, vaddr=bf728000 __ldl_mmu, vaddr=bfa89000 __ldl_mmu, vaddr=bfa8a000 So the rest of the vaddr I got has a different of 4096 bytes, instead of 4. I repeated the experiment for several times and got the same results. Is there anything wrong? or could you explain this? Thanks. steven On Fri, Aug 17, 2012 at 7:57 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: > On Fri, Aug 17, 2012 at 3:14 PM, 陳韋任 (Wei-Ren Chen) > <chenwj@iis.sinica.edu.tw> wrote: >>> > On Thu, Aug 16, 2012 at 7:49 PM, Steven <wangwangkang@gmail.com> wrote: >>> > [...] >>> >> I want to get the guest memory address in the instruction mov >>> >> 0x4(%ebx) %eax, whic is 0x4(%ebx). >>> >> Since %ebx is not resolved until the execution time, the code in >>> >> softmmu_header.h does not generate any hit or miss information. >>> >> Do you know any place that I could resolve the memory access address? Thanks. >>> > >>> > You'll have to generate code. Look at how helpers work. >>> Hi, Laurent, >>> do you mean the target-i386/op_helper.c/helper.c or the tcg helper? Thanks. >> >> What do you mean by "resolve the memory access address"? Do you want >> to get guest virtual address for each guest memory access, right? As Max >> mentioned before (you can also read [1]), there are fast and slow path >> in QEMU softmmu, tlb hit and tlb miss respectively. Max provided patch >> for slow path. As for fast path, take a look on tcg_out_tlb_load (tcg >> /i386/tcg-target.c). tcg_out_tlb_load will generate native code in the >> code cache to do tlb lookup, I think you cannot use the trick Max used >> since tcg_out_tlb_load will not be called when the fast path executed, > > That's why I've posted the following hunk that should have made all > accesses go via slow path: > > diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c > index da17bba..ec68c19 100644 > --- a/tcg/i386/tcg-target.c > +++ b/tcg/i386/tcg-target.c > @@ -1062,7 +1062,7 @@ static inline void tcg_out_tlb_load(TCGContext > *s, int addrlo_idx, > tcg_out_mov(s, type, r0, addrlo); > > /* jne label1 */ > - tcg_out8(s, OPC_JCC_short + JCC_JNE); > + tcg_out8(s, OPC_JMP_short); > label_ptr[0] = s->code_ptr; > s->code_ptr++; > > >> it "generates" code instead. Therefore, you might have to insert your >> instrument code in the code cache, perhaps modifying tcg_out_tlb_load >> to log value of "addrlo" (see comments above tcg_out_tlb_load). > > -- > Thanks. > -- Max
On Tue, Aug 21, 2012 at 9:40 AM, Steven <wangwangkang@gmail.com> wrote: > Hi, Max, > I wrote a small program to verify your patch could catch all the load > instructions from the guest. However, I found some problem from the > results. > > The guest OS and the emulated machine are both 32bit x86. My simple > program in the guest declares an 1048576-element integer array, > initialize the elements, and load them in a loop. It looks like this > int array[1048576]; > initialize the array; > > /* region of interests */ > int temp; > for (i=0; i < 1048576; i++) { > temp = array[i]; > } > So ideally, the path should catch the guest virtual address of in the > loop, right? > In addition, the virtual address for the beginning and end > of the array is 0xbf68b6e0 and 0xbfa8b6e0. > What i got is as follows > > __ldl_mmu, vaddr=bf68b6e0 > __ldl_mmu, vaddr=bf68b6e4 > __ldl_mmu, vaddr=bf68b6e8 > ..... > These should be the virtual address of the above loop. The > results look good because the gap between each vaddr is 4 bypte, which > is the length of each element. > However, after certain address, I got > > __ldl_mmu, vaddr=bf68bffc > __ldl_mmu, vaddr=bf68c000 > __ldl_mmu, vaddr=bf68d000 > __ldl_mmu, vaddr=bf68e000 > __ldl_mmu, vaddr=bf68f000 > __ldl_mmu, vaddr=bf690000 > __ldl_mmu, vaddr=bf691000 > __ldl_mmu, vaddr=bf692000 > __ldl_mmu, vaddr=bf693000 > __ldl_mmu, vaddr=bf694000 > ... > __ldl_mmu, vaddr=bf727000 > __ldl_mmu, vaddr=bf728000 > __ldl_mmu, vaddr=bfa89000 > __ldl_mmu, vaddr=bfa8a000 > So the rest of the vaddr I got has a different of 4096 bytes, instead > of 4. I repeated the experiment for several times and got the same > results. Is there anything wrong? or could you explain this? Thanks. I see two possibilities here: - maybe there are more fast path shortcuts in the QEMU code? in that case output of qemu -d op,out_asm would help. - maybe your compiler had optimized that sample code? could you try to declare array in your sample as 'volatile int'?
On Sat, Aug 25, 2012 at 9:20 PM, Steven <wangwangkang@gmail.com> wrote: > On Tue, Aug 21, 2012 at 3:18 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: >> On Tue, Aug 21, 2012 at 9:40 AM, Steven <wangwangkang@gmail.com> wrote: >>> Hi, Max, >>> I wrote a small program to verify your patch could catch all the load >>> instructions from the guest. However, I found some problem from the >>> results. >>> >>> The guest OS and the emulated machine are both 32bit x86. My simple >>> program in the guest declares an 1048576-element integer array, >>> initialize the elements, and load them in a loop. It looks like this >>> int array[1048576]; >>> initialize the array; >>> >>> /* region of interests */ >>> int temp; >>> for (i=0; i < 1048576; i++) { >>> temp = array[i]; >>> } >>> So ideally, the path should catch the guest virtual address of in the >>> loop, right? >>> In addition, the virtual address for the beginning and end >>> of the array is 0xbf68b6e0 and 0xbfa8b6e0. >>> What i got is as follows >>> >>> __ldl_mmu, vaddr=bf68b6e0 >>> __ldl_mmu, vaddr=bf68b6e4 >>> __ldl_mmu, vaddr=bf68b6e8 >>> ..... >>> These should be the virtual address of the above loop. The >>> results look good because the gap between each vaddr is 4 bypte, which >>> is the length of each element. >>> However, after certain address, I got >>> >>> __ldl_mmu, vaddr=bf68bffc >>> __ldl_mmu, vaddr=bf68c000 >>> __ldl_mmu, vaddr=bf68d000 >>> __ldl_mmu, vaddr=bf68e000 >>> __ldl_mmu, vaddr=bf68f000 >>> __ldl_mmu, vaddr=bf690000 >>> __ldl_mmu, vaddr=bf691000 >>> __ldl_mmu, vaddr=bf692000 >>> __ldl_mmu, vaddr=bf693000 >>> __ldl_mmu, vaddr=bf694000 >>> ... >>> __ldl_mmu, vaddr=bf727000 >>> __ldl_mmu, vaddr=bf728000 >>> __ldl_mmu, vaddr=bfa89000 >>> __ldl_mmu, vaddr=bfa8a000 >>> So the rest of the vaddr I got has a different of 4096 bytes, instead >>> of 4. I repeated the experiment for several times and got the same >>> results. Is there anything wrong? or could you explain this? Thanks. >> >> I see two possibilities here: >> - maybe there are more fast path shortcuts in the QEMU code? >> in that case output of qemu -d op,out_asm would help. >> - maybe your compiler had optimized that sample code? >> could you try to declare array in your sample as 'volatile int'? > After adding the "volatile" qualifier, the results are correct now. > So your patch can trap all the guest memory data load access, no > matter slow path or fast path. > > However, I found some problem when I try understanding the instruction > access. So I run the VM with "-d in_asm" to see program counter of > each guest code. I got > > __ldl_cmmu,ffffffff8102ff91 > __ldl_cmmu,ffffffff8102ff9a > ---------------- > IN: > 0xffffffff8102ff8a: mov 0x8(%rbx),%rax > 0xffffffff8102ff8e: add 0x790(%rbx),%rax > 0xffffffff8102ff95: xor %edx,%edx > 0xffffffff8102ff97: mov 0x858(%rbx),%rcx > 0xffffffff8102ff9e: cmp %rcx,%rax > 0xffffffff8102ffa1: je 0xffffffff8102ffb0 > ..... > > __ldl_cmmu,00000000004005a1 > __ldl_cmmu,00000000004005a6 > ---------------- > IN: > 0x0000000000400594: push %rbp > 0x0000000000400595: mov %rsp,%rbp > 0x0000000000400598: sub $0x20,%rsp > 0x000000000040059c: mov %rdi,-0x18(%rbp) > 0x00000000004005a0: mov $0x1,%edi > 0x00000000004005a5: callq 0x4004a0 > > From the results, I see that the guest virtual address of the pc is > slightly different between the __ldl_cmmu and the tb's pc(below IN:). > Could you help to understand this? Which one is the true pc memory > access? Thanks. Guest code is accessed at the translation time by C functions and I guess there are other layers of address translation caching. I wouldn't try to interpret these _cmmu printouts and would instead instrument [cpu_]ld{{u,s}{b,w},l,q}_code macros.
On Sat, Aug 25, 2012 at 4:41 PM, Max Filippov <jcmvbkbc@gmail.com> wrote: > On Sat, Aug 25, 2012 at 9:20 PM, Steven <wangwangkang@gmail.com> wrote: >> On Tue, Aug 21, 2012 at 3:18 AM, Max Filippov <jcmvbkbc@gmail.com> wrote: >>> On Tue, Aug 21, 2012 at 9:40 AM, Steven <wangwangkang@gmail.com> wrote: >>>> Hi, Max, >>>> I wrote a small program to verify your patch could catch all the load >>>> instructions from the guest. However, I found some problem from the >>>> results. >>>> >>>> The guest OS and the emulated machine are both 32bit x86. My simple >>>> program in the guest declares an 1048576-element integer array, >>>> initialize the elements, and load them in a loop. It looks like this >>>> int array[1048576]; >>>> initialize the array; >>>> >>>> /* region of interests */ >>>> int temp; >>>> for (i=0; i < 1048576; i++) { >>>> temp = array[i]; >>>> } >>>> So ideally, the path should catch the guest virtual address of in the >>>> loop, right? >>>> In addition, the virtual address for the beginning and end >>>> of the array is 0xbf68b6e0 and 0xbfa8b6e0. >>>> What i got is as follows >>>> >>>> __ldl_mmu, vaddr=bf68b6e0 >>>> __ldl_mmu, vaddr=bf68b6e4 >>>> __ldl_mmu, vaddr=bf68b6e8 >>>> ..... >>>> These should be the virtual address of the above loop. The >>>> results look good because the gap between each vaddr is 4 bypte, which >>>> is the length of each element. >>>> However, after certain address, I got >>>> >>>> __ldl_mmu, vaddr=bf68bffc >>>> __ldl_mmu, vaddr=bf68c000 >>>> __ldl_mmu, vaddr=bf68d000 >>>> __ldl_mmu, vaddr=bf68e000 >>>> __ldl_mmu, vaddr=bf68f000 >>>> __ldl_mmu, vaddr=bf690000 >>>> __ldl_mmu, vaddr=bf691000 >>>> __ldl_mmu, vaddr=bf692000 >>>> __ldl_mmu, vaddr=bf693000 >>>> __ldl_mmu, vaddr=bf694000 >>>> ... >>>> __ldl_mmu, vaddr=bf727000 >>>> __ldl_mmu, vaddr=bf728000 >>>> __ldl_mmu, vaddr=bfa89000 >>>> __ldl_mmu, vaddr=bfa8a000 >>>> So the rest of the vaddr I got has a different of 4096 bytes, instead >>>> of 4. I repeated the experiment for several times and got the same >>>> results. Is there anything wrong? or could you explain this? Thanks. >>> >>> I see two possibilities here: >>> - maybe there are more fast path shortcuts in the QEMU code? >>> in that case output of qemu -d op,out_asm would help. >>> - maybe your compiler had optimized that sample code? >>> could you try to declare array in your sample as 'volatile int'? >> After adding the "volatile" qualifier, the results are correct now. >> So your patch can trap all the guest memory data load access, no >> matter slow path or fast path. >> >> However, I found some problem when I try understanding the instruction >> access. So I run the VM with "-d in_asm" to see program counter of >> each guest code. I got >> >> __ldl_cmmu,ffffffff8102ff91 >> __ldl_cmmu,ffffffff8102ff9a >> ---------------- >> IN: >> 0xffffffff8102ff8a: mov 0x8(%rbx),%rax >> 0xffffffff8102ff8e: add 0x790(%rbx),%rax >> 0xffffffff8102ff95: xor %edx,%edx >> 0xffffffff8102ff97: mov 0x858(%rbx),%rcx >> 0xffffffff8102ff9e: cmp %rcx,%rax >> 0xffffffff8102ffa1: je 0xffffffff8102ffb0 >> ..... >> >> __ldl_cmmu,00000000004005a1 >> __ldl_cmmu,00000000004005a6 >> ---------------- >> IN: >> 0x0000000000400594: push %rbp >> 0x0000000000400595: mov %rsp,%rbp >> 0x0000000000400598: sub $0x20,%rsp >> 0x000000000040059c: mov %rdi,-0x18(%rbp) >> 0x00000000004005a0: mov $0x1,%edi >> 0x00000000004005a5: callq 0x4004a0 >> >> From the results, I see that the guest virtual address of the pc is >> slightly different between the __ldl_cmmu and the tb's pc(below IN:). >> Could you help to understand this? Which one is the true pc memory >> access? Thanks. > > Guest code is accessed at the translation time by C functions and > I guess there are other layers of address translation caching. I wouldn't > try to interpret these _cmmu printouts and would instead instrument > [cpu_]ld{{u,s}{b,w},l,q}_code macros. yes, you are right. Some ldub_code in x86 guest does not call __ldq_cmmu when the tlb hits. By the way, when I use your patch, I saw too many log event for the kernel data _mmu, ie., the addrs is 0x7fff ffff ffff. There are too many such mmu event that the user mode data can not be executed. So I have to setup a condition like if (addr < 0x8000 0000 0000) fprintf(stderr, "%s: %08x\n", __func__, addr); Then my simple array access program can be finished. I am wondering whether you have met the similar problem or you have any suggestion on this. My final goal is to obtain the memory access trace for a particular process in the guest, so your patch really helps, except for too many kernel _mmu events. steven > > -- > Thanks. > -- Max
> My final goal is to obtain the memory access trace for a particular > process in the guest, so your patch really helps, except for too many > kernel _mmu events. How do you know guest is running which process, and log it's memory access trace? Regards, chenwj
I added a special opcode, which is not used by existing x86. When the process in the guest issues this opcode, the qemu starts to log its mmu access. On Mon, Aug 27, 2012 at 11:14 PM, 陳韋任 (Wei-Ren Chen) <chenwj@iis.sinica.edu.tw> wrote: >> My final goal is to obtain the memory access trace for a particular >> process in the guest, so your patch really helps, except for too many >> kernel _mmu events. > > How do you know guest is running which process, and log it's memory > access trace? > > Regards, > chenwj > > -- > Wei-Ren Chen (陳韋任) > Computer Systems Lab, Institute of Information Science, > Academia Sinica, Taiwan (R.O.C.) > Tel:886-2-2788-3799 #1667 > Homepage: http://people.cs.nctu.edu.tw/~chenwj
On Mon, Aug 27, 2012 at 8:15 PM, Steven <wangwangkang@gmail.com> wrote: >> Guest code is accessed at the translation time by C functions and >> I guess there are other layers of address translation caching. I wouldn't >> try to interpret these _cmmu printouts and would instead instrument >> [cpu_]ld{{u,s}{b,w},l,q}_code macros. > yes, you are right. > Some ldub_code in x86 guest does not call __ldq_cmmu when the tlb hits. > By the way, when I use your patch, I saw too many log event for the > kernel data _mmu, ie., the addrs is > 0x7fff ffff ffff. There are too many such mmu event that the user mode > data can not be executed. So I have to setup a condition like > if (addr < 0x8000 0000 0000) > fprintf(stderr, "%s: %08x\n", __func__, addr); > Then my simple array access program can be finished. You can also try to differentiate kernel/userspace by mmu_idx passed to helpers. > I am wondering whether you have met the similar problem or you have > any suggestion on this. I used simple samples (tests/tcg/xtensa testsuite), their memory access pattern didn't deviate from what I expected. > My final goal is to obtain the memory access trace for a particular > process in the guest, so your patch really helps, except for too many > kernel _mmu events. Wouldn't it be easier to use qemu-user for that?
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c index da17bba..ec68c19 100644 --- a/tcg/i386/tcg-target.c +++ b/tcg/i386/tcg-target.c @@ -1062,7 +1062,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, int addrlo_idx, tcg_out_mov(s, type, r0, addrlo); /* jne label1 */ - tcg_out8(s, OPC_JCC_short + JCC_JNE); + tcg_out8(s, OPC_JMP_short); label_ptr[0] = s->code_ptr; s->code_ptr++;