diff mbox series

[v3] uprobes/x86: emulate push insns for uprobe on x86

Message ID 20171113221139.1516536-1-yhs@fb.com
State Not Applicable, archived
Delegated to: David Miller
Headers show
Series [v3] uprobes/x86: emulate push insns for uprobe on x86 | expand

Commit Message

Yonghong Song Nov. 13, 2017, 10:11 p.m. UTC
Uprobe is a tracing mechanism for userspace programs.
Typical uprobe will incur overhead of two traps.
First trap is caused by replaced trap insn, and
the second trap is to execute the original displaced
insn in user space.

To reduce the overhead, kernel provides hooks
for architectures to emulate the original insn
and skip the second trap. In x86, emulation
is done for certain branch insns.

This patch extends the emulation to "push <reg>"
insns. These insns are typical in the beginning
of the function. For example, bcc
in https://github.com/iovisor/bcc repo provides
tools to measure funclantency, detect memleak, etc.
The tools will place uprobes in the beginning of
function and possibly uretprobes at the end of function.
This patch is able to reduce the trap overhead for
uprobe from 2 to 1.

Without this patch, uretprobe will typically incur
three traps. With this patch, if the function starts
with "push" insn, the number of traps can be
reduced from 3 to 2.

An experiment was conducted on two local VMs,
fedora 26 64-bit VM and 32-bit VM, both 4 processors
and 4GB memory, booted with latest tip repo (and this patch).
The host is MacBook with intel i7 processor.

The test program looks like
  #include <stdio.h>
  #include <stdlib.h>
  #include <time.h>
  #include <sys/time.h>

  static void test() __attribute__((noinline));
  void test() {}
  int main() {
    struct timeval start, end;

    gettimeofday(&start, NULL);
    for (int i = 0; i < 1000000; i++) {
      test();
    }
    gettimeofday(&end, NULL);

    printf("%ld\n", ((end.tv_sec * 1000000 + end.tv_usec)
                     - (start.tv_sec * 1000000 + start.tv_usec)));
    return 0;
  }

The program is compiled without optimization, and
the first insn for function "test" is "push %rbp".
The host is relatively idle.

Before the test run, the uprobe is inserted as below for uprobe:
  echo 'p <binary>:<test_func_offset>' > /sys/kernel/debug/tracing/uprobe_events
  echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
and for uretprobe:
  echo 'r <binary>:<test_func_offset>' > /sys/kernel/debug/tracing/uprobe_events
  echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable

Unit: microsecond(usec) per loop iteration

x86_64          W/ this patch   W/O this patch
uprobe          1.55            3.1
uretprobe       2.0             3.6

x86_32          W/ this patch   W/O this patch
uprobe          1.41            3.5
uretprobe       1.75            4.0

You can see that this patch significantly reduced the overhead,
50% for uprobe and 44% for uretprobe on x86_64, and even more
on x86_32.

Signed-off-by: Yonghong Song <yhs@fb.com>
---
 arch/x86/include/asm/uprobes.h |   4 ++
 arch/x86/kernel/uprobes.c      | 111 +++++++++++++++++++++++++++++++++++++++--
 2 files changed, 111 insertions(+), 4 deletions(-)

Changelogs:
v2 -> v3:
  . Do not emulate 32bit application on x86_64 platforms
v1 -> v2:
  . Address Oleg's comments

Comments

Oleg Nesterov Nov. 14, 2017, 3:51 p.m. UTC | #1
On 11/13, Yonghong Song wrote:
>
> +static int push_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
> +{
> +	u8 opc1 = OPCODE1(insn), reg_offset = 0;
> +
> +	if (opc1 < 0x50 || opc1 > 0x57)
> +		return -ENOSYS;
> +
> +	if (insn->length > 2)
> +		return -ENOSYS;
> +#ifdef CONFIG_X86_64
> +	if (test_thread_flag(TIF_ADDR32))
> +		return -ENOSYS;
> +#endif

No, this doesn't look right, see my previous email. You should do this
check in the "if (insn->length == 2)" branch below, "push bp" should be
emulated correctly.

And test_thread_flag(TIF_ADDR32) is not right too. The caller is not
necessarily the probed task. See is_64bit_mm(mm) in arch_uprobe_analyze_insn().

And again... please check if uprobe_init_insn() fails or not in this case
(32bit task does, say, "push r8"). If it fails, your V2 should be fine.


To remind, uprobes && 32-bit is broken, let me quote my another email:

	The 3rd bug means that you simply can't uprobe a 32bit task on a 64bit
	system, the in_compat_syscall() logic in get_unmapped_area() looks very
	wrong although I need to re-check.

I didn't have time for this problem so far. But emulation should work, so
you can hopefully test your patch.

Oleg.
Oleg Nesterov Nov. 14, 2017, 4:03 p.m. UTC | #2
On 11/14, Oleg Nesterov wrote:
>
> > +#ifdef CONFIG_X86_64
> > +	if (test_thread_flag(TIF_ADDR32))
> > +		return -ENOSYS;
> > +#endif
>
> No, this doesn't look right, see my previous email. You should do this
> check in the "if (insn->length == 2)" branch below, "push bp" should be
> emulated correctly.
>
> And test_thread_flag(TIF_ADDR32) is not right too. The caller is not
> necessarily the probed task. See is_64bit_mm(mm) in arch_uprobe_analyze_insn().
>
> And again... please check if uprobe_init_insn() fails or not in this case
> (32bit task does, say, "push r8"). If it fails, your V2 should be fine.
>
>
> To remind, uprobes && 32-bit is broken, let me quote my another email:
>
> 	The 3rd bug means that you simply can't uprobe a 32bit task on a 64bit
> 	system, the in_compat_syscall() logic in get_unmapped_area() looks very
> 	wrong although I need to re-check.

Yes,

> I didn't have time for this problem so far. But emulation should work, so
> you can hopefully test your patch.

Ah, no, sizeof_long() is broken by the same reason, so you can't test it...

OK, I'll try to do something tomorrow, then we will see what can we do
with your patch...

But it would be nice if you can check what uprobe_init_insn() does in this
case, see above.

Oleg.
Yonghong Song Nov. 14, 2017, 10:35 p.m. UTC | #3
On 11/14/17 7:51 AM, Oleg Nesterov wrote:
> On 11/13, Yonghong Song wrote:
>>
>> +static int push_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
>> +{
>> +	u8 opc1 = OPCODE1(insn), reg_offset = 0;
>> +
>> +	if (opc1 < 0x50 || opc1 > 0x57)
>> +		return -ENOSYS;
>> +
>> +	if (insn->length > 2)
>> +		return -ENOSYS;
>> +#ifdef CONFIG_X86_64
>> +	if (test_thread_flag(TIF_ADDR32))
>> +		return -ENOSYS;
>> +#endif
> 
> No, this doesn't look right, see my previous email. You should do this
> check in the "if (insn->length == 2)" branch below, "push bp" should be
> emulated correctly.
> 
> And test_thread_flag(TIF_ADDR32) is not right too. The caller is not
> necessarily the probed task. See is_64bit_mm(mm) in arch_uprobe_analyze_insn().

I printed out some statistics. On x86_64 platform, for 32bit 
application, test_thread_flag(TIF_ADDR32) returns true and 
is_64bit_mm(mm) returns false. For 64bit application,
test_thread_flag(TIF_ADDR32) returns false and is_64bit_mm(mm) return 
true. So that is why my patch works fine.

I did not fully understand how to trigger "the caller is not necessarily 
the probed task." So in the next revision, I will use is_64bit_mm(mm) 
instead.

> 
> And again... please check if uprobe_init_insn() fails or not in this case
> (32bit task does, say, "push r8"). If it fails, your V2 should be fine.

The compiler won't generated "push r8" for 32bit task since register 
"r8" is not available on 32bit instruction.

> 
> 
> To remind, uprobes && 32-bit is broken, let me quote my another email:
> 
> 	The 3rd bug means that you simply can't uprobe a 32bit task on a 64bit
> 	system, the in_compat_syscall() logic in get_unmapped_area() looks very
> 	wrong although I need to re-check.
> 
> I didn't have time for this problem so far. But emulation should work, so
> you can hopefully test your patch.
> 
> Oleg.
>
Yonghong Song Nov. 14, 2017, 10:43 p.m. UTC | #4
On 11/14/17 8:03 AM, Oleg Nesterov wrote:
> On 11/14, Oleg Nesterov wrote:
>>
>>> +#ifdef CONFIG_X86_64
>>> +	if (test_thread_flag(TIF_ADDR32))
>>> +		return -ENOSYS;
>>> +#endif
>>
>> No, this doesn't look right, see my previous email. You should do this
>> check in the "if (insn->length == 2)" branch below, "push bp" should be
>> emulated correctly.
>>
>> And test_thread_flag(TIF_ADDR32) is not right too. The caller is not
>> necessarily the probed task. See is_64bit_mm(mm) in arch_uprobe_analyze_insn().
>>
>> And again... please check if uprobe_init_insn() fails or not in this case
>> (32bit task does, say, "push r8"). If it fails, your V2 should be fine.
>>
>>
>> To remind, uprobes && 32-bit is broken, let me quote my another email:
>>
>> 	The 3rd bug means that you simply can't uprobe a 32bit task on a 64bit
>> 	system, the in_compat_syscall() logic in get_unmapped_area() looks very
>> 	wrong although I need to re-check.
> 
> Yes,
> 
>> I didn't have time for this problem so far. But emulation should work, so
>> you can hopefully test your patch.
> 
> Ah, no, sizeof_long() is broken by the same reason, so you can't test it...

Right. I hacked the emulate_push_stack (original name: push_ret_address) 
with sizeof_long = 4, and 32bit binary uprobe works fine on x86_64 
platform then... But that will involve a bigger change to propogate
the is_64bit_mm() along the call graph.

> 
> OK, I'll try to do something tomorrow, then we will see what can we do
> with your patch...

Thanks for reviewing! I will wait for your further comments/direction
before next step.

> 
> But it would be nice if you can check what uprobe_init_insn() does in this
> case, see above.

As mentioned in my previous email, for 32bit application,
compiler won't generate "push %r8" as "%r8" is only available on
x86_64 platform. For 32bit app, I see "push %bp" etc which does not
have rex_prefix. They cannot be emulated right now due to 
emulate_push_stack needs change.

> 
> Oleg.
>
Oleg Nesterov Nov. 15, 2017, 3:38 p.m. UTC | #5
On 11/14, Yonghong Song wrote:
>
> On 11/14/17 7:51 AM, Oleg Nesterov wrote:
> >
> >And test_thread_flag(TIF_ADDR32) is not right too. The caller is not
> >necessarily the probed task. See is_64bit_mm(mm) in arch_uprobe_analyze_insn().
>
> I printed out some statistics. On x86_64 platform, for 32bit application,
> test_thread_flag(TIF_ADDR32) returns true

See above. The caller can be 64-bit even if the probed task is 32bit. Or
vice versa.

> and is_64bit_mm(mm) returns false.

This is what we need. Again, see its usage in arch_uprobe_analyze_insn()
and note that mm != current->mm.

> So that is why my patch works fine.

test_thread_flag() can't work in general, see above.

> I did not fully understand how to trigger "the caller is not necessarily the
> probed task." So in the next revision, I will use is_64bit_mm(mm) instead.

register_for_each_vma() paths can call arch_uprobe_analyze_insn(), the task
which calls register_ has is not necessarily the task(s) we want to probe.

> >And again... please check if uprobe_init_insn() fails or not in this case
> >(32bit task does, say, "push r8"). If it fails, your V2 should be fine.
>
> The compiler won't generated "push r8" for 32bit task since register "r8" is
> not available on 32bit instruction.

And?

uprobes should be transparent even when it comes to user-space bugs. If
a 32bit app does asm(".byte 0x41, 0x50") for any reason we should either
deny to probe this insn (uprobe_init_insn() should fail), or we should
execute it out-of-line so that it will be trapped correctly.

Oleg.
Oleg Nesterov Nov. 15, 2017, 3:47 p.m. UTC | #6
On 11/14, Yonghong Song wrote:
>
>
> On 11/14/17 8:03 AM, Oleg Nesterov wrote:
> >Ah, no, sizeof_long() is broken by the same reason, so you can't test it...
>
> Right. I hacked the emulate_push_stack (original name: push_ret_address)
> with sizeof_long = 4, and 32bit binary uprobe works fine on x86_64 platform
> then...

OK,

> >OK, I'll try to do something tomorrow, then we will see what can we do
> >with your patch...
>
> Thanks for reviewing! I will wait for your further comments/direction
> before next step.

Oh. tomorrow, I promise. Sorry I was bit busy today...

> >
> >But it would be nice if you can check what uprobe_init_insn() does in this
> >case, see above.
>
> As mentioned in my previous email, for 32bit application,
> compiler won't generate "push %r8" as "%r8" is only available on
> x86_64 platform.

But this is irrelevant, see my previous email.

So please, check if uprobe_init_insn() fails or not in this case. After that
we will know whether your patch needs the additional is_64bit_mm() check in
push_setup_xol_ops() or not.

Oleg.
Oleg Nesterov Nov. 15, 2017, 5:07 p.m. UTC | #7
On 11/15, Oleg Nesterov wrote:
>
> So please, check if uprobe_init_insn() fails or not in this case. After that
> we will know whether your patch needs the additional is_64bit_mm() check in
> push_setup_xol_ops() or not.

OK, I did the check for you.

uprobe_init_insn() doesn't fail but insn_init(x86_64 => 0) parse it as
single-byte insn with OPCODE1 == 0x41, so push_setup_xol_ops() doesn't
need to worry about compat tasks.

In short, your "V2" should be fine except you can factor out
auprobe->push.ilen initialization (as you did in V3). Please send V4.

Oleg.
Yonghong Song Nov. 15, 2017, 5:25 p.m. UTC | #8
On 11/15/17 9:07 AM, Oleg Nesterov wrote:
> On 11/15, Oleg Nesterov wrote:
>>
>> So please, check if uprobe_init_insn() fails or not in this case. After that
>> we will know whether your patch needs the additional is_64bit_mm() check in
>> push_setup_xol_ops() or not.
> 
> OK, I did the check for you.
> 
> uprobe_init_insn() doesn't fail but insn_init(x86_64 => 0) parse it as
> single-byte insn with OPCODE1 == 0x41, so push_setup_xol_ops() doesn't
> need to worry about compat tasks.
> 
> In short, your "V2" should be fine except you can factor out
> auprobe->push.ilen initialization (as you did in V3). Please send V4.

Thanks a lot! I am just about to use inline asm or binary rewriter to 
create such a code for testing...

I will send V4 shortly.

> 
> Oleg.
>
diff mbox series

Patch

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 74f4c2f..d8bfa98 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -53,6 +53,10 @@  struct arch_uprobe {
 			u8	fixups;
 			u8	ilen;
 		} 			defparam;
+		struct {
+			u8	reg_offset;	/* to the start of pt_regs */
+			u8	ilen;
+		}			push;
 	};
 };
 
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index a3755d2..007d8e77 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -528,11 +528,11 @@  static int default_pre_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
 	return 0;
 }
 
-static int push_ret_address(struct pt_regs *regs, unsigned long ip)
+static int emulate_push_stack(struct pt_regs *regs, unsigned long val)
 {
 	unsigned long new_sp = regs->sp - sizeof_long();
 
-	if (copy_to_user((void __user *)new_sp, &ip, sizeof_long()))
+	if (copy_to_user((void __user *)new_sp, &val, sizeof_long()))
 		return -EFAULT;
 
 	regs->sp = new_sp;
@@ -566,7 +566,7 @@  static int default_post_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs
 		regs->ip += correction;
 	} else if (auprobe->defparam.fixups & UPROBE_FIX_CALL) {
 		regs->sp += sizeof_long(); /* Pop incorrect return address */
-		if (push_ret_address(regs, utask->vaddr + auprobe->defparam.ilen))
+		if (emulate_push_stack(regs, utask->vaddr + auprobe->defparam.ilen))
 			return -ERESTART;
 	}
 	/* popf; tell the caller to not touch TF */
@@ -655,7 +655,7 @@  static bool branch_emulate_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
 		 *
 		 * But there is corner case, see the comment in ->post_xol().
 		 */
-		if (push_ret_address(regs, new_ip))
+		if (emulate_push_stack(regs, new_ip))
 			return false;
 	} else if (!check_jmp_cond(auprobe, regs)) {
 		offs = 0;
@@ -665,6 +665,16 @@  static bool branch_emulate_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
 	return true;
 }
 
+static bool push_emulate_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
+{
+	unsigned long *src_ptr = (void *)regs + auprobe->push.reg_offset;
+
+	if (emulate_push_stack(regs, *src_ptr))
+		return false;
+	regs->ip += auprobe->push.ilen;
+	return true;
+}
+
 static int branch_post_xol_op(struct arch_uprobe *auprobe, struct pt_regs *regs)
 {
 	BUG_ON(!branch_is_call(auprobe));
@@ -703,6 +713,10 @@  static const struct uprobe_xol_ops branch_xol_ops = {
 	.post_xol = branch_post_xol_op,
 };
 
+static const struct uprobe_xol_ops push_xol_ops = {
+	.emulate  = push_emulate_op,
+};
+
 /* Returns -ENOSYS if branch_xol_ops doesn't handle this insn */
 static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
 {
@@ -750,6 +764,91 @@  static int branch_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
 	return 0;
 }
 
+/* Returns -ENOSYS if push_xol_ops doesn't handle this insn */
+static int push_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
+{
+	u8 opc1 = OPCODE1(insn), reg_offset = 0;
+
+	if (opc1 < 0x50 || opc1 > 0x57)
+		return -ENOSYS;
+
+	if (insn->length > 2)
+		return -ENOSYS;
+#ifdef CONFIG_X86_64
+	if (test_thread_flag(TIF_ADDR32))
+		return -ENOSYS;
+#endif
+	if (insn->length == 2) {
+		/* only support rex_prefix 0x41 (x64 only) */
+#ifdef CONFIG_X86_64
+		if (insn->rex_prefix.nbytes != 1 ||
+		    insn->rex_prefix.bytes[0] != 0x41)
+			return -ENOSYS;
+
+		switch (opc1) {
+		case 0x50:
+			reg_offset = offsetof(struct pt_regs, r8);
+			break;
+		case 0x51:
+			reg_offset = offsetof(struct pt_regs, r9);
+			break;
+		case 0x52:
+			reg_offset = offsetof(struct pt_regs, r10);
+			break;
+		case 0x53:
+			reg_offset = offsetof(struct pt_regs, r11);
+			break;
+		case 0x54:
+			reg_offset = offsetof(struct pt_regs, r12);
+			break;
+		case 0x55:
+			reg_offset = offsetof(struct pt_regs, r13);
+			break;
+		case 0x56:
+			reg_offset = offsetof(struct pt_regs, r14);
+			break;
+		case 0x57:
+			reg_offset = offsetof(struct pt_regs, r15);
+			break;
+		}
+#else
+		return -ENOSYS;
+#endif
+	} else {
+		switch (opc1) {
+		case 0x50:
+			reg_offset = offsetof(struct pt_regs, ax);
+			break;
+		case 0x51:
+			reg_offset = offsetof(struct pt_regs, cx);
+			break;
+		case 0x52:
+			reg_offset = offsetof(struct pt_regs, dx);
+			break;
+		case 0x53:
+			reg_offset = offsetof(struct pt_regs, bx);
+			break;
+		case 0x54:
+			reg_offset = offsetof(struct pt_regs, sp);
+			break;
+		case 0x55:
+			reg_offset = offsetof(struct pt_regs, bp);
+			break;
+		case 0x56:
+			reg_offset = offsetof(struct pt_regs, si);
+			break;
+		case 0x57:
+			reg_offset = offsetof(struct pt_regs, di);
+			break;
+		}
+	}
+
+	auprobe->push.ilen = insn->length;
+	auprobe->push.reg_offset = reg_offset;
+	auprobe->ops = &push_xol_ops;
+	return 0;
+}
+
 /**
  * arch_uprobe_analyze_insn - instruction analysis including validity and fixups.
  * @mm: the probed address space.
@@ -771,6 +870,10 @@  int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret != -ENOSYS)
 		return ret;
 
+	ret = push_setup_xol_ops(auprobe, &insn);
+	if (ret != -ENOSYS)
+		return ret;
+
 	/*
 	 * Figure out which fixups default_post_xol_op() will need to perform,
 	 * and annotate defparam->fixups accordingly.