mbox series

[v9,0/5] Add the ability to do BPF directed error injection

Message ID 1513010210-30594-1-git-send-email-josef@toxicpanda.com
Headers show
Series Add the ability to do BPF directed error injection | expand

Message

Josef Bacik Dec. 11, 2017, 4:36 p.m. UTC
This is the same as v8, just rebased onto the bpf tree.

v8->v9:
- rebased onto the bpf tree.

v7->v8:
- removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.

v6->v7:
- moved the opt-in macro to bpf.h out of kprobes.h.

v5->v6:
- add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
  feature.  This way only functions that opt-in will be allowed to be
  overridden.
- added a btrfs patch to allow error injection for open_ctree() so that the bpf
  sample actually works.

v4->v5:
- disallow kprobe_override programs from being put in the prog map array so we
  don't tail call into something we didn't check.  This allows us to make the
  normal path still fast without a bunch of percpu operations.

v3->v4:
- fix a build error found by kbuild test bot (I didn't wait long enough
  apparently.)
- Added a warning message as per Daniels suggestion.

v2->v3:
- added a ->kprobe_override flag to bpf_prog.
- added some sanity checks to disallow attaching bpf progs that have
  ->kprobe_override set that aren't for ftrace kprobes.
- added the trace_kprobe_ftrace helper to check if the trace_event_call is a
  ftrace kprobe.
- renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
  value in the kprobe path, and thus only write to it if we're overriding or
  clearing the override.

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

- Original message -

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,

Josef

Comments

Alexei Starovoitov Dec. 12, 2017, 5:37 p.m. UTC | #1
On 12/11/17 8:36 AM, Josef Bacik wrote:
> This is the same as v8, just rebased onto the bpf tree.
>
> v8->v9:
> - rebased onto the bpf tree.
>
> v7->v8:
> - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
>
> v6->v7:
> - moved the opt-in macro to bpf.h out of kprobes.h.
>
> v5->v6:
> - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
>   feature.  This way only functions that opt-in will be allowed to be
>   overridden.
> - added a btrfs patch to allow error injection for open_ctree() so that the bpf
>   sample actually works.
>
> v4->v5:
> - disallow kprobe_override programs from being put in the prog map array so we
>   don't tail call into something we didn't check.  This allows us to make the
>   normal path still fast without a bunch of percpu operations.
>
> v3->v4:
> - fix a build error found by kbuild test bot (I didn't wait long enough
>   apparently.)
> - Added a warning message as per Daniels suggestion.
>
> v2->v3:
> - added a ->kprobe_override flag to bpf_prog.
> - added some sanity checks to disallow attaching bpf progs that have
>   ->kprobe_override set that aren't for ftrace kprobes.
> - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
>   ftrace kprobe.
> - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
>   value in the kprobe path, and thus only write to it if we're overriding or
>   clearing the override.
>
> v1->v2:
> - moved things around to make sure that bpf_override_return could really only be
>   used for an ftrace kprobe.
> - killed the special return values from trace_call_bpf.
> - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
>   it was being called from an ftrace kprobe context.
> - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
> - updated the test as per Alexei's review.
>
> - Original message -
>
> A lot of our error paths are not well tested because we have no good way of
> injecting errors generically.  Some subystems (block, memory) have ways to
> inject errors, but they are random so it's hard to get reproduceable results.
>
> With BPF we can add determinism to our error injection.  We can use kprobes and
> other things to verify we are injecting errors at the exact case we are trying
> to test.  This patch gives us the tool to actual do the error injection part.
> It is very simple, we just set the return value of the pt_regs we're given to
> whatever we provide, and then override the PC with a dummy function that simply
> returns.
>
> Right now this only works on x86, but it would be simple enough to expand to
> other architectures.  Thanks,

Applied, thanks Josef!

While applying in the patch "bpf: add a bpf_override_function helper"
I moved ifdef CONFIG_BPF_KPROBE_OVERRIDE few lines,
so when it's not set the program will fail at load time with error
"unknown func bpf_override_return#58"
instead of returning EINVAL at run-time.
That's more standard way of adding new helpers.

Thanks
Darrick Wong Dec. 12, 2017, 11:11 p.m. UTC | #2
On Mon, Dec 11, 2017 at 11:36:45AM -0500, Josef Bacik wrote:
> This is the same as v8, just rebased onto the bpf tree.
> 
> v8->v9:
> - rebased onto the bpf tree.
> 
> v7->v8:
> - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
> 
> v6->v7:
> - moved the opt-in macro to bpf.h out of kprobes.h.
> 
> v5->v6:
> - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
>   feature.  This way only functions that opt-in will be allowed to be
>   overridden.
> - added a btrfs patch to allow error injection for open_ctree() so that the bpf
>   sample actually works.
> 
> v4->v5:
> - disallow kprobe_override programs from being put in the prog map array so we
>   don't tail call into something we didn't check.  This allows us to make the
>   normal path still fast without a bunch of percpu operations.
> 
> v3->v4:
> - fix a build error found by kbuild test bot (I didn't wait long enough
>   apparently.)
> - Added a warning message as per Daniels suggestion.
> 
> v2->v3:
> - added a ->kprobe_override flag to bpf_prog.
> - added some sanity checks to disallow attaching bpf progs that have
>   ->kprobe_override set that aren't for ftrace kprobes.
> - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
>   ftrace kprobe.
> - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
>   value in the kprobe path, and thus only write to it if we're overriding or
>   clearing the override.
> 
> v1->v2:
> - moved things around to make sure that bpf_override_return could really only be
>   used for an ftrace kprobe.
> - killed the special return values from trace_call_bpf.
> - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
>   it was being called from an ftrace kprobe context.
> - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
> - updated the test as per Alexei's review.
> 
> - Original message -
> 
> A lot of our error paths are not well tested because we have no good way of
> injecting errors generically.  Some subystems (block, memory) have ways to
> inject errors, but they are random so it's hard to get reproduceable results.
> 
> With BPF we can add determinism to our error injection.  We can use kprobes and
> other things to verify we are injecting errors at the exact case we are trying
> to test.  This patch gives us the tool to actual do the error injection part.
> It is very simple, we just set the return value of the pt_regs we're given to
> whatever we provide, and then override the PC with a dummy function that simply
> returns.

Heh, this looks cool.  I decided to try it to see what happens, and saw
a bunch of dmesg pasted in below.  Is that supposed to happen?  Or am I
the only fs developer still running with lockdep enabled? :)

It looks like bpf_override_return has some sort of side effect such that
we get the splat, since commenting it out makes the symptom go away.

<shrug>

--D

[ 1847.769183] BTRFS error (device (null)): open_ctree failed
[ 1847.770130] BUG: sleeping function called from invalid context at /storage/home/djwong/cdev/work/linux-xfs/kernel/locking/rwsem.c:69
[ 1847.771976] in_atomic(): 1, irqs_disabled(): 0, pid: 1524, name: mount
[ 1847.773016] 1 lock held by mount/1524:
[ 1847.773530]  #0:  (&type->s_umount_key#34/1){+.+.}, at: [<00000000653a9bb4>] sget_userns+0x302/0x4f0
[ 1847.774731] Preemption disabled at:
[ 1847.774735] [<          (null)>]           (null)
[ 1847.777009] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
[ 1847.778800] Call Trace:
[ 1847.779047]  dump_stack+0x7c/0xbe
[ 1847.779361]  ___might_sleep+0x1f7/0x260
[ 1847.779720]  down_write+0x29/0xb0
[ 1847.780046]  unregister_shrinker+0x15/0x70
[ 1847.780427]  deactivate_locked_super+0x2e/0x60
[ 1847.780935]  btrfs_mount+0xbb6/0x1000 [btrfs]
[ 1847.781353]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.781750]  ? mount_fs+0xf/0x80
[ 1847.782065]  ? alloc_vfsmnt+0x1a1/0x230
[ 1847.782429]  mount_fs+0xf/0x80
[ 1847.782733]  vfs_kern_mount+0x62/0x160
[ 1847.783128]  btrfs_mount+0x3d3/0x1000 [btrfs]
[ 1847.783493]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.783849]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.784207]  ? mount_fs+0xf/0x80
[ 1847.784502]  mount_fs+0xf/0x80
[ 1847.784835]  vfs_kern_mount+0x62/0x160
[ 1847.785235]  do_mount+0x1b1/0xd50
[ 1847.785594]  ? _copy_from_user+0x5b/0x90
[ 1847.786028]  ? memdup_user+0x4b/0x70
[ 1847.786501]  SyS_mount+0x85/0xd0
[ 1847.786835]  entry_SYSCALL_64_fastpath+0x1f/0x96
[ 1847.787311] RIP: 0033:0x7f6ebecc1b5a
[ 1847.787691] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[ 1847.788383] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
[ 1847.789106] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
[ 1847.789807] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
[ 1847.790511] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
[ 1847.791211] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 1847.792029] BUG: scheduling while atomic: mount/1524/0x00000002
[ 1847.792680] 1 lock held by mount/1524:
[ 1847.793087]  #0:  (rcu_preempt_state.exp_mutex){+.+.}, at: [<00000000a6c536a9>] _synchronize_rcu_expedited+0x1ce/0x400
[ 1847.794129] Modules linked in: xfs libcrc32c btrfs xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress zlib_deflate raid6_pq dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: xfs]
[ 1847.795949] Preemption disabled at:
[ 1847.795951] [<          (null)>]           (null)
[ 1847.796844] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
[ 1847.797621] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
[ 1847.798510] Call Trace:
[ 1847.798786]  dump_stack+0x7c/0xbe
[ 1847.799134]  __schedule_bug+0x88/0xe0
[ 1847.799517]  __schedule+0x78c/0xb20
[ 1847.799890]  ? trace_hardirqs_on_caller+0x119/0x180
[ 1847.800391]  schedule+0x40/0x90
[ 1847.800729]  _synchronize_rcu_expedited+0x36b/0x400
[ 1847.801218]  ? rcu_preempt_qs+0xa0/0xa0
[ 1847.801616]  ? remove_wait_queue+0x60/0x60
[ 1847.802040]  ? rcu_preempt_qs+0xa0/0xa0
[ 1847.802433]  ? rcu_exp_wait_wake+0x630/0x630
[ 1847.802872]  ? __lock_acquire+0xfb9/0x1120
[ 1847.803302]  ? __lock_acquire+0x534/0x1120
[ 1847.803725]  ? bdi_unregister+0x57/0x1a0
[ 1847.804135]  bdi_unregister+0x5c/0x1a0
[ 1847.804519]  bdi_put+0xcb/0xe0
[ 1847.804746]  generic_shutdown_super+0xe2/0x110
[ 1847.805066]  kill_anon_super+0xe/0x20
[ 1847.805344]  btrfs_kill_super+0x12/0xa0 [btrfs]
[ 1847.805664]  deactivate_locked_super+0x34/0x60
[ 1847.806111]  btrfs_mount+0xbb6/0x1000 [btrfs]
[ 1847.806476]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.806824]  ? mount_fs+0xf/0x80
[ 1847.807104]  ? alloc_vfsmnt+0x1a1/0x230
[ 1847.807416]  mount_fs+0xf/0x80
[ 1847.807712]  vfs_kern_mount+0x62/0x160
[ 1847.808112]  btrfs_mount+0x3d3/0x1000 [btrfs]
[ 1847.808565]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.809005]  ? __lockdep_init_map+0x5c/0x1d0
[ 1847.809425]  ? mount_fs+0xf/0x80
[ 1847.809731]  mount_fs+0xf/0x80
[ 1847.810070]  vfs_kern_mount+0x62/0x160
[ 1847.810469]  do_mount+0x1b1/0xd50
[ 1847.810821]  ? _copy_from_user+0x5b/0x90
[ 1847.811237]  ? memdup_user+0x4b/0x70
[ 1847.811622]  SyS_mount+0x85/0xd0
[ 1847.811996]  entry_SYSCALL_64_fastpath+0x1f/0x96
[ 1847.812465] RIP: 0033:0x7f6ebecc1b5a
[ 1847.812840] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[ 1847.813615] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
[ 1847.814302] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
[ 1847.814770] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
[ 1847.815246] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
[ 1847.815720] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001

--D

> 
> Right now this only works on x86, but it would be simple enough to expand to
> other architectures.  Thanks,
> 
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Dec. 13, 2017, 6:03 p.m. UTC | #3
On Tue, Dec 12, 2017 at 03:11:50PM -0800, Darrick J. Wong wrote:
> On Mon, Dec 11, 2017 at 11:36:45AM -0500, Josef Bacik wrote:
> > This is the same as v8, just rebased onto the bpf tree.
> > 
> > v8->v9:
> > - rebased onto the bpf tree.
> > 
> > v7->v8:
> > - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
> > 
> > v6->v7:
> > - moved the opt-in macro to bpf.h out of kprobes.h.
> > 
> > v5->v6:
> > - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
> >   feature.  This way only functions that opt-in will be allowed to be
> >   overridden.
> > - added a btrfs patch to allow error injection for open_ctree() so that the bpf
> >   sample actually works.
> > 
> > v4->v5:
> > - disallow kprobe_override programs from being put in the prog map array so we
> >   don't tail call into something we didn't check.  This allows us to make the
> >   normal path still fast without a bunch of percpu operations.
> > 
> > v3->v4:
> > - fix a build error found by kbuild test bot (I didn't wait long enough
> >   apparently.)
> > - Added a warning message as per Daniels suggestion.
> > 
> > v2->v3:
> > - added a ->kprobe_override flag to bpf_prog.
> > - added some sanity checks to disallow attaching bpf progs that have
> >   ->kprobe_override set that aren't for ftrace kprobes.
> > - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
> >   ftrace kprobe.
> > - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
> >   value in the kprobe path, and thus only write to it if we're overriding or
> >   clearing the override.
> > 
> > v1->v2:
> > - moved things around to make sure that bpf_override_return could really only be
> >   used for an ftrace kprobe.
> > - killed the special return values from trace_call_bpf.
> > - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
> >   it was being called from an ftrace kprobe context.
> > - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
> > - updated the test as per Alexei's review.
> > 
> > - Original message -
> > 
> > A lot of our error paths are not well tested because we have no good way of
> > injecting errors generically.  Some subystems (block, memory) have ways to
> > inject errors, but they are random so it's hard to get reproduceable results.
> > 
> > With BPF we can add determinism to our error injection.  We can use kprobes and
> > other things to verify we are injecting errors at the exact case we are trying
> > to test.  This patch gives us the tool to actual do the error injection part.
> > It is very simple, we just set the return value of the pt_regs we're given to
> > whatever we provide, and then override the PC with a dummy function that simply
> > returns.
> 
> Heh, this looks cool.  I decided to try it to see what happens, and saw
> a bunch of dmesg pasted in below.  Is that supposed to happen?  Or am I
> the only fs developer still running with lockdep enabled? :)
> 
> It looks like bpf_override_return has some sort of side effect such that
> we get the splat, since commenting it out makes the symptom go away.
> 
> <shrug>
> 
> --D
> 
> [ 1847.769183] BTRFS error (device (null)): open_ctree failed
> [ 1847.770130] BUG: sleeping function called from invalid context at /storage/home/djwong/cdev/work/linux-xfs/kernel/locking/rwsem.c:69
> [ 1847.771976] in_atomic(): 1, irqs_disabled(): 0, pid: 1524, name: mount
> [ 1847.773016] 1 lock held by mount/1524:
> [ 1847.773530]  #0:  (&type->s_umount_key#34/1){+.+.}, at: [<00000000653a9bb4>] sget_userns+0x302/0x4f0
> [ 1847.774731] Preemption disabled at:
> [ 1847.774735] [<          (null)>]           (null)
> [ 1847.777009] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> [ 1847.778800] Call Trace:
> [ 1847.779047]  dump_stack+0x7c/0xbe
> [ 1847.779361]  ___might_sleep+0x1f7/0x260
> [ 1847.779720]  down_write+0x29/0xb0
> [ 1847.780046]  unregister_shrinker+0x15/0x70
> [ 1847.780427]  deactivate_locked_super+0x2e/0x60
> [ 1847.780935]  btrfs_mount+0xbb6/0x1000 [btrfs]
> [ 1847.781353]  ? __lockdep_init_map+0x5c/0x1d0
> [ 1847.781750]  ? mount_fs+0xf/0x80
> [ 1847.782065]  ? alloc_vfsmnt+0x1a1/0x230
> [ 1847.782429]  mount_fs+0xf/0x80
> [ 1847.782733]  vfs_kern_mount+0x62/0x160
> [ 1847.783128]  btrfs_mount+0x3d3/0x1000 [btrfs]
> [ 1847.783493]  ? __lockdep_init_map+0x5c/0x1d0
> [ 1847.783849]  ? __lockdep_init_map+0x5c/0x1d0
> [ 1847.784207]  ? mount_fs+0xf/0x80
> [ 1847.784502]  mount_fs+0xf/0x80
> [ 1847.784835]  vfs_kern_mount+0x62/0x160
> [ 1847.785235]  do_mount+0x1b1/0xd50
> [ 1847.785594]  ? _copy_from_user+0x5b/0x90
> [ 1847.786028]  ? memdup_user+0x4b/0x70
> [ 1847.786501]  SyS_mount+0x85/0xd0
> [ 1847.786835]  entry_SYSCALL_64_fastpath+0x1f/0x96
> [ 1847.787311] RIP: 0033:0x7f6ebecc1b5a
> [ 1847.787691] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> [ 1847.788383] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> [ 1847.789106] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> [ 1847.789807] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> [ 1847.790511] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> [ 1847.791211] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> [ 1847.792029] BUG: scheduling while atomic: mount/1524/0x00000002
> [ 1847.792680] 1 lock held by mount/1524:
> [ 1847.793087]  #0:  (rcu_preempt_state.exp_mutex){+.+.}, at: [<00000000a6c536a9>] _synchronize_rcu_expedited+0x1ce/0x400
> [ 1847.794129] Modules linked in: xfs libcrc32c btrfs xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress zlib_deflate raid6_pq dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: xfs]
> [ 1847.795949] Preemption disabled at:
> [ 1847.795951] [<          (null)>]           (null)
> [ 1847.796844] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> [ 1847.797621] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
> [ 1847.798510] Call Trace:
> [ 1847.798786]  dump_stack+0x7c/0xbe
> [ 1847.799134]  __schedule_bug+0x88/0xe0
> [ 1847.799517]  __schedule+0x78c/0xb20
> [ 1847.799890]  ? trace_hardirqs_on_caller+0x119/0x180
> [ 1847.800391]  schedule+0x40/0x90
> [ 1847.800729]  _synchronize_rcu_expedited+0x36b/0x400
> [ 1847.801218]  ? rcu_preempt_qs+0xa0/0xa0
> [ 1847.801616]  ? remove_wait_queue+0x60/0x60
> [ 1847.802040]  ? rcu_preempt_qs+0xa0/0xa0
> [ 1847.802433]  ? rcu_exp_wait_wake+0x630/0x630
> [ 1847.802872]  ? __lock_acquire+0xfb9/0x1120
> [ 1847.803302]  ? __lock_acquire+0x534/0x1120
> [ 1847.803725]  ? bdi_unregister+0x57/0x1a0
> [ 1847.804135]  bdi_unregister+0x5c/0x1a0
> [ 1847.804519]  bdi_put+0xcb/0xe0
> [ 1847.804746]  generic_shutdown_super+0xe2/0x110
> [ 1847.805066]  kill_anon_super+0xe/0x20
> [ 1847.805344]  btrfs_kill_super+0x12/0xa0 [btrfs]
> [ 1847.805664]  deactivate_locked_super+0x34/0x60
> [ 1847.806111]  btrfs_mount+0xbb6/0x1000 [btrfs]
> [ 1847.806476]  ? __lockdep_init_map+0x5c/0x1d0
> [ 1847.806824]  ? mount_fs+0xf/0x80
> [ 1847.807104]  ? alloc_vfsmnt+0x1a1/0x230
> [ 1847.807416]  mount_fs+0xf/0x80
> [ 1847.807712]  vfs_kern_mount+0x62/0x160
> [ 1847.808112]  btrfs_mount+0x3d3/0x1000 [btrfs]
> [ 1847.808565]  ? __lockdep_init_map+0x5c/0x1d0
> [ 1847.809005]  ? __lockdep_init_map+0x5c/0x1d0
> [ 1847.809425]  ? mount_fs+0xf/0x80
> [ 1847.809731]  mount_fs+0xf/0x80
> [ 1847.810070]  vfs_kern_mount+0x62/0x160
> [ 1847.810469]  do_mount+0x1b1/0xd50
> [ 1847.810821]  ? _copy_from_user+0x5b/0x90
> [ 1847.811237]  ? memdup_user+0x4b/0x70
> [ 1847.811622]  SyS_mount+0x85/0xd0
> [ 1847.811996]  entry_SYSCALL_64_fastpath+0x1f/0x96
> [ 1847.812465] RIP: 0033:0x7f6ebecc1b5a
> [ 1847.812840] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> [ 1847.813615] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> [ 1847.814302] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> [ 1847.814770] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> [ 1847.815246] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> [ 1847.815720] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
i> 

Looks like this is new, Masami this is happening because of your change here

5bb4fc2d8641 ("kprobes/x86: Disable preemption in ftrace-based jprobes")

which makes it not do the preempt_enable() if the handler returns 1.  Why is
that?  Should I be doing preempt_enable_no_resched() from the handler before
returning 1?  Or is this just an oversight on your part?  Thanks,

Josef
Darrick Wong Dec. 13, 2017, 6:07 p.m. UTC | #4
On Wed, Dec 13, 2017 at 01:03:57PM -0500, Josef Bacik wrote:
> On Tue, Dec 12, 2017 at 03:11:50PM -0800, Darrick J. Wong wrote:
> > On Mon, Dec 11, 2017 at 11:36:45AM -0500, Josef Bacik wrote:
> > > This is the same as v8, just rebased onto the bpf tree.
> > > 
> > > v8->v9:
> > > - rebased onto the bpf tree.
> > > 
> > > v7->v8:
> > > - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
> > > 
> > > v6->v7:
> > > - moved the opt-in macro to bpf.h out of kprobes.h.
> > > 
> > > v5->v6:
> > > - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
> > >   feature.  This way only functions that opt-in will be allowed to be
> > >   overridden.
> > > - added a btrfs patch to allow error injection for open_ctree() so that the bpf
> > >   sample actually works.
> > > 
> > > v4->v5:
> > > - disallow kprobe_override programs from being put in the prog map array so we
> > >   don't tail call into something we didn't check.  This allows us to make the
> > >   normal path still fast without a bunch of percpu operations.
> > > 
> > > v3->v4:
> > > - fix a build error found by kbuild test bot (I didn't wait long enough
> > >   apparently.)
> > > - Added a warning message as per Daniels suggestion.
> > > 
> > > v2->v3:
> > > - added a ->kprobe_override flag to bpf_prog.
> > > - added some sanity checks to disallow attaching bpf progs that have
> > >   ->kprobe_override set that aren't for ftrace kprobes.
> > > - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
> > >   ftrace kprobe.
> > > - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
> > >   value in the kprobe path, and thus only write to it if we're overriding or
> > >   clearing the override.
> > > 
> > > v1->v2:
> > > - moved things around to make sure that bpf_override_return could really only be
> > >   used for an ftrace kprobe.
> > > - killed the special return values from trace_call_bpf.
> > > - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
> > >   it was being called from an ftrace kprobe context.
> > > - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
> > > - updated the test as per Alexei's review.
> > > 
> > > - Original message -
> > > 
> > > A lot of our error paths are not well tested because we have no good way of
> > > injecting errors generically.  Some subystems (block, memory) have ways to
> > > inject errors, but they are random so it's hard to get reproduceable results.
> > > 
> > > With BPF we can add determinism to our error injection.  We can use kprobes and
> > > other things to verify we are injecting errors at the exact case we are trying
> > > to test.  This patch gives us the tool to actual do the error injection part.
> > > It is very simple, we just set the return value of the pt_regs we're given to
> > > whatever we provide, and then override the PC with a dummy function that simply
> > > returns.
> > 
> > Heh, this looks cool.  I decided to try it to see what happens, and saw
> > a bunch of dmesg pasted in below.  Is that supposed to happen?  Or am I
> > the only fs developer still running with lockdep enabled? :)
> > 
> > It looks like bpf_override_return has some sort of side effect such that
> > we get the splat, since commenting it out makes the symptom go away.
> > 
> > <shrug>
> > 
> > --D
> > 
> > [ 1847.769183] BTRFS error (device (null)): open_ctree failed
> > [ 1847.770130] BUG: sleeping function called from invalid context at /storage/home/djwong/cdev/work/linux-xfs/kernel/locking/rwsem.c:69
> > [ 1847.771976] in_atomic(): 1, irqs_disabled(): 0, pid: 1524, name: mount
> > [ 1847.773016] 1 lock held by mount/1524:
> > [ 1847.773530]  #0:  (&type->s_umount_key#34/1){+.+.}, at: [<00000000653a9bb4>] sget_userns+0x302/0x4f0
> > [ 1847.774731] Preemption disabled at:
> > [ 1847.774735] [<          (null)>]           (null)
> > [ 1847.777009] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> > [ 1847.778800] Call Trace:
> > [ 1847.779047]  dump_stack+0x7c/0xbe
> > [ 1847.779361]  ___might_sleep+0x1f7/0x260
> > [ 1847.779720]  down_write+0x29/0xb0
> > [ 1847.780046]  unregister_shrinker+0x15/0x70
> > [ 1847.780427]  deactivate_locked_super+0x2e/0x60
> > [ 1847.780935]  btrfs_mount+0xbb6/0x1000 [btrfs]
> > [ 1847.781353]  ? __lockdep_init_map+0x5c/0x1d0
> > [ 1847.781750]  ? mount_fs+0xf/0x80
> > [ 1847.782065]  ? alloc_vfsmnt+0x1a1/0x230
> > [ 1847.782429]  mount_fs+0xf/0x80
> > [ 1847.782733]  vfs_kern_mount+0x62/0x160
> > [ 1847.783128]  btrfs_mount+0x3d3/0x1000 [btrfs]
> > [ 1847.783493]  ? __lockdep_init_map+0x5c/0x1d0
> > [ 1847.783849]  ? __lockdep_init_map+0x5c/0x1d0
> > [ 1847.784207]  ? mount_fs+0xf/0x80
> > [ 1847.784502]  mount_fs+0xf/0x80
> > [ 1847.784835]  vfs_kern_mount+0x62/0x160
> > [ 1847.785235]  do_mount+0x1b1/0xd50
> > [ 1847.785594]  ? _copy_from_user+0x5b/0x90
> > [ 1847.786028]  ? memdup_user+0x4b/0x70
> > [ 1847.786501]  SyS_mount+0x85/0xd0
> > [ 1847.786835]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > [ 1847.787311] RIP: 0033:0x7f6ebecc1b5a
> > [ 1847.787691] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> > [ 1847.788383] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> > [ 1847.789106] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> > [ 1847.789807] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> > [ 1847.790511] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> > [ 1847.791211] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> > [ 1847.792029] BUG: scheduling while atomic: mount/1524/0x00000002
> > [ 1847.792680] 1 lock held by mount/1524:
> > [ 1847.793087]  #0:  (rcu_preempt_state.exp_mutex){+.+.}, at: [<00000000a6c536a9>] _synchronize_rcu_expedited+0x1ce/0x400
> > [ 1847.794129] Modules linked in: xfs libcrc32c btrfs xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress zlib_deflate raid6_pq dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: xfs]
> > [ 1847.795949] Preemption disabled at:
> > [ 1847.795951] [<          (null)>]           (null)
> > [ 1847.796844] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> > [ 1847.797621] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
> > [ 1847.798510] Call Trace:
> > [ 1847.798786]  dump_stack+0x7c/0xbe
> > [ 1847.799134]  __schedule_bug+0x88/0xe0
> > [ 1847.799517]  __schedule+0x78c/0xb20
> > [ 1847.799890]  ? trace_hardirqs_on_caller+0x119/0x180
> > [ 1847.800391]  schedule+0x40/0x90
> > [ 1847.800729]  _synchronize_rcu_expedited+0x36b/0x400
> > [ 1847.801218]  ? rcu_preempt_qs+0xa0/0xa0
> > [ 1847.801616]  ? remove_wait_queue+0x60/0x60
> > [ 1847.802040]  ? rcu_preempt_qs+0xa0/0xa0
> > [ 1847.802433]  ? rcu_exp_wait_wake+0x630/0x630
> > [ 1847.802872]  ? __lock_acquire+0xfb9/0x1120
> > [ 1847.803302]  ? __lock_acquire+0x534/0x1120
> > [ 1847.803725]  ? bdi_unregister+0x57/0x1a0
> > [ 1847.804135]  bdi_unregister+0x5c/0x1a0
> > [ 1847.804519]  bdi_put+0xcb/0xe0
> > [ 1847.804746]  generic_shutdown_super+0xe2/0x110
> > [ 1847.805066]  kill_anon_super+0xe/0x20
> > [ 1847.805344]  btrfs_kill_super+0x12/0xa0 [btrfs]
> > [ 1847.805664]  deactivate_locked_super+0x34/0x60
> > [ 1847.806111]  btrfs_mount+0xbb6/0x1000 [btrfs]
> > [ 1847.806476]  ? __lockdep_init_map+0x5c/0x1d0
> > [ 1847.806824]  ? mount_fs+0xf/0x80
> > [ 1847.807104]  ? alloc_vfsmnt+0x1a1/0x230
> > [ 1847.807416]  mount_fs+0xf/0x80
> > [ 1847.807712]  vfs_kern_mount+0x62/0x160
> > [ 1847.808112]  btrfs_mount+0x3d3/0x1000 [btrfs]
> > [ 1847.808565]  ? __lockdep_init_map+0x5c/0x1d0
> > [ 1847.809005]  ? __lockdep_init_map+0x5c/0x1d0
> > [ 1847.809425]  ? mount_fs+0xf/0x80
> > [ 1847.809731]  mount_fs+0xf/0x80
> > [ 1847.810070]  vfs_kern_mount+0x62/0x160
> > [ 1847.810469]  do_mount+0x1b1/0xd50
> > [ 1847.810821]  ? _copy_from_user+0x5b/0x90
> > [ 1847.811237]  ? memdup_user+0x4b/0x70
> > [ 1847.811622]  SyS_mount+0x85/0xd0
> > [ 1847.811996]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > [ 1847.812465] RIP: 0033:0x7f6ebecc1b5a
> > [ 1847.812840] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> > [ 1847.813615] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> > [ 1847.814302] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> > [ 1847.814770] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> > [ 1847.815246] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> > [ 1847.815720] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> i> 
> 
> Looks like this is new, Masami this is happening because of your change here
> 
> 5bb4fc2d8641 ("kprobes/x86: Disable preemption in ftrace-based jprobes")
> 
> which makes it not do the preempt_enable() if the handler returns 1.  Why is
> that?  Should I be doing preempt_enable_no_resched() from the handler before
> returning 1?  Or is this just an oversight on your part?  Thanks,

FWIW I shut up the preemption imbalance warnings with the attached
coarse bandaid.  No idea if that's the correct fix...

--D

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5db8498..fd948e3 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1215,8 +1215,10 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
 		if (__this_cpu_read(bpf_kprobe_override)) {
 			__this_cpu_write(bpf_kprobe_override, 0);
 			reset_current_kprobe();
+			preempt_enable();
 			return 1;
 		}
+		preempt_enable();
 		if (!ret)
 			return 0;
 	}
> 
> Josef
Josef Bacik Dec. 13, 2017, 6:57 p.m. UTC | #5
On Wed, Dec 13, 2017 at 10:07:32AM -0800, Darrick J. Wong wrote:
> On Wed, Dec 13, 2017 at 01:03:57PM -0500, Josef Bacik wrote:
> > On Tue, Dec 12, 2017 at 03:11:50PM -0800, Darrick J. Wong wrote:
> > > On Mon, Dec 11, 2017 at 11:36:45AM -0500, Josef Bacik wrote:
> > > > This is the same as v8, just rebased onto the bpf tree.
> > > > 
> > > > v8->v9:
> > > > - rebased onto the bpf tree.
> > > > 
> > > > v7->v8:
> > > > - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
> > > > 
> > > > v6->v7:
> > > > - moved the opt-in macro to bpf.h out of kprobes.h.
> > > > 
> > > > v5->v6:
> > > > - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
> > > >   feature.  This way only functions that opt-in will be allowed to be
> > > >   overridden.
> > > > - added a btrfs patch to allow error injection for open_ctree() so that the bpf
> > > >   sample actually works.
> > > > 
> > > > v4->v5:
> > > > - disallow kprobe_override programs from being put in the prog map array so we
> > > >   don't tail call into something we didn't check.  This allows us to make the
> > > >   normal path still fast without a bunch of percpu operations.
> > > > 
> > > > v3->v4:
> > > > - fix a build error found by kbuild test bot (I didn't wait long enough
> > > >   apparently.)
> > > > - Added a warning message as per Daniels suggestion.
> > > > 
> > > > v2->v3:
> > > > - added a ->kprobe_override flag to bpf_prog.
> > > > - added some sanity checks to disallow attaching bpf progs that have
> > > >   ->kprobe_override set that aren't for ftrace kprobes.
> > > > - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
> > > >   ftrace kprobe.
> > > > - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
> > > >   value in the kprobe path, and thus only write to it if we're overriding or
> > > >   clearing the override.
> > > > 
> > > > v1->v2:
> > > > - moved things around to make sure that bpf_override_return could really only be
> > > >   used for an ftrace kprobe.
> > > > - killed the special return values from trace_call_bpf.
> > > > - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
> > > >   it was being called from an ftrace kprobe context.
> > > > - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
> > > > - updated the test as per Alexei's review.
> > > > 
> > > > - Original message -
> > > > 
> > > > A lot of our error paths are not well tested because we have no good way of
> > > > injecting errors generically.  Some subystems (block, memory) have ways to
> > > > inject errors, but they are random so it's hard to get reproduceable results.
> > > > 
> > > > With BPF we can add determinism to our error injection.  We can use kprobes and
> > > > other things to verify we are injecting errors at the exact case we are trying
> > > > to test.  This patch gives us the tool to actual do the error injection part.
> > > > It is very simple, we just set the return value of the pt_regs we're given to
> > > > whatever we provide, and then override the PC with a dummy function that simply
> > > > returns.
> > > 
> > > Heh, this looks cool.  I decided to try it to see what happens, and saw
> > > a bunch of dmesg pasted in below.  Is that supposed to happen?  Or am I
> > > the only fs developer still running with lockdep enabled? :)
> > > 
> > > It looks like bpf_override_return has some sort of side effect such that
> > > we get the splat, since commenting it out makes the symptom go away.
> > > 
> > > <shrug>
> > > 
> > > --D
> > > 
> > > [ 1847.769183] BTRFS error (device (null)): open_ctree failed
> > > [ 1847.770130] BUG: sleeping function called from invalid context at /storage/home/djwong/cdev/work/linux-xfs/kernel/locking/rwsem.c:69
> > > [ 1847.771976] in_atomic(): 1, irqs_disabled(): 0, pid: 1524, name: mount
> > > [ 1847.773016] 1 lock held by mount/1524:
> > > [ 1847.773530]  #0:  (&type->s_umount_key#34/1){+.+.}, at: [<00000000653a9bb4>] sget_userns+0x302/0x4f0
> > > [ 1847.774731] Preemption disabled at:
> > > [ 1847.774735] [<          (null)>]           (null)
> > > [ 1847.777009] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> > > [ 1847.778800] Call Trace:
> > > [ 1847.779047]  dump_stack+0x7c/0xbe
> > > [ 1847.779361]  ___might_sleep+0x1f7/0x260
> > > [ 1847.779720]  down_write+0x29/0xb0
> > > [ 1847.780046]  unregister_shrinker+0x15/0x70
> > > [ 1847.780427]  deactivate_locked_super+0x2e/0x60
> > > [ 1847.780935]  btrfs_mount+0xbb6/0x1000 [btrfs]
> > > [ 1847.781353]  ? __lockdep_init_map+0x5c/0x1d0
> > > [ 1847.781750]  ? mount_fs+0xf/0x80
> > > [ 1847.782065]  ? alloc_vfsmnt+0x1a1/0x230
> > > [ 1847.782429]  mount_fs+0xf/0x80
> > > [ 1847.782733]  vfs_kern_mount+0x62/0x160
> > > [ 1847.783128]  btrfs_mount+0x3d3/0x1000 [btrfs]
> > > [ 1847.783493]  ? __lockdep_init_map+0x5c/0x1d0
> > > [ 1847.783849]  ? __lockdep_init_map+0x5c/0x1d0
> > > [ 1847.784207]  ? mount_fs+0xf/0x80
> > > [ 1847.784502]  mount_fs+0xf/0x80
> > > [ 1847.784835]  vfs_kern_mount+0x62/0x160
> > > [ 1847.785235]  do_mount+0x1b1/0xd50
> > > [ 1847.785594]  ? _copy_from_user+0x5b/0x90
> > > [ 1847.786028]  ? memdup_user+0x4b/0x70
> > > [ 1847.786501]  SyS_mount+0x85/0xd0
> > > [ 1847.786835]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > > [ 1847.787311] RIP: 0033:0x7f6ebecc1b5a
> > > [ 1847.787691] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> > > [ 1847.788383] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> > > [ 1847.789106] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> > > [ 1847.789807] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> > > [ 1847.790511] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> > > [ 1847.791211] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> > > [ 1847.792029] BUG: scheduling while atomic: mount/1524/0x00000002
> > > [ 1847.792680] 1 lock held by mount/1524:
> > > [ 1847.793087]  #0:  (rcu_preempt_state.exp_mutex){+.+.}, at: [<00000000a6c536a9>] _synchronize_rcu_expedited+0x1ce/0x400
> > > [ 1847.794129] Modules linked in: xfs libcrc32c btrfs xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress zlib_deflate raid6_pq dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: xfs]
> > > [ 1847.795949] Preemption disabled at:
> > > [ 1847.795951] [<          (null)>]           (null)
> > > [ 1847.796844] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> > > [ 1847.797621] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
> > > [ 1847.798510] Call Trace:
> > > [ 1847.798786]  dump_stack+0x7c/0xbe
> > > [ 1847.799134]  __schedule_bug+0x88/0xe0
> > > [ 1847.799517]  __schedule+0x78c/0xb20
> > > [ 1847.799890]  ? trace_hardirqs_on_caller+0x119/0x180
> > > [ 1847.800391]  schedule+0x40/0x90
> > > [ 1847.800729]  _synchronize_rcu_expedited+0x36b/0x400
> > > [ 1847.801218]  ? rcu_preempt_qs+0xa0/0xa0
> > > [ 1847.801616]  ? remove_wait_queue+0x60/0x60
> > > [ 1847.802040]  ? rcu_preempt_qs+0xa0/0xa0
> > > [ 1847.802433]  ? rcu_exp_wait_wake+0x630/0x630
> > > [ 1847.802872]  ? __lock_acquire+0xfb9/0x1120
> > > [ 1847.803302]  ? __lock_acquire+0x534/0x1120
> > > [ 1847.803725]  ? bdi_unregister+0x57/0x1a0
> > > [ 1847.804135]  bdi_unregister+0x5c/0x1a0
> > > [ 1847.804519]  bdi_put+0xcb/0xe0
> > > [ 1847.804746]  generic_shutdown_super+0xe2/0x110
> > > [ 1847.805066]  kill_anon_super+0xe/0x20
> > > [ 1847.805344]  btrfs_kill_super+0x12/0xa0 [btrfs]
> > > [ 1847.805664]  deactivate_locked_super+0x34/0x60
> > > [ 1847.806111]  btrfs_mount+0xbb6/0x1000 [btrfs]
> > > [ 1847.806476]  ? __lockdep_init_map+0x5c/0x1d0
> > > [ 1847.806824]  ? mount_fs+0xf/0x80
> > > [ 1847.807104]  ? alloc_vfsmnt+0x1a1/0x230
> > > [ 1847.807416]  mount_fs+0xf/0x80
> > > [ 1847.807712]  vfs_kern_mount+0x62/0x160
> > > [ 1847.808112]  btrfs_mount+0x3d3/0x1000 [btrfs]
> > > [ 1847.808565]  ? __lockdep_init_map+0x5c/0x1d0
> > > [ 1847.809005]  ? __lockdep_init_map+0x5c/0x1d0
> > > [ 1847.809425]  ? mount_fs+0xf/0x80
> > > [ 1847.809731]  mount_fs+0xf/0x80
> > > [ 1847.810070]  vfs_kern_mount+0x62/0x160
> > > [ 1847.810469]  do_mount+0x1b1/0xd50
> > > [ 1847.810821]  ? _copy_from_user+0x5b/0x90
> > > [ 1847.811237]  ? memdup_user+0x4b/0x70
> > > [ 1847.811622]  SyS_mount+0x85/0xd0
> > > [ 1847.811996]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > > [ 1847.812465] RIP: 0033:0x7f6ebecc1b5a
> > > [ 1847.812840] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> > > [ 1847.813615] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> > > [ 1847.814302] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> > > [ 1847.814770] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> > > [ 1847.815246] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> > > [ 1847.815720] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> > i> 
> > 
> > Looks like this is new, Masami this is happening because of your change here
> > 
> > 5bb4fc2d8641 ("kprobes/x86: Disable preemption in ftrace-based jprobes")
> > 
> > which makes it not do the preempt_enable() if the handler returns 1.  Why is
> > that?  Should I be doing preempt_enable_no_resched() from the handler before
> > returning 1?  Or is this just an oversight on your part?  Thanks,
> 
> FWIW I shut up the preemption imbalance warnings with the attached
> coarse bandaid.  No idea if that's the correct fix...
> 
> --D
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 5db8498..fd948e3 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -1215,8 +1215,10 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
>  		if (__this_cpu_read(bpf_kprobe_override)) {
>  			__this_cpu_write(bpf_kprobe_override, 0);
>  			reset_current_kprobe();
> +			preempt_enable();
>  			return 1;
>  		}
> +		preempt_enable();
>  		if (!ret)
>  			return 0;
>  	}

Yeah I'd like to avoid doing this and know why exactly we leave a unpaired
preempt_disable() in kprobe_ftrace_handler() so we don't do something like this
only to have the handler change again in the future and break us again.  Thanks,

Josef
Masami Hiramatsu (Google) Dec. 14, 2017, 6:46 a.m. UTC | #6
On Wed, 13 Dec 2017 13:57:45 -0500
Josef Bacik <josef@toxicpanda.com> wrote:

> On Wed, Dec 13, 2017 at 10:07:32AM -0800, Darrick J. Wong wrote:
> > On Wed, Dec 13, 2017 at 01:03:57PM -0500, Josef Bacik wrote:
> > > On Tue, Dec 12, 2017 at 03:11:50PM -0800, Darrick J. Wong wrote:
> > > > On Mon, Dec 11, 2017 at 11:36:45AM -0500, Josef Bacik wrote:
> > > > > This is the same as v8, just rebased onto the bpf tree.
> > > > > 
> > > > > v8->v9:
> > > > > - rebased onto the bpf tree.
> > > > > 
> > > > > v7->v8:
> > > > > - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.
> > > > > 
> > > > > v6->v7:
> > > > > - moved the opt-in macro to bpf.h out of kprobes.h.
> > > > > 
> > > > > v5->v6:
> > > > > - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
> > > > >   feature.  This way only functions that opt-in will be allowed to be
> > > > >   overridden.
> > > > > - added a btrfs patch to allow error injection for open_ctree() so that the bpf
> > > > >   sample actually works.
> > > > > 
> > > > > v4->v5:
> > > > > - disallow kprobe_override programs from being put in the prog map array so we
> > > > >   don't tail call into something we didn't check.  This allows us to make the
> > > > >   normal path still fast without a bunch of percpu operations.
> > > > > 
> > > > > v3->v4:
> > > > > - fix a build error found by kbuild test bot (I didn't wait long enough
> > > > >   apparently.)
> > > > > - Added a warning message as per Daniels suggestion.
> > > > > 
> > > > > v2->v3:
> > > > > - added a ->kprobe_override flag to bpf_prog.
> > > > > - added some sanity checks to disallow attaching bpf progs that have
> > > > >   ->kprobe_override set that aren't for ftrace kprobes.
> > > > > - added the trace_kprobe_ftrace helper to check if the trace_event_call is a
> > > > >   ftrace kprobe.
> > > > > - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
> > > > >   value in the kprobe path, and thus only write to it if we're overriding or
> > > > >   clearing the override.
> > > > > 
> > > > > v1->v2:
> > > > > - moved things around to make sure that bpf_override_return could really only be
> > > > >   used for an ftrace kprobe.
> > > > > - killed the special return values from trace_call_bpf.
> > > > > - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
> > > > >   it was being called from an ftrace kprobe context.
> > > > > - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
> > > > > - updated the test as per Alexei's review.
> > > > > 
> > > > > - Original message -
> > > > > 
> > > > > A lot of our error paths are not well tested because we have no good way of
> > > > > injecting errors generically.  Some subystems (block, memory) have ways to
> > > > > inject errors, but they are random so it's hard to get reproduceable results.
> > > > > 
> > > > > With BPF we can add determinism to our error injection.  We can use kprobes and
> > > > > other things to verify we are injecting errors at the exact case we are trying
> > > > > to test.  This patch gives us the tool to actual do the error injection part.
> > > > > It is very simple, we just set the return value of the pt_regs we're given to
> > > > > whatever we provide, and then override the PC with a dummy function that simply
> > > > > returns.
> > > > 
> > > > Heh, this looks cool.  I decided to try it to see what happens, and saw
> > > > a bunch of dmesg pasted in below.  Is that supposed to happen?  Or am I
> > > > the only fs developer still running with lockdep enabled? :)
> > > > 
> > > > It looks like bpf_override_return has some sort of side effect such that
> > > > we get the splat, since commenting it out makes the symptom go away.
> > > > 
> > > > <shrug>
> > > > 
> > > > --D
> > > > 
> > > > [ 1847.769183] BTRFS error (device (null)): open_ctree failed
> > > > [ 1847.770130] BUG: sleeping function called from invalid context at /storage/home/djwong/cdev/work/linux-xfs/kernel/locking/rwsem.c:69
> > > > [ 1847.771976] in_atomic(): 1, irqs_disabled(): 0, pid: 1524, name: mount
> > > > [ 1847.773016] 1 lock held by mount/1524:
> > > > [ 1847.773530]  #0:  (&type->s_umount_key#34/1){+.+.}, at: [<00000000653a9bb4>] sget_userns+0x302/0x4f0
> > > > [ 1847.774731] Preemption disabled at:
> > > > [ 1847.774735] [<          (null)>]           (null)
> > > > [ 1847.777009] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> > > > [ 1847.778800] Call Trace:
> > > > [ 1847.779047]  dump_stack+0x7c/0xbe
> > > > [ 1847.779361]  ___might_sleep+0x1f7/0x260
> > > > [ 1847.779720]  down_write+0x29/0xb0
> > > > [ 1847.780046]  unregister_shrinker+0x15/0x70
> > > > [ 1847.780427]  deactivate_locked_super+0x2e/0x60
> > > > [ 1847.780935]  btrfs_mount+0xbb6/0x1000 [btrfs]
> > > > [ 1847.781353]  ? __lockdep_init_map+0x5c/0x1d0
> > > > [ 1847.781750]  ? mount_fs+0xf/0x80
> > > > [ 1847.782065]  ? alloc_vfsmnt+0x1a1/0x230
> > > > [ 1847.782429]  mount_fs+0xf/0x80
> > > > [ 1847.782733]  vfs_kern_mount+0x62/0x160
> > > > [ 1847.783128]  btrfs_mount+0x3d3/0x1000 [btrfs]
> > > > [ 1847.783493]  ? __lockdep_init_map+0x5c/0x1d0
> > > > [ 1847.783849]  ? __lockdep_init_map+0x5c/0x1d0
> > > > [ 1847.784207]  ? mount_fs+0xf/0x80
> > > > [ 1847.784502]  mount_fs+0xf/0x80
> > > > [ 1847.784835]  vfs_kern_mount+0x62/0x160
> > > > [ 1847.785235]  do_mount+0x1b1/0xd50
> > > > [ 1847.785594]  ? _copy_from_user+0x5b/0x90
> > > > [ 1847.786028]  ? memdup_user+0x4b/0x70
> > > > [ 1847.786501]  SyS_mount+0x85/0xd0
> > > > [ 1847.786835]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > > > [ 1847.787311] RIP: 0033:0x7f6ebecc1b5a
> > > > [ 1847.787691] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> > > > [ 1847.788383] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> > > > [ 1847.789106] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> > > > [ 1847.789807] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> > > > [ 1847.790511] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> > > > [ 1847.791211] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> > > > [ 1847.792029] BUG: scheduling while atomic: mount/1524/0x00000002
> > > > [ 1847.792680] 1 lock held by mount/1524:
> > > > [ 1847.793087]  #0:  (rcu_preempt_state.exp_mutex){+.+.}, at: [<00000000a6c536a9>] _synchronize_rcu_expedited+0x1ce/0x400
> > > > [ 1847.794129] Modules linked in: xfs libcrc32c btrfs xor zstd_decompress zstd_compress xxhash lzo_compress lzo_decompress zlib_deflate raid6_pq dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: xfs]
> > > > [ 1847.795949] Preemption disabled at:
> > > > [ 1847.795951] [<          (null)>]           (null)
> > > > [ 1847.796844] CPU: 2 PID: 1524 Comm: mount Tainted: G        W        4.15.0-rc3-xfsx #3
> > > > [ 1847.797621] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
> > > > [ 1847.798510] Call Trace:
> > > > [ 1847.798786]  dump_stack+0x7c/0xbe
> > > > [ 1847.799134]  __schedule_bug+0x88/0xe0
> > > > [ 1847.799517]  __schedule+0x78c/0xb20
> > > > [ 1847.799890]  ? trace_hardirqs_on_caller+0x119/0x180
> > > > [ 1847.800391]  schedule+0x40/0x90
> > > > [ 1847.800729]  _synchronize_rcu_expedited+0x36b/0x400
> > > > [ 1847.801218]  ? rcu_preempt_qs+0xa0/0xa0
> > > > [ 1847.801616]  ? remove_wait_queue+0x60/0x60
> > > > [ 1847.802040]  ? rcu_preempt_qs+0xa0/0xa0
> > > > [ 1847.802433]  ? rcu_exp_wait_wake+0x630/0x630
> > > > [ 1847.802872]  ? __lock_acquire+0xfb9/0x1120
> > > > [ 1847.803302]  ? __lock_acquire+0x534/0x1120
> > > > [ 1847.803725]  ? bdi_unregister+0x57/0x1a0
> > > > [ 1847.804135]  bdi_unregister+0x5c/0x1a0
> > > > [ 1847.804519]  bdi_put+0xcb/0xe0
> > > > [ 1847.804746]  generic_shutdown_super+0xe2/0x110
> > > > [ 1847.805066]  kill_anon_super+0xe/0x20
> > > > [ 1847.805344]  btrfs_kill_super+0x12/0xa0 [btrfs]
> > > > [ 1847.805664]  deactivate_locked_super+0x34/0x60
> > > > [ 1847.806111]  btrfs_mount+0xbb6/0x1000 [btrfs]
> > > > [ 1847.806476]  ? __lockdep_init_map+0x5c/0x1d0
> > > > [ 1847.806824]  ? mount_fs+0xf/0x80
> > > > [ 1847.807104]  ? alloc_vfsmnt+0x1a1/0x230
> > > > [ 1847.807416]  mount_fs+0xf/0x80
> > > > [ 1847.807712]  vfs_kern_mount+0x62/0x160
> > > > [ 1847.808112]  btrfs_mount+0x3d3/0x1000 [btrfs]
> > > > [ 1847.808565]  ? __lockdep_init_map+0x5c/0x1d0
> > > > [ 1847.809005]  ? __lockdep_init_map+0x5c/0x1d0
> > > > [ 1847.809425]  ? mount_fs+0xf/0x80
> > > > [ 1847.809731]  mount_fs+0xf/0x80
> > > > [ 1847.810070]  vfs_kern_mount+0x62/0x160
> > > > [ 1847.810469]  do_mount+0x1b1/0xd50
> > > > [ 1847.810821]  ? _copy_from_user+0x5b/0x90
> > > > [ 1847.811237]  ? memdup_user+0x4b/0x70
> > > > [ 1847.811622]  SyS_mount+0x85/0xd0
> > > > [ 1847.811996]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > > > [ 1847.812465] RIP: 0033:0x7f6ebecc1b5a
> > > > [ 1847.812840] RSP: 002b:00007ffc7bd1c958 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
> > > > [ 1847.813615] RAX: ffffffffffffffda RBX: 00007f6ebefba63a RCX: 00007f6ebecc1b5a
> > > > [ 1847.814302] RDX: 0000000000bfd010 RSI: 0000000000bfa230 RDI: 0000000000bfa210
> > > > [ 1847.814770] RBP: 0000000000bfa0f0 R08: 0000000000000000 R09: 0000000000000014
> > > > [ 1847.815246] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f6ebf1ca83c
> > > > [ 1847.815720] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> > > i> 
> > > 
> > > Looks like this is new, Masami this is happening because of your change here
> > > 
> > > 5bb4fc2d8641 ("kprobes/x86: Disable preemption in ftrace-based jprobes")
> > > 
> > > which makes it not do the preempt_enable() if the handler returns 1.  Why is
> > > that?

Yes, because this (return 1) is expected to be done only by jprobe.

> > >  Should I be doing preempt_enable_no_resched() from the handler before
> > > returning 1? Or is this just an oversight on your part?  Thanks,

Yes, or you have to hook after return path and fixup preempt count as
jprobe did.

(And now jprobe is coming to an end.)

> > 
> > FWIW I shut up the preemption imbalance warnings with the attached
> > coarse bandaid.  No idea if that's the correct fix...

No, this is not correct way to fix this issue.
I guess your BPF extention is trying to change instrunction pointer to
another address (right?). If so, you have to carefully do followings
before returning to modified address.

- reset current_kprobes
- call preempt_enable_no_resched()


> > 
> > --D
> > 
> > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> > index 5db8498..fd948e3 100644
> > --- a/kernel/trace/trace_kprobe.c
> > +++ b/kernel/trace/trace_kprobe.c
> > @@ -1215,8 +1215,10 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
> >  		if (__this_cpu_read(bpf_kprobe_override)) {
> >  			__this_cpu_write(bpf_kprobe_override, 0);
> >  			reset_current_kprobe();
> > +			preempt_enable();
> >  			return 1;
> >  		}
> > +		preempt_enable();
> >  		if (!ret)
> >  			return 0;
> >  	}
> 
> Yeah I'd like to avoid doing this and know why exactly we leave a unpaired
> preempt_disable() in kprobe_ftrace_handler() so we don't do something like this
> only to have the handler change again in the future and break us again.  Thanks,

Ah, I see. kprobe_perf_func invokes BPF and BPF changes instruction address.
In that case, only what you need is adding preempt_enable_no_resched() at
right after the reset_current_kprobe().

Anyway, Could you CC the series to me?

Thank you,