diff mbox series

[v2,bpf-next,2/4] bpf: introduce helper bpf_get_task_stak()

Message ID 20200626001332.1554603-3-songliubraving@fb.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: introduce bpf_get_task_stack() | expand

Commit Message

Song Liu June 26, 2020, 12:13 a.m. UTC
Introduce helper bpf_get_task_stack(), which dumps stack trace of given
task. This is different to bpf_get_stack(), which gets stack track of
current task. One potential use case of bpf_get_task_stack() is to call
it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.

bpf_get_task_stack() uses stack_trace_save_tsk() instead of
get_perf_callchain() for kernel stack. The benefit of this choice is that
stack_trace_save_tsk() doesn't require changes in arch/. The downside of
using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
stack trace to unsigned long array. For 32-bit systems, we need to
translate it to u64 array.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/bpf.h            |  1 +
 include/uapi/linux/bpf.h       | 35 ++++++++++++++-
 kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
 kernel/trace/bpf_trace.c       |  2 +
 scripts/bpf_helpers_doc.py     |  2 +
 tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
 6 files changed, 149 insertions(+), 5 deletions(-)

Comments

Yonghong Song June 26, 2020, 3:40 p.m. UTC | #1
On 6/25/20 5:13 PM, Song Liu wrote:
> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
> task. This is different to bpf_get_stack(), which gets stack track of
> current task. One potential use case of bpf_get_task_stack() is to call
> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
> 
> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
> get_perf_callchain() for kernel stack. The benefit of this choice is that
> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
> stack trace to unsigned long array. For 32-bit systems, we need to
> translate it to u64 array.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>   include/linux/bpf.h            |  1 +
>   include/uapi/linux/bpf.h       | 35 ++++++++++++++-
>   kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
>   kernel/trace/bpf_trace.c       |  2 +
>   scripts/bpf_helpers_doc.py     |  2 +
>   tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
>   6 files changed, 149 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 07052d44bca1c..cee31ee56367b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1607,6 +1607,7 @@ extern const struct bpf_func_proto bpf_get_current_uid_gid_proto;
>   extern const struct bpf_func_proto bpf_get_current_comm_proto;
>   extern const struct bpf_func_proto bpf_get_stackid_proto;
>   extern const struct bpf_func_proto bpf_get_stack_proto;
> +extern const struct bpf_func_proto bpf_get_task_stack_proto;
>   extern const struct bpf_func_proto bpf_sock_map_update_proto;
>   extern const struct bpf_func_proto bpf_sock_hash_update_proto;
>   extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 19684813faaed..7638412987354 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3252,6 +3252,38 @@ union bpf_attr {
>    * 		case of **BPF_CSUM_LEVEL_QUERY**, the current skb->csum_level
>    * 		is returned or the error code -EACCES in case the skb is not
>    * 		subject to CHECKSUM_UNNECESSARY.
> + *
> + * int bpf_get_task_stack(struct task_struct *task, void *buf, u32 size, u64 flags)

Andrii's recent patch changed the return type to 'long' to align with
kernel u64 return type for better llvm code generation.

Please rebase and you will see the new convention.

> + *	Description
> + *		Return a user or a kernel stack in bpf program provided buffer.
> + *		To achieve this, the helper needs *task*, which is a valid
> + *		pointer to struct task_struct. To store the stacktrace, the
> + *		bpf program provides *buf* with	a nonnegative *size*.
> + *
> + *		The last argument, *flags*, holds the number of stack frames to
> + *		skip (from 0 to 255), masked with
> + *		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
> + *		the following flags:
> + *
> + *		**BPF_F_USER_STACK**
> + *			Collect a user space stack instead of a kernel stack.
> + *		**BPF_F_USER_BUILD_ID**
> + *			Collect buildid+offset instead of ips for user stack,
> + *			only valid if **BPF_F_USER_STACK** is also specified.
> + *
> + *		**bpf_get_task_stack**\ () can collect up to
> + *		**PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
> + *		to sufficient large buffer size. Note that
> + *		this limit can be controlled with the **sysctl** program, and
> + *		that it should be manually increased in order to profile long
> + *		user stacks (such as stacks for Java programs). To do so, use:
> + *
> + *		::
> + *
> + *			# sysctl kernel.perf_event_max_stack=<new value>
> + *	Return
> + *		A non-negative value equal to or less than *size* on success,
> + *		or a negative error in case of failure.
>    */
>   #define __BPF_FUNC_MAPPER(FN)		\
>   	FN(unspec),			\
> @@ -3389,7 +3421,8 @@ union bpf_attr {
>   	FN(ringbuf_submit),		\
>   	FN(ringbuf_discard),		\
>   	FN(ringbuf_query),		\
> -	FN(csum_level),
> +	FN(csum_level),			\
> +	FN(get_task_stack),
>   
>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>    * function eBPF program intends to call
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 599488f25e404..64b7843057a23 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -348,6 +348,44 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>   	}
>   }
>   
> +static struct perf_callchain_entry *
> +get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
> +{
> +	struct perf_callchain_entry *entry;
> +	int rctx;
> +
> +	entry = get_callchain_entry(&rctx);
> +
> +	if (rctx == -1)
> +		return NULL;

Is this needed? Should be below !entry enough?

> +
> +	if (!entry)
> +		goto exit_put;
> +
> +	entry->nr = init_nr +
> +		stack_trace_save_tsk(task, (unsigned long *)(entry->ip + init_nr),
> +				     sysctl_perf_event_max_stack - init_nr, 0);
> +
> +	/* stack_trace_save_tsk() works on unsigned long array, while
> +	 * perf_callchain_entry uses u64 array. For 32-bit systems, it is
> +	 * necessary to fix this mismatch.
> +	 */
> +	if (__BITS_PER_LONG != 64) {
> +		unsigned long *from = (unsigned long *) entry->ip;
> +		u64 *to = entry->ip;
> +		int i;
> +
> +		/* copy data from the end to avoid using extra buffer */
> +		for (i = entry->nr - 1; i >= (int)init_nr; i--)
> +			to[i] = (u64)(from[i]);
> +	}
> +
> +exit_put:
> +	put_callchain_entry(rctx);
> +
> +	return entry;
> +}
> +
>   BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>   	   u64, flags)
>   {
> @@ -448,8 +486,8 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
>   	.arg3_type	= ARG_ANYTHING,
>   };
>   
> -BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
> -	   u64, flags)
[...]
Andrii Nakryiko June 26, 2020, 8:17 p.m. UTC | #2
On Thu, Jun 25, 2020 at 5:14 PM Song Liu <songliubraving@fb.com> wrote:
>
> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
> task. This is different to bpf_get_stack(), which gets stack track of
> current task. One potential use case of bpf_get_task_stack() is to call
> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
>
> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
> get_perf_callchain() for kernel stack. The benefit of this choice is that
> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
> stack trace to unsigned long array. For 32-bit systems, we need to
> translate it to u64 array.
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---

Looks great, I just think that there are cases where user doesn't
necessarily has valid task_struct pointer, just pid, so would be nice
to not artificially restrict such cases by having extra helper.

Acked-by: Andrii Nakryiko <andriin@fb.com>

>  include/linux/bpf.h            |  1 +
>  include/uapi/linux/bpf.h       | 35 ++++++++++++++-
>  kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
>  kernel/trace/bpf_trace.c       |  2 +
>  scripts/bpf_helpers_doc.py     |  2 +
>  tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
>  6 files changed, 149 insertions(+), 5 deletions(-)
>

[...]

> +       /* stack_trace_save_tsk() works on unsigned long array, while
> +        * perf_callchain_entry uses u64 array. For 32-bit systems, it is
> +        * necessary to fix this mismatch.
> +        */
> +       if (__BITS_PER_LONG != 64) {
> +               unsigned long *from = (unsigned long *) entry->ip;
> +               u64 *to = entry->ip;
> +               int i;
> +
> +               /* copy data from the end to avoid using extra buffer */
> +               for (i = entry->nr - 1; i >= (int)init_nr; i--)
> +                       to[i] = (u64)(from[i]);

doing this forward would be just fine as well, no? First iteration
will cast and overwrite low 32-bits, all the subsequent iterations
won't even overlap.

> +       }
> +
> +exit_put:
> +       put_callchain_entry(rctx);
> +
> +       return entry;
> +}
> +

[...]

> +BPF_CALL_4(bpf_get_task_stack, struct task_struct *, task, void *, buf,
> +          u32, size, u64, flags)
> +{
> +       struct pt_regs *regs = task_pt_regs(task);
> +
> +       return __bpf_get_stack(regs, task, buf, size, flags);
> +}


So this takes advantage of BTF and having a direct task_struct
pointer. But for kprobes/tracepoint I think it would also be extremely
helpful to be able to request stack trace by PID. How about one more
helper which will wrap this one with get/put task by PID, e.g.,
bpf_get_pid_stack(int pid, void *buf, u32 size, u64 flags)? Would that
be a problem?

> +
> +static int bpf_get_task_stack_btf_ids[5];
> +const struct bpf_func_proto bpf_get_task_stack_proto = {
> +       .func           = bpf_get_task_stack,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_BTF_ID,
> +       .arg2_type      = ARG_PTR_TO_UNINIT_MEM,
> +       .arg3_type      = ARG_CONST_SIZE_OR_ZERO,
> +       .arg4_type      = ARG_ANYTHING,
> +       .btf_id         = bpf_get_task_stack_btf_ids,
> +};
> +

[...]
Andrii Nakryiko June 26, 2020, 8:22 p.m. UTC | #3
On Fri, Jun 26, 2020 at 1:17 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Jun 25, 2020 at 5:14 PM Song Liu <songliubraving@fb.com> wrote:
> >
> > Introduce helper bpf_get_task_stack(), which dumps stack trace of given
> > task. This is different to bpf_get_stack(), which gets stack track of
> > current task. One potential use case of bpf_get_task_stack() is to call
> > it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
> >
> > bpf_get_task_stack() uses stack_trace_save_tsk() instead of
> > get_perf_callchain() for kernel stack. The benefit of this choice is that
> > stack_trace_save_tsk() doesn't require changes in arch/. The downside of
> > using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
> > stack trace to unsigned long array. For 32-bit systems, we need to
> > translate it to u64 array.
> >
> > Signed-off-by: Song Liu <songliubraving@fb.com>
> > ---
>
> Looks great, I just think that there are cases where user doesn't
> necessarily has valid task_struct pointer, just pid, so would be nice
> to not artificially restrict such cases by having extra helper.
>
> Acked-by: Andrii Nakryiko <andriin@fb.com>

oh, please also fix a typo in the subject, it will make grepping more
frustrating

[...]
Song Liu June 26, 2020, 10:37 p.m. UTC | #4
> On Jun 26, 2020, at 8:40 AM, Yonghong Song <yhs@fb.com> wrote:
> 
> 
> 
> On 6/25/20 5:13 PM, Song Liu wrote:
>> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
>> task. This is different to bpf_get_stack(), which gets stack track of
>> current task. One potential use case of bpf_get_task_stack() is to call
>> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
>> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
>> get_perf_callchain() for kernel stack. The benefit of this choice is that
>> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
>> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
>> stack trace to unsigned long array. For 32-bit systems, we need to
>> translate it to u64 array.
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> 
[...]
>> +++ b/include/uapi/linux/bpf.h
>> @@ -3252,6 +3252,38 @@ union bpf_attr {
>>   * 		case of **BPF_CSUM_LEVEL_QUERY**, the current skb->csum_level
>>   * 		is returned or the error code -EACCES in case the skb is not
>>   * 		subject to CHECKSUM_UNNECESSARY.
>> + *
>> + * int bpf_get_task_stack(struct task_struct *task, void *buf, u32 size, u64 flags)
> 
> Andrii's recent patch changed the return type to 'long' to align with
> kernel u64 return type for better llvm code generation.
> 
> Please rebase and you will see the new convention.

Will fix. 

> 
>> + *	Description
>> 

[...]

>>  +static struct perf_callchain_entry *
>> +get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
>> +{
>> +	struct perf_callchain_entry *entry;
>> +	int rctx;
>> +
>> +	entry = get_callchain_entry(&rctx);
>> +
>> +	if (rctx == -1)
>> +		return NULL;
> 
> Is this needed? Should be below !entry enough?

It is needed before Peter's suggestion. After applying Peter's patch, 
this is no longer needed. 

Thanks,
Song
Song Liu June 26, 2020, 10:45 p.m. UTC | #5
> On Jun 26, 2020, at 1:17 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Thu, Jun 25, 2020 at 5:14 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
>> task. This is different to bpf_get_stack(), which gets stack track of
>> current task. One potential use case of bpf_get_task_stack() is to call
>> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
>> 
>> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
>> get_perf_callchain() for kernel stack. The benefit of this choice is that
>> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
>> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
>> stack trace to unsigned long array. For 32-bit systems, we need to
>> translate it to u64 array.
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
> 
> Looks great, I just think that there are cases where user doesn't
> necessarily has valid task_struct pointer, just pid, so would be nice
> to not artificially restrict such cases by having extra helper.
> 
> Acked-by: Andrii Nakryiko <andriin@fb.com>

Thanks!

> 
>> include/linux/bpf.h            |  1 +
>> include/uapi/linux/bpf.h       | 35 ++++++++++++++-
>> kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
>> kernel/trace/bpf_trace.c       |  2 +
>> scripts/bpf_helpers_doc.py     |  2 +
>> tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
>> 6 files changed, 149 insertions(+), 5 deletions(-)
>> 
> 
> [...]
> 
>> +       /* stack_trace_save_tsk() works on unsigned long array, while
>> +        * perf_callchain_entry uses u64 array. For 32-bit systems, it is
>> +        * necessary to fix this mismatch.
>> +        */
>> +       if (__BITS_PER_LONG != 64) {
>> +               unsigned long *from = (unsigned long *) entry->ip;
>> +               u64 *to = entry->ip;
>> +               int i;
>> +
>> +               /* copy data from the end to avoid using extra buffer */
>> +               for (i = entry->nr - 1; i >= (int)init_nr; i--)
>> +                       to[i] = (u64)(from[i]);
> 
> doing this forward would be just fine as well, no? First iteration
> will cast and overwrite low 32-bits, all the subsequent iterations
> won't even overlap.

I think first iteration will write zeros to higher 32 bits, no?

> 
>> +       }
>> +
>> +exit_put:
>> +       put_callchain_entry(rctx);
>> +
>> +       return entry;
>> +}
>> +
> 
> [...]
> 
>> +BPF_CALL_4(bpf_get_task_stack, struct task_struct *, task, void *, buf,
>> +          u32, size, u64, flags)
>> +{
>> +       struct pt_regs *regs = task_pt_regs(task);
>> +
>> +       return __bpf_get_stack(regs, task, buf, size, flags);
>> +}
> 
> 
> So this takes advantage of BTF and having a direct task_struct
> pointer. But for kprobes/tracepoint I think it would also be extremely
> helpful to be able to request stack trace by PID. How about one more
> helper which will wrap this one with get/put task by PID, e.g.,
> bpf_get_pid_stack(int pid, void *buf, u32 size, u64 flags)? Would that
> be a problem?

That should work. Let me add that in a follow up patch.
Andrii Nakryiko June 26, 2020, 10:51 p.m. UTC | #6
On Fri, Jun 26, 2020 at 3:45 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jun 26, 2020, at 1:17 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Jun 25, 2020 at 5:14 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
> >> task. This is different to bpf_get_stack(), which gets stack track of
> >> current task. One potential use case of bpf_get_task_stack() is to call
> >> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
> >>
> >> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
> >> get_perf_callchain() for kernel stack. The benefit of this choice is that
> >> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
> >> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
> >> stack trace to unsigned long array. For 32-bit systems, we need to
> >> translate it to u64 array.
> >>
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >
> > Looks great, I just think that there are cases where user doesn't
> > necessarily has valid task_struct pointer, just pid, so would be nice
> > to not artificially restrict such cases by having extra helper.
> >
> > Acked-by: Andrii Nakryiko <andriin@fb.com>
>
> Thanks!
>
> >
> >> include/linux/bpf.h            |  1 +
> >> include/uapi/linux/bpf.h       | 35 ++++++++++++++-
> >> kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
> >> kernel/trace/bpf_trace.c       |  2 +
> >> scripts/bpf_helpers_doc.py     |  2 +
> >> tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
> >> 6 files changed, 149 insertions(+), 5 deletions(-)
> >>
> >
> > [...]
> >
> >> +       /* stack_trace_save_tsk() works on unsigned long array, while
> >> +        * perf_callchain_entry uses u64 array. For 32-bit systems, it is
> >> +        * necessary to fix this mismatch.
> >> +        */
> >> +       if (__BITS_PER_LONG != 64) {
> >> +               unsigned long *from = (unsigned long *) entry->ip;
> >> +               u64 *to = entry->ip;
> >> +               int i;
> >> +
> >> +               /* copy data from the end to avoid using extra buffer */
> >> +               for (i = entry->nr - 1; i >= (int)init_nr; i--)
> >> +                       to[i] = (u64)(from[i]);
> >
> > doing this forward would be just fine as well, no? First iteration
> > will cast and overwrite low 32-bits, all the subsequent iterations
> > won't even overlap.
>
> I think first iteration will write zeros to higher 32 bits, no?

Oh, wait, I completely misread what this is doing. It up-converts from
32-bit to 64-bit, sorry. Yeah, ignore me on this :)

But then I have another question. How do you know that entry->ip has
enough space to keep the same number of 2x bigger entries?

>
> >
> >> +       }
> >> +
> >> +exit_put:
> >> +       put_callchain_entry(rctx);
> >> +
> >> +       return entry;
> >> +}
> >> +
> >
> > [...]
> >
> >> +BPF_CALL_4(bpf_get_task_stack, struct task_struct *, task, void *, buf,
> >> +          u32, size, u64, flags)
> >> +{
> >> +       struct pt_regs *regs = task_pt_regs(task);
> >> +
> >> +       return __bpf_get_stack(regs, task, buf, size, flags);
> >> +}
> >
> >
> > So this takes advantage of BTF and having a direct task_struct
> > pointer. But for kprobes/tracepoint I think it would also be extremely
> > helpful to be able to request stack trace by PID. How about one more
> > helper which will wrap this one with get/put task by PID, e.g.,
> > bpf_get_pid_stack(int pid, void *buf, u32 size, u64 flags)? Would that
> > be a problem?
>
> That should work. Let me add that in a follow up patch.
>
Song Liu June 26, 2020, 11:47 p.m. UTC | #7
> On Jun 26, 2020, at 3:51 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> 
> On Fri, Jun 26, 2020 at 3:45 PM Song Liu <songliubraving@fb.com> wrote:
>> 
>> 
>> 
>>> On Jun 26, 2020, at 1:17 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>>> 
>>> On Thu, Jun 25, 2020 at 5:14 PM Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
>>>> task. This is different to bpf_get_stack(), which gets stack track of
>>>> current task. One potential use case of bpf_get_task_stack() is to call
>>>> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
>>>> 
>>>> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
>>>> get_perf_callchain() for kernel stack. The benefit of this choice is that
>>>> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
>>>> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
>>>> stack trace to unsigned long array. For 32-bit systems, we need to
>>>> translate it to u64 array.
>>>> 
>>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>>> ---
>>> 
>>> Looks great, I just think that there are cases where user doesn't
>>> necessarily has valid task_struct pointer, just pid, so would be nice
>>> to not artificially restrict such cases by having extra helper.
>>> 
>>> Acked-by: Andrii Nakryiko <andriin@fb.com>
>> 
>> Thanks!
>> 
>>> 
>>>> include/linux/bpf.h            |  1 +
>>>> include/uapi/linux/bpf.h       | 35 ++++++++++++++-
>>>> kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
>>>> kernel/trace/bpf_trace.c       |  2 +
>>>> scripts/bpf_helpers_doc.py     |  2 +
>>>> tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
>>>> 6 files changed, 149 insertions(+), 5 deletions(-)
>>>> 
>>> 
>>> [...]
>>> 
>>>> +       /* stack_trace_save_tsk() works on unsigned long array, while
>>>> +        * perf_callchain_entry uses u64 array. For 32-bit systems, it is
>>>> +        * necessary to fix this mismatch.
>>>> +        */
>>>> +       if (__BITS_PER_LONG != 64) {
>>>> +               unsigned long *from = (unsigned long *) entry->ip;
>>>> +               u64 *to = entry->ip;
>>>> +               int i;
>>>> +
>>>> +               /* copy data from the end to avoid using extra buffer */
>>>> +               for (i = entry->nr - 1; i >= (int)init_nr; i--)
>>>> +                       to[i] = (u64)(from[i]);
>>> 
>>> doing this forward would be just fine as well, no? First iteration
>>> will cast and overwrite low 32-bits, all the subsequent iterations
>>> won't even overlap.
>> 
>> I think first iteration will write zeros to higher 32 bits, no?
> 
> Oh, wait, I completely misread what this is doing. It up-converts from
> 32-bit to 64-bit, sorry. Yeah, ignore me on this :)
> 
> But then I have another question. How do you know that entry->ip has
> enough space to keep the same number of 2x bigger entries?

The buffer is sized for sysctl_perf_event_max_stack u64 numbers. 
stack_trace_save_tsk() will put at most stack_trace_save_tsk unsigned 
long in it (init_nr == 0). So the buffer is big enough. 

Thanks,
Song
Andrii Nakryiko June 27, 2020, 12:06 a.m. UTC | #8
On Fri, Jun 26, 2020 at 4:47 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jun 26, 2020, at 3:51 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jun 26, 2020 at 3:45 PM Song Liu <songliubraving@fb.com> wrote:
> >>
> >>
> >>
> >>> On Jun 26, 2020, at 1:17 PM, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >>>
> >>> On Thu, Jun 25, 2020 at 5:14 PM Song Liu <songliubraving@fb.com> wrote:
> >>>>
> >>>> Introduce helper bpf_get_task_stack(), which dumps stack trace of given
> >>>> task. This is different to bpf_get_stack(), which gets stack track of
> >>>> current task. One potential use case of bpf_get_task_stack() is to call
> >>>> it from bpf_iter__task and dump all /proc/<pid>/stack to a seq_file.
> >>>>
> >>>> bpf_get_task_stack() uses stack_trace_save_tsk() instead of
> >>>> get_perf_callchain() for kernel stack. The benefit of this choice is that
> >>>> stack_trace_save_tsk() doesn't require changes in arch/. The downside of
> >>>> using stack_trace_save_tsk() is that stack_trace_save_tsk() dumps the
> >>>> stack trace to unsigned long array. For 32-bit systems, we need to
> >>>> translate it to u64 array.
> >>>>
> >>>> Signed-off-by: Song Liu <songliubraving@fb.com>
> >>>> ---
> >>>
> >>> Looks great, I just think that there are cases where user doesn't
> >>> necessarily has valid task_struct pointer, just pid, so would be nice
> >>> to not artificially restrict such cases by having extra helper.
> >>>
> >>> Acked-by: Andrii Nakryiko <andriin@fb.com>
> >>
> >> Thanks!
> >>
> >>>
> >>>> include/linux/bpf.h            |  1 +
> >>>> include/uapi/linux/bpf.h       | 35 ++++++++++++++-
> >>>> kernel/bpf/stackmap.c          | 79 ++++++++++++++++++++++++++++++++--
> >>>> kernel/trace/bpf_trace.c       |  2 +
> >>>> scripts/bpf_helpers_doc.py     |  2 +
> >>>> tools/include/uapi/linux/bpf.h | 35 ++++++++++++++-
> >>>> 6 files changed, 149 insertions(+), 5 deletions(-)
> >>>>
> >>>
> >>> [...]
> >>>
> >>>> +       /* stack_trace_save_tsk() works on unsigned long array, while
> >>>> +        * perf_callchain_entry uses u64 array. For 32-bit systems, it is
> >>>> +        * necessary to fix this mismatch.
> >>>> +        */
> >>>> +       if (__BITS_PER_LONG != 64) {
> >>>> +               unsigned long *from = (unsigned long *) entry->ip;
> >>>> +               u64 *to = entry->ip;
> >>>> +               int i;
> >>>> +
> >>>> +               /* copy data from the end to avoid using extra buffer */
> >>>> +               for (i = entry->nr - 1; i >= (int)init_nr; i--)
> >>>> +                       to[i] = (u64)(from[i]);
> >>>
> >>> doing this forward would be just fine as well, no? First iteration
> >>> will cast and overwrite low 32-bits, all the subsequent iterations
> >>> won't even overlap.
> >>
> >> I think first iteration will write zeros to higher 32 bits, no?
> >
> > Oh, wait, I completely misread what this is doing. It up-converts from
> > 32-bit to 64-bit, sorry. Yeah, ignore me on this :)
> >
> > But then I have another question. How do you know that entry->ip has
> > enough space to keep the same number of 2x bigger entries?
>
> The buffer is sized for sysctl_perf_event_max_stack u64 numbers.
> stack_trace_save_tsk() will put at most stack_trace_save_tsk unsigned
> long in it (init_nr == 0). So the buffer is big enough.
>

Awesome, thanks for clarification!

> Thanks,
> Song
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 07052d44bca1c..cee31ee56367b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1607,6 +1607,7 @@  extern const struct bpf_func_proto bpf_get_current_uid_gid_proto;
 extern const struct bpf_func_proto bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
 extern const struct bpf_func_proto bpf_get_stack_proto;
+extern const struct bpf_func_proto bpf_get_task_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 extern const struct bpf_func_proto bpf_sock_hash_update_proto;
 extern const struct bpf_func_proto bpf_get_current_cgroup_id_proto;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 19684813faaed..7638412987354 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3252,6 +3252,38 @@  union bpf_attr {
  * 		case of **BPF_CSUM_LEVEL_QUERY**, the current skb->csum_level
  * 		is returned or the error code -EACCES in case the skb is not
  * 		subject to CHECKSUM_UNNECESSARY.
+ *
+ * int bpf_get_task_stack(struct task_struct *task, void *buf, u32 size, u64 flags)
+ *	Description
+ *		Return a user or a kernel stack in bpf program provided buffer.
+ *		To achieve this, the helper needs *task*, which is a valid
+ *		pointer to struct task_struct. To store the stacktrace, the
+ *		bpf program provides *buf* with	a nonnegative *size*.
+ *
+ *		The last argument, *flags*, holds the number of stack frames to
+ *		skip (from 0 to 255), masked with
+ *		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ *		the following flags:
+ *
+ *		**BPF_F_USER_STACK**
+ *			Collect a user space stack instead of a kernel stack.
+ *		**BPF_F_USER_BUILD_ID**
+ *			Collect buildid+offset instead of ips for user stack,
+ *			only valid if **BPF_F_USER_STACK** is also specified.
+ *
+ *		**bpf_get_task_stack**\ () can collect up to
+ *		**PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
+ *		to sufficient large buffer size. Note that
+ *		this limit can be controlled with the **sysctl** program, and
+ *		that it should be manually increased in order to profile long
+ *		user stacks (such as stacks for Java programs). To do so, use:
+ *
+ *		::
+ *
+ *			# sysctl kernel.perf_event_max_stack=<new value>
+ *	Return
+ *		A non-negative value equal to or less than *size* on success,
+ *		or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3389,7 +3421,8 @@  union bpf_attr {
 	FN(ringbuf_submit),		\
 	FN(ringbuf_discard),		\
 	FN(ringbuf_query),		\
-	FN(csum_level),
+	FN(csum_level),			\
+	FN(get_task_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 599488f25e404..64b7843057a23 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -348,6 +348,44 @@  static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 	}
 }
 
+static struct perf_callchain_entry *
+get_callchain_entry_for_task(struct task_struct *task, u32 init_nr)
+{
+	struct perf_callchain_entry *entry;
+	int rctx;
+
+	entry = get_callchain_entry(&rctx);
+
+	if (rctx == -1)
+		return NULL;
+
+	if (!entry)
+		goto exit_put;
+
+	entry->nr = init_nr +
+		stack_trace_save_tsk(task, (unsigned long *)(entry->ip + init_nr),
+				     sysctl_perf_event_max_stack - init_nr, 0);
+
+	/* stack_trace_save_tsk() works on unsigned long array, while
+	 * perf_callchain_entry uses u64 array. For 32-bit systems, it is
+	 * necessary to fix this mismatch.
+	 */
+	if (__BITS_PER_LONG != 64) {
+		unsigned long *from = (unsigned long *) entry->ip;
+		u64 *to = entry->ip;
+		int i;
+
+		/* copy data from the end to avoid using extra buffer */
+		for (i = entry->nr - 1; i >= (int)init_nr; i--)
+			to[i] = (u64)(from[i]);
+	}
+
+exit_put:
+	put_callchain_entry(rctx);
+
+	return entry;
+}
+
 BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	   u64, flags)
 {
@@ -448,8 +486,8 @@  const struct bpf_func_proto bpf_get_stackid_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
-	   u64, flags)
+static int __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
+			   void *buf, u32 size, u64 flags)
 {
 	u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
 	bool user_build_id = flags & BPF_F_USER_BUILD_ID;
@@ -471,13 +509,22 @@  BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
 	if (unlikely(size % elem_size))
 		goto clear;
 
+	/* cannot get valid user stack for task without user_mode regs */
+	if (task && user && !user_mode(regs))
+		goto err_fault;
+
 	num_elem = size / elem_size;
 	if (sysctl_perf_event_max_stack < num_elem)
 		init_nr = 0;
 	else
 		init_nr = sysctl_perf_event_max_stack - num_elem;
+
+	if (kernel && task)
+		trace = get_callchain_entry_for_task(task, init_nr);
+	else
 		trace = get_perf_callchain(regs, init_nr, kernel, user,
-				   sysctl_perf_event_max_stack, false, false);
+					   sysctl_perf_event_max_stack,
+					   false, false);
 	if (unlikely(!trace))
 		goto err_fault;
 
@@ -505,6 +552,12 @@  BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
 	return err;
 }
 
+BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
+	   u64, flags)
+{
+	return __bpf_get_stack(regs, NULL, buf, size, flags);
+}
+
 const struct bpf_func_proto bpf_get_stack_proto = {
 	.func		= bpf_get_stack,
 	.gpl_only	= true,
@@ -515,6 +568,26 @@  const struct bpf_func_proto bpf_get_stack_proto = {
 	.arg4_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_task_stack, struct task_struct *, task, void *, buf,
+	   u32, size, u64, flags)
+{
+	struct pt_regs *regs = task_pt_regs(task);
+
+	return __bpf_get_stack(regs, task, buf, size, flags);
+}
+
+static int bpf_get_task_stack_btf_ids[5];
+const struct bpf_func_proto bpf_get_task_stack_proto = {
+	.func		= bpf_get_task_stack,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg4_type	= ARG_ANYTHING,
+	.btf_id		= bpf_get_task_stack_btf_ids,
+};
+
 /* Called from eBPF program */
 static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
 {
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index e729c9e587a07..65fa62723e2f8 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1134,6 +1134,8 @@  bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_ringbuf_discard_proto;
 	case BPF_FUNC_ringbuf_query:
 		return &bpf_ringbuf_query_proto;
+	case BPF_FUNC_get_task_stack:
+		return &bpf_get_task_stack_proto;
 	default:
 		return NULL;
 	}
diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 91fa668fa8602..a8783d668c5b7 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py
@@ -421,6 +421,7 @@  class PrinterHelpers(Printer):
             'struct sockaddr',
             'struct tcphdr',
             'struct seq_file',
+            'struct task_struct',
 
             'struct __sk_buff',
             'struct sk_msg_md',
@@ -458,6 +459,7 @@  class PrinterHelpers(Printer):
             'struct sockaddr',
             'struct tcphdr',
             'struct seq_file',
+            'struct task_struct',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 19684813faaed..7638412987354 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3252,6 +3252,38 @@  union bpf_attr {
  * 		case of **BPF_CSUM_LEVEL_QUERY**, the current skb->csum_level
  * 		is returned or the error code -EACCES in case the skb is not
  * 		subject to CHECKSUM_UNNECESSARY.
+ *
+ * int bpf_get_task_stack(struct task_struct *task, void *buf, u32 size, u64 flags)
+ *	Description
+ *		Return a user or a kernel stack in bpf program provided buffer.
+ *		To achieve this, the helper needs *task*, which is a valid
+ *		pointer to struct task_struct. To store the stacktrace, the
+ *		bpf program provides *buf* with	a nonnegative *size*.
+ *
+ *		The last argument, *flags*, holds the number of stack frames to
+ *		skip (from 0 to 255), masked with
+ *		**BPF_F_SKIP_FIELD_MASK**. The next bits can be used to set
+ *		the following flags:
+ *
+ *		**BPF_F_USER_STACK**
+ *			Collect a user space stack instead of a kernel stack.
+ *		**BPF_F_USER_BUILD_ID**
+ *			Collect buildid+offset instead of ips for user stack,
+ *			only valid if **BPF_F_USER_STACK** is also specified.
+ *
+ *		**bpf_get_task_stack**\ () can collect up to
+ *		**PERF_MAX_STACK_DEPTH** both kernel and user frames, subject
+ *		to sufficient large buffer size. Note that
+ *		this limit can be controlled with the **sysctl** program, and
+ *		that it should be manually increased in order to profile long
+ *		user stacks (such as stacks for Java programs). To do so, use:
+ *
+ *		::
+ *
+ *			# sysctl kernel.perf_event_max_stack=<new value>
+ *	Return
+ *		A non-negative value equal to or less than *size* on success,
+ *		or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3389,7 +3421,8 @@  union bpf_attr {
 	FN(ringbuf_submit),		\
 	FN(ringbuf_discard),		\
 	FN(ringbuf_query),		\
-	FN(csum_level),
+	FN(csum_level),			\
+	FN(get_task_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call