diff mbox series

[bpf-next,v10,1/2] bpf: add new helper get_file_path for mapping a file descriptor to a pathname

Message ID e8b1281b7405eb4b6c1f094169e6efd2c8cc95da.1574162990.git.ethercflow@gmail.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: adding get_file_path helper | expand

Commit Message

Wenbo Zhang Nov. 19, 2019, 1:27 p.m. UTC
When people want to identify which file system files are being opened,
read, and written to, they can use this helper with file descriptor as
input to achieve this goal. Other pseudo filesystems are also supported.

This requirement is mainly discussed here:

  https://github.com/iovisor/bcc/issues/237

v9->v10: addressed Andrii's feedback
- send this patch together with the patch selftests as one patch series

v8->v9:
- format helper description

v7->v8: addressed Alexei's feedback
- use fget_raw instead of fdget_raw, as fdget_raw is only used inside fs/
- ensure we're in user context which is safe fot the help to run
- filter unmountable pseudo filesystem, because they don't have real path
- supplement the description of this helper function

v6->v7:
- fix missing signed-off-by line

v5->v6: addressed Andrii's feedback
- avoid unnecessary goto end by having two explicit returns

v4->v5: addressed Andrii and Daniel's feedback
- rename bpf_fd2path to bpf_get_file_path to be consistent with other
helper's names
- when fdget_raw fails, set ret to -EBADF instead of -EINVAL
- remove fdput from fdget_raw's error path
- use IS_ERR instead of IS_ERR_OR_NULL as d_path ether returns a pointer
into the buffer or an error code if the path was too long
- modify the normal path's return value to return copied string length
including NUL
- update this helper description's Return bits.

v3->v4: addressed Daniel's feedback
- fix missing fdput()
- move fd2path from kernel/bpf/trace.c to kernel/trace/bpf_trace.c
- move fd2path's test code to another patch
- add comment to explain why use fdget_raw instead of fdget

v2->v3: addressed Yonghong's feedback
- remove unnecessary LOCKDOWN_BPF_READ
- refactor error handling section for enhanced readability
- provide a test case in tools/testing/selftests/bpf

v1->v2: addressed Daniel's feedback
- fix backward compatibility
- add this helper description
- fix signed-off name

Signed-off-by: Wenbo Zhang <ethercflow@gmail.com>
---
 include/uapi/linux/bpf.h       | 29 +++++++++++++++-
 kernel/trace/bpf_trace.c       | 63 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 29 +++++++++++++++-
 3 files changed, 119 insertions(+), 2 deletions(-)

Comments

Alexei Starovoitov Nov. 23, 2019, 3:18 a.m. UTC | #1
On Tue, Nov 19, 2019 at 08:27:37AM -0500, Wenbo Zhang wrote:
> When people want to identify which file system files are being opened,
> read, and written to, they can use this helper with file descriptor as
> input to achieve this goal. Other pseudo filesystems are also supported.
> 
> This requirement is mainly discussed here:
> 
>   https://github.com/iovisor/bcc/issues/237
> 
> v9->v10: addressed Andrii's feedback
> - send this patch together with the patch selftests as one patch series
> 
> v8->v9:
> - format helper description
> 
> v7->v8: addressed Alexei's feedback
> - use fget_raw instead of fdget_raw, as fdget_raw is only used inside fs/
> - ensure we're in user context which is safe fot the help to run
> - filter unmountable pseudo filesystem, because they don't have real path
> - supplement the description of this helper function
> 
> v6->v7:
> - fix missing signed-off-by line
> 
> v5->v6: addressed Andrii's feedback
> - avoid unnecessary goto end by having two explicit returns
> 
> v4->v5: addressed Andrii and Daniel's feedback
> - rename bpf_fd2path to bpf_get_file_path to be consistent with other
> helper's names
> - when fdget_raw fails, set ret to -EBADF instead of -EINVAL
> - remove fdput from fdget_raw's error path
> - use IS_ERR instead of IS_ERR_OR_NULL as d_path ether returns a pointer
> into the buffer or an error code if the path was too long
> - modify the normal path's return value to return copied string length
> including NUL
> - update this helper description's Return bits.
> 
> v3->v4: addressed Daniel's feedback
> - fix missing fdput()
> - move fd2path from kernel/bpf/trace.c to kernel/trace/bpf_trace.c
> - move fd2path's test code to another patch
> - add comment to explain why use fdget_raw instead of fdget
> 
> v2->v3: addressed Yonghong's feedback
> - remove unnecessary LOCKDOWN_BPF_READ
> - refactor error handling section for enhanced readability
> - provide a test case in tools/testing/selftests/bpf
> 
> v1->v2: addressed Daniel's feedback
> - fix backward compatibility
> - add this helper description
> - fix signed-off name
> 
> Signed-off-by: Wenbo Zhang <ethercflow@gmail.com>
...
> +BPF_CALL_3(bpf_get_file_path, char *, dst, u32, size, int, fd)
> +{
> +	struct file *f;
> +	char *p;
> +	int ret = -EBADF;
> +
> +	/* Ensure we're in user context which is safe for the helper to
> +	 * run. This helper has no business in a kthread.
> +	 */
> +	if (unlikely(in_interrupt() ||
> +		     current->flags & (PF_KTHREAD | PF_EXITING)))
> +		return -EPERM;
> +
> +	/* Use fget_raw instead of fget to support O_PATH, and it doesn't
> +	 * have any sleepable code, so it's ok to be here.
> +	 */
> +	f = fget_raw(fd);
> +	if (!f)
> +		goto error;
> +
> +	/* For unmountable pseudo filesystem, it seems to have no meaning
> +	 * to get their fake paths as they don't have path, and to be no
> +	 * way to validate this function pointer can be always safe to call
> +	 * in the current context.
> +	 */
> +	if (f->f_path.dentry->d_op && f->f_path.dentry->d_op->d_dname)
> +		return -EINVAL;
> +
> +	/* After filter unmountable pseudo filesytem, d_path won't call
> +	 * dentry->d_op->d_name(), the normally path doesn't have any
> +	 * sleepable code, and despite it uses the current macro to get
> +	 * fs_struct (current->fs), we've already ensured we're in user
> +	 * context, so it's ok to be here.
> +	 */
> +	p = d_path(&f->f_path, dst, size);
> +	if (IS_ERR(p)) {
> +		ret = PTR_ERR(p);
> +		fput(f);
> +		goto error;
> +	}
> +
> +	ret = strlen(p);
> +	memmove(dst, p, ret);
> +	dst[ret++] = '\0';
> +	fput(f);
> +	return ret;
> +
> +error:
> +	memset(dst, '0', size);
> +	return ret;
> +}

Al,

could you please review about code whether it's doing enough checks to be
called safely from preempt_disabled region?

It's been under review for many weeks and looks good from bpf pov. Essentially
tracing folks need easy way to convert FD to full path name. This feature
request first came in 2015.

Thanks!
Al Viro Nov. 23, 2019, 4:43 a.m. UTC | #2
On Fri, Nov 22, 2019 at 07:18:28PM -0800, Alexei Starovoitov wrote:
> > +	/* After filter unmountable pseudo filesytem, d_path won't call
> > +	 * dentry->d_op->d_name(), the normally path doesn't have any
> > +	 * sleepable code, and despite it uses the current macro to get
> > +	 * fs_struct (current->fs), we've already ensured we're in user
> > +	 * context, so it's ok to be here.
> > +	 */
> > +	p = d_path(&f->f_path, dst, size);
> > +	if (IS_ERR(p)) {
> > +		ret = PTR_ERR(p);
> > +		fput(f);
> > +		goto error;
> > +	}
> > +
> > +	ret = strlen(p);
> > +	memmove(dst, p, ret);
> > +	dst[ret++] = '\0';
> > +	fput(f);
> > +	return ret;
> > +
> > +error:
> > +	memset(dst, '0', size);
> > +	return ret;
> > +}
> 
> Al,
> 
> could you please review about code whether it's doing enough checks to be
> called safely from preempt_disabled region?

Depends.  Which context is it running in?  In particular, which
locks might be already held?
Al Viro Nov. 23, 2019, 4:51 a.m. UTC | #3
On Fri, Nov 22, 2019 at 07:18:28PM -0800, Alexei Starovoitov wrote:
> > +	f = fget_raw(fd);
> > +	if (!f)
> > +		goto error;
> > +
> > +	/* For unmountable pseudo filesystem, it seems to have no meaning
> > +	 * to get their fake paths as they don't have path, and to be no
> > +	 * way to validate this function pointer can be always safe to call
> > +	 * in the current context.
> > +	 */
> > +	if (f->f_path.dentry->d_op && f->f_path.dentry->d_op->d_dname)
> > +		return -EINVAL;

An obvious leak here, BTW.

Anyway, what could that be used for?  I mean, if you want to check
something about syscall arguments, that's an unfixably racy way to go.
Descriptor table can be a shared data structure, and two consequent
fdget() on the same number can bloody well yield completely unrelated
struct file references.

IOW, anything that does descriptor -> struct file * translation more than
once is an instant TOCTOU suspect.  In this particular case, the function
will produce a pathname of something that was once reachable via descriptor
with this number; quite possibly never before that function had been called
_and_ not once after it has returned.
Alexei Starovoitov Nov. 23, 2019, 5:19 a.m. UTC | #4
On Sat, Nov 23, 2019 at 04:51:51AM +0000, Al Viro wrote:
> On Fri, Nov 22, 2019 at 07:18:28PM -0800, Alexei Starovoitov wrote:
> > > +	f = fget_raw(fd);
> > > +	if (!f)
> > > +		goto error;
> > > +
> > > +	/* For unmountable pseudo filesystem, it seems to have no meaning
> > > +	 * to get their fake paths as they don't have path, and to be no
> > > +	 * way to validate this function pointer can be always safe to call
> > > +	 * in the current context.
> > > +	 */
> > > +	if (f->f_path.dentry->d_op && f->f_path.dentry->d_op->d_dname)
> > > +		return -EINVAL;
> 
> An obvious leak here, BTW.

ohh. right.

> Depends.  Which context is it running in?  In particular, which
> locks might be already held?

hard to tell. It will be run out of bpf prog that attaches to kprobe or
tracepoint. What is the concern about locking?
d_path() doesn't take any locks and doesn't depend on any locks. Above 'if'
checks that plain d_path() is used and not some specilized callback with
unknown logic.

> Anyway, what could that be used for?  I mean, if you want to check
> something about syscall arguments, that's an unfixably racy way to go.
> Descriptor table can be a shared data structure, and two consequent
> fdget() on the same number can bloody well yield completely unrelated
> struct file references.

yes. It is racy. There are no guarantees on correctness of FD.
The program can pass arbitrary integer into this helper.

> IOW, anything that does descriptor -> struct file * translation more than
> once is an instant TOCTOU suspect.  In this particular case, the function
> will produce a pathname of something that was once reachable via descriptor
> with this number; quite possibly never before that function had been called
> _and_ not once after it has returned.

Right. TOCTOU is not a concern here. It's tracing. It's ok for full path to be
'one time deal'. Right now people use bpf_probe_read() to replicate what
d_path() does.
See https://github.com/iovisor/bcc/issues/237#issuecomment-547564661
It sort of works, but calling d_path() is simpler and more accurate.
The key thing that bpf helpers need to make sure is that regardless of
how they're called and what integer is passed in as an FD the helper
must not crash or lockup the kernel or cause it to misbehave.
Hence above in_interrupt() and other checks to limit the context.
Al Viro Nov. 23, 2019, 5:35 a.m. UTC | #5
On Fri, Nov 22, 2019 at 09:19:21PM -0800, Alexei Starovoitov wrote:

> hard to tell. It will be run out of bpf prog that attaches to kprobe or
> tracepoint. What is the concern about locking?
> d_path() doesn't take any locks and doesn't depend on any locks. Above 'if'
> checks that plain d_path() is used and not some specilized callback with
> unknown logic.

It sure as hell does.  It might end up taking rename_lock and/or mount_lock
spinlock components.  It'll try not to, but if the first pass ends up with
seqlock mismatch, it will just grab the spinlock the second time around.

> > with this number; quite possibly never before that function had been called
> > _and_ not once after it has returned.
> 
> Right. TOCTOU is not a concern here. It's tracing. It's ok for full path to be
> 'one time deal'.

It might very well be a full path of something completely unrelated to what
the syscall ends up operating upon.  It's not that the file might've been
moved; it might be a different file.  IOW, results of that tracing might be
misleading.
Alexei Starovoitov Nov. 23, 2019, 6:04 a.m. UTC | #6
On Sat, Nov 23, 2019 at 05:35:14AM +0000, Al Viro wrote:
> On Fri, Nov 22, 2019 at 09:19:21PM -0800, Alexei Starovoitov wrote:
> 
> > hard to tell. It will be run out of bpf prog that attaches to kprobe or
> > tracepoint. What is the concern about locking?
> > d_path() doesn't take any locks and doesn't depend on any locks. Above 'if'
> > checks that plain d_path() is used and not some specilized callback with
> > unknown logic.
> 
> It sure as hell does.  It might end up taking rename_lock and/or mount_lock
> spinlock components.  It'll try not to, but if the first pass ends up with
> seqlock mismatch, it will just grab the spinlock the second time around.

ohh. got it. I missed _or_lock() part in there.
The need_seqretry() logic is tricky. afaics there is no way for the checks
outside of prepend_path() to prevent spin_lock to happen. And adding a flag to
prepend_path() to return early if retry is needed is too ugly. So this helper
won't be safe to be run out of kprobe. But if we allow it for tracepoints only
it should be ok. I think. There are no tracepoints in inner guts of vfs and I
don't think they will ever be. So running in tracepoint->bpf_prog->d_path we
will be sure that rename_lock+mount_lock can be safely spinlocked. Am I missing
something?

> > > with this number; quite possibly never before that function had been called
> > > _and_ not once after it has returned.
> > 
> > Right. TOCTOU is not a concern here. It's tracing. It's ok for full path to be
> > 'one time deal'.
> 
> It might very well be a full path of something completely unrelated to what
> the syscall ends up operating upon.  It's not that the file might've been
> moved; it might be a different file.  IOW, results of that tracing might be
> misleading.

That is correct. Tracing is fine with such limitation. Still better than probe_read.
Brendan Gregg Dec. 13, 2019, 7:51 p.m. UTC | #7
On Fri, Nov 22, 2019 at 10:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Nov 23, 2019 at 05:35:14AM +0000, Al Viro wrote:
> > On Fri, Nov 22, 2019 at 09:19:21PM -0800, Alexei Starovoitov wrote:
> >
> > > hard to tell. It will be run out of bpf prog that attaches to kprobe or
> > > tracepoint. What is the concern about locking?
> > > d_path() doesn't take any locks and doesn't depend on any locks. Above 'if'
> > > checks that plain d_path() is used and not some specilized callback with
> > > unknown logic.
> >
> > It sure as hell does.  It might end up taking rename_lock and/or mount_lock
> > spinlock components.  It'll try not to, but if the first pass ends up with
> > seqlock mismatch, it will just grab the spinlock the second time around.
>
> ohh. got it. I missed _or_lock() part in there.
> The need_seqretry() logic is tricky. afaics there is no way for the checks
> outside of prepend_path() to prevent spin_lock to happen. And adding a flag to
> prepend_path() to return early if retry is needed is too ugly. So this helper
> won't be safe to be run out of kprobe. But if we allow it for tracepoints only
> it should be ok. I think. There are no tracepoints in inner guts of vfs and I
> don't think they will ever be. So running in tracepoint->bpf_prog->d_path we
> will be sure that rename_lock+mount_lock can be safely spinlocked. Am I missing
> something?

It seems rather restrictive to only allow tracepoints (especially
without VFS tracepoints), although I'll use it to improve my syscall
tracepoint tools, so I'd be happy to see this merged even with that
restriction.

Just a thought: if *buffer is in BPF memory, can prepend_path() check
it's memory location and not try to grab the lock based on that? This
would be to avoid adding a flag.

>
> > > > with this number; quite possibly never before that function had been called
> > > > _and_ not once after it has returned.
> > >
> > > Right. TOCTOU is not a concern here. It's tracing. It's ok for full path to be
> > > 'one time deal'.
> >
> > It might very well be a full path of something completely unrelated to what
> > the syscall ends up operating upon.  It's not that the file might've been
> > moved; it might be a different file.  IOW, results of that tracing might be
> > misleading.
>
> That is correct. Tracing is fine with such limitation. Still better than probe_read.
>

+1

Tracing is observability tools and we document these caveats, and this
won't be the first time I've published tools where the printed path
may not be the one you think (e.g., the case of hard links.)

Brendan
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dbbcf0b02970..71d9705df120 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2821,6 +2821,32 @@  union bpf_attr {
  * 	Return
  * 		On success, the strictly positive length of the string,	including
  * 		the trailing NUL character. On error, a negative value.
+ *
+ * int bpf_get_file_path(char *path, u32 size, int fd)
+ *	Description
+ *		Get **file** atrribute from the current task by *fd*, then call
+ *		**d_path** to get it's absolute path and copy it as string into
+ *		*path* of *size*. Notice the **path** don't support unmountable
+ *		pseudo filesystems as they don't have path (eg: SOCKFS, PIPEFS).
+ *		The *size* must be strictly positive. On success, the helper
+ *		makes sure that the *path* is NUL-terminated, and the buffer
+ *		could be:
+ *		- a regular full path (include mountable fs eg: /proc, /sys)
+ *		- a regular full path with "(deleted)" at the end.
+ *		On failure, it is filled with zeroes.
+ *	Return
+ *		On success, returns the length of the copied string INCLUDING
+ *		the trailing NUL.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EPERM** if no permission to get the path (eg: in irq ctx).
+ *
+ *		**-EBADF** if *fd* is invalid.
+ *
+ *		**-EINVAL** if *fd* corresponds to a unmountable pseudo fs
+ *
+ *		**-ENAMETOOLONG** if full path is longer than *size*
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2938,7 +2964,8 @@  union bpf_attr {
 	FN(probe_read_user),		\
 	FN(probe_read_kernel),		\
 	FN(probe_read_user_str),	\
-	FN(probe_read_kernel_str),
+	FN(probe_read_kernel_str),	\
+	FN(get_file_path),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ffc91d4935ac..c77e55418f1e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -762,6 +762,67 @@  static const struct bpf_func_proto bpf_send_signal_proto = {
 	.arg1_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_get_file_path, char *, dst, u32, size, int, fd)
+{
+	struct file *f;
+	char *p;
+	int ret = -EBADF;
+
+	/* Ensure we're in user context which is safe for the helper to
+	 * run. This helper has no business in a kthread.
+	 */
+	if (unlikely(in_interrupt() ||
+		     current->flags & (PF_KTHREAD | PF_EXITING)))
+		return -EPERM;
+
+	/* Use fget_raw instead of fget to support O_PATH, and it doesn't
+	 * have any sleepable code, so it's ok to be here.
+	 */
+	f = fget_raw(fd);
+	if (!f)
+		goto error;
+
+	/* For unmountable pseudo filesystem, it seems to have no meaning
+	 * to get their fake paths as they don't have path, and to be no
+	 * way to validate this function pointer can be always safe to call
+	 * in the current context.
+	 */
+	if (f->f_path.dentry->d_op && f->f_path.dentry->d_op->d_dname)
+		return -EINVAL;
+
+	/* After filter unmountable pseudo filesytem, d_path won't call
+	 * dentry->d_op->d_name(), the normally path doesn't have any
+	 * sleepable code, and despite it uses the current macro to get
+	 * fs_struct (current->fs), we've already ensured we're in user
+	 * context, so it's ok to be here.
+	 */
+	p = d_path(&f->f_path, dst, size);
+	if (IS_ERR(p)) {
+		ret = PTR_ERR(p);
+		fput(f);
+		goto error;
+	}
+
+	ret = strlen(p);
+	memmove(dst, p, ret);
+	dst[ret++] = '\0';
+	fput(f);
+	return ret;
+
+error:
+	memset(dst, '0', size);
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_get_file_path_proto = {
+	.func       = bpf_get_file_path,
+	.gpl_only   = true,
+	.ret_type   = RET_INTEGER,
+	.arg1_type  = ARG_PTR_TO_UNINIT_MEM,
+	.arg2_type  = ARG_CONST_SIZE,
+	.arg3_type  = ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -822,6 +883,8 @@  tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 #endif
 	case BPF_FUNC_send_signal:
 		return &bpf_send_signal_proto;
+	case BPF_FUNC_get_file_path:
+		return &bpf_get_file_path_proto;
 	default:
 		return NULL;
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index dbbcf0b02970..71d9705df120 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2821,6 +2821,32 @@  union bpf_attr {
  * 	Return
  * 		On success, the strictly positive length of the string,	including
  * 		the trailing NUL character. On error, a negative value.
+ *
+ * int bpf_get_file_path(char *path, u32 size, int fd)
+ *	Description
+ *		Get **file** atrribute from the current task by *fd*, then call
+ *		**d_path** to get it's absolute path and copy it as string into
+ *		*path* of *size*. Notice the **path** don't support unmountable
+ *		pseudo filesystems as they don't have path (eg: SOCKFS, PIPEFS).
+ *		The *size* must be strictly positive. On success, the helper
+ *		makes sure that the *path* is NUL-terminated, and the buffer
+ *		could be:
+ *		- a regular full path (include mountable fs eg: /proc, /sys)
+ *		- a regular full path with "(deleted)" at the end.
+ *		On failure, it is filled with zeroes.
+ *	Return
+ *		On success, returns the length of the copied string INCLUDING
+ *		the trailing NUL.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EPERM** if no permission to get the path (eg: in irq ctx).
+ *
+ *		**-EBADF** if *fd* is invalid.
+ *
+ *		**-EINVAL** if *fd* corresponds to a unmountable pseudo fs
+ *
+ *		**-ENAMETOOLONG** if full path is longer than *size*
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2938,7 +2964,8 @@  union bpf_attr {
 	FN(probe_read_user),		\
 	FN(probe_read_kernel),		\
 	FN(probe_read_user_str),	\
-	FN(probe_read_kernel_str),
+	FN(probe_read_kernel_str),	\
+	FN(get_file_path),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call