diff mbox series

[RFC,bpf-next,4/8] bpf: support GET_FD_BY_ID and GET_NEXT_ID for bpf_link

Message ID 20200404000948.3980903-5-andriin@fb.com
State RFC
Delegated to: BPF Maintainers
Headers show
Series bpf_link observability APIs | expand

Commit Message

Andrii Nakryiko April 4, 2020, 12:09 a.m. UTC
Add support to look up bpf_link by ID and iterate over all existing bpf_links
in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
done as the very last step in finalizing bpf_link, together with installing
FD. This approach allows users of bpf_link in kernel code to not worry about
races between user-space and kernel code that hasn't finished attaching and
initializing bpf_link.

Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
thus allowed to perform modifications on them (like LINK_UPDATE), from other
processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
querying bpf_link information (implemented later in the series) will be
allowed.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 include/uapi/linux/bpf.h |  2 ++
 kernel/bpf/syscall.c     | 56 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

Comments

Toke Høiland-Jørgensen April 6, 2020, 11:34 a.m. UTC | #1
Andrii Nakryiko <andriin@fb.com> writes:

> Add support to look up bpf_link by ID and iterate over all existing bpf_links
> in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
> that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
> done as the very last step in finalizing bpf_link, together with installing
> FD. This approach allows users of bpf_link in kernel code to not worry about
> races between user-space and kernel code that hasn't finished attaching and
> initializing bpf_link.
>
> Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
> bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
> thus allowed to perform modifications on them (like LINK_UPDATE), from other
> processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
> querying bpf_link information (implemented later in the series) will be
> allowed.

I must admit I remain sceptical about this model of restricting access
without any of the regular override mechanisms (for instance, enforcing
read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you
keep saying there would be 'some' override mechanism, I think it would
be helpful if you could just include that so we can see the full
mechanism in context.

-Toke
Andrii Nakryiko April 6, 2020, 7:06 p.m. UTC | #2
On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andriin@fb.com> writes:
>
> > Add support to look up bpf_link by ID and iterate over all existing bpf_links
> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
> > done as the very last step in finalizing bpf_link, together with installing
> > FD. This approach allows users of bpf_link in kernel code to not worry about
> > races between user-space and kernel code that hasn't finished attaching and
> > initializing bpf_link.
> >
> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
> > thus allowed to perform modifications on them (like LINK_UPDATE), from other
> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
> > querying bpf_link information (implemented later in the series) will be
> > allowed.
>
> I must admit I remain sceptical about this model of restricting access
> without any of the regular override mechanisms (for instance, enforcing
> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you
> keep saying there would be 'some' override mechanism, I think it would
> be helpful if you could just include that so we can see the full
> mechanism in context.

I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up.

One way to go about this is to allow creating writable bpf_link for
GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH
operation on writable links, same as we do with LINK_UPDATE here.
LINK_DETACH will do the same as cgroup bpf_link auto-detachment on
cgroup dying: it will detach bpf_link, but will leave it alive until
last FD is closed.

We need to consider, though, if CAP_DAC_OVERRIDE is something that can
be disabled for majority of real-life applications to prevent them
from doing this. If every realistic application has/needs
CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can
get writable bpf_link and do anything with it.

>
> -Toke
>
Toke Høiland-Jørgensen April 8, 2020, 3:14 p.m. UTC | #3
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andriin@fb.com> writes:
>>
>> > Add support to look up bpf_link by ID and iterate over all existing bpf_links
>> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
>> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
>> > done as the very last step in finalizing bpf_link, together with installing
>> > FD. This approach allows users of bpf_link in kernel code to not worry about
>> > races between user-space and kernel code that hasn't finished attaching and
>> > initializing bpf_link.
>> >
>> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
>> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
>> > thus allowed to perform modifications on them (like LINK_UPDATE), from other
>> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
>> > querying bpf_link information (implemented later in the series) will be
>> > allowed.
>>
>> I must admit I remain sceptical about this model of restricting access
>> without any of the regular override mechanisms (for instance, enforcing
>> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you
>> keep saying there would be 'some' override mechanism, I think it would
>> be helpful if you could just include that so we can see the full
>> mechanism in context.
>
> I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up.
>
> One way to go about this is to allow creating writable bpf_link for
> GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH
> operation on writable links, same as we do with LINK_UPDATE here.
> LINK_DETACH will do the same as cgroup bpf_link auto-detachment on
> cgroup dying: it will detach bpf_link, but will leave it alive until
> last FD is closed.

Yup, I think this would be a reasonable way to implement the override
mechanism - it would ensure 'full root' users (like a root shell) can
remove attachments, while still preventing applications from doing so by
limiting their capabilities.

Extending on the concept of RO/RW bpf_link attachments, maybe it should
even be possible for an application to choose which mode it wants to pin
its fd in? With the same capability being able to override it of
course...

> We need to consider, though, if CAP_DAC_OVERRIDE is something that can
> be disabled for majority of real-life applications to prevent them
> from doing this. If every realistic application has/needs
> CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can
> get writable bpf_link and do anything with it.

I poked around a bit, and looking at the sandboxing configurations
shipped with various daemons in their systemd unit files, it appears
that the main case where daemons are granted CAP_DAC_OVERRIDE is if they
have to be able to read /etc/shadow (which is installed as chmod 0). If
this is really the case, that would indicate it's not a widely needed
capability; but I wouldn't exactly say that I've done a comprehensive
survey, so probably a good idea for you to check your users as well :)

-Toke
Andrii Nakryiko April 8, 2020, 8:23 p.m. UTC | #4
On Wed, Apr 8, 2020 at 8:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andriin@fb.com> writes:
> >>
> >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links
> >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
> >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
> >> > done as the very last step in finalizing bpf_link, together with installing
> >> > FD. This approach allows users of bpf_link in kernel code to not worry about
> >> > races between user-space and kernel code that hasn't finished attaching and
> >> > initializing bpf_link.
> >> >
> >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
> >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
> >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other
> >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
> >> > querying bpf_link information (implemented later in the series) will be
> >> > allowed.
> >>
> >> I must admit I remain sceptical about this model of restricting access
> >> without any of the regular override mechanisms (for instance, enforcing
> >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you
> >> keep saying there would be 'some' override mechanism, I think it would
> >> be helpful if you could just include that so we can see the full
> >> mechanism in context.
> >
> > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up.
> >
> > One way to go about this is to allow creating writable bpf_link for
> > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH
> > operation on writable links, same as we do with LINK_UPDATE here.
> > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on
> > cgroup dying: it will detach bpf_link, but will leave it alive until
> > last FD is closed.
>
> Yup, I think this would be a reasonable way to implement the override
> mechanism - it would ensure 'full root' users (like a root shell) can
> remove attachments, while still preventing applications from doing so by
> limiting their capabilities.

So I did some experiments and I think I want to keep GET_FD_BY_ID for
bpf_link to return only read-only bpf_links. After that, one can pin
bpf_link temporarily and re-open it as writable one, provided
CAP_DAC_OVERRIDE capability is present. All that works already,
because pinned bpf_link is just a file, so one can do fchmod on it and
all that will go through normal file access permission check code
path. Unfortunately, just re-opening same FD as writable (which would
be possible if fcntl(fd, F_SETFL, S_IRUSR
 S_IWUSR) was supported on Linux) without pinning is not possible.
Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work
either, because backing inode is not BPF FS inode. I'm not sure, but
maybe we can support the latter eventually. But either way, I think
given this is to be used for manual troubleshooting, going through few
extra hoops to force-detach bpf_link is actually a good thing.


>
> Extending on the concept of RO/RW bpf_link attachments, maybe it should
> even be possible for an application to choose which mode it wants to pin
> its fd in? With the same capability being able to override it of
> course...

Isn't that what patch #2 is doing?... There are few bugs in the
implementation currently, but it will work in the final version.

>
> > We need to consider, though, if CAP_DAC_OVERRIDE is something that can
> > be disabled for majority of real-life applications to prevent them
> > from doing this. If every realistic application has/needs
> > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can
> > get writable bpf_link and do anything with it.
>
> I poked around a bit, and looking at the sandboxing configurations
> shipped with various daemons in their systemd unit files, it appears
> that the main case where daemons are granted CAP_DAC_OVERRIDE is if they
> have to be able to read /etc/shadow (which is installed as chmod 0). If
> this is really the case, that would indicate it's not a widely needed
> capability; but I wouldn't exactly say that I've done a comprehensive
> survey, so probably a good idea for you to check your users as well :)

Right, it might not be possible to drop it for all applications right
away, but at least CAP_DAC_OVERRIDE is not CAP_SYS_ADMIN, which is
absolutely necessary to work with BPF.

>
> -Toke
>
Toke Høiland-Jørgensen April 8, 2020, 9:21 p.m. UTC | #5
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Apr 8, 2020 at 8:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Andrii Nakryiko <andriin@fb.com> writes:
>> >>
>> >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links
>> >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
>> >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
>> >> > done as the very last step in finalizing bpf_link, together with installing
>> >> > FD. This approach allows users of bpf_link in kernel code to not worry about
>> >> > races between user-space and kernel code that hasn't finished attaching and
>> >> > initializing bpf_link.
>> >> >
>> >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
>> >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
>> >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other
>> >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
>> >> > querying bpf_link information (implemented later in the series) will be
>> >> > allowed.
>> >>
>> >> I must admit I remain sceptical about this model of restricting access
>> >> without any of the regular override mechanisms (for instance, enforcing
>> >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you
>> >> keep saying there would be 'some' override mechanism, I think it would
>> >> be helpful if you could just include that so we can see the full
>> >> mechanism in context.
>> >
>> > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up.
>> >
>> > One way to go about this is to allow creating writable bpf_link for
>> > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH
>> > operation on writable links, same as we do with LINK_UPDATE here.
>> > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on
>> > cgroup dying: it will detach bpf_link, but will leave it alive until
>> > last FD is closed.
>>
>> Yup, I think this would be a reasonable way to implement the override
>> mechanism - it would ensure 'full root' users (like a root shell) can
>> remove attachments, while still preventing applications from doing so by
>> limiting their capabilities.
>
> So I did some experiments and I think I want to keep GET_FD_BY_ID for
> bpf_link to return only read-only bpf_links.

Why, exactly? (also, see below)

> After that, one can pin bpf_link temporarily and re-open it as
> writable one, provided CAP_DAC_OVERRIDE capability is present. All
> that works already, because pinned bpf_link is just a file, so one can
> do fchmod on it and all that will go through normal file access
> permission check code path.

Ah, I did not know that was possible - I was assuming that bpffs was
doing something special to prevent that. But if not, great!

> Unfortunately, just re-opening same FD as writable (which would
> be possible if fcntl(fd, F_SETFL, S_IRUSR
>  S_IWUSR) was supported on Linux) without pinning is not possible.
> Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work
> either, because backing inode is not BPF FS inode. I'm not sure, but
> maybe we can support the latter eventually. But either way, I think
> given this is to be used for manual troubleshooting, going through few
> extra hoops to force-detach bpf_link is actually a good thing.

Hmm, I disagree that deliberately making users jump through hoops is a
good thing. Smells an awful lot like security through obscurity to me;
and we all know how well that works anyway...

>> Extending on the concept of RO/RW bpf_link attachments, maybe it should
>> even be possible for an application to choose which mode it wants to pin
>> its fd in? With the same capability being able to override it of
>> course...
>
> Isn't that what patch #2 is doing?...

Ah yes, so it is! I guess I skipped over that a bit too fast ;)

> There are few bugs in the implementation currently, but it will work
> in the final version.

Cool.

>> > We need to consider, though, if CAP_DAC_OVERRIDE is something that can
>> > be disabled for majority of real-life applications to prevent them
>> > from doing this. If every realistic application has/needs
>> > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can
>> > get writable bpf_link and do anything with it.
>>
>> I poked around a bit, and looking at the sandboxing configurations
>> shipped with various daemons in their systemd unit files, it appears
>> that the main case where daemons are granted CAP_DAC_OVERRIDE is if they
>> have to be able to read /etc/shadow (which is installed as chmod 0). If
>> this is really the case, that would indicate it's not a widely needed
>> capability; but I wouldn't exactly say that I've done a comprehensive
>> survey, so probably a good idea for you to check your users as well :)
>
> Right, it might not be possible to drop it for all applications right
> away, but at least CAP_DAC_OVERRIDE is not CAP_SYS_ADMIN, which is
> absolutely necessary to work with BPF.

Yeah, I do hope that we'll eventually get CAP_BPF...

-Toke
Andrii Nakryiko April 9, 2020, 6:49 p.m. UTC | #6
On Wed, Apr 8, 2020 at 2:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Wed, Apr 8, 2020 at 8:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Andrii Nakryiko <andriin@fb.com> writes:
> >> >>
> >> >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links
> >> >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
> >> >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
> >> >> > done as the very last step in finalizing bpf_link, together with installing
> >> >> > FD. This approach allows users of bpf_link in kernel code to not worry about
> >> >> > races between user-space and kernel code that hasn't finished attaching and
> >> >> > initializing bpf_link.
> >> >> >
> >> >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create
> >> >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and
> >> >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other
> >> >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only
> >> >> > querying bpf_link information (implemented later in the series) will be
> >> >> > allowed.
> >> >>
> >> >> I must admit I remain sceptical about this model of restricting access
> >> >> without any of the regular override mechanisms (for instance, enforcing
> >> >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you
> >> >> keep saying there would be 'some' override mechanism, I think it would
> >> >> be helpful if you could just include that so we can see the full
> >> >> mechanism in context.
> >> >
> >> > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up.
> >> >
> >> > One way to go about this is to allow creating writable bpf_link for
> >> > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH
> >> > operation on writable links, same as we do with LINK_UPDATE here.
> >> > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on
> >> > cgroup dying: it will detach bpf_link, but will leave it alive until
> >> > last FD is closed.
> >>
> >> Yup, I think this would be a reasonable way to implement the override
> >> mechanism - it would ensure 'full root' users (like a root shell) can
> >> remove attachments, while still preventing applications from doing so by
> >> limiting their capabilities.
> >
> > So I did some experiments and I think I want to keep GET_FD_BY_ID for
> > bpf_link to return only read-only bpf_links.
>
> Why, exactly? (also, see below)

For the reasons I explained below: because you can turn read-only
bpf_link into writable one through pinning + chmod, if you have
CAP_DAC_OVERRIDE.

>
> > After that, one can pin bpf_link temporarily and re-open it as
> > writable one, provided CAP_DAC_OVERRIDE capability is present. All
> > that works already, because pinned bpf_link is just a file, so one can
> > do fchmod on it and all that will go through normal file access
> > permission check code path.
>
> Ah, I did not know that was possible - I was assuming that bpffs was
> doing something special to prevent that. But if not, great!
>
> > Unfortunately, just re-opening same FD as writable (which would
> > be possible if fcntl(fd, F_SETFL, S_IRUSR
> >  S_IWUSR) was supported on Linux) without pinning is not possible.
> > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work
> > either, because backing inode is not BPF FS inode. I'm not sure, but
> > maybe we can support the latter eventually. But either way, I think
> > given this is to be used for manual troubleshooting, going through few
> > extra hoops to force-detach bpf_link is actually a good thing.
>
> Hmm, I disagree that deliberately making users jump through hoops is a
> good thing. Smells an awful lot like security through obscurity to me;
> and we all know how well that works anyway...

Depends on who users are? bpftool can implement this as one of
`bpftool link` sub-commands and allow human operators to force-detach
bpf_link, if necessary. I think applications shouldn't do this
(programmatically) at all, which is why I think it's actually good
that it's harder and not obvious, this will make developer think again
before implementing this, hopefully. For me it's about discouraging
bad practice.

>
> >> Extending on the concept of RO/RW bpf_link attachments, maybe it should
> >> even be possible for an application to choose which mode it wants to pin
> >> its fd in? With the same capability being able to override it of
> >> course...
> >
> > Isn't that what patch #2 is doing?...
>
> Ah yes, so it is! I guess I skipped over that a bit too fast ;)
>
> > There are few bugs in the implementation currently, but it will work
> > in the final version.
>
> Cool.
>
> >> > We need to consider, though, if CAP_DAC_OVERRIDE is something that can
> >> > be disabled for majority of real-life applications to prevent them
> >> > from doing this. If every realistic application has/needs
> >> > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can
> >> > get writable bpf_link and do anything with it.
> >>
> >> I poked around a bit, and looking at the sandboxing configurations
> >> shipped with various daemons in their systemd unit files, it appears
> >> that the main case where daemons are granted CAP_DAC_OVERRIDE is if they
> >> have to be able to read /etc/shadow (which is installed as chmod 0). If
> >> this is really the case, that would indicate it's not a widely needed
> >> capability; but I wouldn't exactly say that I've done a comprehensive
> >> survey, so probably a good idea for you to check your users as well :)
> >
> > Right, it might not be possible to drop it for all applications right
> > away, but at least CAP_DAC_OVERRIDE is not CAP_SYS_ADMIN, which is
> > absolutely necessary to work with BPF.
>
> Yeah, I do hope that we'll eventually get CAP_BPF...
>
> -Toke
>
Toke Høiland-Jørgensen April 14, 2020, 10:32 a.m. UTC | #7
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

>> > After that, one can pin bpf_link temporarily and re-open it as
>> > writable one, provided CAP_DAC_OVERRIDE capability is present. All
>> > that works already, because pinned bpf_link is just a file, so one can
>> > do fchmod on it and all that will go through normal file access
>> > permission check code path.
>>
>> Ah, I did not know that was possible - I was assuming that bpffs was
>> doing something special to prevent that. But if not, great!
>>
>> > Unfortunately, just re-opening same FD as writable (which would
>> > be possible if fcntl(fd, F_SETFL, S_IRUSR
>> >  S_IWUSR) was supported on Linux) without pinning is not possible.
>> > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work
>> > either, because backing inode is not BPF FS inode. I'm not sure, but
>> > maybe we can support the latter eventually. But either way, I think
>> > given this is to be used for manual troubleshooting, going through few
>> > extra hoops to force-detach bpf_link is actually a good thing.
>>
>> Hmm, I disagree that deliberately making users jump through hoops is a
>> good thing. Smells an awful lot like security through obscurity to me;
>> and we all know how well that works anyway...
>
> Depends on who users are? bpftool can implement this as one of
> `bpftool link` sub-commands and allow human operators to force-detach
> bpf_link, if necessary.

Yeah, I would expect this to be the common way this would be used: built
into tools.

> I think applications shouldn't do this (programmatically) at all,
> which is why I think it's actually good that it's harder and not
> obvious, this will make developer think again before implementing
> this, hopefully. For me it's about discouraging bad practice.

I guess I just don't share your optimism that making people jump through
hoops will actually discourage them :)

If people know what they are doing it should be enough to document it as
discouraged. And if they don't, they are perfectly capable of finding
and copy-pasting the sequence of hoop-jumps required to achieve what
they want, probably with more bugs added along the way.

So in the end I think that all you're really achieving is annoying
people who do have a legitimate reason to override the behaviour (which
includes yourself as a bpftool developer :)). That's what I meant by the
'security through obscurity' comment.

-Toke
Andrii Nakryiko April 14, 2020, 6:47 p.m. UTC | #8
On Tue, Apr 14, 2020 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> >> > After that, one can pin bpf_link temporarily and re-open it as
> >> > writable one, provided CAP_DAC_OVERRIDE capability is present. All
> >> > that works already, because pinned bpf_link is just a file, so one can
> >> > do fchmod on it and all that will go through normal file access
> >> > permission check code path.
> >>
> >> Ah, I did not know that was possible - I was assuming that bpffs was
> >> doing something special to prevent that. But if not, great!
> >>
> >> > Unfortunately, just re-opening same FD as writable (which would
> >> > be possible if fcntl(fd, F_SETFL, S_IRUSR
> >> >  S_IWUSR) was supported on Linux) without pinning is not possible.
> >> > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work
> >> > either, because backing inode is not BPF FS inode. I'm not sure, but
> >> > maybe we can support the latter eventually. But either way, I think
> >> > given this is to be used for manual troubleshooting, going through few
> >> > extra hoops to force-detach bpf_link is actually a good thing.
> >>
> >> Hmm, I disagree that deliberately making users jump through hoops is a
> >> good thing. Smells an awful lot like security through obscurity to me;
> >> and we all know how well that works anyway...
> >
> > Depends on who users are? bpftool can implement this as one of
> > `bpftool link` sub-commands and allow human operators to force-detach
> > bpf_link, if necessary.
>
> Yeah, I would expect this to be the common way this would be used: built
> into tools.
>
> > I think applications shouldn't do this (programmatically) at all,
> > which is why I think it's actually good that it's harder and not
> > obvious, this will make developer think again before implementing
> > this, hopefully. For me it's about discouraging bad practice.
>
> I guess I just don't share your optimism that making people jump through
> hoops will actually discourage them :)

I understand. I just don't see why would anyone have to implement this
at all and especially would think it's a good idea to begin with?

>
> If people know what they are doing it should be enough to document it as
> discouraged. And if they don't, they are perfectly capable of finding
> and copy-pasting the sequence of hoop-jumps required to achieve what
> they want, probably with more bugs added along the way.
>
> So in the end I think that all you're really achieving is annoying
> people who do have a legitimate reason to override the behaviour (which
> includes yourself as a bpftool developer :)). That's what I meant by the
> 'security through obscurity' comment.

Can I please get a list of real examples of legitimate reasons to
override this behavior?

>
> -Toke
>
Toke Høiland-Jørgensen April 15, 2020, 9:26 a.m. UTC | #9
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Apr 14, 2020 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> >> > After that, one can pin bpf_link temporarily and re-open it as
>> >> > writable one, provided CAP_DAC_OVERRIDE capability is present. All
>> >> > that works already, because pinned bpf_link is just a file, so one can
>> >> > do fchmod on it and all that will go through normal file access
>> >> > permission check code path.
>> >>
>> >> Ah, I did not know that was possible - I was assuming that bpffs was
>> >> doing something special to prevent that. But if not, great!
>> >>
>> >> > Unfortunately, just re-opening same FD as writable (which would
>> >> > be possible if fcntl(fd, F_SETFL, S_IRUSR
>> >> >  S_IWUSR) was supported on Linux) without pinning is not possible.
>> >> > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work
>> >> > either, because backing inode is not BPF FS inode. I'm not sure, but
>> >> > maybe we can support the latter eventually. But either way, I think
>> >> > given this is to be used for manual troubleshooting, going through few
>> >> > extra hoops to force-detach bpf_link is actually a good thing.
>> >>
>> >> Hmm, I disagree that deliberately making users jump through hoops is a
>> >> good thing. Smells an awful lot like security through obscurity to me;
>> >> and we all know how well that works anyway...
>> >
>> > Depends on who users are? bpftool can implement this as one of
>> > `bpftool link` sub-commands and allow human operators to force-detach
>> > bpf_link, if necessary.
>>
>> Yeah, I would expect this to be the common way this would be used: built
>> into tools.
>>
>> > I think applications shouldn't do this (programmatically) at all,
>> > which is why I think it's actually good that it's harder and not
>> > obvious, this will make developer think again before implementing
>> > this, hopefully. For me it's about discouraging bad practice.
>>
>> I guess I just don't share your optimism that making people jump through
>> hoops will actually discourage them :)
>
> I understand. I just don't see why would anyone have to implement this
> at all and especially would think it's a good idea to begin with?
>
>>
>> If people know what they are doing it should be enough to document it as
>> discouraged. And if they don't, they are perfectly capable of finding
>> and copy-pasting the sequence of hoop-jumps required to achieve what
>> they want, probably with more bugs added along the way.
>>
>> So in the end I think that all you're really achieving is annoying
>> people who do have a legitimate reason to override the behaviour (which
>> includes yourself as a bpftool developer :)). That's what I meant by the
>> 'security through obscurity' comment.
>
> Can I please get a list of real examples of legitimate reasons to
> override this behavior?

Primarily, I expect that this would be built into admin tools (like
bpftool as you suggested). I just don't see why such tools should be
made to do the whole pin/reopen dance (which, BTW, adds an implicit
dependency on having a mounted bpffs) when we could just add a
capability check directly in bpf_link_get_fd_by_id()?

-Toke
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index eccfd1dea951..407c086bc9e4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -113,6 +113,8 @@  enum bpf_cmd {
 	BPF_MAP_DELETE_BATCH,
 	BPF_LINK_CREATE,
 	BPF_LINK_UPDATE,
+	BPF_LINK_GET_FD_BY_ID,
+	BPF_LINK_GET_NEXT_ID,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8b3a7d5814ae..527ec16702be 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -3719,6 +3719,55 @@  static int link_update(union bpf_attr *attr)
 	return ret;
 }
 
+static int bpf_link_inc_not_zero(struct bpf_link *link)
+{
+	return atomic64_fetch_add_unless(&link->refcnt, 1, 0) ? 0 : -ENOENT;
+}
+
+#define BPF_LINK_GET_FD_BY_ID_LAST_FIELD open_flags
+
+static int bpf_link_get_fd_by_id(const union bpf_attr *attr)
+{
+	struct bpf_link *link;
+	u32 id = attr->link_id;
+	int f_flags;
+	int fd, err;
+
+	if (CHECK_ATTR(BPF_LINK_GET_FD_BY_ID) ||
+	    /* links are not allowed to be open by ID as writable */
+	    attr->open_flags & ~BPF_F_RDONLY)
+		return -EINVAL;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	f_flags = bpf_get_file_flag(attr->open_flags);
+	if (f_flags < 0)
+		return f_flags;
+
+	spin_lock_bh(&link_idr_lock);
+	link = idr_find(&link_idr, id);
+	/* before link is "settled", ID is 0, pretend it doesn't exist yet */
+	if (link) {
+		if (link->id)
+			err = bpf_link_inc_not_zero(link);
+		else
+			err = -EAGAIN;
+	} else {
+		err = -ENOENT;
+	}
+	spin_unlock_bh(&link_idr_lock);
+
+	if (err)
+		return err;
+
+	fd = bpf_link_new_fd(link, f_flags);
+	if (fd < 0)
+		bpf_link_put(link);
+
+	return fd;
+}
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
 {
 	union bpf_attr attr;
@@ -3836,6 +3885,13 @@  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_LINK_UPDATE:
 		err = link_update(&attr);
 		break;
+	case BPF_LINK_GET_FD_BY_ID:
+		err = bpf_link_get_fd_by_id(&attr);
+		break;
+	case BPF_LINK_GET_NEXT_ID:
+		err = bpf_obj_get_next_id(&attr, uattr,
+					  &link_idr, &link_idr_lock);
+		break;
 	default:
 		err = -EINVAL;
 		break;