Message ID | 20200404000948.3980903-5-andriin@fb.com |
---|---|
State | RFC |
Delegated to: | BPF Maintainers |
Headers | show |
Series | bpf_link observability APIs | expand |
Andrii Nakryiko <andriin@fb.com> writes: > Add support to look up bpf_link by ID and iterate over all existing bpf_links > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is > done as the very last step in finalizing bpf_link, together with installing > FD. This approach allows users of bpf_link in kernel code to not worry about > races between user-space and kernel code that hasn't finished attaching and > initializing bpf_link. > > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and > thus allowed to perform modifications on them (like LINK_UPDATE), from other > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only > querying bpf_link information (implemented later in the series) will be > allowed. I must admit I remain sceptical about this model of restricting access without any of the regular override mechanisms (for instance, enforcing read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you keep saying there would be 'some' override mechanism, I think it would be helpful if you could just include that so we can see the full mechanism in context. -Toke
On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andriin@fb.com> writes: > > > Add support to look up bpf_link by ID and iterate over all existing bpf_links > > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking > > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is > > done as the very last step in finalizing bpf_link, together with installing > > FD. This approach allows users of bpf_link in kernel code to not worry about > > races between user-space and kernel code that hasn't finished attaching and > > initializing bpf_link. > > > > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create > > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and > > thus allowed to perform modifications on them (like LINK_UPDATE), from other > > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only > > querying bpf_link information (implemented later in the series) will be > > allowed. > > I must admit I remain sceptical about this model of restricting access > without any of the regular override mechanisms (for instance, enforcing > read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you > keep saying there would be 'some' override mechanism, I think it would > be helpful if you could just include that so we can see the full > mechanism in context. I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up. One way to go about this is to allow creating writable bpf_link for GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH operation on writable links, same as we do with LINK_UPDATE here. LINK_DETACH will do the same as cgroup bpf_link auto-detachment on cgroup dying: it will detach bpf_link, but will leave it alive until last FD is closed. We need to consider, though, if CAP_DAC_OVERRIDE is something that can be disabled for majority of real-life applications to prevent them from doing this. If every realistic application has/needs CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can get writable bpf_link and do anything with it. > > -Toke >
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Andrii Nakryiko <andriin@fb.com> writes: >> >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is >> > done as the very last step in finalizing bpf_link, together with installing >> > FD. This approach allows users of bpf_link in kernel code to not worry about >> > races between user-space and kernel code that hasn't finished attaching and >> > initializing bpf_link. >> > >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only >> > querying bpf_link information (implemented later in the series) will be >> > allowed. >> >> I must admit I remain sceptical about this model of restricting access >> without any of the regular override mechanisms (for instance, enforcing >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you >> keep saying there would be 'some' override mechanism, I think it would >> be helpful if you could just include that so we can see the full >> mechanism in context. > > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up. > > One way to go about this is to allow creating writable bpf_link for > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH > operation on writable links, same as we do with LINK_UPDATE here. > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on > cgroup dying: it will detach bpf_link, but will leave it alive until > last FD is closed. Yup, I think this would be a reasonable way to implement the override mechanism - it would ensure 'full root' users (like a root shell) can remove attachments, while still preventing applications from doing so by limiting their capabilities. Extending on the concept of RO/RW bpf_link attachments, maybe it should even be possible for an application to choose which mode it wants to pin its fd in? With the same capability being able to override it of course... > We need to consider, though, if CAP_DAC_OVERRIDE is something that can > be disabled for majority of real-life applications to prevent them > from doing this. If every realistic application has/needs > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can > get writable bpf_link and do anything with it. I poked around a bit, and looking at the sandboxing configurations shipped with various daemons in their systemd unit files, it appears that the main case where daemons are granted CAP_DAC_OVERRIDE is if they have to be able to read /etc/shadow (which is installed as chmod 0). If this is really the case, that would indicate it's not a widely needed capability; but I wouldn't exactly say that I've done a comprehensive survey, so probably a good idea for you to check your users as well :) -Toke
On Wed, Apr 8, 2020 at 8:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> Andrii Nakryiko <andriin@fb.com> writes: > >> > >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links > >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking > >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is > >> > done as the very last step in finalizing bpf_link, together with installing > >> > FD. This approach allows users of bpf_link in kernel code to not worry about > >> > races between user-space and kernel code that hasn't finished attaching and > >> > initializing bpf_link. > >> > > >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create > >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and > >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other > >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only > >> > querying bpf_link information (implemented later in the series) will be > >> > allowed. > >> > >> I must admit I remain sceptical about this model of restricting access > >> without any of the regular override mechanisms (for instance, enforcing > >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you > >> keep saying there would be 'some' override mechanism, I think it would > >> be helpful if you could just include that so we can see the full > >> mechanism in context. > > > > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up. > > > > One way to go about this is to allow creating writable bpf_link for > > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH > > operation on writable links, same as we do with LINK_UPDATE here. > > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on > > cgroup dying: it will detach bpf_link, but will leave it alive until > > last FD is closed. > > Yup, I think this would be a reasonable way to implement the override > mechanism - it would ensure 'full root' users (like a root shell) can > remove attachments, while still preventing applications from doing so by > limiting their capabilities. So I did some experiments and I think I want to keep GET_FD_BY_ID for bpf_link to return only read-only bpf_links. After that, one can pin bpf_link temporarily and re-open it as writable one, provided CAP_DAC_OVERRIDE capability is present. All that works already, because pinned bpf_link is just a file, so one can do fchmod on it and all that will go through normal file access permission check code path. Unfortunately, just re-opening same FD as writable (which would be possible if fcntl(fd, F_SETFL, S_IRUSR S_IWUSR) was supported on Linux) without pinning is not possible. Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work either, because backing inode is not BPF FS inode. I'm not sure, but maybe we can support the latter eventually. But either way, I think given this is to be used for manual troubleshooting, going through few extra hoops to force-detach bpf_link is actually a good thing. > > Extending on the concept of RO/RW bpf_link attachments, maybe it should > even be possible for an application to choose which mode it wants to pin > its fd in? With the same capability being able to override it of > course... Isn't that what patch #2 is doing?... There are few bugs in the implementation currently, but it will work in the final version. > > > We need to consider, though, if CAP_DAC_OVERRIDE is something that can > > be disabled for majority of real-life applications to prevent them > > from doing this. If every realistic application has/needs > > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can > > get writable bpf_link and do anything with it. > > I poked around a bit, and looking at the sandboxing configurations > shipped with various daemons in their systemd unit files, it appears > that the main case where daemons are granted CAP_DAC_OVERRIDE is if they > have to be able to read /etc/shadow (which is installed as chmod 0). If > this is really the case, that would indicate it's not a widely needed > capability; but I wouldn't exactly say that I've done a comprehensive > survey, so probably a good idea for you to check your users as well :) Right, it might not be possible to drop it for all applications right away, but at least CAP_DAC_OVERRIDE is not CAP_SYS_ADMIN, which is absolutely necessary to work with BPF. > > -Toke >
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Wed, Apr 8, 2020 at 8:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> >> >> Andrii Nakryiko <andriin@fb.com> writes: >> >> >> >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links >> >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking >> >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is >> >> > done as the very last step in finalizing bpf_link, together with installing >> >> > FD. This approach allows users of bpf_link in kernel code to not worry about >> >> > races between user-space and kernel code that hasn't finished attaching and >> >> > initializing bpf_link. >> >> > >> >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create >> >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and >> >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other >> >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only >> >> > querying bpf_link information (implemented later in the series) will be >> >> > allowed. >> >> >> >> I must admit I remain sceptical about this model of restricting access >> >> without any of the regular override mechanisms (for instance, enforcing >> >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you >> >> keep saying there would be 'some' override mechanism, I think it would >> >> be helpful if you could just include that so we can see the full >> >> mechanism in context. >> > >> > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up. >> > >> > One way to go about this is to allow creating writable bpf_link for >> > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH >> > operation on writable links, same as we do with LINK_UPDATE here. >> > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on >> > cgroup dying: it will detach bpf_link, but will leave it alive until >> > last FD is closed. >> >> Yup, I think this would be a reasonable way to implement the override >> mechanism - it would ensure 'full root' users (like a root shell) can >> remove attachments, while still preventing applications from doing so by >> limiting their capabilities. > > So I did some experiments and I think I want to keep GET_FD_BY_ID for > bpf_link to return only read-only bpf_links. Why, exactly? (also, see below) > After that, one can pin bpf_link temporarily and re-open it as > writable one, provided CAP_DAC_OVERRIDE capability is present. All > that works already, because pinned bpf_link is just a file, so one can > do fchmod on it and all that will go through normal file access > permission check code path. Ah, I did not know that was possible - I was assuming that bpffs was doing something special to prevent that. But if not, great! > Unfortunately, just re-opening same FD as writable (which would > be possible if fcntl(fd, F_SETFL, S_IRUSR > S_IWUSR) was supported on Linux) without pinning is not possible. > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work > either, because backing inode is not BPF FS inode. I'm not sure, but > maybe we can support the latter eventually. But either way, I think > given this is to be used for manual troubleshooting, going through few > extra hoops to force-detach bpf_link is actually a good thing. Hmm, I disagree that deliberately making users jump through hoops is a good thing. Smells an awful lot like security through obscurity to me; and we all know how well that works anyway... >> Extending on the concept of RO/RW bpf_link attachments, maybe it should >> even be possible for an application to choose which mode it wants to pin >> its fd in? With the same capability being able to override it of >> course... > > Isn't that what patch #2 is doing?... Ah yes, so it is! I guess I skipped over that a bit too fast ;) > There are few bugs in the implementation currently, but it will work > in the final version. Cool. >> > We need to consider, though, if CAP_DAC_OVERRIDE is something that can >> > be disabled for majority of real-life applications to prevent them >> > from doing this. If every realistic application has/needs >> > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can >> > get writable bpf_link and do anything with it. >> >> I poked around a bit, and looking at the sandboxing configurations >> shipped with various daemons in their systemd unit files, it appears >> that the main case where daemons are granted CAP_DAC_OVERRIDE is if they >> have to be able to read /etc/shadow (which is installed as chmod 0). If >> this is really the case, that would indicate it's not a widely needed >> capability; but I wouldn't exactly say that I've done a comprehensive >> survey, so probably a good idea for you to check your users as well :) > > Right, it might not be possible to drop it for all applications right > away, but at least CAP_DAC_OVERRIDE is not CAP_SYS_ADMIN, which is > absolutely necessary to work with BPF. Yeah, I do hope that we'll eventually get CAP_BPF... -Toke
On Wed, Apr 8, 2020 at 2:21 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Wed, Apr 8, 2020 at 8:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > >> > >> > On Mon, Apr 6, 2020 at 4:34 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> >> > >> >> Andrii Nakryiko <andriin@fb.com> writes: > >> >> > >> >> > Add support to look up bpf_link by ID and iterate over all existing bpf_links > >> >> > in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking > >> >> > that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is > >> >> > done as the very last step in finalizing bpf_link, together with installing > >> >> > FD. This approach allows users of bpf_link in kernel code to not worry about > >> >> > races between user-space and kernel code that hasn't finished attaching and > >> >> > initializing bpf_link. > >> >> > > >> >> > Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create > >> >> > bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and > >> >> > thus allowed to perform modifications on them (like LINK_UPDATE), from other > >> >> > processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only > >> >> > querying bpf_link information (implemented later in the series) will be > >> >> > allowed. > >> >> > >> >> I must admit I remain sceptical about this model of restricting access > >> >> without any of the regular override mechanisms (for instance, enforcing > >> >> read-only mode regardless of CAP_DAC_OVERRIDE in this series). Since you > >> >> keep saying there would be 'some' override mechanism, I think it would > >> >> be helpful if you could just include that so we can see the full > >> >> mechanism in context. > >> > > >> > I wasn't aware of CAP_DAC_OVERRIDE, thanks for bringing this up. > >> > > >> > One way to go about this is to allow creating writable bpf_link for > >> > GET_FD_BY_ID if CAP_DAC_OVERRIDE is set. Then we can allow LINK_DETACH > >> > operation on writable links, same as we do with LINK_UPDATE here. > >> > LINK_DETACH will do the same as cgroup bpf_link auto-detachment on > >> > cgroup dying: it will detach bpf_link, but will leave it alive until > >> > last FD is closed. > >> > >> Yup, I think this would be a reasonable way to implement the override > >> mechanism - it would ensure 'full root' users (like a root shell) can > >> remove attachments, while still preventing applications from doing so by > >> limiting their capabilities. > > > > So I did some experiments and I think I want to keep GET_FD_BY_ID for > > bpf_link to return only read-only bpf_links. > > Why, exactly? (also, see below) For the reasons I explained below: because you can turn read-only bpf_link into writable one through pinning + chmod, if you have CAP_DAC_OVERRIDE. > > > After that, one can pin bpf_link temporarily and re-open it as > > writable one, provided CAP_DAC_OVERRIDE capability is present. All > > that works already, because pinned bpf_link is just a file, so one can > > do fchmod on it and all that will go through normal file access > > permission check code path. > > Ah, I did not know that was possible - I was assuming that bpffs was > doing something special to prevent that. But if not, great! > > > Unfortunately, just re-opening same FD as writable (which would > > be possible if fcntl(fd, F_SETFL, S_IRUSR > > S_IWUSR) was supported on Linux) without pinning is not possible. > > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work > > either, because backing inode is not BPF FS inode. I'm not sure, but > > maybe we can support the latter eventually. But either way, I think > > given this is to be used for manual troubleshooting, going through few > > extra hoops to force-detach bpf_link is actually a good thing. > > Hmm, I disagree that deliberately making users jump through hoops is a > good thing. Smells an awful lot like security through obscurity to me; > and we all know how well that works anyway... Depends on who users are? bpftool can implement this as one of `bpftool link` sub-commands and allow human operators to force-detach bpf_link, if necessary. I think applications shouldn't do this (programmatically) at all, which is why I think it's actually good that it's harder and not obvious, this will make developer think again before implementing this, hopefully. For me it's about discouraging bad practice. > > >> Extending on the concept of RO/RW bpf_link attachments, maybe it should > >> even be possible for an application to choose which mode it wants to pin > >> its fd in? With the same capability being able to override it of > >> course... > > > > Isn't that what patch #2 is doing?... > > Ah yes, so it is! I guess I skipped over that a bit too fast ;) > > > There are few bugs in the implementation currently, but it will work > > in the final version. > > Cool. > > >> > We need to consider, though, if CAP_DAC_OVERRIDE is something that can > >> > be disabled for majority of real-life applications to prevent them > >> > from doing this. If every realistic application has/needs > >> > CAP_DAC_OVERRIDE, then that's essentially just saying that anyone can > >> > get writable bpf_link and do anything with it. > >> > >> I poked around a bit, and looking at the sandboxing configurations > >> shipped with various daemons in their systemd unit files, it appears > >> that the main case where daemons are granted CAP_DAC_OVERRIDE is if they > >> have to be able to read /etc/shadow (which is installed as chmod 0). If > >> this is really the case, that would indicate it's not a widely needed > >> capability; but I wouldn't exactly say that I've done a comprehensive > >> survey, so probably a good idea for you to check your users as well :) > > > > Right, it might not be possible to drop it for all applications right > > away, but at least CAP_DAC_OVERRIDE is not CAP_SYS_ADMIN, which is > > absolutely necessary to work with BPF. > > Yeah, I do hope that we'll eventually get CAP_BPF... > > -Toke >
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> > After that, one can pin bpf_link temporarily and re-open it as >> > writable one, provided CAP_DAC_OVERRIDE capability is present. All >> > that works already, because pinned bpf_link is just a file, so one can >> > do fchmod on it and all that will go through normal file access >> > permission check code path. >> >> Ah, I did not know that was possible - I was assuming that bpffs was >> doing something special to prevent that. But if not, great! >> >> > Unfortunately, just re-opening same FD as writable (which would >> > be possible if fcntl(fd, F_SETFL, S_IRUSR >> > S_IWUSR) was supported on Linux) without pinning is not possible. >> > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work >> > either, because backing inode is not BPF FS inode. I'm not sure, but >> > maybe we can support the latter eventually. But either way, I think >> > given this is to be used for manual troubleshooting, going through few >> > extra hoops to force-detach bpf_link is actually a good thing. >> >> Hmm, I disagree that deliberately making users jump through hoops is a >> good thing. Smells an awful lot like security through obscurity to me; >> and we all know how well that works anyway... > > Depends on who users are? bpftool can implement this as one of > `bpftool link` sub-commands and allow human operators to force-detach > bpf_link, if necessary. Yeah, I would expect this to be the common way this would be used: built into tools. > I think applications shouldn't do this (programmatically) at all, > which is why I think it's actually good that it's harder and not > obvious, this will make developer think again before implementing > this, hopefully. For me it's about discouraging bad practice. I guess I just don't share your optimism that making people jump through hoops will actually discourage them :) If people know what they are doing it should be enough to document it as discouraged. And if they don't, they are perfectly capable of finding and copy-pasting the sequence of hoop-jumps required to achieve what they want, probably with more bugs added along the way. So in the end I think that all you're really achieving is annoying people who do have a legitimate reason to override the behaviour (which includes yourself as a bpftool developer :)). That's what I meant by the 'security through obscurity' comment. -Toke
On Tue, Apr 14, 2020 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > >> > After that, one can pin bpf_link temporarily and re-open it as > >> > writable one, provided CAP_DAC_OVERRIDE capability is present. All > >> > that works already, because pinned bpf_link is just a file, so one can > >> > do fchmod on it and all that will go through normal file access > >> > permission check code path. > >> > >> Ah, I did not know that was possible - I was assuming that bpffs was > >> doing something special to prevent that. But if not, great! > >> > >> > Unfortunately, just re-opening same FD as writable (which would > >> > be possible if fcntl(fd, F_SETFL, S_IRUSR > >> > S_IWUSR) was supported on Linux) without pinning is not possible. > >> > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work > >> > either, because backing inode is not BPF FS inode. I'm not sure, but > >> > maybe we can support the latter eventually. But either way, I think > >> > given this is to be used for manual troubleshooting, going through few > >> > extra hoops to force-detach bpf_link is actually a good thing. > >> > >> Hmm, I disagree that deliberately making users jump through hoops is a > >> good thing. Smells an awful lot like security through obscurity to me; > >> and we all know how well that works anyway... > > > > Depends on who users are? bpftool can implement this as one of > > `bpftool link` sub-commands and allow human operators to force-detach > > bpf_link, if necessary. > > Yeah, I would expect this to be the common way this would be used: built > into tools. > > > I think applications shouldn't do this (programmatically) at all, > > which is why I think it's actually good that it's harder and not > > obvious, this will make developer think again before implementing > > this, hopefully. For me it's about discouraging bad practice. > > I guess I just don't share your optimism that making people jump through > hoops will actually discourage them :) I understand. I just don't see why would anyone have to implement this at all and especially would think it's a good idea to begin with? > > If people know what they are doing it should be enough to document it as > discouraged. And if they don't, they are perfectly capable of finding > and copy-pasting the sequence of hoop-jumps required to achieve what > they want, probably with more bugs added along the way. > > So in the end I think that all you're really achieving is annoying > people who do have a legitimate reason to override the behaviour (which > includes yourself as a bpftool developer :)). That's what I meant by the > 'security through obscurity' comment. Can I please get a list of real examples of legitimate reasons to override this behavior? > > -Toke >
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Tue, Apr 14, 2020 at 3:32 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> >> > After that, one can pin bpf_link temporarily and re-open it as >> >> > writable one, provided CAP_DAC_OVERRIDE capability is present. All >> >> > that works already, because pinned bpf_link is just a file, so one can >> >> > do fchmod on it and all that will go through normal file access >> >> > permission check code path. >> >> >> >> Ah, I did not know that was possible - I was assuming that bpffs was >> >> doing something special to prevent that. But if not, great! >> >> >> >> > Unfortunately, just re-opening same FD as writable (which would >> >> > be possible if fcntl(fd, F_SETFL, S_IRUSR >> >> > S_IWUSR) was supported on Linux) without pinning is not possible. >> >> > Opening link from /proc/<pid>/fd/<link-fd> doesn't seem to work >> >> > either, because backing inode is not BPF FS inode. I'm not sure, but >> >> > maybe we can support the latter eventually. But either way, I think >> >> > given this is to be used for manual troubleshooting, going through few >> >> > extra hoops to force-detach bpf_link is actually a good thing. >> >> >> >> Hmm, I disagree that deliberately making users jump through hoops is a >> >> good thing. Smells an awful lot like security through obscurity to me; >> >> and we all know how well that works anyway... >> > >> > Depends on who users are? bpftool can implement this as one of >> > `bpftool link` sub-commands and allow human operators to force-detach >> > bpf_link, if necessary. >> >> Yeah, I would expect this to be the common way this would be used: built >> into tools. >> >> > I think applications shouldn't do this (programmatically) at all, >> > which is why I think it's actually good that it's harder and not >> > obvious, this will make developer think again before implementing >> > this, hopefully. For me it's about discouraging bad practice. >> >> I guess I just don't share your optimism that making people jump through >> hoops will actually discourage them :) > > I understand. I just don't see why would anyone have to implement this > at all and especially would think it's a good idea to begin with? > >> >> If people know what they are doing it should be enough to document it as >> discouraged. And if they don't, they are perfectly capable of finding >> and copy-pasting the sequence of hoop-jumps required to achieve what >> they want, probably with more bugs added along the way. >> >> So in the end I think that all you're really achieving is annoying >> people who do have a legitimate reason to override the behaviour (which >> includes yourself as a bpftool developer :)). That's what I meant by the >> 'security through obscurity' comment. > > Can I please get a list of real examples of legitimate reasons to > override this behavior? Primarily, I expect that this would be built into admin tools (like bpftool as you suggested). I just don't see why such tools should be made to do the whole pin/reopen dance (which, BTW, adds an implicit dependency on having a mounted bpffs) when we could just add a capability check directly in bpf_link_get_fd_by_id()? -Toke
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index eccfd1dea951..407c086bc9e4 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -113,6 +113,8 @@ enum bpf_cmd { BPF_MAP_DELETE_BATCH, BPF_LINK_CREATE, BPF_LINK_UPDATE, + BPF_LINK_GET_FD_BY_ID, + BPF_LINK_GET_NEXT_ID, }; enum bpf_map_type { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8b3a7d5814ae..527ec16702be 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -3719,6 +3719,55 @@ static int link_update(union bpf_attr *attr) return ret; } +static int bpf_link_inc_not_zero(struct bpf_link *link) +{ + return atomic64_fetch_add_unless(&link->refcnt, 1, 0) ? 0 : -ENOENT; +} + +#define BPF_LINK_GET_FD_BY_ID_LAST_FIELD open_flags + +static int bpf_link_get_fd_by_id(const union bpf_attr *attr) +{ + struct bpf_link *link; + u32 id = attr->link_id; + int f_flags; + int fd, err; + + if (CHECK_ATTR(BPF_LINK_GET_FD_BY_ID) || + /* links are not allowed to be open by ID as writable */ + attr->open_flags & ~BPF_F_RDONLY) + return -EINVAL; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + f_flags = bpf_get_file_flag(attr->open_flags); + if (f_flags < 0) + return f_flags; + + spin_lock_bh(&link_idr_lock); + link = idr_find(&link_idr, id); + /* before link is "settled", ID is 0, pretend it doesn't exist yet */ + if (link) { + if (link->id) + err = bpf_link_inc_not_zero(link); + else + err = -EAGAIN; + } else { + err = -ENOENT; + } + spin_unlock_bh(&link_idr_lock); + + if (err) + return err; + + fd = bpf_link_new_fd(link, f_flags); + if (fd < 0) + bpf_link_put(link); + + return fd; +} + SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr; @@ -3836,6 +3885,13 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_LINK_UPDATE: err = link_update(&attr); break; + case BPF_LINK_GET_FD_BY_ID: + err = bpf_link_get_fd_by_id(&attr); + break; + case BPF_LINK_GET_NEXT_ID: + err = bpf_obj_get_next_id(&attr, uattr, + &link_idr, &link_idr_lock); + break; default: err = -EINVAL; break;
Add support to look up bpf_link by ID and iterate over all existing bpf_links in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is done as the very last step in finalizing bpf_link, together with installing FD. This approach allows users of bpf_link in kernel code to not worry about races between user-space and kernel code that hasn't finished attaching and initializing bpf_link. Further, it's critical that BPF_LINK_GET_FD_BY_ID only ever allows to create bpf_link FD that's O_RDONLY. This is to protect processes owning bpf_link and thus allowed to perform modifications on them (like LINK_UPDATE), from other processes that got bpf_link ID from GET_NEXT_ID API. In the latter case, only querying bpf_link information (implemented later in the series) will be allowed. Signed-off-by: Andrii Nakryiko <andriin@fb.com> --- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 56 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+)