Message ID | 1305169376-2363-1-git-send-email-wad@chromium.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Ok, i like the direction here, but i think the ABI should be done differently. In this patch the ftrace event filter mechanism is used: * Will Drewry <wad@chromium.org> wrote: > +static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr, > + const char *filter_string) > +{ > + int err = -ENOMEM; > + struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter), > + GFP_KERNEL); > + if (!filter) > + goto fail; > + > + INIT_HLIST_NODE(&filter->node); > + filter->syscall_nr = syscall_nr; > + filter->data = syscall_nr_to_meta(syscall_nr); > + > + /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip > + * using a predicate at all. > + */ > + if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string)) > + goto out; > + > + /* Argument-based filtering only works on ftrace-hooked syscalls. */ > + if (!filter->data) { > + err = -ENOSYS; > + goto fail; > + } > + > +#ifdef CONFIG_FTRACE_SYSCALLS > + err = ftrace_parse_filter(&filter->event_filter, > + filter->data->enter_event->event.type, > + filter_string); > + if (err) > + goto fail; > +#endif > + > +out: > + return filter; > + > +fail: > + kfree(filter); > + return ERR_PTR(err); > +} Via a prctl() ABI: > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_SET_ENDIAN: > error = SET_ENDIAN(me, arg2); > break; > - > case PR_GET_SECCOMP: > error = prctl_get_seccomp(); > break; > case PR_SET_SECCOMP: > - error = prctl_set_seccomp(arg2); > + error = prctl_set_seccomp(arg2, arg3); > + break; > + case PR_SET_SECCOMP_FILTER: > + error = prctl_set_seccomp_filter(arg2, > + (char __user *) arg3); > + break; > + case PR_CLEAR_SECCOMP_FILTER: > + error = prctl_clear_seccomp_filter(arg2); > + break; > + case PR_GET_SECCOMP_FILTER: > + error = prctl_get_seccomp_filter(arg2, > + (char __user *) arg3, > + arg4); To restrict execution to system calls. Two observations: 1) We already have a specific ABI for this: you can set filters for events via an event fd. Why not extend that mechanism instead and improve *both* your sandboxing bits and the events code? This new seccomp code has a lot more to do with trace event filters than the minimal old seccomp code ... kernel/trace/trace_event_filter.c is 2000 lines of tricky code that interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of mostly trivial code. 2) Why should this concept not be made available wider, to allow the restriction of not just system calls but other security relevant components of the kernel as well? This too, if you approach the problem via the events code, will be a natural end result, while if you approach it from the seccomp prctl angle it will be a limited hack only. Note, the end result will be the same - just using a different ABI. So i really think the ABI itself should be closer related to the event code. What this "seccomp" code does is that it uses specific syscall events to restrict execution of certain event generating codepaths, such as system calls. Thanks, Ingo
Hi, On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote: > 1) We already have a specific ABI for this: you can set filters for events via > an event fd. > > Why not extend that mechanism instead and improve *both* your sandboxing > bits and the events code? This new seccomp code has a lot more > to do with trace event filters than the minimal old seccomp code ... Would this require privileges to get the event fd to start with? If so, I would prefer to avoid that, since using prctl() as shown in the patch set won't require any privs. -Kees
* Kees Cook <kees.cook@canonical.com> wrote: > Hi, > > On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote: > > 1) We already have a specific ABI for this: you can set filters for events via > > an event fd. > > > > Why not extend that mechanism instead and improve *both* your sandboxing > > bits and the events code? This new seccomp code has a lot more > > to do with trace event filters than the minimal old seccomp code ... > > Would this require privileges to get the event fd to start with? [...] No special privileges with the default perf_events_paranoid value. > [...] If so, I would prefer to avoid that, since using prctl() as shown in > the patch set won't require any privs. and we could also explicitly allow syscall events without any privileges, regardless of the setting of 'perf_events_paranoid' config value. Obviously a sandboxing host process wants to run with as low privileges as it can. Thanks, Ingo
On Wed, 11 May 2011, Will Drewry wrote: > +void seccomp_filter_log_failure(int syscall) > +{ > + printk(KERN_INFO > + "%s[%d]: system call %d (%s) blocked at ip:%lx\n", > + current->comm, task_pid_nr(current), syscall, > + syscall_nr_to_name(syscall), KSTK_EIP(current)); > +} I think it'd be a good idea to utilize the audit facility here. - James
On Thu, 12 May 2011, Ingo Molnar wrote: > > 2) Why should this concept not be made available wider, to allow the > restriction of not just system calls but other security relevant components > of the kernel as well? Because the aim of this is to reduce the attack surface of the syscall interface. LSM is the correct level of abstraction for general security mediation, because it allows you to take into account all relevant security information in a race-free context. > This too, if you approach the problem via the events code, will be a natural > end result, while if you approach it from the seccomp prctl angle it will be > a limited hack only. I'd say it's a well-defined and readily understandable feature. - James
On Thu, May 12, 2011 at 09:48:50AM +0200, Ingo Molnar wrote: > To restrict execution to system calls. > > Two observations: > > 1) We already have a specific ABI for this: you can set filters for events via > an event fd. > > Why not extend that mechanism instead and improve *both* your sandboxing > bits and the events code? This new seccomp code has a lot more > to do with trace event filters than the minimal old seccomp code ... > > kernel/trace/trace_event_filter.c is 2000 lines of tricky code that > interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of > mostly trivial code. > > 2) Why should this concept not be made available wider, to allow the > restriction of not just system calls but other security relevant components > of the kernel as well? > > This too, if you approach the problem via the events code, will be a natural > end result, while if you approach it from the seccomp prctl angle it will be > a limited hack only. > > Note, the end result will be the same - just using a different ABI. > > So i really think the ABI itself should be closer related to the event code. > What this "seccomp" code does is that it uses specific syscall events to > restrict execution of certain event generating codepaths, such as system calls. > > Thanks, > > Ingo What's positive with that approach is that the code is all there already. Create a perf event for a given trace event, attach a filter to it. What needs to be added is an override of the effect of the filter. By default it's dropping the event, but there may be different flavours, including sending a signal. All in one, extending the current code to allow that looks trivial. The negative points are that * trace events are supposed to stay passive and not act on the system, except doing some endpoint things like writing to a buffer. We can't call do_exit() from a tracepoint for example, preemption is disabled there. * Also, is it actually relevant to extend that seccomp filtering to other events than syscalls? Exposing kernel events to filtering sounds actually to me bringing a new potential security issue. But with fine restrictions this can probably be dealt with. Especially if by default only syscalls can be filtered * I think Peter did not want to give such "active" role to perf in the system.
* James Morris <jmorris@namei.org> wrote: > On Thu, 12 May 2011, Ingo Molnar wrote: > > > 2) Why should this concept not be made available wider, to allow the > > restriction of not just system calls but other security relevant components > > of the kernel as well? > > Because the aim of this is to reduce the attack surface of the syscall > interface. What i suggest achieves the same, my argument is that we could aim it to be even more flexible and even more useful. > LSM is the correct level of abstraction for general security mediation, > because it allows you to take into account all relevant security information > in a race-free context. I don't care about LSM though, i find it poorly designed. The approach implemented here, the ability for *unprivileged code* to define (the seeds of ...) flexible security policies, in a proper Linuxish way, which is inherited along the task parent/child hieararchy and which allows nesting etc. is a *lot* more flexible. What Will implemented here is pretty huge in my opinion: it turns security from a root-only kind of weird hack into an essential component of its APIs, available to *any* app not just the select security policy/mechanism chosen by the distributor ... If implemented properly this could replace LSM in the long run. As a prctl() hack bound to seccomp (which, by all means, is a natural extension to the current seccomp ABI, so perfectly fine if we only want that scope), that is much less likely to happen. And if we merge the seccomp interface prematurely then interest towards a more flexible approach will disappear, so either we do it properly now or it will take some time for someone to come around and do it ... Also note that i do not consider the perf events ABI itself cast into stone - and we could very well add a new system call for this, independent of perf events. I just think that the seccomp scope itself is exciting but looks limited to what the real potential of this could be. > > This too, if you approach the problem via the events code, will be a natural > > end result, while if you approach it from the seccomp prctl angle it will be > > a limited hack only. > > I'd say it's a well-defined and readily understandable feature. Note, it was me who suggested this very event-filter-engine design a year ago, when the first submission still used a crude bitmap of allowed seccomp syscalls: http://lwn.net/Articles/332974/ Funnily enough, back then you wrote this: " I'm concerned that we're seeing yet another security scheme being designed on the fly, without a well-formed threat model, and without taking into account lessons learned from the seemingly endless parade of similar, failed schemes. " so when and how did your opinion of this scheme turn from it being an "endless parade of failed schemes" to it being a "well-defined and readily understandable feature"? :-) The idea itself has not changed since last year, what happened is that the filter engine got a couple of new features and Will has separated it out and has implemented a working prototype for sandboxing. What i do here is to suggest *further* steps down the same road, now that we see that this scheme can indeed be used to implement sandboxing ... I think it's a valid line of inquiry. Thanks, Ingo
[Thanks to everyone for the continued feedback and insights - I appreciate it!] On Thu, May 12, 2011 at 8:01 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * James Morris <jmorris@namei.org> wrote: > >> On Thu, 12 May 2011, Ingo Molnar wrote: >> >> > 2) Why should this concept not be made available wider, to allow the >> > restriction of not just system calls but other security relevant components >> > of the kernel as well? >> >> Because the aim of this is to reduce the attack surface of the syscall >> interface. > > What i suggest achieves the same, my argument is that we could aim it to be > even more flexible and even more useful. > >> LSM is the correct level of abstraction for general security mediation, >> because it allows you to take into account all relevant security information >> in a race-free context. > > I don't care about LSM though, i find it poorly designed. > > The approach implemented here, the ability for *unprivileged code* to define > (the seeds of ...) flexible security policies, in a proper Linuxish way, which > is inherited along the task parent/child hieararchy and which allows nesting > etc. is a *lot* more flexible. > > What Will implemented here is pretty huge in my opinion: it turns security from > a root-only kind of weird hack into an essential component of its APIs, > available to *any* app not just the select security policy/mechanism chosen by > the distributor ... > > If implemented properly this could replace LSM in the long run. > > As a prctl() hack bound to seccomp (which, by all means, is a natural extension > to the current seccomp ABI, so perfectly fine if we only want that scope), that > is much less likely to happen. > > And if we merge the seccomp interface prematurely then interest towards a more > flexible approach will disappear, so either we do it properly now or it will > take some time for someone to come around and do it ... > > Also note that i do not consider the perf events ABI itself cast into stone - > and we could very well add a new system call for this, independent of perf > events. I just think that the seccomp scope itself is exciting but looks > limited to what the real potential of this could be. I agree with you on many of these points! However, I don't think that the views around LSMs, perf/ftrace infrastructure, or the current seccomp filtering implementation are necessarily in conflict. Here is my understanding of how the different worlds fit together and where I see this patchset living, along with where I could see future work going. Perhaps I'm being a trifle naive, but here goes anyway: 1. LSMs provide a global mechanism for hooking "security relevant" events at a point where all the incoming user-sourced data has been preprocessed and moved into userspace. The hooks are called every time one of those boundaries are crossed. 2. Perf and the ftrace infrastructure provide global function tracing and system call hooks with direct access to the caller's registers (and memory). 3. seccomp (as it exists today) provides a global system call entry hook point with a binary per-process decision about whether to provide "secure computing" behavior. When I boil that down to abstractions, I see: A. Globally scoped: LSMs, ftrace/perf B. Locally/process scoped: seccomp The result of that logical equivalence is that I see room for: I. A per-process, locally scoped security event hooking interface (the proposed changes in this patchset) II. A globally scoped security event hooking interface _prior_ to argument processing III. A globally scoped security event hooking interface _post_ argument processing II and III could be reduced further if I assume that ftrace/perf provides (II) and a simple intermediary layer (hook entry/exit) provides the argument processing steps that then call out a global security policy system. The driving motivation for this patchset is kernel attack surface reduction, but that need arises because we lack a process-scoped mechanism for making security decisions -- everything is global: creds/DAC, containers, LSM, etc. Adding ftrace filtering to agl's original bitmask-seccomp proposal opens up the process-local security world. At present, it can limit the attack surface with simple binary filters or apply limited security policy through the use of filter strings. Based on your mails, I see two main deficiencies in my proposed patchset: a. Deep argument analysis: Any arguments that live in user memory needs to be copied into the kernel, then checked, and substituted for the actual system call, then have the original pointers restored (when applicable) on system call exit. There is a large overhead here and the LSM hooks provide much of this support on a global level. b. Lack of support for non-system call events. For (a), if the long term view of ftrace/perf & LSMs is that LSM-like functionality will live on top of the ftrace/perf infrastructure, then adding support for the intermediary layer to analyze arguments will come with time. It's also likely that for process-local stuff (e.g.,) a new predicate could be added to callback to a userspace supervisor, or even a more generic ability for modules to register new predicates/functions in the filtering engine itself -- like "fd == 1 && check_path(path) == '/etc/safe.conf'" or "check_xattr(path, expected)". Of course, I'm just making stuff up right now :) For (b), we could just add a field we don't use right now in the prctl interface: prctl(PR_SET_SECCOMP_FILTER, int event_type, int event_or_syscall_nr, char *filter) [or something similar] Then we can add process-local/scoped supported event types somewhere down the road without an ABI change. Tying it all together, it'd look like: * Now -- add process-scoped security support: secocmp filter with support for "future" event types * Soon -- expand ftrace syscall hooks to hook more system calls * Later -- expand ftrace filter language to support either deep argument analysis and/or custom registered predicates * Later, later -- implement a LSM-like hooking layer for "interesting" event types on top of the ftrace hooks That would yield process-scoped security controls and global security controls and the ability to continue to create new and interesting security modules. All that said, I'm in over my head. I've focused primarily on the process-scoped security. I think James, some of the LSM authors, and out-of-tree security system maintainers would be good to help guide direction toward the security view you have in mind to ensure the flexibility desired exists. And that's even assuming this sketch is even vaguely interesting... [snip] > What i do here is to suggest *further* steps down the same road, now that we > see that this scheme can indeed be used to implement sandboxing ... I think > it's a valid line of inquiry. I certainly agree that it's a valid line of inquiry, but I worry about the massive scope expansion. I know it hurts my head, but I'm hoping the brain-dump above frames up how I think about this patch and your line of inquiry. ftrace hooking and the perf code certainly look a lot like LSMs if I squint hard :) But there is a substantial amount of work to merge the worlds, and (thankfully) I don't think that future directly impacts process-scoped security mechanisms even if they can interact nicely. thanks! will
On Thu, 12 May 2011, Ingo Molnar wrote: > Funnily enough, back then you wrote this: > > " I'm concerned that we're seeing yet another security scheme being designed on > the fly, without a well-formed threat model, and without taking into account > lessons learned from the seemingly endless parade of similar, failed schemes. " > > so when and how did your opinion of this scheme turn from it being an "endless > parade of failed schemes" to it being a "well-defined and readily > understandable feature"? :-) When it was defined in a way which limited its purpose to reducing the attack surface of the sycall interface. - James
* James Morris <jmorris@namei.org> wrote: > On Thu, 12 May 2011, Ingo Molnar wrote: > > Funnily enough, back then you wrote this: > > > > " I'm concerned that we're seeing yet another security scheme being designed on > > the fly, without a well-formed threat model, and without taking into account > > lessons learned from the seemingly endless parade of similar, failed schemes. " > > > > so when and how did your opinion of this scheme turn from it being an > > "endless parade of failed schemes" to it being a "well-defined and readily > > understandable feature"? :-) > > When it was defined in a way which limited its purpose to reducing the attack > surface of the sycall interface. Let me outline a simple example of a new filter expression based security feature that could be implemented outside the narrow system call boundary you find acceptable, and please tell what is bad about it. Say i'm a user-space sandbox developer who wants to enforce that sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/. It is a simple and sensible security feature, agreed? It allows most code to run well and link to countless libraries - but no access to other files is allowed. I would also like my sandbox app to be able to install this policy without having to be root. I do not want the sandbox app to have permission to create labels on /lib and /usr/lib and what not. Firstly, using the filter code i deny the various link creation syscalls so that sandboxed code cannot escape for example by creating a symlink to outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way new syscalls added by new kernels are denied by defalt. The current symlink creation syscalls are not opted in to.) But the next step, actually checking filenames, poses a big hurdle: i cannot implement the filename checking at the sys_open() syscall level in a secure way: because the pathname is passed to sys_open() by pointer, and if i check it at the generic sys_open() syscall level, another thread in the sandbox might modify the underlying filename *after* i've checked it. But if i had a VFS event at the fs/namei.c::getname() level, i would have access to a central point where the VFS string becomes stable to the kernel and can be checked (and denied if necessary). A sidenote, and not surprisingly, the audit subsystem already has an event callback there: audit_getname(result); Unfortunately this audit callback cannot be used for my purposes, because the event is single-purpose for auditd and because it allows no feedback (no deny/accept discretion for the security policy). But if had this simple event there: err = event_vfs_getname(result); I could implement this new filename based sandboxing policy, using a filter like this installed on the vfs::getname event and inherited by all sandboxed tasks (which cannot uninstall the filter, obviously): " if (strstr(name, "..")) return -EACCESS; if (!strncmp(name, "/home/sandbox/", 14) && !strncmp(name, "/lib/", 5) && !strncmp(name, "/usr/lib/", 9)) return -EACCESS; " # # Note1: Obviously the filter engine would be extended to allow such simple string # match functions. ) # # Note2: ".." is disallowed so that sandboxed code cannot escape the restrictions # using "/..". # This kind of flexible and dynamic sandboxing would allow a wide range of file ops within the sandbox, while still isolating it from files not included in the specified VFS namespace. ( Note that there are tons of other examples as well, for useful security features that are best done using events outside the syscall boundary. ) The security event filters code tied to seccomp and syscalls at the moment is useful, but limited in its future potential. So i argue that it should go slightly further and should become: - unprivileged: application-definable, allowing the embedding of security policy in *apps* as well, not just the system - flexible: can be added/removed runtime unprivileged, and cheaply so - transparent: does not impact executing code that meets the policy - nestable: it is inherited by child tasks and is fundamentally stackable, multiple policies will have the combined effect and they are transparent to each other. So if a child task within a sandbox adds *more* checks then those add to the already existing set of checks. We only narrow permissions, never extend them. - generic: allowing observation and (safe) control of security relevant parameters not just at the system call boundary but at other relevant places of kernel execution as well: which points/callbacks could also be used for other types of event extraction such as perf. It could even be shared with audit ... I argue that this is the LSM and audit subsystems designed right: in the long run it could allow everything that LSM does at the moment - and so much more ... And you argue that allowing this would be bad, if it was extended like that then you'd consider it a failed scheme? Why? Thanks, Ingo
On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
> err = event_vfs_getname(result);
I really think we should not do this. Events like we have them should be
inactive, totally passive entities, only observe but not affect
execution (other than the bare minimal time delay introduced by
observance).
If you want another entity that is more active, please invent a new name
for it and create a new subsystem for them, now you could have these
active entities also have an (automatic) passive event side, but that's
some detail.
* Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: > > err = event_vfs_getname(result); > > I really think we should not do this. Events like we have them should be > inactive, totally passive entities, only observe but not affect execution > (other than the bare minimal time delay introduced by observance). Well, this patchset already demonstrates that we can use a single event callback for a rather useful purpose. Either it makes sense to do, in which case we should share facilities as much as possible, or it makes no sense, in which case we should not merge it at all. > If you want another entity that is more active, please invent a new name for > it and create a new subsystem for them, now you could have these active > entities also have an (automatic) passive event side, but that's some detail. Why should we have two callbacks next to each other: event_vfs_getname(result); result = check_event_vfs_getname(result); if one could do it all? Thanks, Ingo
On Fri, 2011-05-13 at 14:26 +0200, Ingo Molnar wrote: > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: > > > err = event_vfs_getname(result); > > > > I really think we should not do this. Events like we have them should be > > inactive, totally passive entities, only observe but not affect execution > > (other than the bare minimal time delay introduced by observance). > > Well, this patchset already demonstrates that we can use a single event > callback for a rather useful purpose. Can and should are two distinct things. > Either it makes sense to do, in which case we should share facilities as much > as possible, or it makes no sense, in which case we should not merge it at all. And I'm arguing we should _not_. Observing is radically different from Affecting, at the very least the two things should have different permission schemes. We should not confuse these two matters. > > If you want another entity that is more active, please invent a new name for > > it and create a new subsystem for them, now you could have these active > > entities also have an (automatic) passive event side, but that's some detail. > > Why should we have two callbacks next to each other: > > event_vfs_getname(result); > result = check_event_vfs_getname(result); > > if one could do it all? Did you actually read the bit where I said that check_event_* (although I still think that name sucks) could imply a matching event_*?
On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote: > > > event_vfs_getname(result); > > result = check_event_vfs_getname(result); Another fundamental difference is how to treat the callback chains for these two. Observers won't have a return value and are assumed to never fail, therefore we can always call every entry on the callback list. Active things otoh do have a return value, and thus we need to have semantics that define what to do with that during callback iteration, when to continue and when to break. Thus for active elements its impossible to guarantee all entries will indeed be called.
* Peter Zijlstra <peterz@infradead.org> wrote: > > Why should we have two callbacks next to each other: > > > > event_vfs_getname(result); > > result = check_event_vfs_getname(result); > > > > if one could do it all? > > Did you actually read the bit where I said that check_event_* (although > I still think that name sucks) could imply a matching event_*? No, did not notice that - and yes that solves this particular problem. So given that by your own admission it makes sense to share the facilities at the low level, i also argue that it makes sense to share as high up as possible. Are you perhaps arguing for a ->observe flag that would make 100% sure that the default behavior for events is observe-only? That would make sense indeed. Otherwise both cases really want to use all the same facilities for event discovery, setup, control and potential extraction of events. Thanks, Ingo
* Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2011-05-13 at 14:39 +0200, Peter Zijlstra wrote: > > > > > event_vfs_getname(result); > > > result = check_event_vfs_getname(result); > > Another fundamental difference is how to treat the callback chains for > these two. > > Observers won't have a return value and are assumed to never fail, > therefore we can always call every entry on the callback list. > > Active things otoh do have a return value, and thus we need to have > semantics that define what to do with that during callback iteration, > when to continue and when to break. Thus for active elements its > impossible to guarantee all entries will indeed be called. I think the sanest semantics is to run all active callbacks as well. For example if this is used for three stacked security policies - as if 3 LSM modules were stacked at once. We'd call all three, and we'd determine that at least one failed - and we'd return a failure. Even if the first one failed already we'd still want to trigger *all* the failures, because security policies like to know when they have triggered a failure (regardless of other active policies) and want to see that failure event (if they are logging such events). So to me this looks pretty similar to observer callbacks as well, it's the natural extension to an observer callback chain. Observer callbacks are simply constant functions (to the caller), those which never return failure and which never modify any of the parameters. It's as if you argued that there should be separate syscalls/facilities for handling readonly files versus handling read/write files. Thanks, Ingo
On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: > I think the sanest semantics is to run all active callbacks as well. > > For example if this is used for three stacked security policies - as if 3 LSM > modules were stacked at once. We'd call all three, and we'd determine that at > least one failed - and we'd return a failure. But that only works for boolean functions where you can return the multi-bit-or of the result. What if you need to return the specific error code. Also, there's bound to be other cases where people will want to employ this, look at all the various notifier chain muck we've got, it already deals with much of this -- simply because users need it. Then there's the whole indirection argument, if you don't need indirection, its often better to not use it, I myself much prefer code to look like: foo1(bar); foo2(bar); foo3(bar); Than: foo_notifier(bar); Simply because its much clearer who all are involved without me having to grep around to see who registers for foo_notifier and wth they do with it. It also makes it much harder to sneak in another user, whereas its nearly impossible to find new notifier users. Its also much faster, no extra memory accesses, no indirect function calls, no other muck.
* Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: > > I think the sanest semantics is to run all active callbacks as well. > > > > For example if this is used for three stacked security policies - as if 3 LSM > > modules were stacked at once. We'd call all three, and we'd determine that at > > least one failed - and we'd return a failure. > > But that only works for boolean functions where you can return the > multi-bit-or of the result. What if you need to return the specific > error code. Do you mean that one filter returns -EINVAL while the other -EACCES? Seems like a non-problem to me, we'd return the first nonzero value. > Also, there's bound to be other cases where people will want to employ > this, look at all the various notifier chain muck we've got, it already > deals with much of this -- simply because users need it. Do you mean it would be easy to abuse it? What kind of abuse are you most worried about? > Then there's the whole indirection argument, if you don't need > indirection, its often better to not use it, I myself much prefer code > to look like: > > foo1(bar); > foo2(bar); > foo3(bar); > > Than: > > foo_notifier(bar); > > Simply because its much clearer who all are involved without me having > to grep around to see who registers for foo_notifier and wth they do > with it. It also makes it much harder to sneak in another user, whereas > its nearly impossible to find new notifier users. > > Its also much faster, no extra memory accesses, no indirect function > calls, no other muck. But i suspect this question has been settled, given the fact that even pure observer events need and already process a chain of events? Am i missing something about your argument? Thanks, Ingo
Cut the microblaze list since its bouncy. On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote: > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: > > > I think the sanest semantics is to run all active callbacks as well. > > > > > > For example if this is used for three stacked security policies - as if 3 LSM > > > modules were stacked at once. We'd call all three, and we'd determine that at > > > least one failed - and we'd return a failure. > > > > But that only works for boolean functions where you can return the > > multi-bit-or of the result. What if you need to return the specific > > error code. > > Do you mean that one filter returns -EINVAL while the other -EACCES? > > Seems like a non-problem to me, we'd return the first nonzero value. Assuming the first is -EINVAL, what then is the value in computing the -EACCESS? Sounds like a massive waste of time to me. > > Also, there's bound to be other cases where people will want to employ > > this, look at all the various notifier chain muck we've got, it already > > deals with much of this -- simply because users need it. > > Do you mean it would be easy to abuse it? What kind of abuse are you most > worried about? I'm not worried about abuse, I'm saying that going by the existing notifier pattern always visiting all entries on the callback list is undesired. > > Then there's the whole indirection argument, if you don't need > > indirection, its often better to not use it, I myself much prefer code > > to look like: > > > > foo1(bar); > > foo2(bar); > > foo3(bar); > > > > Than: > > > > foo_notifier(bar); > > > > Simply because its much clearer who all are involved without me having > > to grep around to see who registers for foo_notifier and wth they do > > with it. It also makes it much harder to sneak in another user, whereas > > its nearly impossible to find new notifier users. > > > > Its also much faster, no extra memory accesses, no indirect function > > calls, no other muck. > > But i suspect this question has been settled, given the fact that even pure > observer events need and already process a chain of events? Am i missing > something about your argument? I'm saying that there's reasons to not use notifiers passive or active. Mostly the whole notifier/indirection muck comes up once you want modules to make use of the thing, because then you need dynamic management of the callback list. (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c for all the cpu-hotplug callbacks :-) Anyway, I oppose for the existing events to gain an active role.
On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote: > > So given that by your own admission it makes sense to share the facilities at > the low level, i also argue that it makes sense to share as high up as > possible. I'm not saying any such thing, I'm saying that it might make sense to observe active objects and auto-create these observation points. That doesn't make them similar or make them share anything.
* Peter Zijlstra <peterz@infradead.org> wrote: > Cut the microblaze list since its bouncy. > > On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: > > > > I think the sanest semantics is to run all active callbacks as well. > > > > > > > > For example if this is used for three stacked security policies - as if 3 LSM > > > > modules were stacked at once. We'd call all three, and we'd determine that at > > > > least one failed - and we'd return a failure. > > > > > > But that only works for boolean functions where you can return the > > > multi-bit-or of the result. What if you need to return the specific > > > error code. > > > > Do you mean that one filter returns -EINVAL while the other -EACCES? > > > > Seems like a non-problem to me, we'd return the first nonzero value. > > Assuming the first is -EINVAL, what then is the value in computing the > -EACCESS? Sounds like a massive waste of time to me. No, because the common case is no rejection - this is a security mechanism. So in the normal case we would execute all 3 anyway, just to determine that all return 0. Are you really worried about the abnormal case of one of them returning an error and us calculating all 3 return values? > > > Also, there's bound to be other cases where people will want to employ > > > this, look at all the various notifier chain muck we've got, it already > > > deals with much of this -- simply because users need it. > > > > Do you mean it would be easy to abuse it? What kind of abuse are you most > > worried about? > > I'm not worried about abuse, I'm saying that going by the existing > notifier pattern always visiting all entries on the callback list is > undesired. That is because many notifier chains are used in an 'event consuming' manner - they are responding to things like hardware events and are called in an interrupt-handler alike fashion most of the time. > > > Then there's the whole indirection argument, if you don't need > > > indirection, its often better to not use it, I myself much prefer code > > > to look like: > > > > > > foo1(bar); > > > foo2(bar); > > > foo3(bar); > > > > > > Than: > > > > > > foo_notifier(bar); > > > > > > Simply because its much clearer who all are involved without me having > > > to grep around to see who registers for foo_notifier and wth they do > > > with it. It also makes it much harder to sneak in another user, whereas > > > its nearly impossible to find new notifier users. > > > > > > Its also much faster, no extra memory accesses, no indirect function > > > calls, no other muck. > > > > But i suspect this question has been settled, given the fact that even pure > > observer events need and already process a chain of events? Am i missing > > something about your argument? > > I'm saying that there's reasons to not use notifiers passive or active. > > Mostly the whole notifier/indirection muck comes up once you want > modules to make use of the thing, because then you need dynamic > management of the callback list. But your argument assumes that we'd have a chain of functions to call, like regular notifiers. While the natural model here would be to have a list of registered event structs for that point, with different filters but basically the same callback mechanism (a call into the filter engine in essence). Also note that the common case would be no event registered - and we'd automatically optimize that case via the existing jump labels optimization. > (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c > for all the cpu-hotplug callbacks :-) > > Anyway, I oppose for the existing events to gain an active role. Why if 'being active' is optional and useful? Thanks, Ingo
* Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2011-05-13 at 14:49 +0200, Ingo Molnar wrote: > > > > So given that by your own admission it makes sense to share the facilities at > > the low level, i also argue that it makes sense to share as high up as > > possible. > > I'm not saying any such thing, I'm saying that it might make sense to > observe active objects and auto-create these observation points. That > doesn't make them similar or make them share anything. Well, they would share the lowest level call site: result = check_event_vfs_getname(result); You call it 'auto-generated call site', i call it a shared (single line) call site. The same thing as far as the lowest level goes. Now (the way i understood it) you'd want to stop the sharing right after that. I argue that it should go all the way up. Note: i fully agree that there should be events where filters can have no effect whatsoever. For example if this was written as: check_event_vfs_getname(result); Then it would have no effect. This is decided by the subsystem developers, obviously. So whether an event is 'active' or 'passive' can be enforced at the subsystem level as well. As far as the event facilities go, 'no effect observation' is a special-case of 'active observation' - just like read-only files are a special case of read-write files. Thanks, Ingo
[dropping microblaze and roland] lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: > * James Morris <jmorris@namei.org> wrote: > It is a simple and sensible security feature, agreed? It allows most code to > run well and link to countless libraries - but no access to other files is > allowed. It's simple enough and sounds reasonable, but you can read all the discussion about AppArmour why many people don't really think it's the best. Still, I'll agree it's a lot better than nothing. > But if i had a VFS event at the fs/namei.c::getname() level, i would have > access to a central point where the VFS string becomes stable to the kernel and > can be checked (and denied if necessary). > > A sidenote, and not surprisingly, the audit subsystem already has an event > callback there: > > audit_getname(result); > > Unfortunately this audit callback cannot be used for my purposes, because the > event is single-purpose for auditd and because it allows no feedback (no > deny/accept discretion for the security policy). > > But if had this simple event there: > > err = event_vfs_getname(result); Wow it sounds so easy. Now lets keep extending your train of thought until we can actually provide the security provided by SELinux. What do we end up with? We end up with an event hook right next to every LSM hook. You know, the LSM hooks were placed where they are for a reason. Because those were the locations inside the kernel where you actually have information about the task doing an operation and the objects (files, sockets, directories, other tasks, etc) they are doing an operation on. Honestly all you are talking about it remaking the LSM with 2 sets of hooks instead if 1. Why? It seems much easier that if you want the language of the filter engine you would just make a new LSM that uses the filter engine for it's policy language rather than the language created by SELinux or SMACK or name your LSM implementation. > - unprivileged: application-definable, allowing the embedding of security > policy in *apps* as well, not just the system > > - flexible: can be added/removed runtime unprivileged, and cheaply so > > - transparent: does not impact executing code that meets the policy > > - nestable: it is inherited by child tasks and is fundamentally stackable, > multiple policies will have the combined effect and they > are transparent to each other. So if a child task within a > sandbox adds *more* checks then those add to the already > existing set of checks. We only narrow permissions, never > extend them. > > - generic: allowing observation and (safe) control of security relevant > parameters not just at the system call boundary but at other > relevant places of kernel execution as well: which > points/callbacks could also be used for other types of event > extraction such as perf. It could even be shared with audit ... I'm not arguing that any of these things are bad things. What you describe is a new LSM that uses a discretionary access control model but with the granularity and flexibility that has traditionally only existed in the mandatory access control security modules previously implemented in the kernel. I won't argue that's a bad idea, there's no reason in my mind that a process shouldn't be allowed to control it's own access decisions in a more flexible way than rwx bits. Then again, I certainly don't see a reason that this syscall hardening patch should be held up while a whole new concept in computer security is contemplated... -Eric
[dropping microblaze and roland] On Fri, 2011-05-13 at 15:18 +0200, Ingo Molnar wrote: > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, 2011-05-13 at 14:54 +0200, Ingo Molnar wrote: > > > I think the sanest semantics is to run all active callbacks as well. > > > > > > For example if this is used for three stacked security policies - as if 3 LSM > > > modules were stacked at once. We'd call all three, and we'd determine that at > > > least one failed - and we'd return a failure. > > > > But that only works for boolean functions where you can return the > > multi-bit-or of the result. What if you need to return the specific > > error code. > > Do you mean that one filter returns -EINVAL while the other -EACCES? > > Seems like a non-problem to me, we'd return the first nonzero value. Sounds so easy! Why haven't LSMs stacked already? Because what happens if one of these hooks did something stateful? Lets say on open, hook #1 returns EPERM. hook #2 allocates memory. The open is going to fail and hooks #2 is never going to get the close() which should have freed the allocation. If you can be completely stateless its easier, but there's a reason that stacking security modules is hard. Serge has tried in the past and both dhowells and casey schaufler are working on it right now. Stacking is never as easy as it sounds :) -Eric
On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote: > Then again, I certainly don't see a > reason that this syscall hardening patch should be held up while a whole > new concept in computer security is contemplated... Which makes me wonder why this syscall hardening stuff is done outside of LSM? Why isn't is part of the LSM so that say SELinux can have a syscall bitmask per security context? Making it part of the LSM also avoids having to add this prctl().
On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:
> this is a security mechanism
Who says? and why would you want to unify two separate concepts only to
them limit it to security that just doesn't make sense.
Either you provide a full on replacement for notifier chain like things
or you don't, only extending trace events in this fashion for security
is like way weird.
Plus see the arguments Eric made about stacking stuff, not only security
schemes will have those problems.
> ... If you can be completely stateless its easier, but there's > a reason that stacking security modules is hard. Serge has tried in the > past and both dhowells and casey schaufler are working on it right now. > Stacking is never as easy as it sounds :) For a bad example of trying to allow alternate security models look at NetBSD's kauth code :-) NetBSD also had issues where some 'system call trace' code was being used to (try to) apply security - unfortunately it worked by looking at the user-space buffers on system call entry - and a multithreaded program can easily arrange to update them after the initial check! For trace/event type activities this wouldn't really matter, for security policy it does. (I've not looked directly at these event points in linux) David
On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote: > On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote: > > Then again, I certainly don't see a > > reason that this syscall hardening patch should be held up while a whole > > new concept in computer security is contemplated... > > Which makes me wonder why this syscall hardening stuff is done outside > of LSM? Why isn't is part of the LSM so that say SELinux can have a > syscall bitmask per security context? I could do that, but I like Will's approach better. From the PoV of meeting security goals of information flow, data confidentiality, integrity, least priv, etc limiting on the syscall boundary doesn't make a lot of sense. You just don't know enough there to enforce these things. These are the types of goals that SELinux and other LSMs have previously tried to enforce. From the PoV of making the kernel more resistant to attacks and making a process more resistant to misbehavior I think that the syscall boundary is appropriate. Although I could do it in SELinux it don't really want to do it there. In case people are interested or confused let me give my definition of two words I've used a bit in these conversations: discretionary and mandatory. Any time I talk about a 'discretionary' security decision it is a security decisions that a process imposed upon itself. Aka the choice to use seccomp is discretionary. The choice to mark our own file u-wx is discretionary. This isn't the best definition but it's one that works well in this discussion. Mandatory security is one enforce by a global policy. It's what selinux is all about. SELinux doesn't give hoot what a process wants to do, it enforces a global policy from the top down. You take over a process, well, too bad, you still have no choice but to follow the mandatory policy. The LSM does NOT enforce a mandatory access control model, it's just how it's been used in the past. Ingo appears to me (please correct me if I'm wrong) to really be a fan of exposing the flexibility of the LSM to a discretionary access control model. That doesn't seem like a bad idea. And maybe using the filter engine to define the language to do this isn't a bad idea either. But I think that's a 'down the road' project, not something to hold up a better seccomp. > Making it part of the LSM also avoids having to add this prctl(). Well, it would mean exposing some new language construct to every LSM (instead of a single prctl construct) and it would mean anyone wanting to use the interface would have to rely on the LSM implementing those hooks the way they need it. Honestly chrome can already get all of the benefits of this patch (given a perfectly coded kernel) and a whole lot more using SELinux, but (surprise surprise) not everyone uses SELinux. I think it's a good idea to expose a simple interface which will be widely enough adopted that many userspace applications can rely on it for hardening. The existence of the LSM and the fact that there exists multiple security modules that may or may not be enabled really leads application developers to be unable to rely on LSM for security. If linux had a single security model which everyone could rely on we wouldn't really have as big of an issue but that's not possible. So I'm advocating for this series which will provide a single useful change which applications can rely upon across distros and platforms to enhance the properties and abilities of the linux kernel. -Eric
On Fri, May 13, 2011 at 10:55 AM, Eric Paris <eparis@redhat.com> wrote: > On Fri, 2011-05-13 at 17:23 +0200, Peter Zijlstra wrote: >> On Fri, 2011-05-13 at 11:10 -0400, Eric Paris wrote: >> > Then again, I certainly don't see a >> > reason that this syscall hardening patch should be held up while a whole >> > new concept in computer security is contemplated... >> >> Which makes me wonder why this syscall hardening stuff is done outside >> of LSM? Why isn't is part of the LSM so that say SELinux can have a >> syscall bitmask per security context? > > I could do that, but I like Will's approach better. From the PoV of > meeting security goals of information flow, data confidentiality, > integrity, least priv, etc limiting on the syscall boundary doesn't make > a lot of sense. You just don't know enough there to enforce these > things. These are the types of goals that SELinux and other LSMs have > previously tried to enforce. From the PoV of making the kernel more > resistant to attacks and making a process more resistant to misbehavior > I think that the syscall boundary is appropriate. Although I could do > it in SELinux it don't really want to do it there. There's also the problem that there are no hooks per-system call for LSMs, only logical hooks that sometimes mirror system call names and are called after user data has been parsed. If system call enter hooks, like seccomp's, were added for LSMs, it would allow the lsm bitmask approach, but it still wouldn't satisfy the issues you raise below (and I wholeheartedly agree with). > In case people are interested or confused let me give my definition of > two words I've used a bit in these conversations: discretionary and > mandatory. Any time I talk about a 'discretionary' security decision it > is a security decisions that a process imposed upon itself. Aka the > choice to use seccomp is discretionary. The choice to mark our own file > u-wx is discretionary. This isn't the best definition but it's one that > works well in this discussion. Mandatory security is one enforce by a > global policy. It's what selinux is all about. SELinux doesn't give > hoot what a process wants to do, it enforces a global policy from the > top down. You take over a process, well, too bad, you still have no > choice but to follow the mandatory policy. > > The LSM does NOT enforce a mandatory access control model, it's just how > it's been used in the past. Ingo appears to me (please correct me if > I'm wrong) to really be a fan of exposing the flexibility of the LSM to > a discretionary access control model. That doesn't seem like a bad > idea. And maybe using the filter engine to define the language to do > this isn't a bad idea either. But I think that's a 'down the road' > project, not something to hold up a better seccomp. > >> Making it part of the LSM also avoids having to add this prctl(). > > Well, it would mean exposing some new language construct to every LSM > (instead of a single prctl construct) and it would mean anyone wanting > to use the interface would have to rely on the LSM implementing those > hooks the way they need it. Honestly chrome can already get all of the > benefits of this patch (given a perfectly coded kernel) and a whole lot > more using SELinux, but (surprise surprise) not everyone uses SELinux. > I think it's a good idea to expose a simple interface which will be > widely enough adopted that many userspace applications can rely on it > for hardening. > > The existence of the LSM and the fact that there exists multiple > security modules that may or may not be enabled really leads application > developers to be unable to rely on LSM for security. If linux had a > single security model which everyone could rely on we wouldn't really > have as big of an issue but that's not possible. So I'm advocating for > this series which will provide a single useful change which applications > can rely upon across distros and platforms to enhance the properties and > abilities of the linux kernel. > > -Eric > >
On Thursday 12 May 2011, Will Drewry wrote: > This change adds a new seccomp mode based on the work by > agl@chromium.org in [1]. This new mode, "filter mode", provides a hash > table of seccomp_filter objects. When in the new mode (2), all system > calls are checked against the filters - first by system call number, > then by a filter string. If an entry exists for a given system call and > all filter predicates evaluate to true, then the task may proceed. > Otherwise, the task is killed (as per seccomp_mode == 1). I've got a question about this: Do you expect the typical usage to disallow ioctl()? Given that ioctl alone is responsible for a huge number of exploits in various drivers, while certain ioctls are immensely useful (FIONREAD, FIOASYNC, ...), do you expect to extend the mechanism to filter specific ioctl commands in the future? Arnd
* Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote: > > this is a security mechanism > > Who says? [...] Kernel developers/maintainers of the affected code. We have security hooks all around the kernel, which can deny/accept execution at various key points, but we do not have 'execute arbitrary user-space defined (safe) scripts' callbacks in general. But yes, if a particular callback point is defined widely enough to allow much bigger intervention into the flow of execution, then more is possible as well. > [...] and why would you want to unify two separate concepts only to them > limit it to security that just doesn't make sense. I don't limit them to security - the callbacks themselves are either for passive observation or, at most, for security accept/deny callbacks. It's decided by the subsystem maintainers what kind of user-space control power (or observation power) they want to allow, not me. I would just like to not stop the facility itself at the 'observe only' level, like you suggest. Thanks, Ingo
* Eric Paris <eparis@redhat.com> wrote: > [dropping microblaze and roland] > > lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: > > * James Morris <jmorris@namei.org> wrote: > > > It is a simple and sensible security feature, agreed? It allows most code to > > run well and link to countless libraries - but no access to other files is > > allowed. > > It's simple enough and sounds reasonable, but you can read all the discussion > about AppArmour why many people don't really think it's the best. [...] I have to say most of the participants of the AppArmour flamefests were dead wrong, and it wasnt the AppArmour folks who were wrong ... The straight ASCII VFS namespace *makes sense*, and yes, the raw physical objects space that SELinux uses makes sense as well. And no, i do not subscribe to the dogma that it is not possible to secure the ASCII VFS namespace: it evidently is possible, if you know and handle the ambiguitites. It is also obviously true that the ASCII VFS namespaces we use every day are a *lot* more intuitive than the labeled physical objects space ... What all the security flamewars missed is the simple fact that being intuitive matters a *lot* not just to not annoy users, but also to broaden the effective number of security-conscious developers ... > > Unfortunately this audit callback cannot be used for my purposes, because > > the event is single-purpose for auditd and because it allows no feedback > > (no deny/accept discretion for the security policy). > > > > But if had this simple event there: > > > > err = event_vfs_getname(result); > > Wow it sounds so easy. Now lets keep extending your train of thought > until we can actually provide the security provided by SELinux. What do > we end up with? We end up with an event hook right next to every LSM > hook. You know, the LSM hooks were placed where they are for a reason. > Because those were the locations inside the kernel where you actually > have information about the task doing an operation and the objects > (files, sockets, directories, other tasks, etc) they are doing an > operation on. > > Honestly all you are talking about it remaking the LSM with 2 sets of > hooks instead if 1. Why? [...] Not at all. I am taking about using *one* set of events, to keep the intrusion at the lowest possible level. LSM could make use of them as well. Obviously for pragmatic reasons that might not be feasible initially. > [...] It seems much easier that if you want the language of the filter > engine you would just make a new LSM that uses the filter engine for it's > policy language rather than the language created by SELinux or SMACK or name > your LSM implementation. Correct, that is what i suggested. Note that performance is a primary concern, so if certain filters are very popular then in practice this would come with support for a couple of 'built in' (pre-optimized) filters that the kernel can accelerate directly, so that we do not incure the cost of executing the filter preds for really common-sense security policies that almost everyone is using. I.e. in the end we'd *roughly* end up with the same performance and security as we are today (i mean, SELinux and the other LSMs did a nice job of collecting the things that apps should be careful about), but the crutial difference isnt just the advantages i menioned, but the fact that the *development model* of security modules would be a *lot* more extensible. So security efforts could move to a whole different level: they could move into key apps and they could integrate with the general mind-set of developers. At least Will as an application framework developer cares, so that hope is justified i think. > > - unprivileged: application-definable, allowing the embedding of security > > policy in *apps* as well, not just the system > > > > - flexible: can be added/removed runtime unprivileged, and cheaply so > > > > - transparent: does not impact executing code that meets the policy > > > > - nestable: it is inherited by child tasks and is fundamentally stackable, > > multiple policies will have the combined effect and they > > are transparent to each other. So if a child task within a > > sandbox adds *more* checks then those add to the already > > existing set of checks. We only narrow permissions, never > > extend them. > > > > - generic: allowing observation and (safe) control of security relevant > > parameters not just at the system call boundary but at other > > relevant places of kernel execution as well: which > > points/callbacks could also be used for other types of event > > extraction such as perf. It could even be shared with audit ... > > I'm not arguing that any of these things are bad things. What you describe > is a new LSM that uses a discretionary access control model but with the > granularity and flexibility that has traditionally only existed in the > mandatory access control security modules previously implemented in the > kernel. > > I won't argue that's a bad idea, there's no reason in my mind that a process > shouldn't be allowed to control it's own access decisions in a more flexible > way than rwx bits. Then again, I certainly don't see a reason that this > syscall hardening patch should be held up while a whole new concept in > computer security is contemplated... Note, i'm not actually asking for the moon, a pony and more. I fully submit that we are yet far away from being able to do a full LSM via this mechanism. What i'm asking for is that because the syscall point steps taken by Will look very promising so it would be nice to do *that* in a slightly more flexible scheme that does not condemn it to be limited to the syscall boundary and such ... Also, to answer you, do you say that by my argument someone should have stood up and said 'no' to the LSM mess that was introduced a couple of years ago and which caused so many problems: - kernel inefficiencies and user-space overhead - stalled security efforts - infighting - friction, fragmentation, overmodularization - non-stackability - security annoyances on the Linux desktop - probably *less* Linux security and should have asked them to do something better designed instead? Thanks, Ingo
On Sat, May 14, 2011 at 2:30 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Eric Paris <eparis@redhat.com> wrote: > >> [dropping microblaze and roland] >> >> lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote: >> > * James Morris <jmorris@namei.org> wrote: >> >> > It is a simple and sensible security feature, agreed? It allows most code to >> > run well and link to countless libraries - but no access to other files is >> > allowed. >> >> It's simple enough and sounds reasonable, but you can read all the discussion >> about AppArmour why many people don't really think it's the best. [...] > > I have to say most of the participants of the AppArmour flamefests were dead > wrong, and it wasnt the AppArmour folks who were wrong ... > > The straight ASCII VFS namespace *makes sense*, and yes, the raw physical > objects space that SELinux uses makes sense as well. > > And no, i do not subscribe to the dogma that it is not possible to secure the > ASCII VFS namespace: it evidently is possible, if you know and handle the > ambiguitites. It is also obviously true that the ASCII VFS namespaces we use > every day are a *lot* more intuitive than the labeled physical objects space > ... > > What all the security flamewars missed is the simple fact that being intuitive > matters a *lot* not just to not annoy users, but also to broaden the effective > number of security-conscious developers ... > >> > Unfortunately this audit callback cannot be used for my purposes, because >> > the event is single-purpose for auditd and because it allows no feedback >> > (no deny/accept discretion for the security policy). >> > >> > But if had this simple event there: >> > >> > err = event_vfs_getname(result); >> >> Wow it sounds so easy. Now lets keep extending your train of thought >> until we can actually provide the security provided by SELinux. What do >> we end up with? We end up with an event hook right next to every LSM >> hook. You know, the LSM hooks were placed where they are for a reason. >> Because those were the locations inside the kernel where you actually >> have information about the task doing an operation and the objects >> (files, sockets, directories, other tasks, etc) they are doing an >> operation on. >> >> Honestly all you are talking about it remaking the LSM with 2 sets of >> hooks instead if 1. Why? [...] > > Not at all. I am taking about using *one* set of events, to keep the intrusion > at the lowest possible level. > > LSM could make use of them as well. > > Obviously for pragmatic reasons that might not be feasible initially. > >> [...] It seems much easier that if you want the language of the filter >> engine you would just make a new LSM that uses the filter engine for it's >> policy language rather than the language created by SELinux or SMACK or name >> your LSM implementation. > > Correct, that is what i suggested. > > Note that performance is a primary concern, so if certain filters are very > popular then in practice this would come with support for a couple of 'built > in' (pre-optimized) filters that the kernel can accelerate directly, so that we > do not incure the cost of executing the filter preds for really common-sense > security policies that almost everyone is using. > > I.e. in the end we'd *roughly* end up with the same performance and security as > we are today (i mean, SELinux and the other LSMs did a nice job of collecting > the things that apps should be careful about), but the crutial difference isnt > just the advantages i menioned, but the fact that the *development model* of > security modules would be a *lot* more extensible. > > So security efforts could move to a whole different level: they could move into > key apps and they could integrate with the general mind-set of developers. > > At least Will as an application framework developer cares, so that hope is > justified i think. > >> > - unprivileged: application-definable, allowing the embedding of security >> > policy in *apps* as well, not just the system >> > >> > - flexible: can be added/removed runtime unprivileged, and cheaply so >> > >> > - transparent: does not impact executing code that meets the policy >> > >> > - nestable: it is inherited by child tasks and is fundamentally stackable, >> > multiple policies will have the combined effect and they >> > are transparent to each other. So if a child task within a >> > sandbox adds *more* checks then those add to the already >> > existing set of checks. We only narrow permissions, never >> > extend them. >> > >> > - generic: allowing observation and (safe) control of security relevant >> > parameters not just at the system call boundary but at other >> > relevant places of kernel execution as well: which >> > points/callbacks could also be used for other types of event >> > extraction such as perf. It could even be shared with audit ... >> >> I'm not arguing that any of these things are bad things. What you describe >> is a new LSM that uses a discretionary access control model but with the >> granularity and flexibility that has traditionally only existed in the >> mandatory access control security modules previously implemented in the >> kernel. >> >> I won't argue that's a bad idea, there's no reason in my mind that a process >> shouldn't be allowed to control it's own access decisions in a more flexible >> way than rwx bits. Then again, I certainly don't see a reason that this >> syscall hardening patch should be held up while a whole new concept in >> computer security is contemplated... > > Note, i'm not actually asking for the moon, a pony and more. > > I fully submit that we are yet far away from being able to do a full LSM via > this mechanism. > > What i'm asking for is that because the syscall point steps taken by Will look > very promising so it would be nice to do *that* in a slightly more flexible > scheme that does not condemn it to be limited to the syscall boundary and such > ... What do you suggest here? From my brief exploration of the ftrace/perf (and seccomp) code, I don't see a clean way of integrating over the existing interfaces to the ftrace framework (e.g., the global perf event pump seems to be a mismatch), but I may be missing something obvious. In my view, implementing this nestled between the seccomp/ftrace world provides a stepping stone forward without being too restrictive. No matter how we change security events in the future, system calls will always be the first line of attack surface reduction. It could just mean that, in the long term, accessing the "security event filtering" framework is done through another new interface with seccomp providing only a targeted syscall filtering featureset that may one day be deprecated (if that day ever comes). If there's a clear way to cleanly expand this interface that I'm missing, I'd love to know - thanks! will > Also, to answer you, do you say that by my argument someone should have stood > up and said 'no' to the LSM mess that was introduced a couple of years ago and > which caused so many problems: > > - kernel inefficiencies and user-space overhead > > - stalled security efforts > > - infighting > > - friction, fragmentation, overmodularization > > - non-stackability > > - security annoyances on the Linux desktop > > - probably *less* Linux security > > and should have asked them to do something better designed instead? > > Thanks, > > Ingo >
On Fri, May 13, 2011 at 2:35 PM, Arnd Bergmann <arnd@arndb.de> wrote: > On Thursday 12 May 2011, Will Drewry wrote: >> This change adds a new seccomp mode based on the work by >> agl@chromium.org in [1]. This new mode, "filter mode", provides a hash >> table of seccomp_filter objects. When in the new mode (2), all system >> calls are checked against the filters - first by system call number, >> then by a filter string. If an entry exists for a given system call and >> all filter predicates evaluate to true, then the task may proceed. >> Otherwise, the task is killed (as per seccomp_mode == 1). > > I've got a question about this: Do you expect the typical usage to disallow > ioctl()? Given that ioctl alone is responsible for a huge number of exploits > in various drivers, while certain ioctls are immensely useful (FIONREAD, > FIOASYNC, ...), do you expect to extend the mechanism to filter specific > ioctl commands in the future? In many cases, I do expect ioctl's to be dropped, but it's totally up to whoever is setting the filters. As is, it can already help out: [even though an LSM, if available, would be appropriate to define a fine-grained policy] ioctl() is hooked by the ftrace syscalls infrastructure (via SYSCALL_DEFINE3): SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) This means you can do: sprintf(filter, "cmd == %u || cmd == %u", FIOASYNC, FIONREAD); prctl(PR_SET_SECCOMP_FILTER, __NR_ioctl, filter); ... prctl(PR_SET_SECCOMP, 2, 0); and then you'll be able to call ioctl on any fd with any argument but limited to only the FIOASYNC and FIONREAD commands. Depending on integration, it could even be limited to ioctl commands that are appropriate to a known fd if the fd is opened prior to entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed with a filter of "1" then narrowed through a later addition of something like "(fd == %u && (cmd == %u || cmd == %u))" or something along those lines. Does that make sense? In general, this interface won't need specific extensions for most system call oriented filtering events. ftrace events may be expanded (to include more system calls), but that's behind the scenes. Only arguments subject to time-of-check-time-of-use attacks (data living in userspace passed in by pointer) are not safe to use via this interface. In theory, that limitation could also be lifted in the implementation without changing the ABI. Thanks! will
On Saturday 14 May 2011, Will Drewry wrote: > Depending on integration, it could even be limited to ioctl commands > that are appropriate to a known fd if the fd is opened prior to > entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed > with a filter of "1" then narrowed through a later addition of > something like "(fd == %u && (cmd == %u || cmd == %u))" or something > along those lines. > > Does that make sense? Thanks for the explanation. This sounds like it's already doing all we need. Arnd
On Fri, 13 May 2011, Ingo Molnar wrote: > Say i'm a user-space sandbox developer who wants to enforce that sandboxed code > should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/. > > It is a simple and sensible security feature, agreed? It allows most code to > run well and link to countless libraries - but no access to other files is > allowed. Not really. Firstly, what is the security goal of these restrictions? Then, are the restrictions complete and unbypassable? How do you reason about the behavior of the system as a whole? > I argue that this is the LSM and audit subsystems designed right: in the long > run it could allow everything that LSM does at the moment - and so much more > ... Now you're proposing a redesign of the security subsystem. That's a significant undertaking. In the meantime, we have a simple, well-defined enhancement to seccomp which will be very useful to current users in reducing their kernel attack surface. We should merge that, and the security subsystem discussion can carry on separately. - James
* Arnd Bergmann <arnd@arndb.de> wrote: > On Saturday 14 May 2011, Will Drewry wrote: > > Depending on integration, it could even be limited to ioctl commands > > that are appropriate to a known fd if the fd is opened prior to > > entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed > > with a filter of "1" then narrowed through a later addition of > > something like "(fd == %u && (cmd == %u || cmd == %u))" or something > > along those lines. > > > > Does that make sense? > > Thanks for the explanation. This sounds like it's already doing all > we need. One thing we could do more clearly here is to help keep the filter expressions symbolic - i.e. help resolve the various ioctl variants as well, not just the raw syscall parameter numbers. But yes, access to the raw syscall parameters and the ability to filter them already gives us the ability to exclude/include specific ioctls in a rather flexible way. Thanks, Ingo
* David Laight <David.Laight@ACULAB.COM> wrote: > [...] unfortunately it worked by looking at the user-space buffers on system > call entry - and a multithreaded program can easily arrange to update them > after the initial check! [...] Such problems of reliability/persistency of security checks is exactly one of my arguments why this should not be limited to the syscall boundary, if you read the example i have provided in this discussion. Thanks, Ingo
* Will Drewry <wad@chromium.org> wrote: > > Note, i'm not actually asking for the moon, a pony and more. > > > > I fully submit that we are yet far away from being able to do a full LSM > > via this mechanism. > > > > What i'm asking for is that because the syscall point steps taken by Will > > look very promising so it would be nice to do *that* in a slightly more > > flexible scheme that does not condemn it to be limited to the syscall > > boundary and such ... > > What do you suggest here? > > From my brief exploration of the ftrace/perf (and seccomp) code, I don't see > a clean way of integrating over the existing interfaces to the ftrace > framework (e.g., the global perf event pump seems to be a mismatch), but I > may be missing something obvious. [...] Well, there's no global perf event pump. Here we'd obviously want to use buffer-less events that do no output (pumping) whatsoever - i.e. just counting events with filters attached. What i suggest is to: - share syscall events # you are fine with that as your patch makes use of them - share the scripting engine # you are fine with that as your patch makes use of it - share *any* other event to do_exit() a process at syscall exit time - share any other active event that kernel developers specifically enable for active use to impact security-relevant execution even sooner than syscall exit time - not just system calls - share the actual facility that manages (sets/gets) filters So right now you have this structure for your new feature: Documentation/trace/seccomp_filter.txt | 75 +++++ arch/x86/kernel/ldt.c | 5 arch/x86/kernel/tls.c | 7 fs/proc/array.c | 21 + fs/proc/base.c | 25 + include/linux/ftrace_event.h | 9 include/linux/seccomp.h | 98 +++++++ include/linux/syscalls.h | 54 ++-- include/trace/syscall.h | 6 kernel/Makefile | 1 kernel/fork.c | 8 kernel/perf_event.c | 7 kernel/seccomp.c | 144 ++++++++++- kernel/seccomp_filter.c | 428 +++++++++++++++++++++++++++++++++ kernel/sys.c | 11 kernel/trace/Kconfig | 10 kernel/trace/trace_events_filter.c | 60 ++-- kernel/trace/trace_syscalls.c | 96 ++++++- 18 files changed, 986 insertions(+), 79 deletions(-) Firstly, one specific problem i can see is that kernel/seccomp_filter.c hardcodes to the system call boundary. Which is fine to a prototype implementation (and obviously fine for something that builds upon seccomp) but not fine in terms of a flexible Linux security feature :-) You have hardcoded these syscall assumptions via: struct seccomp_filter { struct list_head list; struct rcu_head rcu; int syscall_nr; struct syscall_metadata *data; struct event_filter *event_filter; }; Which comes just because you chose to enumerate only syscall events - instead of enumerating all events. Instead of that please bind to all events instead - syscalls are just one of the many events we have. Type 'perf list' and see how many event types we have, and quite a few could be enabled for 'active feedback' sandboxing as well. Secondly, and this is really a variant of the first problem you have, the way you process event filter 'failures' is pretty limited. You utilize the regular seccomp method which works by calling into __secure_computing() and silently accepting syscalls or generating a hard do_exit() on even the slightest of filter failures. Instead of that what we'd want to have is to have regular syscall events registered, *and* enabled for such active filtering purposes. The moment the filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and some other attribute in the task and the kernel would do a do_exit() at the earliest of opportunities before calling the syscall or at the return-from-syscall point latest. Note that no seccomp specific code would have to execute here, we can already generate events both at syscall entry and at syscall exit points, the only new bit we'd need is for the 'kill the task' [or abort the syscall] policy action. This is *far* more generic still yields the same short-term end result as far as your sandboxing is concerned. What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'. We'd mark all syscall events as 'active' straight away. [ Detail: these trace events return a return code, which the calling code can use, that way event return values could be used sooner than syscall exit points. IRQ code could make use of it as well, so for example this way we could filter based early packet inspection, still in the IRQ code. ] Then what your feature would do is to simply open up the events you are interested in (just as counting events, without any buffers), and set 'active' filters in them them to 'deny/allow' specific combinations of parameters only. Thirdly, you have added a separate facility to set/query filters via /proc, this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity would go away altogether: it should be possible to query existing events and filters even outside of the seccomp facility, so if you want something explicit here beyond the current event ioctl() to set filters please improve the generic facility - which right now is poorer than what you have implemented so it's in good need to be improved! :-) All in one such a solution would enable us to reuse code in a lot more deeper fashion while still providing all the functionality you are interested in. These are all short-term concerns, not pie-in-the-sky future ideal LSM concerns. Thanks, Ingo
* Will Drewry <wad@chromium.org> wrote: > I agree with you on many of these points! However, I don't think that the > views around LSMs, perf/ftrace infrastructure, or the current seccomp > filtering implementation are necessarily in conflict. Here is my > understanding of how the different worlds fit together and where I see this > patchset living, along with where I could see future work going. Perhaps I'm > being a trifle naive, but here goes anyway: > > 1. LSMs provide a global mechanism for hooking "security relevant" > events at a point where all the incoming user-sourced data has been > preprocessed and moved into userspace. The hooks are called every > time one of those boundaries are crossed. > 2. Perf and the ftrace infrastructure provide global function tracing > and system call hooks with direct access to the caller's registers > (and memory). No, perf events are not just global but per task as well. Nor are events limited to 'tracing' (generating a flow of events into a trace buffer) - they can just be themselves as well and count and generate callbacks. The generic NMI watchdog uses that kind of event model for example, see kernel/watchdog.c and how it makes use of struct perf_event abstractions to do per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per task events and integrates it with the ptrace hw-breakpoints code. Ideally Peter's one particular suggestion is right IMO and we'd want to be able for a perf_event to just be a list of callbacks, attached to a task and barely more than a discoverable, named notifier chain in its slimmest form. In practice it's fatter than that right now, but we should definitely factor out that aspect of it more clearly, both code-wise and API-wise. kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is possible and desirable. > 3. seccomp (as it exists today) provides a global system call entry > hook point with a binary per-process decision about whether to provide > "secure computing" behavior. > > When I boil that down to abstractions, I see: > A. Globally scoped: LSMs, ftrace/perf > B. Locally/process scoped: seccomp Ok, i see where you got the idea that you needed to cut your surface of abstraction at the filter engine / syscall enumeration level - i think you were thinking of it in the ftrace model of tracepoints, not in the perf model of events. No, events are generic and as such per task as well, not just global. I've replied to your other mail with more specific suggestions of how we could provide your feature using abstractions that share code more widely. Talking specifics will i hope help move the discussion forward! :-) Thanks, Ingo
On Mon, May 16, 2011 at 7:55 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Will Drewry <wad@chromium.org> wrote: > >> I agree with you on many of these points! However, I don't think that the >> views around LSMs, perf/ftrace infrastructure, or the current seccomp >> filtering implementation are necessarily in conflict. Here is my >> understanding of how the different worlds fit together and where I see this >> patchset living, along with where I could see future work going. Perhaps I'm >> being a trifle naive, but here goes anyway: >> >> 1. LSMs provide a global mechanism for hooking "security relevant" >> events at a point where all the incoming user-sourced data has been >> preprocessed and moved into userspace. The hooks are called every >> time one of those boundaries are crossed. > >> 2. Perf and the ftrace infrastructure provide global function tracing >> and system call hooks with direct access to the caller's registers >> (and memory). > > No, perf events are not just global but per task as well. Nor are events > limited to 'tracing' (generating a flow of events into a trace buffer) - they > can just be themselves as well and count and generate callbacks. I was looking at the perf_sysenter_enable() call, but clearly there is more going on :) > The generic NMI watchdog uses that kind of event model for example, see > kernel/watchdog.c and how it makes use of struct perf_event abstractions to do > per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per > task events and integrates it with the ptrace hw-breakpoints code. > > Ideally Peter's one particular suggestion is right IMO and we'd want to be able > for a perf_event to just be a list of callbacks, attached to a task and barely > more than a discoverable, named notifier chain in its slimmest form. > > In practice it's fatter than that right now, but we should definitely factor > out that aspect of it more clearly, both code-wise and API-wise. > kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is > possible and desirable. > >> 3. seccomp (as it exists today) provides a global system call entry >> hook point with a binary per-process decision about whether to provide >> "secure computing" behavior. >> >> When I boil that down to abstractions, I see: >> A. Globally scoped: LSMs, ftrace/perf >> B. Locally/process scoped: seccomp > > Ok, i see where you got the idea that you needed to cut your surface of > abstraction at the filter engine / syscall enumeration level - i think you were > thinking of it in the ftrace model of tracepoints, not in the perf model of > events. > > No, events are generic and as such per task as well, not just global. > > I've replied to your other mail with more specific suggestions of how we could > provide your feature using abstractions that share code more widely. Talking > specifics will i hope help move the discussion forward! :-) Agreed. I'll digest both the watchdog code as well as your other comments and follow up when I have a fuller picture in my head. (I have a few initial comments I'll post in response to your other mail.) Thanks! will
* James Morris <jmorris@namei.org> wrote: > On Fri, 13 May 2011, Ingo Molnar wrote: > > > Say i'm a user-space sandbox developer who wants to enforce that sandboxed > > code should only be allowed to open files in /home/sandbox/, /lib/ and > > /usr/lib/. > > > > It is a simple and sensible security feature, agreed? It allows most code > > to run well and link to countless libraries - but no access to other files > > is allowed. > > Not really. > > Firstly, what is the security goal of these restrictions? [...] To do what i described above? Namely: " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/ " > [...] Then, are the restrictions complete and unbypassable? If only the system calls i mentioned are allowed, and if the sandboxed VFS namespace itself is isolated from the rest of the system (no bind mounts, no hard links outside the sandbox, etc.) then its goal is to not be bypassable - what use is a sandbox if the sandbox can be bypassed by the sandboxed code? There's a few ways how to alter (and thus bypass) VFS namespace lookups: symlinks, chdir, chroot, rename, etc., which (as i mentioned) have to be excluded by default or filtered as well. > How do you reason about the behavior of the system as a whole? For some usecases i mainly want to reason about what the sandboxed code can do and can not do, within a fairly static and limited VFS namespace environment. I might not want to have a full-blown 'physical barrier' for all objects labeled as inaccessible to sandboxed code (or labeled as accessible to sandboxed code). Especially as manipulating file labels is not also slow (affects all files) but is also often an exclusively privileged operation even for owned files, for no good reason. For things like /lib/ and /usr/lib/ it also *has* to be a privileged operation. > > I argue that this is the LSM and audit subsystems designed right: in the > > long run it could allow everything that LSM does at the moment - and so > > much more ... > > Now you're proposing a redesign of the security subsystem. That's a > significant undertaking. It certainly is. > In the meantime, we have a simple, well-defined enhancement to seccomp which > will be very useful to current users in reducing their kernel attack surface. > > We should merge that, and the security subsystem discussion can carry on > separately. Is that the development and merge process along which the LSM subsystem got into its current state? Thanks, Ingo
Sorry to be absent from this thread so far, I just got back from my travels and I'm now catching up on email. On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote: > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig > index 377a7a5..22e1668 100644 > --- a/arch/arm/Kconfig > +++ b/arch/arm/Kconfig > @@ -1664,6 +1664,16 @@ config SECCOMP > and the task is only allowed to execute a few safe syscalls > defined by each seccomp mode. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > config CC_STACKPROTECTOR > bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)" > depends on EXPERIMENTAL > diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig > index eccdefe..7641ee9 100644 > --- a/arch/microblaze/Kconfig > +++ b/arch/microblaze/Kconfig > @@ -129,6 +129,16 @@ config SECCOMP > > If unsure, say Y. Only embedded should say N here. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > endmenu > > menu "Advanced setup" > diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig > index 8e256cc..fe4cbda 100644 > --- a/arch/mips/Kconfig > +++ b/arch/mips/Kconfig > @@ -2245,6 +2245,16 @@ config SECCOMP > > If unsure, say Y. Only embedded should say N here. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > config USE_OF > bool "Flattened Device Tree support" > select OF > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index 8f4d50b..83499e4 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -605,6 +605,16 @@ config SECCOMP > > If unsure, say Y. Only embedded should say N here. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > endmenu > > config ISA_DMA_API > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index 2508a6f..2777515 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -614,6 +614,16 @@ config SECCOMP > > If unsure, say Y. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > endmenu > > menu "Power Management" > diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig > index 4b89da2..00c1521 100644 > --- a/arch/sh/Kconfig > +++ b/arch/sh/Kconfig > @@ -676,6 +676,16 @@ config SECCOMP > > If unsure, say N. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > config SMP > bool "Symmetric multi-processing support" > depends on SYS_SUPPORTS_SMP > diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig > index e560d10..5b42255 100644 > --- a/arch/sparc/Kconfig > +++ b/arch/sparc/Kconfig > @@ -270,6 +270,16 @@ config SECCOMP > > If unsure, say Y. Only embedded should say N here. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > config HOTPLUG_CPU > bool "Support for hot-pluggable CPUs" > depends on SPARC64 && SMP > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index cc6c53a..d6d44d9 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1485,6 +1485,16 @@ config SECCOMP > > If unsure, say Y. Only embedded should say N here. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + depends on SECCOMP && EXPERIMENTAL > + help > + Per-process, inherited system call filtering using shared code > + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS > + is not available, enhanced filters will not be available. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > config CC_STACKPROTECTOR > bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)" > ---help--- You just cut-and-pasted 8 copies of a config selection. The proper way to do that is to add the Kconfig selection in a core kernel Kconfig, have it depend on "HAVE_SECCOMP_FILTER" and then in each of these configs, simply add in the arch Kconfig: config <ARCH> [...] select HAVE_SECCOMP_FILTER That way you don't need to duplicate the config option all over the place. -- Steve
On Mon, May 16, 2011 at 10:26 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > Sorry to be absent from this thread so far, I just got back from my > travels and I'm now catching up on email. > > > On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote: > >> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig >> index 377a7a5..22e1668 100644 >> --- a/arch/arm/Kconfig >> +++ b/arch/arm/Kconfig >> @@ -1664,6 +1664,16 @@ config SECCOMP >> and the task is only allowed to execute a few safe syscalls >> defined by each seccomp mode. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> config CC_STACKPROTECTOR >> bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)" >> depends on EXPERIMENTAL >> diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig >> index eccdefe..7641ee9 100644 >> --- a/arch/microblaze/Kconfig >> +++ b/arch/microblaze/Kconfig >> @@ -129,6 +129,16 @@ config SECCOMP >> >> If unsure, say Y. Only embedded should say N here. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> endmenu >> >> menu "Advanced setup" >> diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig >> index 8e256cc..fe4cbda 100644 >> --- a/arch/mips/Kconfig >> +++ b/arch/mips/Kconfig >> @@ -2245,6 +2245,16 @@ config SECCOMP >> >> If unsure, say Y. Only embedded should say N here. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> config USE_OF >> bool "Flattened Device Tree support" >> select OF >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig >> index 8f4d50b..83499e4 100644 >> --- a/arch/powerpc/Kconfig >> +++ b/arch/powerpc/Kconfig >> @@ -605,6 +605,16 @@ config SECCOMP >> >> If unsure, say Y. Only embedded should say N here. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> endmenu >> >> config ISA_DMA_API >> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig >> index 2508a6f..2777515 100644 >> --- a/arch/s390/Kconfig >> +++ b/arch/s390/Kconfig >> @@ -614,6 +614,16 @@ config SECCOMP >> >> If unsure, say Y. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> endmenu >> >> menu "Power Management" >> diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig >> index 4b89da2..00c1521 100644 >> --- a/arch/sh/Kconfig >> +++ b/arch/sh/Kconfig >> @@ -676,6 +676,16 @@ config SECCOMP >> >> If unsure, say N. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> config SMP >> bool "Symmetric multi-processing support" >> depends on SYS_SUPPORTS_SMP >> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig >> index e560d10..5b42255 100644 >> --- a/arch/sparc/Kconfig >> +++ b/arch/sparc/Kconfig >> @@ -270,6 +270,16 @@ config SECCOMP >> >> If unsure, say Y. Only embedded should say N here. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> config HOTPLUG_CPU >> bool "Support for hot-pluggable CPUs" >> depends on SPARC64 && SMP >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index cc6c53a..d6d44d9 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -1485,6 +1485,16 @@ config SECCOMP >> >> If unsure, say Y. Only embedded should say N here. >> >> +config SECCOMP_FILTER >> + bool "Enable seccomp-based system call filtering" >> + depends on SECCOMP && EXPERIMENTAL >> + help >> + Per-process, inherited system call filtering using shared code >> + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS >> + is not available, enhanced filters will not be available. >> + >> + See Documentation/prctl/seccomp_filter.txt for more detail. >> + >> config CC_STACKPROTECTOR >> bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)" >> ---help--- > > You just cut-and-pasted 8 copies of a config selection. The proper way > to do that is to add the Kconfig selection in a core kernel Kconfig, > have it depend on "HAVE_SECCOMP_FILTER" and then in each of these > configs, simply add in the arch Kconfig: > > config <ARCH> > [...] > select HAVE_SECCOMP_FILTER > > > That way you don't need to duplicate the config option all over the > place. Thanks! I had seen the HAVE_* format but failed to acknowledge why! I'll fix it in the next rev (assuming it still fits there)! will
On Mon, May 16, 2011 at 7:43 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Will Drewry <wad@chromium.org> wrote: > >> > Note, i'm not actually asking for the moon, a pony and more. >> > >> > I fully submit that we are yet far away from being able to do a full LSM >> > via this mechanism. >> > >> > What i'm asking for is that because the syscall point steps taken by Will >> > look very promising so it would be nice to do *that* in a slightly more >> > flexible scheme that does not condemn it to be limited to the syscall >> > boundary and such ... >> >> What do you suggest here? >> >> From my brief exploration of the ftrace/perf (and seccomp) code, I don't see >> a clean way of integrating over the existing interfaces to the ftrace >> framework (e.g., the global perf event pump seems to be a mismatch), but I >> may be missing something obvious. [...] > > Well, there's no global perf event pump. Here we'd obviously want to use > buffer-less events that do no output (pumping) whatsoever - i.e. just counting > events with filters attached. Cool - I missed these entirely. I'll get reading :) > What i suggest is to: > > - share syscall events # you are fine with that as your patch makes use of them > - share the scripting engine # you are fine with that as your patch makes use of it > > - share *any* other event to do_exit() a process at syscall exit time > > - share any other active event that kernel developers specifically enable > for active use to impact security-relevant execution even sooner than > syscall exit time - not just system calls > > - share the actual facility that manages (sets/gets) filters These make sense to me at a high level. I'll throw in a few initial comments, but I'll be back for a round-two once I catch up on the rest of the perf code. > So right now you have this structure for your new feature: > > Documentation/trace/seccomp_filter.txt | 75 +++++ > arch/x86/kernel/ldt.c | 5 > arch/x86/kernel/tls.c | 7 > fs/proc/array.c | 21 + > fs/proc/base.c | 25 + > include/linux/ftrace_event.h | 9 > include/linux/seccomp.h | 98 +++++++ > include/linux/syscalls.h | 54 ++-- > include/trace/syscall.h | 6 > kernel/Makefile | 1 > kernel/fork.c | 8 > kernel/perf_event.c | 7 > kernel/seccomp.c | 144 ++++++++++- > kernel/seccomp_filter.c | 428 +++++++++++++++++++++++++++++++++ > kernel/sys.c | 11 > kernel/trace/Kconfig | 10 > kernel/trace/trace_events_filter.c | 60 ++-- > kernel/trace/trace_syscalls.c | 96 ++++++- > 18 files changed, 986 insertions(+), 79 deletions(-) > > Firstly, one specific problem i can see is that kernel/seccomp_filter.c > hardcodes to the system call boundary. Which is fine to a prototype > implementation (and obviously fine for something that builds upon seccomp) but > not fine in terms of a flexible Linux security feature :-) > > You have hardcoded these syscall assumptions via: > > struct seccomp_filter { > struct list_head list; > struct rcu_head rcu; > int syscall_nr; > struct syscall_metadata *data; > struct event_filter *event_filter; > }; (This structure is a bit different in the new rev of the patch, but nothing relevant to this specific part of the discussion :) > Which comes just because you chose to enumerate only syscall events - instead > of enumerating all events. While I'm willing to live with the tradeoff, using ftrace event numbers from FTRACE_SYSCALLS means that the filter will be unable to hook a fair number of syscalls: execve, clone, etc (all the ptregs fixup syscalls on x86) and anything that returns int instead of long (on x86). Though the last two patches in the initial patch series provided a proposed clean up for the latter :/ The current revision of the seccomp filter patch can function in a bitmask-like state when CONFIG_FTRACE is unset or CONFIG_FTRACE_SYSCALLS is unset. This also means any platform with CONFIG_SECCOMP support can start using this right away, but I realize that is more of a short-term gain rather than a long-term one. > Instead of that please bind to all events instead - syscalls are just one of > the many events we have. Type 'perf list' and see how many event types we have, > and quite a few could be enabled for 'active feedback' sandboxing as well. > > Secondly, and this is really a variant of the first problem you have, the way > you process event filter 'failures' is pretty limited. You utilize the regular > seccomp method which works by calling into __secure_computing() and silently > accepting syscalls or generating a hard do_exit() on even the slightest of > filter failures. > > Instead of that what we'd want to have is to have regular syscall events > registered, *and* enabled for such active filtering purposes. The moment the > filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and > some other attribute in the task and the kernel would do a do_exit() at the > earliest of opportunities before calling the syscall or at the > return-from-syscall point latest. > > Note that no seccomp specific code would have to execute here, we can already > generate events both at syscall entry and at syscall exit points, the only new > bit we'd need is for the 'kill the task' [or abort the syscall] policy action. > > This is *far* more generic still yields the same short-term end result as far > as your sandboxing is concerned. Almost :/ I still need to review the code you've pointed out, but, at present, the ftrace hooks occur after the seccomp and syscall auditing hooks. This means that that code is exposed no matter what in this model. To trim the exposed surface to userspace, we really need those early hooks. While I can see both hacky and less hacky approaches around this, it stills strikes me that the seccomp thread flag and early interception are good to reuse. One option might be to allow seccomp to be a secure-syscall event source, but I suspect that lands more on the hack-y side of the fence :) Anyway, I'll read up on the other filters and try to understand exactly how per-task, counting perf events are being handled. > What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'. > We'd mark all syscall events as 'active' straight away. It seems like we'll still end up special casing system calls no matter what as they are the userland vector into the kernel. Marking syscalls active right away sounds roughly equivalent to calling prctl(PR_SET_SECCOMP, 2). This could easily be done following the model of syscall_nr_to_meta() since that enumerates every system call meta data entry which yields both entry and exit event numbers. Of course, I'd need to understand how that ties in with the per-task, counting code. > [ Detail: these trace events return a return code, which the calling code can > use, that way event return values could be used sooner than syscall exit > points. IRQ code could make use of it as well, so for example this way we > could filter based early packet inspection, still in the IRQ code. ] > > Then what your feature would do is to simply open up the events you are > interested in (just as counting events, without any buffers), and set 'active' > filters in them them to 'deny/allow' specific combinations of parameters only. > > Thirdly, you have added a separate facility to set/query filters via /proc, > this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity would > go away altogether: it should be possible to query existing events and filters > even outside of the seccomp facility, so if you want something explicit here > beyond the current event ioctl() to set filters please improve the generic > facility - which right now is poorer than what you have implemented so it's in > good need to be improved! :-) I haven't looked at the other interface, so I will. I know about the existence of the perf_event_open syscall, but that's about it. Would it make sense to use the same syscall in a unified-interface world? Even if the right way, semantically it seems awkward. > All in one such a solution would enable us to reuse code in a lot more deeper > fashion while still providing all the functionality you are interested in. > These are all short-term concerns, not pie-in-the-sky future ideal LSM > concerns. Thanks for the concrete feedback! I'll follow up again once I better understand the code you're citing as well as the existing userland interface to it. cheers! will
On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote: > > > > Then there's the whole indirection argument, if you don't need > > > > indirection, its often better to not use it, I myself much prefer code > > > > to look like: > > > > > > > > foo1(bar); > > > > foo2(bar); > > > > foo3(bar); > > > > > > > > Than: > > > > > > > > foo_notifier(bar); > > > > > > > > Simply because its much clearer who all are involved without me having > > > > to grep around to see who registers for foo_notifier and wth they do > > > > with it. It also makes it much harder to sneak in another user, whereas > > > > its nearly impossible to find new notifier users. > > > > > > > > Its also much faster, no extra memory accesses, no indirect function > > > > calls, no other muck. > > > > > > But i suspect this question has been settled, given the fact that even pure > > > observer events need and already process a chain of events? Am i missing > > > something about your argument? > > > > I'm saying that there's reasons to not use notifiers passive or active. > > > > Mostly the whole notifier/indirection muck comes up once you want > > modules to make use of the thing, because then you need dynamic > > management of the callback list. > > But your argument assumes that we'd have a chain of functions to call, like > regular notifiers. > > While the natural model here would be to have a list of registered event > structs for that point, with different filters but basically the same callback > mechanism (a call into the filter engine in essence). > > Also note that the common case would be no event registered - and we'd > automatically optimize that case via the existing jump labels optimization. I agree that I prefer the "notifier" type over having direct function calls. Yes, it's easier to read and figure out what functions are called, but notifiers can be optimized for the default case where nothing is called (jump-label nop). > > > (Then again, I'm fairly glad we don't have explicit callbacks in kernel/cpu.c > > for all the cpu-hotplug callbacks :-) > > > > Anyway, I oppose for the existing events to gain an active role. > > Why if 'being active' is optional and useful? I'm a bit nervous about the 'active' role of (trace_)events, because of the way multiple callbacks can be registered. How would: err = event_x(); if (err == -EACCESS) { be handled? Would we need a way to prioritize which call back gets the return value? One way I guess would be to add a check_event option, where you pass in an ENUM of the event you want: event_x(); err = check_event_x(MYEVENT); If something registered itself as "MYEVENT" to event_x, then you get the return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or something could be returned. I'm sure we could even optimize it such a way if no active events have been registered to event_x, that check_event_x() will return -ENODEV without any branches. -- Steve
* Steven Rostedt <rostedt@goodmis.org> wrote: > I'm a bit nervous about the 'active' role of (trace_)events, because of the > way multiple callbacks can be registered. How would: > > err = event_x(); > if (err == -EACCESS) { > > be handled? [...] The default behavior would be something obvious: to trigger all callbacks and use the first non-zero return value. > [...] Would we need a way to prioritize which call back gets the return > value? One way I guess would be to add a check_event option, where you pass > in an ENUM of the event you want: > > event_x(); > err = check_event_x(MYEVENT); > > If something registered itself as "MYEVENT" to event_x, then you get the > return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or > something could be returned. I'm sure we could even optimize it such a way if > no active events have been registered to event_x, that check_event_x() will > return -ENODEV without any branches. I would keep it simple and extensible - that way we can complicate it when the need arises! :) Thanks, Ingo
On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the > > way multiple callbacks can be registered. How would: > > > > err = event_x(); > > if (err == -EACCESS) { > > > > be handled? [...] > > The default behavior would be something obvious: to trigger all callbacks and > use the first non-zero return value. But how do we know which callback that was from? There's no ordering of what callbacks are called first. > > > [...] Would we need a way to prioritize which call back gets the return > > value? One way I guess would be to add a check_event option, where you pass > > in an ENUM of the event you want: > > > > event_x(); > > err = check_event_x(MYEVENT); > > > > If something registered itself as "MYEVENT" to event_x, then you get the > > return code of MYEVENT. If the MYEVENT was not registered, a -ENODEV or > > something could be returned. I'm sure we could even optimize it such a way if > > no active events have been registered to event_x, that check_event_x() will > > return -ENODEV without any branches. > > I would keep it simple and extensible - that way we can complicate it when the > need arises! :) The above is rather trivial to implement. I don't think it complicates anything. -- Steve
On Mon, 16 May 2011, Ingo Molnar wrote: > > Not really. > > > > Firstly, what is the security goal of these restrictions? [...] > > To do what i described above? Namely: > > " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ > and /usr/lib/ " These are access rules, they don't really describe a high-level security goal. How do you know it's ok to open everything in these directories? - James
* Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the > > > way multiple callbacks can be registered. How would: > > > > > > err = event_x(); > > > if (err == -EACCESS) { > > > > > > be handled? [...] > > > > The default behavior would be something obvious: to trigger all callbacks and > > use the first non-zero return value. > > But how do we know which callback that was from? There's no ordering of what > callbacks are called first. We do not have to know that - nor do the calling sites care in general. Do you have some specific usecase in mind where the identity of the callback that generates a match matters? Thanks, Ingo
* Will Drewry <wad@chromium.org> wrote: > > This is *far* more generic still yields the same short-term end result as > > far as your sandboxing is concerned. > > Almost :/ [...] Hey that's a pretty good result from a subsystem that was not written with your usecase in mind *at all* ;-) > [...] I still need to review the code you've pointed out, but, at present, > the ftrace hooks occur after the seccomp and syscall auditing hooks. This > means that that code is exposed no matter what in this model. To trim the > exposed surface to userspace, we really need those early hooks. While I can > see both hacky and less hacky approaches around this, it stills strikes me > that the seccomp thread flag and early interception are good to reuse. One > option might be to allow seccomp to be a secure-syscall event source, but I > suspect that lands more on the hack-y side of the fence :) Agreed, there should be no security compromise imposed on your usecase, at all. You could move the event callback sooner into the syscall-entry sequence to make sure it's the highest priority thing to process? There's no semantic dependency on its current location so this can be changed AFAICS. Thanks, Ingo
On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote: > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: > > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the > > > > way multiple callbacks can be registered. How would: > > > > > > > > err = event_x(); > > > > if (err == -EACCESS) { > > > > > > > > be handled? [...] > > > > > > The default behavior would be something obvious: to trigger all callbacks and > > > use the first non-zero return value. > > > > But how do we know which callback that was from? There's no ordering of what > > callbacks are called first. > > We do not have to know that - nor do the calling sites care in general. Do you > have some specific usecase in mind where the identity of the callback that > generates a match matters? Maybe I'm confused. I was thinking that these event_*() are what we currently call trace_*(), but the event_*(), I assume, can return a value if a call back returns one. Thus, we now have the ability to dynamically attach function calls to arbitrary points in the kernel that can have an affect on the code that called it. Right now, we only have the ability to attach function calls to these locations that have passive affects (tracing/profiling). But you say, "nor do the calling sites care in general". Then what do these calling sites do with the return code? Are we limiting these actions to security only? Or can we have some other feature. I can envision that we can make the Linux kernel quite dynamic here with "self modifying code". That is, anywhere we have "hooks", perhaps we could replace them with dynamic switches (jump labels). Maybe events would not be the best use, but they could be a generic one. Knowing what callback returned the result would be beneficial. Right now, you are saying if the call back return anything, just abort the call, not knowing what callback was called. -- Steve
* James Morris <jmorris@namei.org> wrote: > On Mon, 16 May 2011, Ingo Molnar wrote: > > > > Not really. > > > > > > Firstly, what is the security goal of these restrictions? [...] > > > > To do what i described above? Namely: > > > > " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/ > > and /usr/lib/ " > > These are access rules, they don't really describe a high-level security > goal. [...] Restrictng sandboxed code to only open files within a given VFS namespace boundary sure sounds like a high-level security goal to me. If implemented and set up correctly then it restricts sandboxed code to only be able to open files reachable via that VFS sub-namespace. That is a rather meaningful high-level concept. What higher level concept do you want to argue? > [...] How do you know it's ok to open everything in these directories? How do you know it's ok to open /etc/hosts? The sysadmin has configured the system that way. How do you know that it's ok for sandboxed code to open files in /home/sandbox/? The sandbox developer has configured the system that way. I'm not sure i get your point. Thanks, Ingo
* Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote: > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: > > > > * Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the > > > > > way multiple callbacks can be registered. How would: > > > > > > > > > > err = event_x(); > > > > > if (err == -EACCESS) { > > > > > > > > > > be handled? [...] > > > > > > > > The default behavior would be something obvious: to trigger all callbacks and > > > > use the first non-zero return value. > > > > > > But how do we know which callback that was from? There's no ordering of what > > > callbacks are called first. > > > > We do not have to know that - nor do the calling sites care in general. Do you > > have some specific usecase in mind where the identity of the callback that > > generates a match matters? > > Maybe I'm confused. I was thinking that these event_*() are what we > currently call trace_*(), but the event_*(), I assume, can return a > value if a call back returns one. Yeah - and the call site can treat it as: - Ugh, if i get an error i need to abort whatever i was about to do or (more advanced future use): - If i get a positive value i need to re-evaluate the parameters that were passed in, they were changed > Thus, we now have the ability to dynamically attach function calls to > arbitrary points in the kernel that can have an affect on the code that > called it. Right now, we only have the ability to attach function calls to > these locations that have passive affects (tracing/profiling). Well, they can only have the effect that the calling site accepts and handles. So the 'effect' is not arbitrary and not defined by the callbacks, it is controlled and handled by the calling code. We do not want invisible side-effects, opaque hooks, etc. Instead of that we want (this is the getname() example i cited in the thread) explicit effects, like: if (event_vfs_getname(result)) return ERR_PTR(-EPERM); > But you say, "nor do the calling sites care in general". Then what do > these calling sites do with the return code? Are we limiting these > actions to security only? Or can we have some other feature. [...] Yeah, not just security. One other example that came up recently is whether to panic the box on certain (bad) events such as NMI errors. This too could be made flexible via the event filter code: we already capture many events, so places that might conceivably do some policy could do so based on a filter condition. > [...] I can envision that we can make the Linux kernel quite dynamic here > with "self modifying code". That is, anywhere we have "hooks", perhaps we > could replace them with dynamic switches (jump labels). Maybe events would > not be the best use, but they could be a generic one. events and explicit function calls and explicit side-effects are pretty much the only thing that are acceptable. We do not want opaque hooks and arbitrary side-effects. > Knowing what callback returned the result would be beneficial. Right now, you > are saying if the call back return anything, just abort the call, not knowing > what callback was called. Yeah, and that's a feature: that way a number of conditions can be attached. Multiple security frameworks may have effect on a task or multiple tools might set policy action on a given event. Thanks, Ingo
On Tue, 17 May 2011, Ingo Molnar wrote:
> I'm not sure i get your point.
Your example was not complete as described. After an apparently simple
specification, you've since added several qualifiers and assumptions, and
I still doubt that it's complete.
A higher level goal would look like
"Allow a sandbox app access only to approved resources, to contain the
effects of flaws in the app", or similar.
Note that this includes a threat model (remote attacker taking control of
the app) and a general and fully stated strategy for dealing with it.
From there, you can start to analyze how to implement the goal, at which
point you'd start thinking about configuration, assumptions, filesystem
access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared
memory, invocation).
Anyway, this is getting off track from the main discussion, but you
asked...
- James
* James Morris <jmorris@namei.org> wrote: > On Tue, 17 May 2011, Ingo Molnar wrote: > > > I'm not sure i get your point. > > Your example was not complete as described. After an apparently simple > specification, you've since added several qualifiers and assumptions, [...] I havent added any qualifiers really (i added examples/description), the opt-in method i mentioned in my first mail should be pretty robust: | Firstly, using the filter code i deny the various link creation syscalls so | that sandboxed code cannot escape for example by creating a symlink to | outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way | new syscalls added by new kernels are denied by defalt. The current symlink | creation syscalls are not opted in to.) > [...] and I still doubt that it's complete. I could too claim that i doubt that the SELinux kernel implementation is secure! So how about we both come up with specific examples about how it's not secure, instead of going down the fear-uncertainty-and-doubt road? ;-) > A higher level goal would look like > > "Allow a sandbox app access only to approved resources, to contain the > effects of flaws in the app", or similar. I see what you mean. I really think that "restricting sandboxed code to only open files within a given VFS namespace boundary" is the most useful highlevel description here - which is really a subset of a "allow a sandbox app access only to an easily approved set of files" highlevel concept. There's no "to contain ..." bit here: *all* of the sandboxed app code is untrusted, so there's no 'remote attacker' and we do not limit our threat to flaws in the app. We want to contain apps to within a small subset of Linux functionality, and we want to do that within regular apps (without having to be superuser), full stop. > Note that this includes a threat model (remote attacker taking control of the > app) and a general and fully stated strategy for dealing with it. Attacker does not have to be remote - most sandboxing concepts protect against locally installed plugins/apps/applets. In sandboxing the whole app is considered untrusted - not just some flaw in it, abused remotely. > From there, you can start to analyze how to implement the goal, at which > point you'd start thinking about configuration, assumptions, filesystem > access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared > memory, invocation). Sandboxed code generally does not have access to anything fancy like that - if it is added then all possible side effects have to be examined. Thanks, Ingo
On Tue, May 17, 2011 at 6:19 AM, Ingo Molnar <mingo@elte.hu> wrote: > > * Steven Rostedt <rostedt@goodmis.org> wrote: > >> On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote: >> > * Steven Rostedt <rostedt@goodmis.org> wrote: >> > >> > > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote: >> > > > * Steven Rostedt <rostedt@goodmis.org> wrote: >> > > > >> > > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the >> > > > > way multiple callbacks can be registered. How would: >> > > > > >> > > > > err = event_x(); >> > > > > if (err == -EACCESS) { >> > > > > >> > > > > be handled? [...] >> > > > >> > > > The default behavior would be something obvious: to trigger all callbacks and >> > > > use the first non-zero return value. >> > > >> > > But how do we know which callback that was from? There's no ordering of what >> > > callbacks are called first. >> > >> > We do not have to know that - nor do the calling sites care in general. Do you >> > have some specific usecase in mind where the identity of the callback that >> > generates a match matters? >> >> Maybe I'm confused. I was thinking that these event_*() are what we >> currently call trace_*(), but the event_*(), I assume, can return a >> value if a call back returns one. > > Yeah - and the call site can treat it as: > > - Ugh, if i get an error i need to abort whatever i was about to do > > or (more advanced future use): > > - If i get a positive value i need to re-evaluate the parameters that were > passed in, they were changed Do event_* that return non-void exist in the tree at all now? I've looked at the various tracepoint macros as well as some of the other handlers (trace_function, perf_tp_event, etc) and I'm not seeing any places where a return value is honored nor could be. At best, the perf_tp_event can be short-circuited it in the hlist_for_each, but it'd still need a way to bubble up a failure and result in not calling the trace/event that the hook precedes. Am I missing something really obvious? I don't feel I've gotten a good handle on exactly how all the tracing code gets triggered, so perhaps I'm still a level (or three) too shallow. (I can see the asm hooks for trace functions and I can see where that translates to registered calls - like trace_function - but I don't see how the hooked calls can be trivially aborted). As is, I'm not sure how the perf and ftrace infrastructure could be reused cleanly without a fair number of hacks to the interface and a good bit of reworking. I can already see a number of challenges around reusing the sys_perf_event_open interface and the fact that reimplementing something even as simple as seccomp mode=1 seems to require a fair amount of tweaking to avoid from being leaky. (E.g., enabling all TRACE_EVENT()s for syscalls will miss unhooked syscalls so either acceptance matching needs to be propagated up the stack along with some seccomp-like task modality or seccomp-on-perf would have to depend on sys_enter events with syscall number predicate matching and fail when a filter discard applies to all active events.) At present, I'm leaning back towards the v2 series (plus the requested minor changes) for the benefit of code clarity and its fail-secure behavior. Even just considering the reduced case of seccomp mode 1 being implemented on the shared infrastructure, I feel like I missing something that makes it viable. Any clues? If not, I don't think a seccomp mode 2 interface via prctl would be intractable if the long term movement is to a ftrace/perf backend - it just means that the in-kernel code would change to wrap whatever the final design ended up being. Thanks and sorry if I'm being dense! >> Thus, we now have the ability to dynamically attach function calls to >> arbitrary points in the kernel that can have an affect on the code that >> called it. Right now, we only have the ability to attach function calls to >> these locations that have passive affects (tracing/profiling). > > Well, they can only have the effect that the calling site accepts and handles. > So the 'effect' is not arbitrary and not defined by the callbacks, it is > controlled and handled by the calling code. > > We do not want invisible side-effects, opaque hooks, etc. > > Instead of that we want (this is the getname() example i cited in the thread) > explicit effects, like: > > if (event_vfs_getname(result)) > return ERR_PTR(-EPERM); > >> But you say, "nor do the calling sites care in general". Then what do >> these calling sites do with the return code? Are we limiting these >> actions to security only? Or can we have some other feature. [...] > > Yeah, not just security. One other example that came up recently is whether to > panic the box on certain (bad) events such as NMI errors. This too could be > made flexible via the event filter code: we already capture many events, so > places that might conceivably do some policy could do so based on a filter > condition. This sounds great - I just wish I could figure out how it'd work :) >> [...] I can envision that we can make the Linux kernel quite dynamic here >> with "self modifying code". That is, anywhere we have "hooks", perhaps we >> could replace them with dynamic switches (jump labels). Maybe events would >> not be the best use, but they could be a generic one. > > events and explicit function calls and explicit side-effects are pretty much > the only thing that are acceptable. We do not want opaque hooks and arbitrary > side-effects. > >> Knowing what callback returned the result would be beneficial. Right now, you >> are saying if the call back return anything, just abort the call, not knowing >> what callback was called. > > Yeah, and that's a feature: that way a number of conditions can be attached. > Multiple security frameworks may have effect on a task or multiple tools might > set policy action on a given event. > > Thanks, > > Ingo >
On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: > Do event_* that return non-void exist in the tree at all now? I've > looked at the various tracepoint macros as well as some of the other > handlers (trace_function, perf_tp_event, etc) and I'm not seeing any > places where a return value is honored nor could be. At best, the > perf_tp_event can be short-circuited it in the hlist_for_each, but > it'd still need a way to bubble up a failure and result in not calling > the trace/event that the hook precedes. No, none of the current trace hooks have return values. That was what I was talking about how to implement in my previous emails. -- Steve
On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: > >> Do event_* that return non-void exist in the tree at all now? I've >> looked at the various tracepoint macros as well as some of the other >> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any >> places where a return value is honored nor could be. At best, the >> perf_tp_event can be short-circuited it in the hlist_for_each, but >> it'd still need a way to bubble up a failure and result in not calling >> the trace/event that the hook precedes. > > No, none of the current trace hooks have return values. That was what I > was talking about how to implement in my previous emails. Led on by complete ignorance, I think I'm finally beginning to untwist the different pieces of the tracing infrastructure. Unfortunately, that means I took a few wrong turns along the way... I think function tracing looks something like this: ftrace_call has been injected into at a specific callsite. Upon hit: 1. ftrace_call triggers 2. does some checks then calls ftrace_trace_function (via mcount instrs) 3. ftrace_trace_function may be a single func or a list. For a list it will be: ftrace_list_func 4. ftrace_list_func calls each registered hook for that function in a while loop ignoring return values 5. registered hook funcs may then track the call, farm it out to specific sub-handlers, etc. This seems to be a red herring for my use case :/ though this helped me understand your back and forth (Ingo & Steve) regarding dynamic versus explicit events. System call tracing is done via kernel/tracepoint.c events fed in via arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. This yields direct sys_enter and sys_exit event sources (and an event class to hook up per-system call events). This means that ftrace_syscall_enter() does the event prep prior to doing a filter check comparing the ftrace_event object for the given syscall_nr to the event data. perf_sysenter_enter() is similar but it pushes the info over to perf_tp_event to be matched not against the global syscall event entry, but against any sub-event in the linked list on that syscall's event. Is that roughly an accurate representation of the two? I wish I hadn't gotten distracted along the function path, but at least I learned something (and it is relevant to the larger scope of this thread's discussion). After doing that digging, it looks like providing hook call pre-emption and return value propagation will be a unique and fun task for each tracer and event subsystem. If I look solely at tracepoints, a generic change would be to make the trace_##subsys function return an int (which I think was the event_vfs_getname proposal?). The other option is to change the trace_sys_enter proto to include a 'int *retcode'. That change would allow the propagation of some sort of policy information. To put it to use, seccomp mode 1 could be implemented on top of trace_syscalls. The following changes would need to happen: 1. dummy metadata should be inserted for all unhooked system calls 2. perf_trace_buf_submit would need to return an int or a new TRACE_REG_SECCOMP_REGISTER handler would need to be setup in syscall_enter_register. 3. If perf is abused, a "kill/fail_on_discard" bit would be added to event->attrs. 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches, 'n' for the number of event matches, and -EACCES/? if a 'fail_on_discard' event is seen. 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode 6. trace_sys_enter() would need to be moved to be the first entry arch/../kernel/ptrace.c for incoming syscalls 7. if trace_sys_enter() yields a negative return code, then do_exit(SIGKILL) the process and return. Entering into seccomp mode 1 would require adding a "0" filter for every system call number (which is why we need a dummy event call for them since failing to check the bitmask can't be flagged fail_on_discard) with the fail_on_discard bit. For the three calls that are allowed, a '1' filter would be set. That would roughly implement seccomp mode 1. It's pretty ugly and the fact that every system call that's disallowed has to be blacklisted is not ideal. An alternate model would be to just use the seccomp mode as we do today and let secure_computing() handle the return code of "# of matches". If it the # of matches is 0, it terminates. A 'fail_on_discard' bit then would only be good to stop further tracepoint callback evaluation. This approach would also *almost* nix the need to provide dummy syscall hooks. (Since sigreturn isn't hooked on x86 because it uses ptregs fixup, a dummy would still be needed to apply a "1" filter to.) Even with that tweak to move to a whitelist model, the perf event evaluation and tracepoint callback ordering is still not guaranteed. Without changing tracepoint itself, all other TPs will still execute. And for perf events, it'll be first-come-first-serve until a fail_on_discard is hit. After using seccomp mode=1 as the sample case to reduce scope, it's possible to ignore it for now :) and look at the seccomp-filter/mode=2 case. The same mechanism could be used to inject more expressive filters. This would be done over the sys_perf_event_open() interface assuming the new attr is added to stop perf event list enumeration. Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would be needed (maybe with the ON_EXEC flag option too to mirror the event->attr on-exec bit). That would yield the ability to register perf events for system calls then use ioctl() to SET_FILTER on them. Reducing the privileges of the filters after installation could be done with another attribute bit like 'set_filter_ands'. If that is also set on the event, and a filter is installed to ensure that sys_perf_event_open is blocked, then it should be reasonably sane. I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow extracting the settings. Clearly, I haven't written the code for that yet, though making the change for a single platform shouldn't be too much code. So that leaves me with some questions: - Is this the type of reuse that was being encouraged? - Does it really make sense to cram this through the perf interface and events? While the changed attributes are innocuous and potentially reusable, it seems that a large amount of the perf facilities are exposed that could have weird side effects, but I'm sure I still don't fully grok the perf infrastructure. - Does extending one tracepoint to provide return values via a pointer make sense? I'd hesitate to make all tracepoint hooks return an int by default. - If all tracepoints returned an int, what would the standard value look like - or would it need to be per tracepoint impl? - How is ambiguity resolved if multiple perf_events are registered for a syscall with different filters? Maybe a 'stop_on_match'? though ordering is still a problem then. - Is there a way to affect a task-wide change without a seccomp flag (or new task_struct entry) via the existing sys_perf_event_open syscall? I considered suggesting a attr bit call 'secure_computing' that when an event with the bit is enabled, it automagically forces the task into seccomp enforcement mode, but that, again, seemed hackish. While I like the idea of sharing the tracepoints infrastructure and the trace_syscalls hooks as well as using a pre-existing interface with very minor changes, I'm worried that the complexity of the interface and of the infrastructure might undermine the ability to continue meeting the desired security goals. I had originally stayed very close to the seccomp world because of how trivial it is to review the code and verify its accuracy/effectiveness. This approach leaves a lot of gaps for kernel code to seep through and a fair amount of ambiguity in what locked down syscall filters might look like. To me, the best part of the above is that it shows that even if we go with a prctl SET_SECCOMP_FILTER-style interface, it is completely certain that if a perf/ftrace-based security infrastructure is on our future, it will be entirely compatible -- even if the prctl() interface is just the "simpler" interface at that point somewhere down the road. Regardless, I'll hack up a proof of concept based on the outline above. Perhaps something more elegant will emerge once I start tweaking the source, but I'm still seeing too many gaps to be comfortable so far. [[There is a whole other approach to this too. We could continue with the prctl() interface and mirror the trace_sys_enter model for secure_computing(). Instead of keeping a seccomp_t-based hlist of events, they could be stored in a new hlist for seccomp_events in struct ftrace_event_call. The ftrace filters would be installed there and the seccomp_syscall_enter() function could do the checks and pass up some state data on the task's seccomp_t that indicates it needs to do_exit(). That would likely reduce the amount of code in seccomp_filter.c pretty seriously (though not entirely beneficially).]] Thanks again for all the feedback and insights! I really hope we can come to an agreeable approach for implementing kernel attack surface reduction. will
On Thu, May 19, 2011 at 4:05 PM, Will Drewry <wad@chromium.org> wrote: > On Thu, May 19, 2011 at 7:22 AM, Steven Rostedt <rostedt@goodmis.org> wrote: >> On Wed, 2011-05-18 at 21:07 -0700, Will Drewry wrote: >> >>> Do event_* that return non-void exist in the tree at all now? I've >>> looked at the various tracepoint macros as well as some of the other >>> handlers (trace_function, perf_tp_event, etc) and I'm not seeing any >>> places where a return value is honored nor could be. At best, the >>> perf_tp_event can be short-circuited it in the hlist_for_each, but >>> it'd still need a way to bubble up a failure and result in not calling >>> the trace/event that the hook precedes. >> >> No, none of the current trace hooks have return values. That was what I >> was talking about how to implement in my previous emails. > > Led on by complete ignorance, I think I'm finally beginning to untwist > the different pieces of the tracing infrastructure. Unfortunately, > that means I took a few wrong turns along the way... > > I think function tracing looks something like this: > > ftrace_call has been injected into at a specific callsite. Upon hit: > 1. ftrace_call triggers > 2. does some checks then calls ftrace_trace_function (via mcount instrs) > 3. ftrace_trace_function may be a single func or a list. For a list it > will be: ftrace_list_func > 4. ftrace_list_func calls each registered hook for that function in a > while loop ignoring return values > 5. registered hook funcs may then track the call, farm it out to > specific sub-handlers, etc. > > This seems to be a red herring for my use case :/ though this helped > me understand your back and forth (Ingo & Steve) regarding dynamic > versus explicit events. > > > System call tracing is done via kernel/tracepoint.c events fed in via > arch/[arch]/kernel/ptrace.c where it calls trace_sys_enter. This > yields direct sys_enter and sys_exit event sources (and an event class > to hook up per-system call events). This means that > ftrace_syscall_enter() does the event prep prior to doing a filter > check comparing the ftrace_event object for the given syscall_nr to > the event data. perf_sysenter_enter() is similar but it pushes the > info over to perf_tp_event to be matched not against the global > syscall event entry, but against any sub-event in the linked list on > that syscall's event. > > Is that roughly an accurate representation of the two? I wish I > hadn't gotten distracted along the function path, but at least I > learned something (and it is relevant to the larger scope of this > thread's discussion). > > > After doing that digging, it looks like providing hook call > pre-emption and return value propagation will be a unique and fun task > for each tracer and event subsystem. If I look solely at tracepoints, > a generic change would be to make the trace_##subsys function return > an int (which I think was the event_vfs_getname proposal?). The other > option is to change the trace_sys_enter proto to include a 'int > *retcode'. > > That change would allow the propagation of some sort of policy > information. To put it to use, seccomp mode 1 could be implemented on > top of trace_syscalls. The following changes would need to happen: > 1. dummy metadata should be inserted for all unhooked system calls > 2. perf_trace_buf_submit would need to return an int or a new > TRACE_REG_SECCOMP_REGISTER handler would need to be setup in > syscall_enter_register. > 3. If perf is abused, a "kill/fail_on_discard" bit would be added to > event->attrs. > 4. perf_trace_buf_submit/perf_tp_event will return 0 for no matches, > 'n' for the number of event matches, and -EACCES/? if a > 'fail_on_discard' event is seen. > 5. perf_syscall_enter would set *retcode = perf_trace_buf_submit()'s retcode > 6. trace_sys_enter() would need to be moved to be the first entry > arch/../kernel/ptrace.c for incoming syscalls > 7. if trace_sys_enter() yields a negative return code, then > do_exit(SIGKILL) the process and return. > > Entering into seccomp mode 1 would require adding a "0" filter for > every system call number (which is why we need a dummy event call for > them since failing to check the bitmask can't be flagged > fail_on_discard) with the fail_on_discard bit. For the three calls > that are allowed, a '1' filter would be set. > > That would roughly implement seccomp mode 1. It's pretty ugly and the > fact that every system call that's disallowed has to be blacklisted is > not ideal. An alternate model would be to just use the seccomp mode > as we do today and let secure_computing() handle the return code of "# > of matches". If it the # of matches is 0, it terminates. A > 'fail_on_discard' bit then would only be good to stop further > tracepoint callback evaluation. This approach would also *almost* nix > the need to provide dummy syscall hooks. (Since sigreturn isn't > hooked on x86 because it uses ptregs fixup, a dummy would still be > needed to apply a "1" filter to.) > > Even with that tweak to move to a whitelist model, the perf event > evaluation and tracepoint callback ordering is still not guaranteed. > Without changing tracepoint itself, all other TPs will still execute. > And for perf events, it'll be first-come-first-serve until a > fail_on_discard is hit. > > After using seccomp mode=1 as the sample case to reduce scope, it's > possible to ignore it for now :) and look at the seccomp-filter/mode=2 > case. The same mechanism could be used to inject more expressive > filters. This would be done over the sys_perf_event_open() interface > assuming the new attr is added to stop perf event list enumeration. > Assuming a whitelist model, a call to prctl(PR_SET_SECCOMP, 2) would > be needed (maybe with the ON_EXEC flag option too to mirror the > event->attr on-exec bit). That would yield the ability to register > perf events for system calls then use ioctl() to SET_FILTER on them. > > Reducing the privileges of the filters after installation could be > done with another attribute bit like 'set_filter_ands'. If that is > also set on the event, and a filter is installed to ensure that > sys_perf_event_open is blocked, then it should be reasonably sane. > > I'd need to add a PERF_EVENT_IOC_GET_FILTER handler to allow > extracting the settings. > > Clearly, I haven't written the code for that yet, though making the > change for a single platform shouldn't be too much code. > > So that leaves me with some questions: > - Is this the type of reuse that was being encouraged? > - Does it really make sense to cram this through the perf interface > and events? While the changed attributes are innocuous and > potentially reusable, it seems that a large amount of the perf > facilities are exposed that could have weird side effects, but I'm > sure I still don't fully grok the perf infrastructure. > - Does extending one tracepoint to provide return values via a pointer > make sense? I'd hesitate to make all tracepoint hooks return an int by > default. > - If all tracepoints returned an int, what would the standard value > look like - or would it need to be per tracepoint impl? > - How is ambiguity resolved if multiple perf_events are registered for > a syscall with different filters? Maybe a 'stop_on_match'? though > ordering is still a problem then. > - Is there a way to affect a task-wide change without a seccomp flag > (or new task_struct entry) via the existing sys_perf_event_open > syscall? I considered suggesting a attr bit call 'secure_computing' > that when an event with the bit is enabled, it automagically forces > the task into seccomp enforcement mode, but that, again, seemed > hackish. > > While I like the idea of sharing the tracepoints infrastructure and > the trace_syscalls hooks as well as using a pre-existing interface > with very minor changes, I'm worried that the complexity of the > interface and of the infrastructure might undermine the ability to > continue meeting the desired security goals. I had originally stayed > very close to the seccomp world because of how trivial it is to review > the code and verify its accuracy/effectiveness. This approach leaves > a lot of gaps for kernel code to seep through and a fair amount of > ambiguity in what locked down syscall filters might look like. > > To me, the best part of the above is that it shows that even if we go > with a prctl SET_SECCOMP_FILTER-style interface, it is completely > certain that if a perf/ftrace-based security infrastructure is on our > future, it will be entirely compatible -- even if the prctl() > interface is just the "simpler" interface at that point somewhere down > the road. > > > Regardless, I'll hack up a proof of concept based on the outline > above. Perhaps something more elegant will emerge once I start > tweaking the source, but I'm still seeing too many gaps to be > comfortable so far. > > > [[There is a whole other approach to this too. We could continue with > the prctl() interface and mirror the trace_sys_enter model for > secure_computing(). Instead of keeping a seccomp_t-based hlist of > events, they could be stored in a new hlist for seccomp_events in > struct ftrace_event_call. The ftrace filters would be installed there > and the seccomp_syscall_enter() function could do the checks and pass > up some state data on the task's seccomp_t that indicates it needs to > do_exit(). That would likely reduce the amount of code in > seccomp_filter.c pretty seriously (though not entirely > beneficially).]] > > > Thanks again for all the feedback and insights! I really hope we can > come to an agreeable approach for implementing kernel attack surface > reduction. Just a quick follow up. I have a PoC of the perf-based system call filtering code in place which uses seccomp.mode as a task_context flag, adds require_secure and err_on_discard perf_event_attrs, and enforces these new events terminate the process in perf_syscall_enter via a return value on perf_buf_submit. This lets a call like: LD_LIBRARY_PATH=/host/home/wad/kernel.uml/tests/ /host/home/wad/kernel.uml/tests/perf record \ -S \ -e 'syscalls:sys_enter_access' \ -e 'syscalls:sys_enter_brk' \ -e 'syscalls:sys_enter_close' \ -e 'syscalls:sys_enter_exit_group' \ -e 'syscalls:sys_enter_fcntl64' \ -e 'syscalls:sys_enter_fstat64' \ -e 'syscalls:sys_enter_getdents64' \ -e 'syscalls:sys_enter_getpid' \ -e 'syscalls:sys_enter_getuid' \ -e 'syscalls:sys_enter_ioctl' \ -e 'syscalls:sys_enter_lstat64' \ -e 'syscalls:sys_enter_mmap_pgoff' \ -e 'syscalls:sys_enter_mprotect' \ -e 'syscalls:sys_enter_munmap' \ -e 'syscalls:sys_enter_open' \ -e 'syscalls:sys_enter_read' \ -e 'syscalls:sys_enter_stat64' \ -e 'syscalls:sys_enter_time' \ -e 'syscalls:sys_enter_newuname' \ -e 'syscalls:sys_enter_write' --filter "fd == 1" \ /bin/ls run successfully while omitting a syscall or passing in --filter "fd == 0" properly terminates it. (ignoring the need for execve and set_thread_area for PoC purposes). The change avoids defining a new trace call type or a huge number of internal changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack surface is non-trivial to verify, and I'm not sure if this ABI change makes sense. It amounts to: include/linux/ftrace_event.h | 4 +- include/linux/perf_event.h | 10 +++++--- kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- kernel/seccomp.c | 8 ++++++ kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- 5 files changed, 82 insertions(+), 16 deletions(-) And can be found here: http://static.dataspill.org/perf_secure/v1/ If there is any interest at all, I can post it properly to this giant CC list. However, it's unclear to me where this thread stands. I also see a completely independent path for implementing system call filtering now, but I'll hold off posting that patch until I see where this goes. Any guidance will be appreciated - thanks again! will
On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: > include/linux/ftrace_event.h | 4 +- > include/linux/perf_event.h | 10 +++++--- > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > kernel/seccomp.c | 8 ++++++ > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > 5 files changed, 82 insertions(+), 16 deletions(-) I strongly oppose to the perf core being mixed with any sekurity voodoo (or any other active role for that matter).
On Tue, 24 May 2011, Peter Zijlstra wrote: > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: > > include/linux/ftrace_event.h | 4 +- > > include/linux/perf_event.h | 10 +++++--- > > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > > kernel/seccomp.c | 8 ++++++ > > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > > 5 files changed, 82 insertions(+), 16 deletions(-) > > I strongly oppose to the perf core being mixed with any sekurity voodoo > (or any other active role for that matter). Amen. We have enough crap to cleanup in perf/ftrace already, so we really do not need security magic added to it. Thanks, tglx
On Tue, May 24, 2011 at 11:25 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, 24 May 2011, Peter Zijlstra wrote: > >> On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: >> > include/linux/ftrace_event.h | 4 +- >> > include/linux/perf_event.h | 10 +++++--- >> > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- >> > kernel/seccomp.c | 8 ++++++ >> > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- >> > 5 files changed, 82 insertions(+), 16 deletions(-) >> >> I strongly oppose to the perf core being mixed with any sekurity voodoo >> (or any other active role for that matter). > > Amen. We have enough crap to cleanup in perf/ftrace already, so we > really do not need security magic added to it. Thanks for the quick responses! I agree, but I'm left a little bit lost now w.r.t. the comments around reusing the ABI. If perf doesn't make sense (which certainly seems wrong from a security interface perspective), then the preexisting ABIs I know of for this case are as follows: - /sys/kernel/debug/tracing/* - prctl(PR_SET_SECCOMP* (or /proc/...) Both would require expansion. The latter was reused by the original patch series. The former doesn't expose much in the way of per-task event filtering -- ftrace_pids doesn't translate well to ftrace_syscall_enter-based enforcement. I'd expect we'd need ftrace_event_call->task_events (like ->perf_events), and either explore them in ftrace_syscall_enter or add a new tracepoint handler, ftrace_task_syscall_enter, via something like TRACE_REG_TASK_REGISTER. It could then do whatever it wanted with the successful or unsuccessful matching against predicates, stacking or not, which could be used for a seccomp-like mechanism. However, bubbling that change up to the non-existent interfaces in debug/tracing could be a challenge too (Registration would require an alternate flow like perf to call TRACE_REG_*? Do they become tracing/events/subsystem/event/task/<tid>/<filter_string_N>? ...?). This is all just a matter of programming... but at this point, I'm not seeing the clear shared path forward. Even with per-task ftrace access in debug/tracing, that would introduce a reasonably large change to the system and add a new ABI, albeit in debug/tracing. If the above (or whatever the right approach is) comes into existence, then any prctl(PR_SET_SECCOMP) ABI could have the backend implementation to modify the same data. I'm not putting it like this to say that I'm designing to be obsolete, but to show that the defined interface wouldn't conflict if ftrace does overlap more in the future. Given the importance of a clearly defined interface for security functionality, I'd be surprised to see all the pieces come together in the near future in such a way that a transition would be immediately possible -- I'm not even sure what the ftrace roadmap really is! Would it be more desirable to put a system call filtering interface on a miscdev (like /dev/syscall_filter) instead of in /proc or prctl (and not reuse seccomp at all)? I'm not clear what the onus is to justify a change in the different ABI areas, but I see system call filtering as an important piece of system security and would like to determine if there is a viable path forward, or if this will need to be revisited in another 2 years. thanks again! will
* Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: > > include/linux/ftrace_event.h | 4 +- > > include/linux/perf_event.h | 10 +++++--- > > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > > kernel/seccomp.c | 8 ++++++ > > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > > 5 files changed, 82 insertions(+), 16 deletions(-) > > I strongly oppose to the perf core being mixed with any sekurity voodoo > (or any other active role for that matter). I'd object to invisible side-effects as well, and vehemently so. But note how intelligently it's used here: it's explicit in the code, it's used explicitly in kernel/seccomp.c and the event generation place in kernel/trace/trace_syscalls.c. So this is a really flexible solution IMO and does not extend events with some invisible 'active' role. It extends the *call site* with an open-coded active role - which active role btw. already pre-existed. Thanks, Ingo
* Will Drewry <wad@chromium.org> wrote: > The change avoids defining a new trace call type or a huge number of internal > changes and hides seccomp.mode=2 from ABI-exposure in prctl, but the attack > surface is non-trivial to verify, and I'm not sure if this ABI change makes > sense. It amounts to: > > include/linux/ftrace_event.h | 4 +- > include/linux/perf_event.h | 10 +++++--- > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > kernel/seccomp.c | 8 ++++++ > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > 5 files changed, 82 insertions(+), 16 deletions(-) > > And can be found here: http://static.dataspill.org/perf_secure/v1/ Wow, i'm very impressed how few changes you needed to do to support this! So, firstly, i don't think we should change perf_tp_event() at all - the 'observer' codepaths should be unaffected. But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that code like seccomp which wants to use event results can use. Also, i'm not sure about the seccomp details and assumptions that were moved into the perf core. How about passing in a helper function to perf_tp_event_check(), where seccomp would define its seccomp specific helper function? That looks sufficiently flexible. That helper function could be an 'extra filter' kind of thing, right? Also, regarding the ABI and the attr.err_on_discard and attr.require_secure bits, they look a bit too specific as well. attr.err_on_discard: with the filter helper function passed in this is probably not needed anymore, right? attr.require_secure: this is basically used to *force* the creation of security-controlling filters, right? It seems to me that this could be done via a seccomp ABI extension as well, without adding this to the perf ABI. That seccomp call could check whether the right events are created and move the task to mode 2 only if that prereq is met - or something like that. > If there is any interest at all, I can post it properly to this giant > CC list. [...] I'd suggest to trim the Cc: list aggressively - anyone interested in the discussion can pick it up on lkml - and i strongly suspect that most of the Cc: participants would want to be off the Cc: :-) Thanks, Ingo
* Ingo Molnar <mingo@elte.hu> wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: > > > include/linux/ftrace_event.h | 4 +- > > > include/linux/perf_event.h | 10 +++++--- > > > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > > > kernel/seccomp.c | 8 ++++++ > > > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > > > 5 files changed, 82 insertions(+), 16 deletions(-) > > > > I strongly oppose to the perf core being mixed with any sekurity voodoo > > (or any other active role for that matter). > > I'd object to invisible side-effects as well, and vehemently so. But note how > intelligently it's used here: it's explicit in the code, it's used explicitly > in kernel/seccomp.c and the event generation place in > kernel/trace/trace_syscalls.c. > > So this is a really flexible solution IMO and does not extend events with > some invisible 'active' role. It extends the *call site* with an open-coded > active role - which active role btw. already pre-existed. Also see my other mail - i think this seccomp code is too tied in to the perf core and ABI - but this is fixable IMO. The fundamental notion that a generator subsystem of events can use filter results as well (such as kernel/trace/trace_syscalls.c.) for its own purposes is pretty robust though. Thanks, Ingo
On Tue, 2011-05-24 at 22:08 +0200, Ingo Molnar wrote: > * Will Drewry <wad@chromium.org> wrote: > But there could be a perf_tp_event_ret() or perf_tp_event_check() entry that > code like seccomp which wants to use event results can use. We should name it something else. The "perf_tp.." is a misnomer as it has nothing to do with performance monitoring. "dynamic_event_.." maybe, as it is dynamic to the affect that we can use jump labels to enable or disable it. -- Steve
On Tue, 24 May 2011, Ingo Molnar wrote: > * Peter Zijlstra <peterz@infradead.org> wrote: > > > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: > > > include/linux/ftrace_event.h | 4 +- > > > include/linux/perf_event.h | 10 +++++--- > > > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > > > kernel/seccomp.c | 8 ++++++ > > > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > > > 5 files changed, 82 insertions(+), 16 deletions(-) > > > > I strongly oppose to the perf core being mixed with any sekurity voodoo > > (or any other active role for that matter). > > I'd object to invisible side-effects as well, and vehemently so. But note how > intelligently it's used here: it's explicit in the code, it's used explicitly > in kernel/seccomp.c and the event generation place in > kernel/trace/trace_syscalls.c. > > So this is a really flexible solution IMO and does not extend events with some > invisible 'active' role. It extends the *call site* with an open-coded active > role - which active role btw. already pre-existed. We do _NOT_ make any decision based on the trace point so what's the "pre-existing" active role in the syscall entry code? I'm all for code reuse and reuse of interfaces, but this is completely wrong. Instrumentation and security decisions are two fundamentally different things and we want them kept separate. Instrumentation is not meant to make decisions. Just because we can does not mean that it is a good idea. So what the current approach does is: - abuse the existing ftrace syscall hook by adding a return value to the tracepoint. So we need to propagate that for every tracepoint just because we have a single user. - abuse the perf per task mechanism Just because we have per task context in perf does not mean that we pull everything and the world which requires per task context into perf. The security folks have per task context already so security related stuff wants to go there. - abuse the perf/ftrace interfaces One of the arguments was that perf and ftrace have permission which are not available from the existing security interfaces. That's not at all a good reason to abuse these interfaces. Let the security folks sort out the problem on their end and do not impose any expectations on perf/ftrace which we have to carry around forever. Yes, it can be made working with a relatively small patch, but it has a very nasty side effect: You add another user space visible ABI to the existing perf/ftrace mess which needs to be supported forever. Brilliant, we have already two ABIs (perf/ftrace) to support and at the same time we urgently need to solve the problem of better integration of those two. So adding a third completely unrelated component with a guaranteed ABI is just making this even more complex. We can factor out the filtering code and let the security dudes reuse it for their own purposes. That makes them to have their own interfaces and does not impose any restrictions upon the tracing/perf ones. And really security stuff wants to be integrated into the existing security frameworks and not duct taped into perf/trace just because it's a conveniant hack around limitiations of the existing security stuff. You really should stop to see everything as a nail just because the only tool you have handy is the perf hammer. perf is about instrumentation and we don't want to violate the oldest principle of unix to have simple tools which do one thing and do it good. Even swiss army knifes have the restriction that you can use only one tool at a time unless you want to stick the corkscrew through your palm when you try to cut bread. Thanks, tglx
* Thomas Gleixner <tglx@linutronix.de> wrote: > On Tue, 24 May 2011, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: > > > > include/linux/ftrace_event.h | 4 +- > > > > include/linux/perf_event.h | 10 +++++--- > > > > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > > > > kernel/seccomp.c | 8 ++++++ > > > > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > > > > 5 files changed, 82 insertions(+), 16 deletions(-) > > > > > > I strongly oppose to the perf core being mixed with any sekurity voodoo > > > (or any other active role for that matter). > > > > I'd object to invisible side-effects as well, and vehemently so. But note how > > intelligently it's used here: it's explicit in the code, it's used explicitly > > in kernel/seccomp.c and the event generation place in > > kernel/trace/trace_syscalls.c. > > > > So this is a really flexible solution IMO and does not extend events with some > > invisible 'active' role. It extends the *call site* with an open-coded active > > role - which active role btw. already pre-existed. > > We do _NOT_ make any decision based on the trace point so what's the > "pre-existing" active role in the syscall entry code? The seccomp code we are discussing in this thread. > I'm all for code reuse and reuse of interfaces, but this is completely > wrong. Instrumentation and security decisions are two fundamentally > different things and we want them kept separate. Instrumentation is > not meant to make decisions. Just because we can does not mean that it > is a good idea. Instrumentation does not 'make decisions': the calling site, which is already emitting both the event and wants to do decisions based on the data that also generates the event wants to do decisions. Those decisions *will be made* and you cannot prevent that, the only open question is can it reuse code intelligently, which code it is btw. already calling for observation reasons? ( Note that pure observers wont be affected and note that pure observation call sites are not affected either. ) > So what the current approach does is: > > - abuse the existing ftrace syscall hook by adding a return value to > the tracepoint. > > So we need to propagate that for every tracepoint just because we > have a single user. This is a technical criticism i share with you and i think it can be fixed - i outlined it to Will yesterday. And no, if done cleanly it's not 'abuse' to reuse code. Could we wait for the first clean iteration of this patch instead of rushing judgement prematurely? > - abuse the perf per task mechanism > > Just because we have per task context in perf does not mean that we > pull everything and the world which requires per task context into > perf. The security folks have per task context already so security > related stuff wants to go there. We do not pull 'everything and the world' in, but code that wants to process events in places that already emit events surely sounds related to me :-) > - abuse the perf/ftrace interfaces > > One of the arguments was that perf and ftrace have permission which > are not available from the existing security interfaces. That's not > at all a good reason to abuse these interfaces. Let the security > folks sort out the problem on their end and do not impose any > expectations on perf/ftrace which we have to carry around forever. > > Yes, it can be made working with a relatively small patch, but it has > a very nasty side effect: > > You add another user space visible ABI to the existing perf/ftrace > mess which needs to be supported forever. What mess? I'm not aware of a mess - other than the ftrace API which is not used by this patch. > Brilliant, we have already two ABIs (perf/ftrace) to support and at > the same time we urgently need to solve the problem of better > integration of those two. So adding a third completely unrelated > component with a guaranteed ABI is just making this even more complex. So your solution is to add yet another ABI for seccomp and to keep seccomp a limited hack forever, just because you are not interested in security? I think we want fewer ABIs and more flexible/reusable facilities. > We can factor out the filtering code and let the security dudes > reuse it for their own purposes. That makes them to have their own > interfaces and does not impose any restrictions upon the > tracing/perf ones. And really security stuff wants to be integrated > into the existing security frameworks and not duct taped into > perf/trace just because it's a conveniant hack around limitiations > of the existing security stuff. You are missing what i tried to point out in earlier discussions: from a security design POV this isnt just about the system call boundary. If this seccomp variant is based on events then it could grow proper security checks in other places as well, in places where we have some sort of object observation event anyway. So this is opens up possibilities to reuse and unify code on a very broad basis. > You really should stop to see everything as a nail just because the > only tool you have handy is the perf hammer. perf is about > instrumentation and we don't want to violate the oldest principle > of unix to have simple tools which do one thing and do it good. That is one of the most misunderstood principles of Unix. The original Unix tool landscape was highly *integrated* and *unified*, into a very tight codebase that was maintained together. Yes, there were different, atomic, simple commands but it was all done together and it was a coherent whole and pleasant to use! People misunderstood this as a license to fragment the heck out functionality and build 'simple and stupid' tools that fit nowhere really and use different, incompatible principles not synced with each other. That is wrong and harmful. So yes, over-integration is obviously wrong - but so is needless fragmentation. Anyway, i still reserve judgement on this seccomp bit but the patches so far looked very promising with a very surprisingly small diffstat. If anything then that should tell you something that events and seccomp are not just casually related ... > Even swiss army knifes have the restriction that you can use only > one tool at a time unless you want to stick the corkscrew through > your palm when you try to cut bread. I'm not sure what you are arguing here - do you pine for the DOS days where you could only use one command at a time? Thanks, Ingo
On Wed, 2011-05-25 at 17:01 +0200, Ingo Molnar wrote: > > We do _NOT_ make any decision based on the trace point so what's the > > "pre-existing" active role in the syscall entry code? > > The seccomp code we are discussing in this thread. That isn't pre-existing, that's proposed. But face it, you can argue until you're blue in the face, but both tglx and I will NAK any and all patches that extend perf/ftrace beyond the passive observing role. Your arguments appear to be as non-persuasive to us as ours are to you, so please drop this endeavor and let the security folks sort it on their own and let's get back to doing useful work.
On Wed, 25 May 2011, Ingo Molnar wrote: > * Thomas Gleixner <tglx@linutronix.de> wrote: > > On Tue, 24 May 2011, Ingo Molnar wrote: > > > * Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Tue, 2011-05-24 at 10:59 -0500, Will Drewry wrote: > > > > > include/linux/ftrace_event.h | 4 +- > > > > > include/linux/perf_event.h | 10 +++++--- > > > > > kernel/perf_event.c | 49 +++++++++++++++++++++++++++++++++++++--- > > > > > kernel/seccomp.c | 8 ++++++ > > > > > kernel/trace/trace_syscalls.c | 27 +++++++++++++++++----- > > > > > 5 files changed, 82 insertions(+), 16 deletions(-) > > > > > > > > I strongly oppose to the perf core being mixed with any sekurity voodoo > > > > (or any other active role for that matter). > > > > > > I'd object to invisible side-effects as well, and vehemently so. But note how > > > intelligently it's used here: it's explicit in the code, it's used explicitly > > > in kernel/seccomp.c and the event generation place in > > > kernel/trace/trace_syscalls.c. > > > > > > So this is a really flexible solution IMO and does not extend events with some > > > invisible 'active' role. It extends the *call site* with an open-coded active > > > role - which active role btw. already pre-existed. > > > > We do _NOT_ make any decision based on the trace point so what's the > > "pre-existing" active role in the syscall entry code? > > The seccomp code we are discussing in this thread. That's proposed code and has absolutely nothing to do with the existing trace point semantics. > > I'm all for code reuse and reuse of interfaces, but this is completely > > wrong. Instrumentation and security decisions are two fundamentally > > different things and we want them kept separate. Instrumentation is > > not meant to make decisions. Just because we can does not mean that it > > is a good idea. > > Instrumentation does not 'make decisions': the calling site, which is > already emitting both the event and wants to do decisions based on > the data that also generates the event wants to do decisions. You can repeat that as often as you want, it does not make it more true. Fact is that the decision is made in the middle of the perf code. > + /* Transition the task if required. */ > + if (ctx->type == task_context && event->attr.require_secure) { > +#ifdef CONFIG_SECCOMP > + /* Don't allow perf events to escape mode = 1. */ > + if (!current->seccomp.mode) > + current->seccomp.mode = 2; > +#endif > + } and further down > + if (event->attr.err_on_discard) > + ok = -EACCES; > Those decisions *will be made* and you cannot prevent that, the only > open question is can it reuse code intelligently, which code it is > btw. already calling for observation reasons? The tracepoint is called for observation reasons and now you make it a decision function. That's what I call abuse. > ( Note that pure observers wont be affected and note that pure > observation call sites are not affected either. ) Hahaha, they still have to run through the additional code when seccomp is enabled and we still have to propagate the return value down to the point where the tracepoint itself is. You call that not affected? > > So what the current approach does is: > > > > - abuse the existing ftrace syscall hook by adding a return value to > > the tracepoint. > > > > So we need to propagate that for every tracepoint just because we > > have a single user. > > This is a technical criticism i share with you and i think it can be > fixed - i outlined it to Will yesterday. > > And no, if done cleanly it's not 'abuse' to reuse code. Could we wait > for the first clean iteration of this patch instead of rushing > judgement prematurely? There is no way to do it cleanly. It always comes for the price that you add additional code into the tracing code path. And there are other people who try hard to remove stuff to recude the overhead which is caused by instrumentation. > > - abuse the perf per task mechanism > > > > Just because we have per task context in perf does not mean that we > > pull everything and the world which requires per task context into > > perf. The security folks have per task context already so security > > related stuff wants to go there. > > We do not pull 'everything and the world' in, but code that wants to > process events in places that already emit events surely sounds > related to me :-) We have enough places where different independent parts of the kernel want to hook into for obvious reasons. We have notifiers for those where performance does not matter much and we have separate calls into the independent functions where it matters or where we need to evaluate the results in specific ways. So now you turn instrumentation into a security mechanism, which "works" nicely for a particular purpose, i.e. decision on a particular syscall number. Now, how do you make that work when a decision has to be made on more than a simple match, e.g. syscall number + arguments ? Not at all, unless you add more complexity and arbitrary callbacks into the instrumentation code. > > Brilliant, we have already two ABIs (perf/ftrace) to support and at > > the same time we urgently need to solve the problem of better > > integration of those two. So adding a third completely unrelated > > component with a guaranteed ABI is just making this even more complex. > > So your solution is to add yet another ABI for seccomp and to keep > seccomp a limited hack forever, just because you are not interested > in security? Well, I'm interested in security, but I'm neither interested in security decisions inside instrumentation code nor in security models which are limited hacks by definition unless you want to add callback complexities to the instrumentation code. It all looks nice and charming with this minimalistic use case, but add real features to it and it gets messy in a split second and you can't hold that off once you started to allow A. And you clearly stated that you want to have more trace point based security features than the simple syscall number filtering. > I think we want fewer ABIs and more flexible/reusable facilities. I'm all for that, but security and instrumentation are different beasts. > > We can factor out the filtering code and let the security dudes > > reuse it for their own purposes. That makes them to have their own > > interfaces and does not impose any restrictions upon the > > tracing/perf ones. And really security stuff wants to be integrated > > into the existing security frameworks and not duct taped into > > perf/trace just because it's a conveniant hack around limitiations > > of the existing security stuff. > > You are missing what i tried to point out in earlier discussions: > from a security design POV this isnt just about the system call > boundary. If this seccomp variant is based on events then it could > grow proper security checks in other places as well, in places where > we have some sort of object observation event anyway. Right, that requires callbacks and more stuff to do object based observation and ties a trace point into a place where it might not make sense after a while, but the security decision wants to stay at that place. The syscall example is so tempting because it's simplistic and easy to solve, but every extension to that model is going to create a nightmare. You are duct taping stuff together which has totally different semantics and requirements. And your only argument is reuse of existing code. That's a good argument in principle, but there is a fundamental difference between intelligent reuse and enforcing it just for reuse sake. > So this is opens up possibilities to reuse and unify code on a very > broad basis. By making a total mess out of it? > So yes, over-integration is obviously wrong - but so is needless > fragmentation. Right. And this falls into the category of over-integration. We have people working on security and they are working on stacked security models, so where is the justification to start another security framework inside the instrumentation code which is completely non interoperable with the existing ones? > If anything then that should tell you something that events and > seccomp are not just casually related ... They happen to have the hook at the same point in the source and for pure coincidence it works because the problem to solve is extremly simplistic. And that's why the diffstat is minimalistic, but that does not prove anything. Thanks, tglx
On Mon 2011-05-16 10:36:05, James Morris wrote: > On Fri, 13 May 2011, Ingo Molnar wrote: > How do you reason about the behavior of the system as a whole? > > > > I argue that this is the LSM and audit subsystems designed right: in the long > > run it could allow everything that LSM does at the moment - and so much more > > ... > > Now you're proposing a redesign of the security subsystem. That's a > significant undertaking. > > In the meantime, we have a simple, well-defined enhancement to seccomp > which will be very useful to current users in reducing their kernel attack > surface. Well, you can do the same with subterfugue, even without kernel changes. But that's ptrace -- slow. (And it already shows that syscall based filters are extremely tricky to configure). If yu want speed, seccomp+server for non-permitted operations seems like reasonable way.
* Pavel Machek <pavel@ucw.cz> wrote: > On Mon 2011-05-16 10:36:05, James Morris wrote: > > On Fri, 13 May 2011, Ingo Molnar wrote: > > How do you reason about the behavior of the system as a whole? > > > > > > > I argue that this is the LSM and audit subsystems designed right: in the long > > > run it could allow everything that LSM does at the moment - and so much more > > > ... > > > > Now you're proposing a redesign of the security subsystem. That's a > > significant undertaking. > > > > In the meantime, we have a simple, well-defined enhancement to seccomp > > which will be very useful to current users in reducing their kernel attack > > surface. > > Well, you can do the same with subterfugue, even without kernel > changes. But that's ptrace -- slow. (And it already shows that > syscall based filters are extremely tricky to configure). Yes, if you use syscall based filters to implement access to underlying objects where the access methods do not capture essential lifetime events properly (such as files) they you'll quickly run into trouble achieving a secure solution. But you can robustly use syscall filters to control the underlying primary *resource*: various pieces of kernel code with *negative* utility to the current app - which have no use to the app but pose risks in terms of potential exploits in them. But you can use event filters to implement arbitrary security policies robustly. For example file objects: if you generate the right events for a class of objects then you can control access to them very robustly. It's not a surprise that this is what SELinux does primarily: it has lifetime event hooks at the inode object (and socket, packet, etc.) level and captures those access attempts and validates them against the permissions of that object, in light of the accessing task's credentials. Exactly that can be done with Will's patch as well, if its potential scope of event-checking points is not stupidly limited to the syscall boundary alone ... Thanks, Ingo
* Thomas Gleixner <tglx@linutronix.de> wrote: > > > We do _NOT_ make any decision based on the trace point so > > > what's the "pre-existing" active role in the syscall entry > > > code? > > > > The seccomp code we are discussing in this thread. > > That's proposed code and has absolutely nothing to do with the > existing trace point semantics. So because it's proposed code it does not exist? If the feature is accepted (and given Linus's opinion it's not clear at all it's accepted in any form) then it's obviously a very legitimate technical concern whether we do: ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5); if (ret) return -EACCES; ... random code ... trace_syscall_event(p1, p2, p3, p4, p5); Where seccomp_check_syscall_event() duplicates much of the machinery that is behind trace_syscall_event(). Or we do the more intelligent: ret = check_syscall_event(p1, p2, p3, p4, p5); if (ret) return -EACCES; Where we have the happy side effects of: - less code at the call site - (a lot of!) shared infrastructure between the proposed seccomp code and event filters. - we'd also be able to trace at security check boundaries - which has obvious bug analysis advantages. In fact i do not see *any* advantages in keeping this needlessly bloaty and needlessly inconsistently sampled form of instrumentation: ret = seccomp_check_syscall_event(p1, p2, p3, p4, p5); if (ret) return -EACCES; ... random code ... trace_syscall_event(p1, p2, p3, p4, p5); Do you? Thanks, Ingo
* Thomas Gleixner <tglx@linutronix.de> wrote: > > If anything then that should tell you something that events and > > seccomp are not just casually related ... > > They happen to have the hook at the same point in the source and > for pure coincidence it works because the problem to solve is > extremly simplistic. And that's why the diffstat is minimalistic, > but that does not prove anything. Here are the diffstats of the various versions of this proposed security feature: bitmask (2009): 6 files changed, 194 insertions(+), 22 deletions(-) filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-) event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-) The third variant, 'event filters', is actually the most sophisticated one of all and it is not simplistic at all. The main reason why the diffstat is small is because it reuses over ten thousand lines of pre-existing kernel code intelligently. Are you interpreting that as some sort of failure of the patch? I think it's a very good thing. To demonstrate the non-simplicity of the feature: - These security rules/filters can be sophisticated like: sys_close() rule protecting against the closing of stdin/stdout/stderr: "fd == 0 || fd == 1 || fd == 2" sys_ioperm() rule allowing port 0x80 access but nothing else: "from != 128 || num != 1" sys_listen() rule limiting the max accept() backlog to 16 entries: "backlog > 16" sys_mprotect(), sys_mmap[2](), sys_unmap() and sys_mremap() rule protecting the first 1 MB NULL pointer guard range: "addr < 0x00100000" sys_setscheduler() rule protecting against the switch to non-SCHED_OTHER scheduler policies: "policy != 0" Most of these examples are finegrained access restrictions that AFAIK are not possible with any of the LSM based security measures that Linux offers today. - These security rules/filters can be safely used and installed by unprivileged userspace, allowing arbitrary end user apps to define their own, flexible security policies. - These security rules/filters get automatically inherited into child tasks and child tasks cannot mess with them - they cannot even query/observe that these filters *exist*. - These security rules/filters nest on each other in basically arbitrary depth, giving us a working, implemented, stackable LSM concept. - These security rules/filters can be extended to arbitrary more object lifetime events in the future, without changing the ABI. - These security rules/filters, unlike most LSM rules, can execute not just within hardirqs but also within deeply atomic contexts such as NMI contexts, putting far less restrictions on what can be security/access checked. - Access permission violations can be set up to generate events of the violations into a scalable ring-buffer, providing unprivileged security-auditing functionality to the managing task(s). I'd call that anything but 'simplistic'. Thanks, Ingo
* Peter Zijlstra <peterz@infradead.org> wrote: > But face it, you can argue until you're blue in the face, That is not a technical argument though - and i considered and answered every valid technical argument made by you and Thomas. You were either not able to or not willing to counter them. > [...] but both tglx and I will NAK any and all patches that extend > perf/ftrace beyond the passive observing role. The thing is, perf is *already* well beyond the 'passive observer' role: we already generate lots of 'action' in response to events. We generate notification signals, we write events - all of which can (and does) modify program behavior. So what's your point? There's no "passive observer" role really - it's apparently just that you dislike this use of instrumentation while you approve of other uses. Thanks, Ingo
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 377a7a5..22e1668 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1664,6 +1664,16 @@ config SECCOMP and the task is only allowed to execute a few safe syscalls defined by each seccomp mode. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config CC_STACKPROTECTOR bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)" depends on EXPERIMENTAL diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index eccdefe..7641ee9 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -129,6 +129,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu menu "Advanced setup" diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 8e256cc..fe4cbda 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -2245,6 +2245,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config USE_OF bool "Flattened Device Tree support" select OF diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8f4d50b..83499e4 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -605,6 +605,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu config ISA_DMA_API diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 2508a6f..2777515 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -614,6 +614,16 @@ config SECCOMP If unsure, say Y. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + endmenu menu "Power Management" diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index 4b89da2..00c1521 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -676,6 +676,16 @@ config SECCOMP If unsure, say N. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config SMP bool "Symmetric multi-processing support" depends on SYS_SUPPORTS_SMP diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index e560d10..5b42255 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -270,6 +270,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config HOTPLUG_CPU bool "Support for hot-pluggable CPUs" depends on SPARC64 && SMP diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cc6c53a..d6d44d9 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1485,6 +1485,16 @@ config SECCOMP If unsure, say Y. Only embedded should say N here. +config SECCOMP_FILTER + bool "Enable seccomp-based system call filtering" + depends on SECCOMP && EXPERIMENTAL + help + Per-process, inherited system call filtering using shared code + across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS + is not available, enhanced filters will not be available. + + See Documentation/prctl/seccomp_filter.txt for more detail. + config CC_STACKPROTECTOR bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)" ---help--- diff --git a/include/linux/prctl.h b/include/linux/prctl.h index a3baeb2..379b391 100644 --- a/include/linux/prctl.h +++ b/include/linux/prctl.h @@ -63,6 +63,15 @@ /* Get/set process seccomp mode */ #define PR_GET_SECCOMP 21 #define PR_SET_SECCOMP 22 +# define PR_SECCOMP_MODE_NONE 0 +# define PR_SECCOMP_MODE_STRICT 1 +# define PR_SECCOMP_MODE_FILTER 2 +# define PR_SECCOMP_FLAG_FILTER_ON_EXEC (1 << 1) + +/* Get/set process seccomp filters */ +#define PR_GET_SECCOMP_FILTER 35 +#define PR_SET_SECCOMP_FILTER 36 +#define PR_CLEAR_SECCOMP_FILTER 37 /* Get/set the capability bounding set (as per security/commoncap.c) */ #define PR_CAPBSET_READ 23 diff --git a/include/linux/sched.h b/include/linux/sched.h index 18d63ce..27eacf9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -77,7 +77,6 @@ struct sched_param { #include <linux/percpu.h> #include <linux/topology.h> #include <linux/proportions.h> -#include <linux/seccomp.h> #include <linux/rcupdate.h> #include <linux/rculist.h> #include <linux/rtmutex.h> @@ -1190,6 +1189,8 @@ enum perf_event_task_context { perf_nr_task_contexts, }; +struct seccomp_state; + struct task_struct { volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ void *stack; @@ -1374,7 +1375,7 @@ struct task_struct { uid_t loginuid; unsigned int sessionid; #endif - seccomp_t seccomp; + struct seccomp_state *seccomp; /* Thread group tracking */ u32 parent_exec_id; diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 167c333..289c836 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -2,12 +2,34 @@ #define _LINUX_SECCOMP_H +/* Forward declare for proc interface */ +struct seq_file; + #ifdef CONFIG_SECCOMP +#include <linux/errno.h> +#include <linux/list.h> #include <linux/thread_info.h> +#include <linux/types.h> #include <asm/seccomp.h> -typedef struct { int mode; } seccomp_t; +/** + * struct seccomp_state - the state of a seccomp'ed process + * + * @mode: + * if this is 1, the process is under standard seccomp rules + * is 2, the process is only allowed to make system calls where + * associated filters evaluate successfully. + * @usage: number of references to the current instance. + * @flags: a bitmask of behavior altering flags. + * @filters: Hash table of filters if using CONFIG_SECCOMP_FILTER. + */ +struct seccomp_state { + uint16_t mode; + atomic_t usage; + long flags; + struct seccomp_filter_table *filters; +}; extern void __secure_computing(int); static inline void secure_computing(int this_syscall) @@ -16,27 +38,113 @@ static inline void secure_computing(int this_syscall) __secure_computing(this_syscall); } +extern struct seccomp_state *seccomp_state_new(void); +extern struct seccomp_state *seccomp_state_dup(const struct seccomp_state *); +extern struct seccomp_state *get_seccomp_state(struct seccomp_state *); +extern void put_seccomp_state(struct seccomp_state *); + +extern long prctl_set_seccomp(unsigned long, unsigned long); extern long prctl_get_seccomp(void); -extern long prctl_set_seccomp(unsigned long); + +extern long prctl_set_seccomp_filter(unsigned long, char __user *); +extern long prctl_get_seccomp_filter(unsigned long, char __user *, + unsigned long); +extern long prctl_clear_seccomp_filter(unsigned long); + +#define inherit_tsk_seccomp_state(_child, _orig) \ + _child->seccomp = get_seccomp_state(_orig->seccomp); +#define put_tsk_seccomp_state(_tsk) put_seccomp_state(_tsk->seccomp) #else /* CONFIG_SECCOMP */ #include <linux/errno.h> -typedef struct { } seccomp_t; +struct seccomp_state { }; #define secure_computing(x) do { } while (0) +#define inherit_tsk_seccomp_state(_child, _orig) do { } while (0) +#define put_tsk_seccomp_state(_tsk) do { } while (0) static inline long prctl_get_seccomp(void) { return -EINVAL; } -static inline long prctl_set_seccomp(unsigned long arg2) +static inline long prctl_set_seccomp(unsigned long a2, unsigned long a3) { return -EINVAL; } +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3) +{ + return -ENOSYS; +} + +static inline long prctl_clear_seccomp_filter(unsigned long a2) +{ + return -ENOSYS; +} + +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3, + unsigned long a4) +{ + return -ENOSYS; +} + +static inline struct seccomp_state *seccomp_state_new(void) +{ + return NULL; +} + +static inline struct seccomp_state *seccomp_state_dup( + const struct seccomp_state *state) +{ + return NULL; +} + +static inline struct seccomp_state *get_seccomp_state( + struct seccomp_state *state) +{ + return NULL; +} + +static inline void put_seccomp_state(struct seccomp_state *state) +{ +} + #endif /* CONFIG_SECCOMP */ +#ifdef CONFIG_SECCOMP_FILTER + +extern int seccomp_show_filters(struct seccomp_state *, struct seq_file *); +extern long seccomp_set_filter(int, char *); +extern long seccomp_clear_filter(int); +extern long seccomp_get_filter(int, char *, unsigned long); + +#else /* CONFIG_SECCOMP_FILTER */ + +static inline int seccomp_show_filters(struct seccomp_state *state, + struct seq_file *m) +{ + return -ENOSYS; +} + +static inline long seccomp_set_filter(int syscall_nr, char *filter) +{ + return -ENOSYS; +} + +static inline long seccomp_clear_filter(int syscall_nr) +{ + return -ENOSYS; +} + +static inline long seccomp_get_filter(int syscall_nr, + char *buf, unsigned long available) +{ + return -ENOSYS; +} + +#endif /* CONFIG_SECCOMP_FILTER */ + #endif /* _LINUX_SECCOMP_H */ diff --git a/include/trace/syscall.h b/include/trace/syscall.h index 242ae04..e061ad0 100644 --- a/include/trace/syscall.h +++ b/include/trace/syscall.h @@ -35,6 +35,8 @@ struct syscall_metadata { extern unsigned long arch_syscall_addr(int nr); extern int init_syscall_trace(struct ftrace_event_call *call); +extern struct syscall_metadata *syscall_nr_to_meta(int); + extern int reg_event_syscall_enter(struct ftrace_event_call *call); extern void unreg_event_syscall_enter(struct ftrace_event_call *call); extern int reg_event_syscall_exit(struct ftrace_event_call *call); @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags, struct trace_event *event); enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags, struct trace_event *event); +#else +static inline struct syscall_metadata *syscall_nr_to_meta(int nr) +{ + return NULL; +} #endif #ifdef CONFIG_PERF_EVENTS diff --git a/kernel/Makefile b/kernel/Makefile index 85cbfb3..84e7dfb 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o +ifeq ($(CONFIG_SECCOMP_FILTER),y) +obj-$(CONFIG_SECCOMP) += seccomp_filter.o +endif obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_TREE_RCU) += rcutree.o obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o diff --git a/kernel/fork.c b/kernel/fork.c index e7548de..46987d4 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -34,6 +34,7 @@ #include <linux/cgroup.h> #include <linux/security.h> #include <linux/hugetlb.h> +#include <linux/seccomp.h> #include <linux/swap.h> #include <linux/syscalls.h> #include <linux/jiffies.h> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk) free_thread_info(tsk->stack); rt_mutex_debug_task_free(tsk); ftrace_graph_exit_task(tsk); + put_tsk_seccomp_state(tsk); free_task_struct(tsk); } EXPORT_SYMBOL(free_task); @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) if (err) goto out; + inherit_tsk_seccomp_state(tsk, orig); setup_thread_stack(tsk, orig); clear_user_return_notifier(tsk); clear_tsk_need_resched(tsk); diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 57d4b13..502ba04 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -6,12 +6,15 @@ * This defines a simple but solid secure-computing mode. */ +#include <linux/err.h> +#include <linux/prctl.h> #include <linux/seccomp.h> #include <linux/sched.h> +#include <linux/slab.h> #include <linux/compat.h> +#include <linux/unistd.h> -/* #define SECCOMP_DEBUG 1 */ -#define NR_SECCOMP_MODES 1 +#include "seccomp.h" /* * Secure computing mode 1 allows only read/write/exit/sigreturn. @@ -32,11 +35,13 @@ static int mode1_syscalls_32[] = { void __secure_computing(int this_syscall) { - int mode = current->seccomp.mode; + int mode = -1; + long ret = 0; int * syscall; - + if (current->seccomp) + mode = current->seccomp->mode; switch (mode) { - case 1: + case PR_SECCOMP_MODE_STRICT: syscall = mode1_syscalls; #ifdef CONFIG_COMPAT if (is_compat_task()) @@ -47,6 +52,20 @@ void __secure_computing(int this_syscall) return; } while (*++syscall); break; + case PR_SECCOMP_MODE_FILTER: + if (this_syscall >= NR_syscalls || this_syscall < 0) + break; + ret = seccomp_test_filters(current->seccomp, this_syscall); + if (!ret) + return; + /* Only check for an override if an access failure occurred. */ + if (ret != -EACCES) + break; + ret = seccomp_maybe_apply_filters(current, this_syscall); + if (!ret) + return; + seccomp_filter_log_failure(this_syscall); + break; default: BUG(); } @@ -57,30 +76,213 @@ void __secure_computing(int this_syscall) do_exit(SIGKILL); } +/* seccomp_state_new - allocate a new state object. */ +struct seccomp_state *seccomp_state_new() +{ + struct seccomp_state *new = kzalloc(sizeof(struct seccomp_state), + GFP_KERNEL); + if (!new) + return NULL; + + new->flags = 0; +#ifdef CONFIG_COMPAT + /* Annotate if this filterset is being created by a compat task. */ + if (is_compat_task()) + new->flags |= SECCOMP_FLAG_COMPAT; +#endif + + atomic_set(&new->usage, 1); + new->filters = seccomp_filter_table_new(); + /* Not supported errors are fine, others are a problem. */ + if (IS_ERR(new->filters) && PTR_ERR(new->filters) != -ENOSYS) { + kfree(new); + new = NULL; + } + return new; +} + +/* seccomp_state_dup - copies an existing state object. */ +struct seccomp_state *seccomp_state_dup(const struct seccomp_state *orig) +{ + int err; + struct seccomp_state *new_state = seccomp_state_new(); + + err = -ENOMEM; + if (!new_state) + goto fail; + new_state->mode = orig->mode; + /* Flag copying will hide if the new process is a compat task. However, + * if the rule was compat/non-compat and the process is the opposite, + * enforcement will terminate it. + */ + new_state->flags = orig->flags; + err = seccomp_copy_all_filters(new_state->filters, + orig->filters); + if (err) + goto fail; + + return new_state; +fail: + put_seccomp_state(new_state); + return NULL; +} + +/* get_seccomp_state - increments the reference count of @orig */ +struct seccomp_state *get_seccomp_state(struct seccomp_state *orig) +{ + if (!orig) + return NULL; + atomic_inc(&orig->usage); + return orig; +} + +static void __put_seccomp_state(struct seccomp_state *orig) +{ + WARN_ON(atomic_read(&orig->usage)); + seccomp_drop_all_filters(orig); + kfree(orig); +} + +/* put_seccomp_state - decrements the reference count of @orig and may free. */ +void put_seccomp_state(struct seccomp_state *orig) +{ + if (!orig) + return; + + if (atomic_dec_and_test(&orig->usage)) + __put_seccomp_state(orig); +} + long prctl_get_seccomp(void) { - return current->seccomp.mode; + if (!current->seccomp) + return 0; + return current->seccomp->mode; } -long prctl_set_seccomp(unsigned long seccomp_mode) +long prctl_set_seccomp(unsigned long seccomp_mode, unsigned long flags) { long ret; + struct seccomp_state *state, *cur_state; + cur_state = get_seccomp_state(current->seccomp); /* can set it only once to be even more secure */ ret = -EPERM; - if (unlikely(current->seccomp.mode)) + if (cur_state && unlikely(cur_state->mode)) goto out; ret = -EINVAL; - if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) { - current->seccomp.mode = seccomp_mode; - set_thread_flag(TIF_SECCOMP); + if (seccomp_mode <= 0 || seccomp_mode > NR_SECCOMP_MODES) + goto out; + + ret = -ENOMEM; + state = (cur_state ? seccomp_state_dup(cur_state) : + seccomp_state_new()); + if (!state) + goto out; + + if (seccomp_mode == PR_SECCOMP_MODE_STRICT) { #ifdef TIF_NOTSC disable_TSC(); #endif - ret = 0; } - out: + rcu_assign_pointer(current->seccomp, state); + synchronize_rcu(); + put_seccomp_state(cur_state); /* For the task */ + + /* Convert the ABI flag to the internal flag value. */ + if (seccomp_mode == PR_SECCOMP_MODE_FILTER && + (flags & PR_SECCOMP_FLAG_FILTER_ON_EXEC)) + state->flags |= SECCOMP_FLAG_ON_EXEC; + /* Encourage flag values to stay synchronized explicitly. */ + BUILD_BUG_ON(PR_SECCOMP_FLAG_FILTER_ON_EXEC != SECCOMP_FLAG_ON_EXEC); + + /* Only set the thread flag once after the new state is in place. */ + state->mode = seccomp_mode; + set_thread_flag(TIF_SECCOMP); + ret = 0; + +out: + put_seccomp_state(cur_state); /* for the get */ + return ret; +} + +long prctl_set_seccomp_filter(unsigned long syscall_nr, + char __user *user_filter) +{ + int nr; + long ret; + char filter[SECCOMP_MAX_FILTER_LENGTH]; + + ret = -EINVAL; + if (syscall_nr >= NR_syscalls || syscall_nr < 0) + goto out; + + ret = -EFAULT; + if (!user_filter || + strncpy_from_user(filter, user_filter, + sizeof(filter) - 1) < 0) + goto out; + + nr = (int) syscall_nr; + ret = seccomp_set_filter(nr, filter); + +out: + return ret; +} + +long prctl_clear_seccomp_filter(unsigned long syscall_nr) +{ + int nr = -1; + long ret; + + ret = -EINVAL; + if (syscall_nr >= NR_syscalls || syscall_nr < 0) + goto out; + + nr = (int) syscall_nr; + ret = seccomp_clear_filter(nr); + +out: + return ret; +} + +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst, + unsigned long available) +{ + int ret, nr; + unsigned long copied; + char *buf = NULL; + ret = -EINVAL; + if (!available) + goto out; + /* Ignore extra buffer space. */ + if (available > SECCOMP_MAX_FILTER_LENGTH) + available = SECCOMP_MAX_FILTER_LENGTH; + + ret = -EINVAL; + if (syscall_nr >= NR_syscalls || syscall_nr < 0) + goto out; + nr = (int) syscall_nr; + + ret = -ENOMEM; + buf = kmalloc(available, GFP_KERNEL); + if (!buf) + goto out; + + ret = seccomp_get_filter(nr, buf, available); + if (ret < 0) + goto out; + + /* Include the NUL byte in the copy. */ + copied = copy_to_user(dst, buf, ret + 1); + ret = -ENOSPC; + if (copied) + goto out; + + ret = 0; +out: + kfree(buf); return ret; } diff --git a/kernel/seccomp.h b/kernel/seccomp.h new file mode 100644 index 0000000..5abd219 --- /dev/null +++ b/kernel/seccomp.h @@ -0,0 +1,74 @@ +/* + * seccomp/seccomp_filter shared internal prototypes and state. + * + * Copyright (C) 2011 Chromium OS Authors. + */ + +#ifndef __KERNEL_SECCOMP_H +#define __KERNEL_SECCOMP_H + +#include <linux/ftrace_event.h> +#include <linux/seccomp.h> + +/* #define SECCOMP_DEBUG 1 */ +#define NR_SECCOMP_MODES 2 + +/* Inherit the max filter length from the filtering engine. */ +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL + +/* Presently, flags only affect SECCOMP_FILTER. */ +#define _SECCOMP_FLAG_COMPAT 0 +#define _SECCOMP_FLAG_ON_EXEC 1 + +#define SECCOMP_FLAG_COMPAT (1 << (_SECCOMP_FLAG_COMPAT)) +#define SECCOMP_FLAG_ON_EXEC (1 << (_SECCOMP_FLAG_ON_EXEC)) + + +#ifdef CONFIG_SECCOMP_FILTER + +#define SECCOMP_FILTER_HASH_BITS 4 +#define SECCOMP_FILTER_HASH_SIZE (1 << SECCOMP_FILTER_HASH_BITS) + +struct seccomp_filter_table; +extern struct seccomp_filter_table *seccomp_filter_table_new(void); +extern int seccomp_copy_all_filters(struct seccomp_filter_table *, + const struct seccomp_filter_table *); +extern void seccomp_drop_all_filters(struct seccomp_state *); + +extern int seccomp_test_filters(struct seccomp_state *, int); +extern int seccomp_maybe_apply_filters(struct task_struct *, int); +extern void seccomp_filter_log_failure(int); + +#else /* CONFIG_SECCOMP_FILTER */ + +static inline void seccomp_filter_log_failure(int syscall) +{ +} + +static inline int seccomp_maybe_apply_filters(struct task_struct *tsk, + int syscall_nr) +{ + return -ENOSYS; +} + +static inline struct seccomp_filter_table *seccomp_filter_table_new(void) +{ + return ERR_PTR(-ENOSYS); +} + +static inline int seccomp_test_filters(struct seccomp_state *state, int nr) +{ + return -ENOSYS; +} + +extern inline int seccomp_copy_all_filters(struct seccomp_filter_table *dst, + const struct seccomp_filter_table *src) +{ + return 0; +} + +static inline void seccomp_drop_all_filters(struct seccomp_state *state) { } + +#endif /* CONFIG_SECCOMP_FILTER */ + +#endif /* __KERNEL_SECCOMP_H */ diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c new file mode 100644 index 0000000..ff4e055 --- /dev/null +++ b/kernel/seccomp_filter.c @@ -0,0 +1,581 @@ +/* filter engine-based seccomp system call filtering. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org> + */ + +#include <linux/compat.h> +#include <linux/errno.h> +#include <linux/hash.h> +#include <linux/prctl.h> +#include <linux/seccomp.h> +#include <linux/seq_file.h> +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/uaccess.h> + +#include <asm/syscall.h> +#include <trace/syscall.h> + +#include "seccomp.h" + +#define SECCOMP_MAX_FILTER_COUNT 512 +#define SECCOMP_WILDCARD_FILTER "1" + +struct seccomp_filter { + struct hlist_node node; + int syscall_nr; + struct syscall_metadata *data; + struct event_filter *event_filter; +}; + +struct seccomp_filter_table { + struct hlist_head heads[SECCOMP_FILTER_HASH_SIZE]; + int count; +}; + +struct seccomp_filter_table *seccomp_filter_table_new(void) +{ + struct seccomp_filter_table *t = + kzalloc(sizeof(struct seccomp_filter_table), GFP_KERNEL); + if (!t) + return ERR_PTR(-ENOMEM); + return t; +} + +static inline u32 syscall_hash(int syscall_nr) +{ + return hash_32(syscall_nr, SECCOMP_FILTER_HASH_BITS); +} + +static const char *get_filter_string(struct seccomp_filter *f) +{ + const char *str = SECCOMP_WILDCARD_FILTER; + if (!f) + return NULL; + + /* Missing event filters qualify as wildcard matches. */ + if (!f->event_filter) + return str; + +#ifdef CONFIG_FTRACE_SYSCALLS + str = ftrace_get_filter_string(f->event_filter); +#endif + return str; +} + +static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr, + const char *filter_string) +{ + int err = -ENOMEM; + struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter), + GFP_KERNEL); + if (!filter) + goto fail; + + INIT_HLIST_NODE(&filter->node); + filter->syscall_nr = syscall_nr; + filter->data = syscall_nr_to_meta(syscall_nr); + + /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip + * using a predicate at all. + */ + if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string)) + goto out; + + /* Argument-based filtering only works on ftrace-hooked syscalls. */ + if (!filter->data) { + err = -ENOSYS; + goto fail; + } + +#ifdef CONFIG_FTRACE_SYSCALLS + err = ftrace_parse_filter(&filter->event_filter, + filter->data->enter_event->event.type, + filter_string); + if (err) + goto fail; +#endif + +out: + return filter; + +fail: + kfree(filter); + return ERR_PTR(err); +} + +static void free_seccomp_filter(struct seccomp_filter *filter) +{ +#ifdef CONFIG_FTRACE_SYSCALLS + ftrace_free_filter(filter->event_filter); +#endif + kfree(filter); +} + +static struct seccomp_filter *copy_seccomp_filter(struct seccomp_filter *orig) +{ + return alloc_seccomp_filter(orig->syscall_nr, get_filter_string(orig)); +} + +/* Returns the matching filter or NULL */ +static struct seccomp_filter *find_filter(struct seccomp_state *state, + int syscall) +{ + struct hlist_node *this, *pos; + struct seccomp_filter *filter = NULL; + + u32 head = syscall_hash(syscall); + if (head >= SECCOMP_FILTER_HASH_SIZE) + goto out; + + hlist_for_each_safe(this, pos, &state->filters->heads[head]) { + filter = hlist_entry(this, struct seccomp_filter, node); + if (filter->syscall_nr == syscall) + goto out; + } + + filter = NULL; + +out: + return filter; +} + +/* Safely drops all filters for a given syscall. This should only be called + * on unattached seccomp_state objects. + */ +static void drop_filter(struct seccomp_state *state, int syscall_nr) +{ + struct seccomp_filter *filter = find_filter(state, syscall_nr); + if (!filter) + return; + + WARN_ON(state->filters->count == 0); + state->filters->count--; + hlist_del(&filter->node); + free_seccomp_filter(filter); +} + +/* This should only be called on unattached seccomp_state objects. */ +static int add_filter(struct seccomp_state *state, int syscall_nr, + char *filter_string) +{ + struct seccomp_filter *filter; + struct hlist_head *head; + char merged[SECCOMP_MAX_FILTER_LENGTH]; + int ret; + u32 hash = syscall_hash(syscall_nr); + + ret = -EINVAL; + if (state->filters->count == SECCOMP_MAX_FILTER_COUNT - 1) + goto out; + + filter_string = strstrip(filter_string); + + /* Disallow empty strings. */ + if (filter_string[0] == 0) + goto out; + + /* Get the right list head. */ + head = &state->filters->heads[hash]; + + /* Find out if there is an existing entry to append to and + * build the resultant filter string. The original filter can be + * destroyed here since the caller should be operating on a copy. + */ + filter = find_filter(state, syscall_nr); + if (filter) { + int expected = snprintf(merged, sizeof(merged), "(%s) && %s", + get_filter_string(filter), + filter_string); + ret = -E2BIG; + if (expected >= sizeof(merged) || expected < 0) + goto out; + filter_string = merged; + hlist_del(&filter->node); + free_seccomp_filter(filter); + } + + /* When in seccomp filtering mode, only allow additions. */ + ret = -EACCES; + if (filter == NULL && state->mode == PR_SECCOMP_MODE_FILTER) + goto out; + + ret = 0; + filter = alloc_seccomp_filter(syscall_nr, filter_string); + if (IS_ERR(filter)) { + ret = PTR_ERR(filter); + goto out; + } + + state->filters->count++; + hlist_add_head(&filter->node, head); +out: + return ret; +} + +/* Wrap optional ftrace syscall support. Returns 1 on match or if ftrace is not + * supported. + */ +static int do_ftrace_syscall_match(struct event_filter *event_filter) +{ + int err = 1; +#ifdef CONFIG_FTRACE_SYSCALLS + uint8_t syscall_state[64]; + + memset(syscall_state, 0, sizeof(syscall_state)); + + /* The generic tracing entry can remain zeroed. */ + err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state), + NULL); + if (err) + return 0; + + err = filter_match_preds(event_filter, syscall_state); +#endif + return err; +} + +/* 1 on match, 0 otherwise. */ +static int filter_match_current(struct seccomp_filter *filter) +{ + /* If no event filter exists, we assume a wildcard match. */ + if (!filter->event_filter) + return 1; + + return do_ftrace_syscall_match(filter->event_filter); +} + +#ifndef KSTK_EIP +#define KSTK_EIP(x) 0L +#endif + +static const char *syscall_nr_to_name(int syscall) +{ + const char *syscall_name = "unknown"; + struct syscall_metadata *data = syscall_nr_to_meta(syscall); + if (data) + syscall_name = data->name; + return syscall_name; +} + +void seccomp_filter_log_failure(int syscall) +{ + printk(KERN_INFO + "%s[%d]: system call %d (%s) blocked at ip:%lx\n", + current->comm, task_pid_nr(current), syscall, + syscall_nr_to_name(syscall), KSTK_EIP(current)); +} + +/** + * seccomp_drop_all_filters - cleans up the filter list and frees the table + * @state: the seccomp_state to destroy the filters in. + */ +void seccomp_drop_all_filters(struct seccomp_state *state) +{ + struct hlist_node *this, *pos; + int head; + if (!state->filters) + return; + for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) { + hlist_for_each_safe(this, pos, &state->filters->heads[head]) { + struct seccomp_filter *f = hlist_entry(this, + struct seccomp_filter, node); + WARN_ON(state->filters->count == 0); + hlist_del(this); + free_seccomp_filter(f); + state->filters->count--; + } + } + kfree(state->filters); +} + +/** + * seccomp_copy_all_filters - copies all filters from src to dst. + * + * @dst: seccomp_filter_table to populate. + * @src: table to read from. + * Returns non-zero on failure. + * Both the source and the destination should have no simultaneous + * writers, and dst should be exclusive to the caller. + */ +int seccomp_copy_all_filters(struct seccomp_filter_table *dst, + const struct seccomp_filter_table *src) +{ + struct seccomp_filter *filter; + int head, ret = 0; + BUG_ON(!dst || !src); + for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) { + struct hlist_node *pos; + hlist_for_each_entry(filter, pos, &src->heads[head], node) { + struct seccomp_filter *new_filter = + copy_seccomp_filter(filter); + if (IS_ERR(new_filter)) { + ret = PTR_ERR(new_filter); + goto done; + } + hlist_add_head(&new_filter->node, + &dst->heads[head]); + dst->count++; + } + } + +done: + return ret; +} + +/** + * seccomp_show_filters - prints the filter state to a seq_file + * @state: the seccomp_state to enumerate the filter and bitmask of + * @m: the prepared seq_file to receive the data + * + * Returns 0 on a successful write. + */ +int seccomp_show_filters(struct seccomp_state *state, struct seq_file *m) +{ + int head; + struct hlist_node *pos; + struct seccomp_filter *filter; + int filtering = 0; + if (!state) + return 0; + if (!state->filters) + return 0; + + filtering = (state->mode == 2); + filtering &= !(state->flags & SECCOMP_FLAG_ON_EXEC); + seq_printf(m, "Filtering: %d\n", filtering); + seq_printf(m, "FilterCount: %d\n", state->filters->count); + for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) { + hlist_for_each_entry(filter, pos, &state->filters->heads[head], + node) { + seq_printf(m, "SystemCall: %d (%s)\n", + filter->syscall_nr, + syscall_nr_to_name(filter->syscall_nr)); + seq_printf(m, "Filter: %s\n", + get_filter_string(filter)); + } + } + return 0; +} +EXPORT_SYMBOL_GPL(seccomp_show_filters); + +/** + * seccomp_maybe_apply_filters - conditionally applies seccomp filters + * @tsk: task to update + * @syscall_nr: current system call in progress + * tsk must already be in seccomp filter mode. + * + * Returns 0 if the call should be allowed or state has been updated. + * This call is only reach if no filters matched the current system call. + * In some cases, such as when the ON_EXEC flag is set, failure should + * not be terminal. + */ +int seccomp_maybe_apply_filters(struct task_struct *tsk, int syscall_nr) +{ + struct seccomp_state *state, *new_state = NULL; + int ret = -EACCES; + + /* There's no question of application if ON_EXEC is not set. */ + state = get_seccomp_state(tsk->seccomp); + if ((state->flags & SECCOMP_FLAG_ON_EXEC) == 0) + goto out; + + ret = 0; + if (syscall_nr != __NR_execve) + goto out; + + new_state = seccomp_state_dup(state); + ret = -ENOMEM; + if (!new_state) + goto out; + + ret = 0; + new_state->flags &= ~(SECCOMP_FLAG_ON_EXEC); + rcu_assign_pointer(tsk->seccomp, new_state); + synchronize_rcu(); + put_seccomp_state(state); /* for the task */ + +out: + put_seccomp_state(state); /* for the get */ + return ret; +} + +/** + * seccomp_test_filters - tests 'current' against the given syscall + * @state: seccomp_state of current to use. + * @syscall: number of the system call to test + * + * Returns 0 on ok and non-zero on error/failure. + */ +int seccomp_test_filters(struct seccomp_state *state, int syscall) +{ + struct seccomp_filter *filter = NULL; + int ret; + +#ifdef CONFIG_COMPAT + ret = -EPERM; + if (is_compat_task() == !!(state->flags & SECCOMP_FLAG_COMPAT)) { + printk(KERN_INFO "%s[%d]: seccomp filter compat() mismatch.\n", + current->comm, task_pid_nr(current)); + goto out; + } +#endif + + ret = 0; + filter = find_filter(state, syscall); + if (filter && filter_match_current(filter)) + goto out; + + ret = -EACCES; +out: + return ret; +} + +/** + * seccomp_get_filter - copies the filter_string into "buf" + * @syscall_nr: system call number to look up + * @buf: destination buffer + * @bufsize: available space in the buffer. + * + * Looks up the filter for the given system call number on current. If found, + * the string length of the NUL-terminated buffer is returned and < 0 is + * returned on error. The NUL byte is not included in the length. + */ +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize) +{ + struct seccomp_state *state; + struct seccomp_filter *filter; + long ret = -ENOENT; + + if (bufsize > SECCOMP_MAX_FILTER_LENGTH) + bufsize = SECCOMP_MAX_FILTER_LENGTH; + + state = get_seccomp_state(current->seccomp); + if (!state) + goto out; + + filter = find_filter(state, syscall_nr); + if (!filter) + goto out; + + ret = strlcpy(buf, get_filter_string(filter), bufsize); + if (ret >= bufsize) { + ret = -ENOSPC; + goto out; + } + /* Zero out any remaining buffer, just in case. */ + memset(buf + ret, 0, bufsize - ret); +out: + put_seccomp_state(state); + return ret; +} +EXPORT_SYMBOL_GPL(seccomp_get_filter); + +/** + * seccomp_clear_filter: clears the seccomp filter for a syscall. + * @syscall_nr: the system call number to clear filters for. + * + * (acts as a frontend for seccomp_set_filter. All restrictions + * apply) + * + * Returns 0 on success. + */ +long seccomp_clear_filter(int syscall_nr) +{ + return seccomp_set_filter(syscall_nr, NULL); +} +EXPORT_SYMBOL_GPL(seccomp_clear_filter); + +/** + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall. + * @syscall_nr: system call number to apply the filter to. + * @filter: ftrace filter string to apply. + * + * Context: User context only. This function may sleep on allocation and + * operates on current. current must be attempting a system call + * when this is called. + * + * New filters may be added for system calls when the current task is + * not in a secure computing mode (seccomp). Otherwise, filters may only + * be added to already filtered system call entries. Any additions will + * be &&'d with the existing filter string to ensure no expansion of privileges + * will be possible. + * + * Returns 0 on success or an errno on failure. + */ +long seccomp_set_filter(int syscall_nr, char *filter) +{ + struct seccomp_state *state, *orig_state; + long ret = -EINVAL; + + orig_state = get_seccomp_state(current->seccomp); + + /* Prior to mutating the state, create a duplicate to avoid modifying + * the behavior of other instances sharing the state and ensure + * consistency. + */ + state = (orig_state ? seccomp_state_dup(orig_state) : + seccomp_state_new()); + if (!state) { + ret = -ENOMEM; + goto out; + } + + /* A NULL filter doubles as a drop value, but the exposed prctl + * interface requires a trip through seccomp_clear_filter(). + * Filter dropping is allowed across the is_compat_task() barrier. + */ + ret = 0; + if (filter == NULL) { + drop_filter(state, syscall_nr); + goto assign; + } + + /* Avoid amiguous filters which may have been inherited from a parent + * with different syscall numbers for the logically same calls. + */ +#ifdef CONFIG_COMPAT + ret = -EACCES; + if (is_compat_task() != !!(state->flags & SECCOMP_FLAG_COMPAT)) { + if (state->filters->count) + goto free_state; + /* It's safe to add if there are no existing ambiguous rules.*/ + if (is_compat_task()) + state->flags |= SECCOMP_FLAG_COMPAT; + else + state->flags &= ~(SECCOMP_FLAG_COMPAT); + } +#endif + + ret = add_filter(state, syscall_nr, filter); + if (ret) + goto free_state; + +assign: + rcu_assign_pointer(current->seccomp, state); + synchronize_rcu(); + put_seccomp_state(orig_state); /* for the task */ +out: + put_seccomp_state(orig_state); /* for the get */ + return ret; + +free_state: + put_seccomp_state(orig_state); /* for the get */ + put_seccomp_state(state); /* drop the dup/new */ + return ret; +} +EXPORT_SYMBOL_GPL(seccomp_set_filter); diff --git a/kernel/sys.c b/kernel/sys.c index af468ed..d29003a 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_SET_ENDIAN: error = SET_ENDIAN(me, arg2); break; - case PR_GET_SECCOMP: error = prctl_get_seccomp(); break; case PR_SET_SECCOMP: - error = prctl_set_seccomp(arg2); + error = prctl_set_seccomp(arg2, arg3); + break; + case PR_SET_SECCOMP_FILTER: + error = prctl_set_seccomp_filter(arg2, + (char __user *) arg3); + break; + case PR_CLEAR_SECCOMP_FILTER: + error = prctl_clear_seccomp_filter(arg2); + break; + case PR_GET_SECCOMP_FILTER: + error = prctl_get_seccomp_filter(arg2, + (char __user *) arg3, + arg4); break; case PR_GET_TSC: error = GET_TSC_CTL(arg2);
This change adds a new seccomp mode based on the work by agl@chromium.org in [1]. This new mode, "filter mode", provides a hash table of seccomp_filter objects. When in the new mode (2), all system calls are checked against the filters - first by system call number, then by a filter string. If an entry exists for a given system call and all filter predicates evaluate to true, then the task may proceed. Otherwise, the task is killed (as per seccomp_mode == 1). Filter string parsing and evaluation is handled by the ftrace filter engine. Related patches tweak to the perf filter trace and free allow the call to be shared. Filters inherit their understanding of types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which already predefines this information in syscall_metadata associated enter_event (and exit_event) structures. If CONFIG_FTRACE and CONFIG_FTRACE_SYSCALLS are not compiled in, only "1" filter strings will be allowed. The net result is a process may have its system calls filtered using the ftrace filter engine's inherent understanding of systems calls. A logical ruleset for a process that only needs stdin/stdout may be: sys_read: fd == 0 sys_write: fd == 1 || fd == 2 sys_exit: 1 The set of filters is specified through the PR_SET_SECCOMP path in prctl(). For example: prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0"); prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2"); prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1"); prctl(PR_SET_SECCOMP, 2, 0); v2: - changed to use the existing syscall number ABI. - prctl changes to minimize parsing in the kernel: prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC }); prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5"); prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read); prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize); - defined PR_SECCOMP_MODE_STRICT and ..._FILTER - added flags - provide a default fail syscall_nr_to_meta in ftrace - provides fallback for unhooked system calls - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c - moved to a hlist and 4 bit hash of linked lists - added support to operate without CONFIG_FTRACE_SYSCALLS - moved Kconfig support next to SECCOMP (should this be done in per-platform patches?) - made Kconfig entries dependent on EXPERIMENTAL - added macros to avoid ifdefs from kernel/fork.c - added compat task/filter matching - drop seccomp.h inclusion in sched.h and drop seccomp_t - added Filtering to "show" output - added on_exec state dup'ing when enabling after a fast-path accept. Signed-off-by: Will Drewry <wad@chromium.org> --- arch/arm/Kconfig | 10 + arch/microblaze/Kconfig | 10 + arch/mips/Kconfig | 10 + arch/powerpc/Kconfig | 10 + arch/s390/Kconfig | 10 + arch/sh/Kconfig | 10 + arch/sparc/Kconfig | 10 + arch/x86/Kconfig | 10 + include/linux/prctl.h | 9 + include/linux/sched.h | 5 +- include/linux/seccomp.h | 116 +++++++++- include/trace/syscall.h | 7 + kernel/Makefile | 3 + kernel/fork.c | 3 + kernel/seccomp.c | 228 +++++++++++++++++- kernel/seccomp.h | 74 ++++++ kernel/seccomp_filter.c | 581 +++++++++++++++++++++++++++++++++++++++++++++++ kernel/sys.c | 15 +- 18 files changed, 1100 insertions(+), 21 deletions(-) create mode 100644 kernel/seccomp.h create mode 100644 kernel/seccomp_filter.c