Message ID | 87eebkf8ph.fsf@oldenburg.str.redhat.com |
---|---|
State | New |
Headers | show |
Series | RFC: Disable clone3 for glibc 2.34 | expand |
* Florian Weimer via Libc-alpha: > Reportedly, the docker package in Ubuntu as used by Github Actions and > others does not provide a way to enable the clone3 system call. It > always fails with EPERM. > > Should we apply a patch like this for the release? > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c > index 1e7a8f6b35..4046c81180 100644 > --- a/sysdeps/unix/sysv/linux/clone-internal.c > +++ b/sysdeps/unix/sysv/linux/clone-internal.c > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args, > int (*func) (void *arg), void *arg) > { > int ret; > -#ifdef HAVE_CLONE3_WAPPER > - /* Try clone3 first. */ > - int saved_errno = errno; > - ret = __clone3 (cl_args, sizeof (*cl_args), func, arg); > - if (ret != -1 || errno != ENOSYS) > - return ret; > - > - /* NB: Restore errno since errno may be checked against non-zero > - return value. */ > - __set_errno (saved_errno); > -#endif > > /* Map clone3 arguments to clone arguments. NB: No need to check > invalid clone3 specific bits in flags nor exit_signal since this > > My concern with this is that we don't know yet where the CET kernel API > will land exactly and if CET will require clone3. So clone3 might have > to come back once we turn on CET, which is hopefully soon. Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the issue with its docker.io/containerd/runc packages. I could trivially fix a previously failing Github Action with: diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml index d2381ec..7b10286 100644 --- a/.github/workflows/fedora.yml +++ b/.github/workflows/fedora.yml @@ -22,6 +22,7 @@ jobs: runs-on: ubuntu-latest container: image: fedora:${{matrix.release}} + options: --security-opt seccomp=unconfined steps: - name: Checkout repository So I think we need to figure out what people are actually complaining about. Thanks, Florian
On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote: > * Florian Weimer via Libc-alpha: > > > Reportedly, the docker package in Ubuntu as used by Github Actions and > > others does not provide a way to enable the clone3 system call. It > > always fails with EPERM. > > > > Should we apply a patch like this for the release? > > > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c > > index 1e7a8f6b35..4046c81180 100644 > > --- a/sysdeps/unix/sysv/linux/clone-internal.c > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args, > > int (*func) (void *arg), void *arg) > > { > > int ret; > > -#ifdef HAVE_CLONE3_WAPPER > > - /* Try clone3 first. */ > > - int saved_errno = errno; > > - ret = __clone3 (cl_args, sizeof (*cl_args), func, arg); > > - if (ret != -1 || errno != ENOSYS) > > - return ret; > > - > > - /* NB: Restore errno since errno may be checked against non-zero > > - return value. */ > > - __set_errno (saved_errno); > > -#endif > > > > /* Map clone3 arguments to clone arguments. NB: No need to check > > invalid clone3 specific bits in flags nor exit_signal since this > > > > My concern with this is that we don't know yet where the CET kernel API > > will land exactly and if CET will require clone3. So clone3 might have > > to come back once we turn on CET, which is hopefully soon. > > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the > issue with its docker.io/containerd/runc packages. > > I could trivially fix a previously failing Github Action with: > > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml > index d2381ec..7b10286 100644 > --- a/.github/workflows/fedora.yml > +++ b/.github/workflows/fedora.yml > @@ -22,6 +22,7 @@ jobs: > runs-on: ubuntu-latest > container: > image: fedora:${{matrix.release}} > + options: --security-opt seccomp=unconfined > > steps: > - name: Checkout repository > > So I think we need to figure out what people are actually complaining > about. This relates to the discussion what errno value should be used in a seccomp filter to indicate that a syscall is blocked. So there are two problems I see with seccomp and clone3(): 1. the profile doesn't include clone3() at all and therefore the syscall is blocked and the default action is EPERM 2. the profile does include clone3() and decided to block it but the runtime has decided to make seccomp return EPERM and not ENOSYS when clone3() is attempted The correct fix in both scenarios is to add clone3() to the seccomp profile and either allow it or return ENOSYS. Note that this ENOSYS/EPERM problem is a general problem. Not just glibc doesn't know when to fallback gracefully other tools don't know either. Application container usually just get lucky because their applications don't need to issue the syscalls that are blocked. On a generic system container with systemd inside this is always an issue and not using ENOSYS is guaranteed to fail across the board. Christian
On Tue, Jul 27, 2021 at 11:24:16AM +0200, Christian Brauner wrote: > On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote: > > * Florian Weimer via Libc-alpha: > > > > > Reportedly, the docker package in Ubuntu as used by Github Actions and > > > others does not provide a way to enable the clone3 system call. It > > > always fails with EPERM. > > > > > > Should we apply a patch like this for the release? > > > > > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c > > > index 1e7a8f6b35..4046c81180 100644 > > > --- a/sysdeps/unix/sysv/linux/clone-internal.c > > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c > > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args, > > > int (*func) (void *arg), void *arg) > > > { > > > int ret; > > > -#ifdef HAVE_CLONE3_WAPPER > > > - /* Try clone3 first. */ > > > - int saved_errno = errno; > > > - ret = __clone3 (cl_args, sizeof (*cl_args), func, arg); > > > - if (ret != -1 || errno != ENOSYS) > > > - return ret; > > > - > > > - /* NB: Restore errno since errno may be checked against non-zero > > > - return value. */ > > > - __set_errno (saved_errno); > > > -#endif > > > > > > /* Map clone3 arguments to clone arguments. NB: No need to check > > > invalid clone3 specific bits in flags nor exit_signal since this > > > > > > My concern with this is that we don't know yet where the CET kernel API > > > will land exactly and if CET will require clone3. So clone3 might have > > > to come back once we turn on CET, which is hopefully soon. > > > > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the > > issue with its docker.io/containerd/runc packages. > > > > I could trivially fix a previously failing Github Action with: > > > > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml > > index d2381ec..7b10286 100644 > > --- a/.github/workflows/fedora.yml > > +++ b/.github/workflows/fedora.yml > > @@ -22,6 +22,7 @@ jobs: > > runs-on: ubuntu-latest > > container: > > image: fedora:${{matrix.release}} > > + options: --security-opt seccomp=unconfined > > > > steps: > > - name: Checkout repository > > > > So I think we need to figure out what people are actually complaining > > about. > > This relates to the discussion what errno value should be used in a > seccomp filter to indicate that a syscall is blocked. > > So there are two problems I see with seccomp and clone3(): > 1. the profile doesn't include clone3() at all and therefore the syscall > is blocked and the default action is EPERM > 2. the profile does include clone3() and decided to block it but the > runtime has decided to make seccomp return EPERM and not ENOSYS when > clone3() is attempted > > The correct fix in both scenarios is to add clone3() to the seccomp > profile and either allow it or return ENOSYS. > > Note that this ENOSYS/EPERM problem is a general problem. Not just glibc > doesn't know when to fallback gracefully other tools don't know either. > Application container usually just get lucky because their applications > don't need to issue the syscalls that are blocked. On a generic system > container with systemd inside this is always an issue and not using > ENOSYS is guaranteed to fail across the board. Aleksa, this is fixed in runC, right? Christian
On 2021-07-27, Christian Brauner <christian.brauner@ubuntu.com> wrote: > On Tue, Jul 27, 2021 at 11:24:16AM +0200, Christian Brauner wrote: > > On Tue, Jul 27, 2021 at 11:11:17AM +0200, Florian Weimer via Libc-alpha wrote: > > > * Florian Weimer via Libc-alpha: > > > > > > > Reportedly, the docker package in Ubuntu as used by Github Actions and > > > > others does not provide a way to enable the clone3 system call. It > > > > always fails with EPERM. > > > > > > > > Should we apply a patch like this for the release? > > > > > > > > diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c > > > > index 1e7a8f6b35..4046c81180 100644 > > > > --- a/sysdeps/unix/sysv/linux/clone-internal.c > > > > +++ b/sysdeps/unix/sysv/linux/clone-internal.c > > > > @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args, > > > > int (*func) (void *arg), void *arg) > > > > { > > > > int ret; > > > > -#ifdef HAVE_CLONE3_WAPPER > > > > - /* Try clone3 first. */ > > > > - int saved_errno = errno; > > > > - ret = __clone3 (cl_args, sizeof (*cl_args), func, arg); > > > > - if (ret != -1 || errno != ENOSYS) > > > > - return ret; > > > > - > > > > - /* NB: Restore errno since errno may be checked against non-zero > > > > - return value. */ > > > > - __set_errno (saved_errno); > > > > -#endif > > > > > > > > /* Map clone3 arguments to clone arguments. NB: No need to check > > > > invalid clone3 specific bits in flags nor exit_signal since this > > > > > > > > My concern with this is that we don't know yet where the CET kernel API > > > > will land exactly and if CET will require clone3. So clone3 might have > > > > to come back once we turn on CET, which is hopefully soon. > > > > > > Ubuntu 20.04 LTS may have already been fixed, I cannot reproduce the > > > issue with its docker.io/containerd/runc packages. > > > > > > I could trivially fix a previously failing Github Action with: > > > > > > diff --git a/.github/workflows/fedora.yml b/.github/workflows/fedora.yml > > > index d2381ec..7b10286 100644 > > > --- a/.github/workflows/fedora.yml > > > +++ b/.github/workflows/fedora.yml > > > @@ -22,6 +22,7 @@ jobs: > > > runs-on: ubuntu-latest > > > container: > > > image: fedora:${{matrix.release}} > > > + options: --security-opt seccomp=unconfined > > > > > > steps: > > > - name: Checkout repository > > > > > > So I think we need to figure out what people are actually complaining > > > about. > > > > This relates to the discussion what errno value should be used in a > > seccomp filter to indicate that a syscall is blocked. > > > > So there are two problems I see with seccomp and clone3(): > > 1. the profile doesn't include clone3() at all and therefore the syscall > > is blocked and the default action is EPERM > > 2. the profile does include clone3() and decided to block it but the > > runtime has decided to make seccomp return EPERM and not ENOSYS when > > clone3() is attempted > > > > The correct fix in both scenarios is to add clone3() to the seccomp > > profile and either allow it or return ENOSYS. > > > > Note that this ENOSYS/EPERM problem is a general problem. Not just glibc > > doesn't know when to fallback gracefully other tools don't know either. > > Application container usually just get lucky because their applications > > don't need to issue the syscalls that are blocked. On a generic system > > container with systemd inside this is always an issue and not using > > ENOSYS is guaranteed to fail across the board. > > Aleksa, this is fixed in runC, right? Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. The way it works is that any syscall which has a larger syscall number than any syscall specified in the filter will get -ENOSYS (this works even if libseccomp is outdated). The only way you could get the -EPERM behaviour with modern runc is if you write a seccomp profile that had rules for newer syscalls (openat2 for instance) but not clone3 -- but Docker doesn't do that. (The reason for this slightly convoluted behaviour was to make sure that intentional omissions actually give you -EPERM.) However this requires the container host to have an updated version of runc which is up to GitHub. (Though we fixed a security issue in runc recently, so I would expect that they've updated their versions of runc by now.)
The 07/27/2021 20:22, Aleksa Sarai wrote: > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. > > The way it works is that any syscall which has a larger syscall number > than any syscall specified in the filter will get -ENOSYS (this works > even if libseccomp is outdated). The only way you could get the -EPERM > behaviour with modern runc is if you write a seccomp profile that had > rules for newer syscalls (openat2 for instance) but not clone3 -- but > Docker doesn't do that. (The reason for this slightly convoluted > behaviour was to make sure that intentional omissions actually give you > -EPERM.) this sounds broken. it really should return ENOSYS unless a user specifically asked for a different errno value for a syscall. EPERM is just wrong. we will see random breakage in the future depending on what unrelated but newer syscalls users added to their whitelist. who thought this was a good idea? i can't believe this is still broken, this seccomp filter bug was reported ages ago.
Semi-related... > int ret; > -#ifdef HAVE_CLONE3_WAPPER > - /* Try clone3 first. */ > - int saved_errno = errno; shouldn't this be WRAPPER ? (It seems to be consistent in the code, so no harm done so far...)
* Andreas K. Huettel via Libc-alpha: > Semi-related... > >> int ret; >> -#ifdef HAVE_CLONE3_WAPPER >> - /* Try clone3 first. */ >> - int saved_errno = errno; > > shouldn't this be WRAPPER ? > > (It seems to be consistent in the code, so no harm done so far...) Yes, it should be. And yes, it's currently consistent. But it's confusing if you search for it and don't make the typo. Thanks, Florian
* Aleksa Sarai: > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. > > The way it works is that any syscall which has a larger syscall number > than any syscall specified in the filter will get -ENOSYS (this works > even if libseccomp is outdated). The only way you could get the -EPERM > behaviour with modern runc is if you write a seccomp profile that had > rules for newer syscalls (openat2 for instance) but not clone3 -- but > Docker doesn't do that. (The reason for this slightly convoluted > behaviour was to make sure that intentional omissions actually give you > -EPERM.) > > However this requires the container host to have an updated version of > runc which is up to GitHub. (Though we fixed a security issue in runc > recently, so I would expect that they've updated their versions of runc > by now.) Indeed I wasn't able to reproduce this locally. Ubuntu's docker.io package behaves as expected, even for “docker build” as far as I can see. So far, the reported breakage has been focused on Github Actions and Azure Devops. They use a custom Docker-Moby build, and I don't know what's in it. The net effect is that clone3 does not work in containers by default. “docker build” still does not allow “--security-opt seccomp=unconfined” for unknown reasons, but that workaround still applies to “docker create”. Daniel P. Berrangé reported that Moby mentions a system call in its policy whose number is larger than clone3, effectively turning ENOSYS into ENOPERM for clone3. Looking at the recent change, it could be the addition of close_range and epoll_pwait2 in this commit: commit 54eff4354b17a9c460b851300f28aed1408a8615 Author: Aleksa Sarai <asarai@suse.de> Date: Sun Jan 17 23:39:31 2021 +1100 profiles: seccomp: update to Linux 5.11 syscall list These syscalls (some of which have been in Linux for a while but were missing from the profile) fall into a few buckets: * close_range(2), epoll_pwait2(2) are just extensions of existing "safe for everyone" syscalls. * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are all equivalent to aspects of mount(2) and thus go into the CAP_SYS_ADMIN category. * process_madvise(2) is similar to the other process_*(2) syscalls and thus goes in the CAP_SYS_PTRACE category. Signed-off-by: Aleksa Sarai <asarai@suse.de> Maybe we don't see this everywhere because these higher system call numbers become available only if the system libseccomp version is recent enough to know about them. Once that is the case, the ENOSYS/EPERM line shifts and clone3 is on the wrong side of it. If that's indeed the explanation, then maybe we can simply fix moby and ask Microsoft to respin their images? Thanks, Florian
On Wed, Jul 28, 2021 at 07:44:03PM +0200, Florian Weimer wrote: > * Aleksa Sarai: > > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. > > > > The way it works is that any syscall which has a larger syscall number > > than any syscall specified in the filter will get -ENOSYS (this works > > even if libseccomp is outdated). The only way you could get the -EPERM > > behaviour with modern runc is if you write a seccomp profile that had > > rules for newer syscalls (openat2 for instance) but not clone3 -- but > > Docker doesn't do that. (The reason for this slightly convoluted > > behaviour was to make sure that intentional omissions actually give you > > -EPERM.) > > > > However this requires the container host to have an updated version of > > runc which is up to GitHub. (Though we fixed a security issue in runc > > recently, so I would expect that they've updated their versions of runc > > by now.) > > Indeed I wasn't able to reproduce this locally. Ubuntu's docker.io > package behaves as expected, even for “docker build” as far as I can > see. > > So far, the reported breakage has been focused on Github Actions and > Azure Devops. They use a custom Docker-Moby build, and I don't know > what's in it. The net effect is that clone3 does not work in containers > by default. “docker build” still does not allow “--security-opt > seccomp=unconfined” for unknown reasons, but that workaround still > applies to “docker create”. FYI I found this issue describing the docker build + seccomp feature gap. Seems it was intentional to not allow it to be used: https://github.com/moby/moby/issues/34454 To workaround this gap in "docker build", you can use the --seccomp-profile option to dockerd daemon when starting it up, to pass a new profile that applies by default to everything. Doesn't help if you're just using a dockerd instance started/managed by someone/something else though. > Daniel P. Berrangé reported that Moby mentions a system call in its > policy whose number is larger than clone3, effectively turning ENOSYS > into ENOPERM for clone3. Looking at the recent change, it could be the > addition of close_range and epoll_pwait2 in this commit: I'm not 100% convinced my understanding is right, as there are quite a few moving parts involved. Wierdly I managed to get things working with existing docker 20.10.7 simply by using the seccomp profile from docker git master, which is counter to my understanding described above. The heuristics involved in runc for EPERM/ENOSYS are very hard to understand and rationalize behaviour for :-( To try to make it simpler I send a pull request to explicitly list clone3 with ENOSYS, so that its not subject to the wierd heuristics in runc https://github.com/moby/moby/pull/42681 > commit 54eff4354b17a9c460b851300f28aed1408a8615 > Author: Aleksa Sarai <asarai@suse.de> > Date: Sun Jan 17 23:39:31 2021 +1100 > > profiles: seccomp: update to Linux 5.11 syscall list > > These syscalls (some of which have been in Linux for a while but were > missing from the profile) fall into a few buckets: > > * close_range(2), epoll_pwait2(2) are just extensions of existing "safe > for everyone" syscalls. > > * The mountv2 API syscalls (fs*(2), move_mount(2), open_tree(2)) are > all equivalent to aspects of mount(2) and thus go into the > CAP_SYS_ADMIN category. > > * process_madvise(2) is similar to the other process_*(2) syscalls and > thus goes in the CAP_SYS_PTRACE category. > > Signed-off-by: Aleksa Sarai <asarai@suse.de> > > Maybe we don't see this everywhere because these higher system call > numbers become available only if the system libseccomp version is recent > enough to know about them. Once that is the case, the ENOSYS/EPERM line > shifts and clone3 is on the wrong side of it. > > If that's indeed the explanation, then maybe we can simply fix moby and > ask Microsoft to respin their images? Regards, Daniel
On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: > The 07/27/2021 20:22, Aleksa Sarai wrote: > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. > > > > The way it works is that any syscall which has a larger syscall number > > than any syscall specified in the filter will get -ENOSYS (this works > > even if libseccomp is outdated). The only way you could get the -EPERM > > behaviour with modern runc is if you write a seccomp profile that had > > rules for newer syscalls (openat2 for instance) but not clone3 -- but > > Docker doesn't do that. (The reason for this slightly convoluted > > behaviour was to make sure that intentional omissions actually give you > > -EPERM.) > > this sounds broken. it really should return ENOSYS unless > a user specifically asked for a different errno value for > a syscall. EPERM is just wrong. Yes, if I was designing it from scratch, that's what I would've done. But there are already existing filters that are written assuming the default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for existing profiles is not a workable solution. Should we fix all existing profiles and then change the behaviour again? Sure, but given we solved this problem in a period of time when people were screaming about glibc being broken in containers, I hope you'll excuse the fact that we didn't really have time to co-ordinate updating every downstream runc user. > we will see random breakage in the future depending on > what unrelated but newer syscalls users added to their > whitelist. who thought this was a good idea? If you update your syscall profile without knowing what you're doing, things will break. That will always be the case. The plan is/was to eventually implement this by explicitly stating a minimum kernel version (so that all syscalls missing in the profile that were available in that kernel version get ENOSYS) but libseccomp doesn't provide that information at the moment, and given that such a filter would be more complicated than the one we have at the moment, that behaviour probably belongs in libseccomp (there are several issues open in the libseccomp repo describing this issue and possible solutions).
* Aleksa Sarai: > If you update your syscall profile without knowing what you're doing, > things will break. That will always be the case. But with the current syscall number dependency, this is jusy *way* too hard. Who would think that adding close_range (#436) to the policy would switch clone3 (#435) from ENOSYS to ENOPERM? I realized that Github actually provides a way to report image bugs, so I filed: Docker seccomp policy incompatible with glibc 2.34 <https://github.com/actions/virtual-environments/issues/3812> Thanks, Florian
The 07/29/2021 18:56, Aleksa Sarai wrote: > On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: > > The 07/27/2021 20:22, Aleksa Sarai wrote: > > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. > > > > > > The way it works is that any syscall which has a larger syscall number > > > than any syscall specified in the filter will get -ENOSYS (this works > > > even if libseccomp is outdated). The only way you could get the -EPERM > > > behaviour with modern runc is if you write a seccomp profile that had > > > rules for newer syscalls (openat2 for instance) but not clone3 -- but > > > Docker doesn't do that. (The reason for this slightly convoluted > > > behaviour was to make sure that intentional omissions actually give you > > > -EPERM.) > > > > this sounds broken. it really should return ENOSYS unless > > a user specifically asked for a different errno value for > > a syscall. EPERM is just wrong. > > Yes, if I was designing it from scratch, that's what I would've done. > > But there are already existing filters that are written assuming the > default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for > existing profiles is not a workable solution. > > Should we fix all existing profiles and then change the behaviour again? > Sure, but given we solved this problem in a period of time when people > were screaming about glibc being broken in containers, I hope you'll > excuse the fact that we didn't really have time to co-ordinate updating > every downstream runc user. i think this can be fixed backward compatibly by returning EPERM for old syscalls. > > we will see random breakage in the future depending on > > what unrelated but newer syscalls users added to their > > whitelist. who thought this was a good idea? > > If you update your syscall profile without knowing what you're doing, > things will break. That will always be the case. > > The plan is/was to eventually implement this by explicitly stating a > minimum kernel version (so that all syscalls missing in the profile that > were available in that kernel version get ENOSYS) but libseccomp doesn't > provide that information at the moment, and given that such a filter > would be more complicated than the one we have at the moment, that > behaviour probably belongs in libseccomp (there are several issues open > in the libseccomp repo describing this issue and possible solutions). i dont think you need to do anything complicated with a fixed cut off, e.g. return nr < 403 ? EPERM : ENOSYS or you can give an explicit list of syscalls that should return EPERM for bw compat reasons and the rest is ENOSYS. (and there should be an easy way to opt-out of the bw compat behaviour and always do ENOSYS)
On 2021-07-29, Florian Weimer <fweimer@redhat.com> wrote: > * Aleksa Sarai: > > > If you update your syscall profile without knowing what you're doing, > > things will break. That will always be the case. > > But with the current syscall number dependency, this is jusy *way* too > hard. Who would think that adding close_range (#436) to the policy > would switch clone3 (#435) from ENOSYS to ENOPERM? Yeah, I expected that the Docker folks would've been aware of this when updating the profile (the maintainers were aware of the runc change at the time) so it does seem this is a bit too complicated... I think changing this to one of the older versions of the feature I had (only EPERM for syscalls that were present in Linux 3.0) is probably less likely to cause confusion, until we have the whole minimum kernel version infrastructure I mentioned.
On 2021-07-29, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: > The 07/29/2021 18:56, Aleksa Sarai wrote: > > On 2021-07-27, Szabolcs Nagy <szabolcs.nagy@arm.com> wrote: > > > The 07/27/2021 20:22, Aleksa Sarai wrote: > > > > Yes, runc has had the -ENOSYS fallback behaviour for a few releases now. > > > > > > > > The way it works is that any syscall which has a larger syscall number > > > > than any syscall specified in the filter will get -ENOSYS (this works > > > > even if libseccomp is outdated). The only way you could get the -EPERM > > > > behaviour with modern runc is if you write a seccomp profile that had > > > > rules for newer syscalls (openat2 for instance) but not clone3 -- but > > > > Docker doesn't do that. (The reason for this slightly convoluted > > > > behaviour was to make sure that intentional omissions actually give you > > > > -EPERM.) > > > > > > this sounds broken. it really should return ENOSYS unless > > > a user specifically asked for a different errno value for > > > a syscall. EPERM is just wrong. > > > > Yes, if I was designing it from scratch, that's what I would've done. > > > > But there are already existing filters that are written assuming the > > default errno is EPERM. Returning ENOSYS from clone(2) or unshare(2) for > > existing profiles is not a workable solution. > > > > Should we fix all existing profiles and then change the behaviour again? > > Sure, but given we solved this problem in a period of time when people > > were screaming about glibc being broken in containers, I hope you'll > > excuse the fact that we didn't really have time to co-ordinate updating > > every downstream runc user. > > i think this can be fixed backward compatibly by > returning EPERM for old syscalls. I just remembered why this was a problem (I tried to implement exactly this behaviour when I first worked on the patch) -- in runc we use libseccomp to generate profiles, but libseccomp has several limitations that mean we cannot implement *any* syscall-number-based fallback behaviour (I tried every way I could think of for at least a week or two). The end result is that in runc we use libseccomp to generate the filter, then output the BPF program, then patch it (add a program to the start which does the syscall check and returns ENOSYS in the correct case) and then we run the program. The only other option would be to basically rewrite libseccomp for runc (which I did consider, but then thought better of). Now, having the fallback being a fixed syscall number seems like it would be trivial -- but because we have to use libseccomp's filter generatiion and patch it, patching the return value of the program to be syscall-specific would require patching every return statement in the generated BPF (and then possibly rewriting every jump depending on where the returns are). I think in practice there's only two returns, but hardcoding that is going to cause issues if libseccomp ever changes that behaviour. But I think you're right that this is probably less likely to cause confusion. Unfortunately I'm not really sure that there's a straightforward way to implement it, outside of implementing the behaviour in libseccomp (or at the very least expanding libseccomp to let us work around it). > > > we will see random breakage in the future depending on > > > what unrelated but newer syscalls users added to their > > > whitelist. who thought this was a good idea? > > > > If you update your syscall profile without knowing what you're doing, > > things will break. That will always be the case. > > > > The plan is/was to eventually implement this by explicitly stating a > > minimum kernel version (so that all syscalls missing in the profile that > > were available in that kernel version get ENOSYS) but libseccomp doesn't > > provide that information at the moment, and given that such a filter > > would be more complicated than the one we have at the moment, that > > behaviour probably belongs in libseccomp (there are several issues open > > in the libseccomp repo describing this issue and possible solutions). > > i dont think you need to do anything complicated > with a fixed cut off, e.g. > > return nr < 403 ? EPERM : ENOSYS See my above point about how this is non-trivial. > or you can give an explicit list of syscalls that > should return EPERM for bw compat reasons and the > rest is ENOSYS. The issue with requiring explicit EPERM rules is that libseccomp doesn't support certain sets of argument rule combinations, meaning that you cannot generate nor require users to specify a set of inverse rules to return EPERM for syscalls like clone() where only certain flags are being blocked. (Another one of my attempts was to generate the set of inverse rules, and use ENOSYS as the default.) (Also some of the inverse rules you can technically generate -- most notably the inverse of SCMP_MASKED_EQ requires 2^n rules to be generated (where n is the number of bits in the mask).) > (and there should be an easy way to opt-out of > the bw compat behaviour and always do ENOSYS) You can opt of out of it already, by setting the defaultErrnoRet to ENOSYS. And someone has submitted a patch to Docker to do exactly that[1]. [1]: https://github.com/moby/moby/pull/42649
diff --git a/sysdeps/unix/sysv/linux/clone-internal.c b/sysdeps/unix/sysv/linux/clone-internal.c index 1e7a8f6b35..4046c81180 100644 --- a/sysdeps/unix/sysv/linux/clone-internal.c +++ b/sysdeps/unix/sysv/linux/clone-internal.c @@ -48,17 +48,6 @@ __clone_internal (struct clone_args *cl_args, int (*func) (void *arg), void *arg) { int ret; -#ifdef HAVE_CLONE3_WAPPER - /* Try clone3 first. */ - int saved_errno = errno; - ret = __clone3 (cl_args, sizeof (*cl_args), func, arg); - if (ret != -1 || errno != ENOSYS) - return ret; - - /* NB: Restore errno since errno may be checked against non-zero - return value. */ - __set_errno (saved_errno); -#endif /* Map clone3 arguments to clone arguments. NB: No need to check invalid clone3 specific bits in flags nor exit_signal since this