mbox series

[v3,0/5] Add a new fchmodat4() syscall

Message ID cover.1689074739.git.legion@kernel.org
Headers show
Series Add a new fchmodat4() syscall | expand

Message

Alexey Gladkov July 11, 2023, 11:25 a.m. UTC
This patch set adds fchmodat4(), a new syscall. The actual
implementation is super simple: essentially it's just the same as
fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
I've attempted to make this match "man 2 fchmodat" as closely as
possible, which says EINVAL is returned for invalid flags (as opposed to
ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
I have a sketch of a glibc patch that I haven't even compiled yet, but
seems fairly straight-forward:

    diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c
    index 6d9cbc1ce9e0..b1beab76d56c 100644
    --- a/sysdeps/unix/sysv/linux/fchmodat.c
    +++ b/sysdeps/unix/sysv/linux/fchmodat.c
    @@ -29,12 +29,36 @@
     int
     fchmodat (int fd, const char *file, mode_t mode, int flag)
     {
    -  if (flag & ~AT_SYMLINK_NOFOLLOW)
    -    return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
    -#ifndef __NR_lchmod		/* Linux so far has no lchmod syscall.  */
    +  /* There are four paths through this code:
    +      - The flags are zero.  In this case it's fine to call fchmodat.
    +      - The flags are non-zero and glibc doesn't have access to
    +	__NR_fchmodat4.  In this case all we can do is emulate the error codes
    +	defined by the glibc interface from userspace.
    +      - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has
    +	fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
    +	matches glibc's library interface so it can be called directly.
    +      - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does
    +	not.  In this case we must respect the error codes defined by the glibc
    +	interface instead of returning ENOSYS.
    +    The intent here is to ensure that the kernel is called at most once per
    +    library call, and that the error types defined by glibc are always
    +    respected.  */
    +
    +#ifdef __NR_fchmodat4
    +  long result;
    +#endif
    +
    +  if (flag == 0)
    +    return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
    +
    +#ifdef __NR_fchmodat4
    +  result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag);
    +  if (result == 0 || errno != ENOSYS)
    +    return result;
    +#endif
    +
       if (flag & AT_SYMLINK_NOFOLLOW)
         return INLINE_SYSCALL_ERROR_RETURN_VALUE (ENOTSUP);
    -#endif

    -  return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
    +  return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
     }

I've never added a new syscall before so I'm not really sure what the
proper procedure to follow is.  Based on the feedback from my v1 patch
set it seems this is somewhat uncontroversial.  At this point I don't
think there's anything I'm missing, though note that I haven't gotten
around to testing it this time because the diff from v1 is trivial for
any platform I could reasonably test on.  The v1 patches suggest a
simple test case, but I didn't re-run it because I don't want to reboot
my laptop.

Changes since v2 [20190717012719.5524-1-palmer@sifive.com]:

* Rebased to master.
* The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro.
* Selftest added.

Changes since v1 [20190531191204.4044-1-palmer@sifive.com]:

* All architectures are now supported, which support squashed into a
  single patch.
* The do_fchmodat() helper function has been removed, in favor of directly
  calling do_fchmodat4().
* The patches are based on 5.2 instead of 5.1.

---

Alexey Gladkov (1):
  selftests: add fchmodat4(2) selftest

Palmer Dabbelt (4):
  Non-functional cleanup of a "__user * filename"
  fs: Add fchmodat4()
  arch: Register fchmodat4, usually as syscall 451
  tools headers UAPI: Sync files changed by new fchmodat4 syscall

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/open.c                                     |  18 ++-
 include/linux/syscalls.h                      |   4 +-
 include/uapi/asm-generic/unistd.h             |   5 +-
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |   1 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |   1 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |   1 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/fchmodat4/.gitignore  |   2 +
 tools/testing/selftests/fchmodat4/Makefile    |   6 +
 .../selftests/fchmodat4/fchmodat4_test.c      | 151 ++++++++++++++++++
 29 files changed, 207 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/fchmodat4/.gitignore
 create mode 100644 tools/testing/selftests/fchmodat4/Makefile
 create mode 100644 tools/testing/selftests/fchmodat4/fchmodat4_test.c

Comments

Florian Weimer July 11, 2023, 12:24 p.m. UTC | #1
* Alexey Gladkov:

> This patch set adds fchmodat4(), a new syscall. The actual
> implementation is super simple: essentially it's just the same as
> fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> I've attempted to make this match "man 2 fchmodat" as closely as
> possible, which says EINVAL is returned for invalid flags (as opposed to
> ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> I have a sketch of a glibc patch that I haven't even compiled yet, but
> seems fairly straight-forward:
>
>     diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c
>     index 6d9cbc1ce9e0..b1beab76d56c 100644
>     --- a/sysdeps/unix/sysv/linux/fchmodat.c
>     +++ b/sysdeps/unix/sysv/linux/fchmodat.c
>     @@ -29,12 +29,36 @@
>      int
>      fchmodat (int fd, const char *file, mode_t mode, int flag)
>      {
>     -  if (flag & ~AT_SYMLINK_NOFOLLOW)
>     -    return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
>     -#ifndef __NR_lchmod		/* Linux so far has no lchmod syscall.  */
>     +  /* There are four paths through this code:
>     +      - The flags are zero.  In this case it's fine to call fchmodat.
>     +      - The flags are non-zero and glibc doesn't have access to
>     +	__NR_fchmodat4.  In this case all we can do is emulate the error codes
>     +	defined by the glibc interface from userspace.
>     +      - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has
>     +	fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
>     +	matches glibc's library interface so it can be called directly.
>     +      - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does

If you define __NR_fchmodat4 on all architectures, we can use these
constants directly in glibc.  We no longer depend on the UAPI
definitions of those constants, to cut down the number of code variants,
and to make glibc's system call profile independent of the kernel header
version at build time.

Your version is based on 2.31, more recent versions have some reasonable
emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
describing the same buggy behavior that you witnessed:

+      /* Some Linux versions with some file systems can actually
+        change symbolic link permissions via /proc, but this is not
+        intentional, and it gives inconsistent results (e.g., error
+        return despite mode change).  The expected behavior is that
+        symbolic link modes cannot be changed at all, and this check
+        enforces that.  */
+      if (S_ISLNK (st.st_mode))
+       {
+         __close_nocancel (pathfd);
+         __set_errno (EOPNOTSUPP);
+         return -1;
+       }

I think there was some kernel discussion about that behavior before, but
apparently, it hasn't led to fixes.

I wonder if it makes sense to add a similar error return to the system
call implementation?

>     +	not.  In this case we must respect the error codes defined by the glibc
>     +	interface instead of returning ENOSYS.
>     +    The intent here is to ensure that the kernel is called at most once per
>     +    library call, and that the error types defined by glibc are always
>     +    respected.  */
>     +
>     +#ifdef __NR_fchmodat4
>     +  long result;
>     +#endif
>     +
>     +  if (flag == 0)
>     +    return INLINE_SYSCALL (fchmodat, 3, fd, file, mode);
>     +
>     +#ifdef __NR_fchmodat4
>     +  result = INLINE_SYSCALL (fchmodat4, 4, fd, file, mode, flag);
>     +  if (result == 0 || errno != ENOSYS)
>     +    return result;
>     +#endif

The last if condition is the recommended approach, but in the past, it
broke container host compatibility pretty badly due to seccomp filters
that return EPERM instead of ENOSYS.  I guess we'll learn soon enough if
that's been fixed by now. 8-P

Thanks,
Florian
Christian Brauner July 11, 2023, 3:14 p.m. UTC | #2
On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote:
> * Alexey Gladkov:
> 
> > This patch set adds fchmodat4(), a new syscall. The actual
> > implementation is super simple: essentially it's just the same as
> > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> > I've attempted to make this match "man 2 fchmodat" as closely as
> > possible, which says EINVAL is returned for invalid flags (as opposed to
> > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> > I have a sketch of a glibc patch that I haven't even compiled yet, but
> > seems fairly straight-forward:
> >
> >     diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c
> >     index 6d9cbc1ce9e0..b1beab76d56c 100644
> >     --- a/sysdeps/unix/sysv/linux/fchmodat.c
> >     +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> >     @@ -29,12 +29,36 @@
> >      int
> >      fchmodat (int fd, const char *file, mode_t mode, int flag)
> >      {
> >     -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> >     -    return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> >     -#ifndef __NR_lchmod		/* Linux so far has no lchmod syscall.  */
> >     +  /* There are four paths through this code:
> >     +      - The flags are zero.  In this case it's fine to call fchmodat.
> >     +      - The flags are non-zero and glibc doesn't have access to
> >     +	__NR_fchmodat4.  In this case all we can do is emulate the error codes
> >     +	defined by the glibc interface from userspace.
> >     +      - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has
> >     +	fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
> >     +	matches glibc's library interface so it can be called directly.
> >     +      - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does
> 
> If you define __NR_fchmodat4 on all architectures, we can use these
> constants directly in glibc.  We no longer depend on the UAPI
> definitions of those constants, to cut down the number of code variants,
> and to make glibc's system call profile independent of the kernel header
> version at build time.
> 
> Your version is based on 2.31, more recent versions have some reasonable
> emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
> describing the same buggy behavior that you witnessed:
> 
> +      /* Some Linux versions with some file systems can actually
> +        change symbolic link permissions via /proc, but this is not
> +        intentional, and it gives inconsistent results (e.g., error
> +        return despite mode change).  The expected behavior is that
> +        symbolic link modes cannot be changed at all, and this check
> +        enforces that.  */
> +      if (S_ISLNK (st.st_mode))
> +       {
> +         __close_nocancel (pathfd);
> +         __set_errno (EOPNOTSUPP);
> +         return -1;
> +       }
> 
> I think there was some kernel discussion about that behavior before, but
> apparently, it hasn't led to fixes.

I think I've explained this somewhere else a couple of months ago but
just in case you weren't on that thread or don't remember and apologies
if you should already know.

A lot of filesystem will happily update the mode of a symlink. The VFS
doesn't do anything to prevent this from happening. This is filesystem
specific.

The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs.
Specifically it comes from filesystems that call posix_acl_chmod(),
e.g., btrfs via

        if (!err && attr->ia_valid & ATTR_MODE)
                err = posix_acl_chmod(idmap, dentry, inode->i_mode);

Most filesystems don't implement i_op->set_acl() for POSIX ACLs.
So posix_acl_chmod() will report EOPNOTSUPP. By the time
posix_acl_chmod() is called, most filesystems will have finished
updating the inode. POSIX ACLs also often aren't integrated into
transactions so a rollback wouldn't even be possible on some
filesystems.

Any filesystem that doesn't implement POSIX ACLs at all will obviously
never fail unless it blocks mode changes on symlinks. Or filesystems
that do have a way to rollback failures from posix_acl_chmod(), or
filesystems that do return an error on chmod() on symlinks such as 9p,
ntfs, ocfs2.

> 
> I wonder if it makes sense to add a similar error return to the system
> call implementation?

Hm, blocking symlink mode changes is pretty regression prone. And just
blocking it through one interface seems weird and makes things even more
inconsistent.

So two options I see:
(1) minimally invasive:
    Filesystems that do call posix_acl_chmod() on symlinks need to be
    changed to stop doing that.
(2) might hit us on the head invasive:
    Try and block symlink mode changes in chmod_common().

Thoughts?
Alexey Gladkov July 11, 2023, 4:16 p.m. UTC | #3
In glibc, the fchmodat(3) function has a flags argument according to the
POSIX specification [1], but kernel syscalls has no such argument.
Therefore, libc implementations do workarounds using /proc. However,
this requires procfs to be mounted and accessible.

This patch set adds fchmodat2(), a new syscall. The syscall allows to
pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other
respects, this syscall is no different from fchmodat().

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/chmod.html

Changes since v3 [cover.1689074739.git.legion@kernel.org]:

* Rebased to master because a new syscall has appeared in master.
* Increased __NR_compat_syscalls as pointed out by Arnd Bergmann.
* Syscall renamed fchmodat4 -> fchmodat2 as suggested by Christian Brauner.
* Returned do_fchmodat4() the original name. We don't need to version
  internal functions.
* Fixed warnings found by checkpatch.pl.

Changes since v2 [20190717012719.5524-1-palmer@sifive.com]:

* Rebased to master.
* The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro.
* Selftest added.

Changes since v1 [20190531191204.4044-1-palmer@sifive.com]:

* All architectures are now supported, which support squashed into a
  single patch.
* The do_fchmodat() helper function has been removed, in favor of directly
  calling do_fchmodat4().
* The patches are based on 5.2 instead of 5.1.

---

Alexey Gladkov (2):
  fs: Add fchmodat2()
  selftests: Add fchmodat2 selftest

Palmer Dabbelt (3):
  Non-functional cleanup of a "__user * filename"
  arch: Register fchmodat2, usually as syscall 452
  tools headers UAPI: Sync files changed by new fchmodat2 syscall

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/open.c                                     |  18 +-
 include/linux/syscalls.h                      |   4 +-
 include/uapi/asm-generic/unistd.h             |   5 +-
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |   2 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |   2 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |   2 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |   2 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/fchmodat2/.gitignore  |   2 +
 tools/testing/selftests/fchmodat2/Makefile    |   6 +
 .../selftests/fchmodat2/fchmodat2_test.c      | 162 ++++++++++++++++++
 30 files changed, 223 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/fchmodat2/.gitignore
 create mode 100644 tools/testing/selftests/fchmodat2/Makefile
 create mode 100644 tools/testing/selftests/fchmodat2/fchmodat2_test.c
Christian Brauner July 11, 2023, 5:36 p.m. UTC | #4
On Tue, 11 Jul 2023 18:16:02 +0200, Alexey Gladkov wrote:
> In glibc, the fchmodat(3) function has a flags argument according to the
> POSIX specification [1], but kernel syscalls has no such argument.
> Therefore, libc implementations do workarounds using /proc. However,
> this requires procfs to be mounted and accessible.
> 
> This patch set adds fchmodat2(), a new syscall. The syscall allows to
> pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other
> respects, this syscall is no different from fchmodat().
> 
> [...]

Tools updates usually go separately.
Flags argument ported to unsigned int; otherwise unchanged.

---

Applied to the master branch of the vfs/vfs.git tree.
Patches in the master branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: master

[1/5] Non-functional cleanup of a "__user * filename"
      https://git.kernel.org/vfs/vfs/c/0f05a6af6b7e
[2/5] fs: Add fchmodat2()
      https://git.kernel.org/vfs/vfs/c/8d593559ec09
[3/5] arch: Register fchmodat2, usually as syscall 452
      https://git.kernel.org/vfs/vfs/c/2ee63b04f206
[5/5] selftests: Add fchmodat2 selftest
      https://git.kernel.org/vfs/vfs/c/f175b92081ec
Rich Felker July 12, 2023, 2:42 a.m. UTC | #5
On Tue, Jul 11, 2023 at 06:16:02PM +0200, Alexey Gladkov wrote:
> In glibc, the fchmodat(3) function has a flags argument according to the
> POSIX specification [1], but kernel syscalls has no such argument.
> Therefore, libc implementations do workarounds using /proc. However,
> this requires procfs to be mounted and accessible.
> 
> This patch set adds fchmodat2(), a new syscall. The syscall allows to
> pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other
> respects, this syscall is no different from fchmodat().
> 
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/chmod.html
> 
> Changes since v3 [cover.1689074739.git.legion@kernel.org]:
> 
> * Rebased to master because a new syscall has appeared in master.
> * Increased __NR_compat_syscalls as pointed out by Arnd Bergmann.
> * Syscall renamed fchmodat4 -> fchmodat2 as suggested by Christian Brauner.
> * Returned do_fchmodat4() the original name. We don't need to version
>   internal functions.
> * Fixed warnings found by checkpatch.pl.
> 
> Changes since v2 [20190717012719.5524-1-palmer@sifive.com]:
> 
> * Rebased to master.
> * The lookup_flags passed to sys_fchmodat4 as suggested by Al Viro.
> * Selftest added.
> 
> Changes since v1 [20190531191204.4044-1-palmer@sifive.com]:
> 
> * All architectures are now supported, which support squashed into a
>   single patch.
> * The do_fchmodat() helper function has been removed, in favor of directly
>   calling do_fchmodat4().
> * The patches are based on 5.2 instead of 5.1.

It's good to see this moving forward. I originally proposed this in a
patch submitted in 2020. I suspect implementation details have changed
since then, but it might make sense to look back at that discussion if
nobody has done so yet (apologies if this was already done and I
missed it) to make sure nothing is overlooked -- I remember there were
some subtleties with what fs backends might try to do with chmod on
symlinks. My proposed commit message also documented a lot of the
history of the issue that might be useful to have as context.

https://lore.kernel.org/all/20200910170256.GK3265@brightrain.aerifal.cx/T/

Rich
Alexey Gladkov July 25, 2023, 11:05 a.m. UTC | #6
On Tue, Jul 11, 2023 at 05:14:24PM +0200, Christian Brauner wrote:
> On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote:
> > * Alexey Gladkov:
> > 
> > > This patch set adds fchmodat4(), a new syscall. The actual
> > > implementation is super simple: essentially it's just the same as
> > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> > > I've attempted to make this match "man 2 fchmodat" as closely as
> > > possible, which says EINVAL is returned for invalid flags (as opposed to
> > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> > > I have a sketch of a glibc patch that I haven't even compiled yet, but
> > > seems fairly straight-forward:
> > >
> > >     diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c
> > >     index 6d9cbc1ce9e0..b1beab76d56c 100644
> > >     --- a/sysdeps/unix/sysv/linux/fchmodat.c
> > >     +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> > >     @@ -29,12 +29,36 @@
> > >      int
> > >      fchmodat (int fd, const char *file, mode_t mode, int flag)
> > >      {
> > >     -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> > >     -    return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> > >     -#ifndef __NR_lchmod		/* Linux so far has no lchmod syscall.  */
> > >     +  /* There are four paths through this code:
> > >     +      - The flags are zero.  In this case it's fine to call fchmodat.
> > >     +      - The flags are non-zero and glibc doesn't have access to
> > >     +	__NR_fchmodat4.  In this case all we can do is emulate the error codes
> > >     +	defined by the glibc interface from userspace.
> > >     +      - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has
> > >     +	fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
> > >     +	matches glibc's library interface so it can be called directly.
> > >     +      - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does
> > 
> > If you define __NR_fchmodat4 on all architectures, we can use these
> > constants directly in glibc.  We no longer depend on the UAPI
> > definitions of those constants, to cut down the number of code variants,
> > and to make glibc's system call profile independent of the kernel header
> > version at build time.
> > 
> > Your version is based on 2.31, more recent versions have some reasonable
> > emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
> > describing the same buggy behavior that you witnessed:
> > 
> > +      /* Some Linux versions with some file systems can actually
> > +        change symbolic link permissions via /proc, but this is not
> > +        intentional, and it gives inconsistent results (e.g., error
> > +        return despite mode change).  The expected behavior is that
> > +        symbolic link modes cannot be changed at all, and this check
> > +        enforces that.  */
> > +      if (S_ISLNK (st.st_mode))
> > +       {
> > +         __close_nocancel (pathfd);
> > +         __set_errno (EOPNOTSUPP);
> > +         return -1;
> > +       }
> > 
> > I think there was some kernel discussion about that behavior before, but
> > apparently, it hasn't led to fixes.
> 
> I think I've explained this somewhere else a couple of months ago but
> just in case you weren't on that thread or don't remember and apologies
> if you should already know.
> 
> A lot of filesystem will happily update the mode of a symlink. The VFS
> doesn't do anything to prevent this from happening. This is filesystem
> specific.
> 
> The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs.
> Specifically it comes from filesystems that call posix_acl_chmod(),
> e.g., btrfs via
> 
>         if (!err && attr->ia_valid & ATTR_MODE)
>                 err = posix_acl_chmod(idmap, dentry, inode->i_mode);
> 
> Most filesystems don't implement i_op->set_acl() for POSIX ACLs.
> So posix_acl_chmod() will report EOPNOTSUPP. By the time
> posix_acl_chmod() is called, most filesystems will have finished
> updating the inode. POSIX ACLs also often aren't integrated into
> transactions so a rollback wouldn't even be possible on some
> filesystems.
> 
> Any filesystem that doesn't implement POSIX ACLs at all will obviously
> never fail unless it blocks mode changes on symlinks. Or filesystems
> that do have a way to rollback failures from posix_acl_chmod(), or
> filesystems that do return an error on chmod() on symlinks such as 9p,
> ntfs, ocfs2.
> 
> > 
> > I wonder if it makes sense to add a similar error return to the system
> > call implementation?
> 
> Hm, blocking symlink mode changes is pretty regression prone. And just
> blocking it through one interface seems weird and makes things even more
> inconsistent.
> 
> So two options I see:
> (1) minimally invasive:
>     Filesystems that do call posix_acl_chmod() on symlinks need to be
>     changed to stop doing that.
> (2) might hit us on the head invasive:
>     Try and block symlink mode changes in chmod_common().
> 
> Thoughts?
> 

We have third option. We can choose not to call chmod_common and return an
error right away:

diff --git a/fs/open.c b/fs/open.c
index 39a7939f0d00..86a427a2a083 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -679,7 +679,9 @@ static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, int l
 retry:
        error = user_path_at(dfd, filename, lookup_flags, &path);
        if (!error) {
-               error = chmod_common(&path, mode);
+               error = -EOPNOTSUPP;
+               if (!(flags & AT_SYMLINK_NOFOLLOW) || !S_ISLNK(path.dentry->d_inode->i_mode))
+                       error = chmod_common(&path, mode);
                path_put(&path);
                if (retry_estale(error, lookup_flags)) {
                        lookup_flags |= LOOKUP_REVAL;

It doesn't seem to be invasive.
Christian Brauner July 25, 2023, 12:05 p.m. UTC | #7
On Tue, Jul 25, 2023 at 01:05:40PM +0200, Alexey Gladkov wrote:
> On Tue, Jul 11, 2023 at 05:14:24PM +0200, Christian Brauner wrote:
> > On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote:
> > > * Alexey Gladkov:
> > > 
> > > > This patch set adds fchmodat4(), a new syscall. The actual
> > > > implementation is super simple: essentially it's just the same as
> > > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> > > > I've attempted to make this match "man 2 fchmodat" as closely as
> > > > possible, which says EINVAL is returned for invalid flags (as opposed to
> > > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> > > > I have a sketch of a glibc patch that I haven't even compiled yet, but
> > > > seems fairly straight-forward:
> > > >
> > > >     diff --git a/sysdeps/unix/sysv/linux/fchmodat.c b/sysdeps/unix/sysv/linux/fchmodat.c
> > > >     index 6d9cbc1ce9e0..b1beab76d56c 100644
> > > >     --- a/sysdeps/unix/sysv/linux/fchmodat.c
> > > >     +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> > > >     @@ -29,12 +29,36 @@
> > > >      int
> > > >      fchmodat (int fd, const char *file, mode_t mode, int flag)
> > > >      {
> > > >     -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> > > >     -    return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> > > >     -#ifndef __NR_lchmod		/* Linux so far has no lchmod syscall.  */
> > > >     +  /* There are four paths through this code:
> > > >     +      - The flags are zero.  In this case it's fine to call fchmodat.
> > > >     +      - The flags are non-zero and glibc doesn't have access to
> > > >     +	__NR_fchmodat4.  In this case all we can do is emulate the error codes
> > > >     +	defined by the glibc interface from userspace.
> > > >     +      - The flags are non-zero, glibc has __NR_fchmodat4, and the kernel has
> > > >     +	fchmodat4.  This is the simplest case, as the fchmodat4 syscall exactly
> > > >     +	matches glibc's library interface so it can be called directly.
> > > >     +      - The flags are non-zero, glibc has __NR_fchmodat4, but the kernel does
> > > 
> > > If you define __NR_fchmodat4 on all architectures, we can use these
> > > constants directly in glibc.  We no longer depend on the UAPI
> > > definitions of those constants, to cut down the number of code variants,
> > > and to make glibc's system call profile independent of the kernel header
> > > version at build time.
> > > 
> > > Your version is based on 2.31, more recent versions have some reasonable
> > > emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
> > > describing the same buggy behavior that you witnessed:
> > > 
> > > +      /* Some Linux versions with some file systems can actually
> > > +        change symbolic link permissions via /proc, but this is not
> > > +        intentional, and it gives inconsistent results (e.g., error
> > > +        return despite mode change).  The expected behavior is that
> > > +        symbolic link modes cannot be changed at all, and this check
> > > +        enforces that.  */
> > > +      if (S_ISLNK (st.st_mode))
> > > +       {
> > > +         __close_nocancel (pathfd);
> > > +         __set_errno (EOPNOTSUPP);
> > > +         return -1;
> > > +       }
> > > 
> > > I think there was some kernel discussion about that behavior before, but
> > > apparently, it hasn't led to fixes.
> > 
> > I think I've explained this somewhere else a couple of months ago but
> > just in case you weren't on that thread or don't remember and apologies
> > if you should already know.
> > 
> > A lot of filesystem will happily update the mode of a symlink. The VFS
> > doesn't do anything to prevent this from happening. This is filesystem
> > specific.
> > 
> > The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs.
> > Specifically it comes from filesystems that call posix_acl_chmod(),
> > e.g., btrfs via
> > 
> >         if (!err && attr->ia_valid & ATTR_MODE)
> >                 err = posix_acl_chmod(idmap, dentry, inode->i_mode);
> > 
> > Most filesystems don't implement i_op->set_acl() for POSIX ACLs.
> > So posix_acl_chmod() will report EOPNOTSUPP. By the time
> > posix_acl_chmod() is called, most filesystems will have finished
> > updating the inode. POSIX ACLs also often aren't integrated into
> > transactions so a rollback wouldn't even be possible on some
> > filesystems.
> > 
> > Any filesystem that doesn't implement POSIX ACLs at all will obviously
> > never fail unless it blocks mode changes on symlinks. Or filesystems
> > that do have a way to rollback failures from posix_acl_chmod(), or
> > filesystems that do return an error on chmod() on symlinks such as 9p,
> > ntfs, ocfs2.
> > 
> > > 
> > > I wonder if it makes sense to add a similar error return to the system
> > > call implementation?
> > 
> > Hm, blocking symlink mode changes is pretty regression prone. And just
> > blocking it through one interface seems weird and makes things even more
> > inconsistent.
> > 
> > So two options I see:
> > (1) minimally invasive:
> >     Filesystems that do call posix_acl_chmod() on symlinks need to be
> >     changed to stop doing that.
> > (2) might hit us on the head invasive:
> >     Try and block symlink mode changes in chmod_common().
> > 
> > Thoughts?
> > 
> 
> We have third option. We can choose not to call chmod_common and return an
> error right away:
> 
> diff --git a/fs/open.c b/fs/open.c
> index 39a7939f0d00..86a427a2a083 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -679,7 +679,9 @@ static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, int l
>  retry:
>         error = user_path_at(dfd, filename, lookup_flags, &path);
>         if (!error) {
> -               error = chmod_common(&path, mode);
> +               error = -EOPNOTSUPP;
> +               if (!(flags & AT_SYMLINK_NOFOLLOW) || !S_ISLNK(path.dentry->d_inode->i_mode))
> +                       error = chmod_common(&path, mode);
>                 path_put(&path);
>                 if (retry_estale(error, lookup_flags)) {
>                         lookup_flags |= LOOKUP_REVAL;
> 
> It doesn't seem to be invasive.

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=77b652535528770217186589d97261847f15f862
David Howells July 25, 2023, 3:58 p.m. UTC | #8
Rather than adding a fchmodat2() syscall, should we add a "set_file_attrs()"
syscall that takes a mask and allows you to set a bunch of stuff all in one
go?  Basically, an interface to notify_change() in the kernel that would allow
several stats to be set atomically.  This might be of particular interest to
network filesystems.

David
Florian Weimer July 25, 2023, 4:10 p.m. UTC | #9
* David Howells:

> Rather than adding a fchmodat2() syscall, should we add a "set_file_attrs()"
> syscall that takes a mask and allows you to set a bunch of stuff all in one
> go?  Basically, an interface to notify_change() in the kernel that would allow
> several stats to be set atomically.  This might be of particular interest to
> network filesystems.

Do you mean atomically as in compare-and-swap (update only if old values
match), or just a way to update multiple file attributes with a single
system call?

Thanks,
Florian
Aleksa Sarai July 25, 2023, 4:50 p.m. UTC | #10
On 2023-07-25, David Howells <dhowells@redhat.com> wrote:
> Rather than adding a fchmodat2() syscall, should we add a "set_file_attrs()"
> syscall that takes a mask and allows you to set a bunch of stuff all in one
> go?  Basically, an interface to notify_change() in the kernel that would allow
> several stats to be set atomically.  This might be of particular interest to
> network filesystems.

Presumably looking something like statx(2) (except hopefully with
extensible structs this time :P)? I think that could also be useful, but
given this is a fairly straight-forward syscall addition (and it also
would resolve the AT_EMPTY_PATH issue for chmod as well as simplify the
glibc wrapper), I think it makes sense to take this and we can do
set_statx(2) separately?
David Howells July 25, 2023, 6:39 p.m. UTC | #11
Florian Weimer <fweimer@redhat.com> wrote:

> > Rather than adding a fchmodat2() syscall, should we add a
> > "set_file_attrs()" syscall that takes a mask and allows you to set a bunch
> > of stuff all in one go?  Basically, an interface to notify_change() in the
> > kernel that would allow several stats to be set atomically.  This might be
> > of particular interest to network filesystems.
> 
> Do you mean atomically as in compare-and-swap (update only if old values
> match), or just a way to update multiple file attributes with a single
> system call?

I was thinking more in terms of the latter.  AFAIK, there aren't any network
filesystems support a CAS interface on file attributes like that.  To be able
to do a CAS operation, we'd need to pass in the old values as well as the new.

Another thing we could look at is doing "create_and_set_attrs()", possibly
allowing it to take a list of xattrs also.

David
Rich Felker July 25, 2023, 6:44 p.m. UTC | #12
On Tue, Jul 25, 2023 at 07:39:51PM +0100, David Howells wrote:
> Florian Weimer <fweimer@redhat.com> wrote:
> 
> > > Rather than adding a fchmodat2() syscall, should we add a
> > > "set_file_attrs()" syscall that takes a mask and allows you to set a bunch
> > > of stuff all in one go?  Basically, an interface to notify_change() in the
> > > kernel that would allow several stats to be set atomically.  This might be
> > > of particular interest to network filesystems.
> > 
> > Do you mean atomically as in compare-and-swap (update only if old values
> > match), or just a way to update multiple file attributes with a single
> > system call?
> 
> I was thinking more in terms of the latter.  AFAIK, there aren't any network
> filesystems support a CAS interface on file attributes like that.  To be able
> to do a CAS operation, we'd need to pass in the old values as well as the new.
> 
> Another thing we could look at is doing "create_and_set_attrs()", possibly
> allowing it to take a list of xattrs also.

Can we please not let " hey let's invent a new interface to do
something that will be hard for underlying filesystems to even provide
and that nothing needs because there's no standard API to do it" be
the enemy of "fixing a known problem implementing an existing standard
API that just requires a simple, clearly-scoped syscall to do it"?

Rich
Christian Brauner July 26, 2023, 1:30 p.m. UTC | #13
On Tue, Jul 25, 2023 at 07:39:51PM +0100, David Howells wrote:
> Florian Weimer <fweimer@redhat.com> wrote:
> 
> > > Rather than adding a fchmodat2() syscall, should we add a
> > > "set_file_attrs()" syscall that takes a mask and allows you to set a bunch
> > > of stuff all in one go?  Basically, an interface to notify_change() in the

That system call would likely be blocked in seccomp sandboxes completely
as seccomp cannot filter structs. I don't consider this an argument to
block new good functionality in general as that would mean arbitrarily
limiting us but it is something to keep in mind. If there's additional
benefit other than just being able to set mutliple values at once then
yeah might be something to discuss.

> > > kernel that would allow several stats to be set atomically.  This might be
> > > of particular interest to network filesystems.
> > 
> > Do you mean atomically as in compare-and-swap (update only if old values
> > match), or just a way to update multiple file attributes with a single
> > system call?
> 
> I was thinking more in terms of the latter.  AFAIK, there aren't any network
> filesystems support a CAS interface on file attributes like that.  To be able
> to do a CAS operation, we'd need to pass in the old values as well as the new.
> 
> Another thing we could look at is doing "create_and_set_attrs()", possibly
> allowing it to take a list of xattrs also.

That would likely require variable sized pointers in a struct which is
something we really try to stay away from. I also think it's not a good
idea to lump xattrs toegether with generic file attributes. They should
remain a separate api imho.
Eric Biggers July 27, 2023, 3:57 a.m. UTC | #14
On Tue, Jul 25, 2023 at 04:58:34PM +0100, David Howells wrote:
> Rather than adding a fchmodat2() syscall, should we add a "set_file_attrs()"
> syscall that takes a mask and allows you to set a bunch of stuff all in one
> go?  Basically, an interface to notify_change() in the kernel that would allow
> several stats to be set atomically.  This might be of particular interest to
> network filesystems.
> 
> David
> 

fchmodat2() is a simple addition that fits well with the existing syscalls.
It fixes an oversight in fchmodat().

IMO we should just add fchmodat2(), and not get sidetracked by trying to add
some super-generalized syscall instead.  That can always be done later.

- Eric
Christian Brauner July 27, 2023, 10:27 a.m. UTC | #15
On Wed, Jul 26, 2023 at 08:57:10PM -0700, Eric Biggers wrote:
> On Tue, Jul 25, 2023 at 04:58:34PM +0100, David Howells wrote:
> > Rather than adding a fchmodat2() syscall, should we add a "set_file_attrs()"
> > syscall that takes a mask and allows you to set a bunch of stuff all in one
> > go?  Basically, an interface to notify_change() in the kernel that would allow
> > several stats to be set atomically.  This might be of particular interest to
> > network filesystems.
> > 
> > David
> > 
> 
> fchmodat2() is a simple addition that fits well with the existing syscalls.
> It fixes an oversight in fchmodat().
> 
> IMO we should just add fchmodat2(), and not get sidetracked by trying to add
> some super-generalized syscall instead.  That can always be done later.

Agreed.