diff mbox series

[v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

Message ID 20210126103502.260758-1-stefanha@redhat.com
State New
Headers show
Series [v2] virtiofsd: prevent opening of special files (CVE-2020-35517) | expand

Commit Message

Stefan Hajnoczi Jan. 26, 2021, 10:35 a.m. UTC
A well-behaved FUSE client does not attempt to open special files with
FUSE_OPEN because they are handled on the client side (e.g. device nodes
are handled by client-side device drivers).

The check to prevent virtiofsd from opening special files is missing in
a few cases, most notably FUSE_OPEN. A malicious client can cause
virtiofsd to open a device node, potentially allowing the guest to
escape. This can be exploited by a modified guest device driver. It is
not exploitable from guest userspace since the guest kernel will handle
special files inside the guest instead of sending FUSE requests.

This patch adds the missing checks to virtiofsd. This is a short-term
solution because it does not prevent a compromised virtiofsd process
from opening device nodes on the host.

Reported-by: Alex Xu <alex@alxu.ca>
Fixes: CVE-2020-35517
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v2:
 * Add doc comment clarifying that symlinks are traversed client-side
   [Daniel]

This issue was diagnosed on public IRC and is therefore already known
and not embargoed.

A stronger fix, and the long-term solution, is for users to mount the
shared directory and any sub-mounts with nodev, as well as nosuid and
noexec. Unfortunately virtiofsd cannot do this automatically because
bind mounts added by the user after virtiofsd has launched would not be
detected. I suggest the following:

1. Modify libvirt and Kata Containers to explicitly set these mount
   options.
2. Then modify virtiofsd to check that the shared directory has the
   necessary options at startup. Refuse to start if the options are
   missing so that the user is aware of the security requirements.

As a bonus this also increases the likelihood that other host processes
besides virtiofsd will be protected by nosuid/noexec/nodev so that a
malicious guest cannot drop these files in place and then arrange for a
host process to come across them.

Additionally, user namespaces have been discussed. They seem like a
worthwhile addition as an unprivileged or privilege-separated mode
although there are limitations with respect to security xattrs and the
actual uid/gid stored on the host file system not corresponding to the
guest uid/gid.
---
 tools/virtiofsd/passthrough_ll.c | 85 +++++++++++++++++++++-----------
 1 file changed, 57 insertions(+), 28 deletions(-)

Comments

Daniel P. Berrangé Jan. 26, 2021, 10:36 a.m. UTC | #1
On Tue, Jan 26, 2021 at 10:35:02AM +0000, Stefan Hajnoczi wrote:
> A well-behaved FUSE client does not attempt to open special files with
> FUSE_OPEN because they are handled on the client side (e.g. device nodes
> are handled by client-side device drivers).
> 
> The check to prevent virtiofsd from opening special files is missing in
> a few cases, most notably FUSE_OPEN. A malicious client can cause
> virtiofsd to open a device node, potentially allowing the guest to
> escape. This can be exploited by a modified guest device driver. It is
> not exploitable from guest userspace since the guest kernel will handle
> special files inside the guest instead of sending FUSE requests.
> 
> This patch adds the missing checks to virtiofsd. This is a short-term
> solution because it does not prevent a compromised virtiofsd process
> from opening device nodes on the host.
> 
> Reported-by: Alex Xu <alex@alxu.ca>
> Fixes: CVE-2020-35517
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


Regards,
Daniel
Liam Merwick Jan. 26, 2021, 10:47 a.m. UTC | #2
On 26/01/2021 10:35, Stefan Hajnoczi wrote:
> A well-behaved FUSE client does not attempt to open special files with
> FUSE_OPEN because they are handled on the client side (e.g. device nodes
> are handled by client-side device drivers).
> 
> The check to prevent virtiofsd from opening special files is missing in
> a few cases, most notably FUSE_OPEN. A malicious client can cause
> virtiofsd to open a device node, potentially allowing the guest to
> escape. This can be exploited by a modified guest device driver. It is
> not exploitable from guest userspace since the guest kernel will handle
> special files inside the guest instead of sending FUSE requests.
> 
> This patch adds the missing checks to virtiofsd. This is a short-term
> solution because it does not prevent a compromised virtiofsd process
> from opening device nodes on the host.
> 
> Reported-by: Alex Xu <alex@alxu.ca>
> Fixes: CVE-2020-35517


Should this also explicitly

Cc: qemu-stable@nongnu.org

Or would only a long-term fix target that?

Reviewed-by: Liam Merwick <liam.merwick@oracle.com>

> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> v2:
>   * Add doc comment clarifying that symlinks are traversed client-side
>     [Daniel]
> 
> This issue was diagnosed on public IRC and is therefore already known
> and not embargoed.
> 
> A stronger fix, and the long-term solution, is for users to mount the
> shared directory and any sub-mounts with nodev, as well as nosuid and
> noexec. Unfortunately virtiofsd cannot do this automatically because
> bind mounts added by the user after virtiofsd has launched would not be
> detected. I suggest the following:
> 
> 1. Modify libvirt and Kata Containers to explicitly set these mount
>     options.
> 2. Then modify virtiofsd to check that the shared directory has the
>     necessary options at startup. Refuse to start if the options are
>     missing so that the user is aware of the security requirements.
> 
> As a bonus this also increases the likelihood that other host processes
> besides virtiofsd will be protected by nosuid/noexec/nodev so that a
> malicious guest cannot drop these files in place and then arrange for a
> host process to come across them.
> 
> Additionally, user namespaces have been discussed. They seem like a
> worthwhile addition as an unprivileged or privilege-separated mode
> although there are limitations with respect to security xattrs and the
> actual uid/gid stored on the host file system not corresponding to the
> guest uid/gid.
> ---
>   tools/virtiofsd/passthrough_ll.c | 85 +++++++++++++++++++++-----------
>   1 file changed, 57 insertions(+), 28 deletions(-)
>
Greg Kurz Jan. 26, 2021, 5:16 p.m. UTC | #3
On Tue, 26 Jan 2021 10:35:02 +0000
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> A well-behaved FUSE client does not attempt to open special files with
> FUSE_OPEN because they are handled on the client side (e.g. device nodes
> are handled by client-side device drivers).
> 
> The check to prevent virtiofsd from opening special files is missing in
> a few cases, most notably FUSE_OPEN. A malicious client can cause
> virtiofsd to open a device node, potentially allowing the guest to
> escape. 

or pretty much anything nasty you can think of, e.g. DoS if the malicious
client repeatedly asks virtiofsd to open FIFOs the other side of which is
never opened.

> This can be exploited by a modified guest device driver. It is
> not exploitable from guest userspace since the guest kernel will handle
> special files inside the guest instead of sending FUSE requests.
> 
> This patch adds the missing checks to virtiofsd. This is a short-term
> solution because it does not prevent a compromised virtiofsd process
> from opening device nodes on the host.
> 
> Reported-by: Alex Xu <alex@alxu.ca>
> Fixes: CVE-2020-35517
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---

The patch looks pretty good to me. It just seems to be missing a change in
lo_create():

    fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
                mode);

A malicious guest could have created anything called ${name} in this directory
before calling FUSE_CREATE and we'll open it blindly, or I'm missing something ?

> v2:
>  * Add doc comment clarifying that symlinks are traversed client-side
>    [Daniel]
> 
> This issue was diagnosed on public IRC and is therefore already known
> and not embargoed.
> 
> A stronger fix, and the long-term solution, is for users to mount the
> shared directory and any sub-mounts with nodev, as well as nosuid and
> noexec. Unfortunately virtiofsd cannot do this automatically because
> bind mounts added by the user after virtiofsd has launched would not be
> detected. I suggest the following:
> 
> 1. Modify libvirt and Kata Containers to explicitly set these mount
>    options.
> 2. Then modify virtiofsd to check that the shared directory has the
>    necessary options at startup. Refuse to start if the options are
>    missing so that the user is aware of the security requirements.
> 
> As a bonus this also increases the likelihood that other host processes
> besides virtiofsd will be protected by nosuid/noexec/nodev so that a
> malicious guest cannot drop these files in place and then arrange for a
> host process to come across them.
> 
> Additionally, user namespaces have been discussed. They seem like a
> worthwhile addition as an unprivileged or privilege-separated mode
> although there are limitations with respect to security xattrs and the
> actual uid/gid stored on the host file system not corresponding to the
> guest uid/gid.
> ---
>  tools/virtiofsd/passthrough_ll.c | 85 +++++++++++++++++++++-----------
>  1 file changed, 57 insertions(+), 28 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 5fb36d9407..b722f43809 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -555,6 +555,30 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino)
>      return fd;
>  }
>  
> +/*
> + * Open a file descriptor for an inode. Returns -EBADF if the inode is not a
> + * regular file or a directory. Use this helper function instead of raw
> + * openat(2) to prevent security issues when a malicious client opens special
> + * files such as block device nodes. Symlink inodes are also rejected since
> + * symlinks must already have been traversed on the client side.
> + */
> +static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode,
> +                         int open_flags)
> +{
> +    g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
> +    int fd;
> +
> +    if (!S_ISREG(inode->filetype) && !S_ISDIR(inode->filetype)) {
> +        return -EBADF;
> +    }
> +
> +    fd = openat(lo->proc_self_fd, fd_str, open_flags);
> +    if (fd < 0) {
> +        return -errno;
> +    }
> +    return fd;
> +}
> +
>  static void lo_init(void *userdata, struct fuse_conn_info *conn)
>  {
>      struct lo_data *lo = (struct lo_data *)userdata;
> @@ -684,8 +708,7 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
>          if (fi) {
>              truncfd = fd;
>          } else {
> -            sprintf(procname, "%i", ifd);
> -            truncfd = openat(lo->proc_self_fd, procname, O_RDWR);
> +            truncfd = lo_inode_open(lo, inode, O_RDWR);
>              if (truncfd < 0) {
>                  goto out_err;
>              }
> @@ -1725,7 +1748,6 @@ static struct lo_inode_plock *lookup_create_plock_ctx(struct lo_data *lo,
>                                                        pid_t pid, int *err)
>  {
>      struct lo_inode_plock *plock;
> -    char procname[64];
>      int fd;
>  
>      plock =
> @@ -1742,12 +1764,10 @@ static struct lo_inode_plock *lookup_create_plock_ctx(struct lo_data *lo,
>      }
>  
>      /* Open another instance of file which can be used for ofd locks. */
> -    sprintf(procname, "%i", inode->fd);
> -
>      /* TODO: What if file is not writable? */
> -    fd = openat(lo->proc_self_fd, procname, O_RDWR);
> -    if (fd == -1) {
> -        *err = errno;
> +    fd = lo_inode_open(lo, inode, O_RDWR);
> +    if (fd < 0) {
> +        *err = -fd;
>          free(plock);
>          return NULL;
>      }
> @@ -1894,18 +1914,24 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
>  {
>      int fd;
>      ssize_t fh;
> -    char buf[64];
>      struct lo_data *lo = lo_data(req);
> +    struct lo_inode *inode = lo_inode(req, ino);
>  
>      fuse_log(FUSE_LOG_DEBUG, "lo_open(ino=%" PRIu64 ", flags=%d)\n", ino,
>               fi->flags);
>  
> +    if (!inode) {
> +        fuse_reply_err(req, EBADF);
> +        return;
> +    }
> +
>      update_open_flags(lo->writeback, lo->allow_direct_io, fi);
>  
> -    sprintf(buf, "%i", lo_fd(req, ino));
> -    fd = openat(lo->proc_self_fd, buf, fi->flags & ~O_NOFOLLOW);
> -    if (fd == -1) {
> -        return (void)fuse_reply_err(req, errno);
> +    fd = lo_inode_open(lo, inode, fi->flags & ~O_NOFOLLOW);
> +    if (fd < 0) {
> +        lo_inode_put(lo, &inode);
> +        fuse_reply_err(req, -fd);
> +        return;
>      }
>  
>      pthread_mutex_lock(&lo->mutex);
> @@ -1913,6 +1939,7 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
>      pthread_mutex_unlock(&lo->mutex);
>      if (fh == -1) {
>          close(fd);
> +        lo_inode_put(lo, &inode);
>          fuse_reply_err(req, ENOMEM);
>          return;
>      }
> @@ -1923,6 +1950,7 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
>      } else if (lo->cache == CACHE_ALWAYS) {
>          fi->keep_cache = 1;
>      }
> +    lo_inode_put(lo, &inode);
>      fuse_reply_open(req, fi);
>  }
>  
> @@ -1982,39 +2010,40 @@ static void lo_flush(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
>  static void lo_fsync(fuse_req_t req, fuse_ino_t ino, int datasync,
>                       struct fuse_file_info *fi)
>  {
> +    struct lo_inode *inode = lo_inode(req, ino);
> +    struct lo_data *lo = lo_data(req);
>      int res;
>      int fd;
> -    char *buf;
>  
>      fuse_log(FUSE_LOG_DEBUG, "lo_fsync(ino=%" PRIu64 ", fi=0x%p)\n", ino,
>               (void *)fi);
>  
> +    if (!inode) {
> +        fuse_reply_err(req, EBADF);
> +        return;
> +    }
> +
>      if (!fi) {
> -        struct lo_data *lo = lo_data(req);
> -
> -        res = asprintf(&buf, "%i", lo_fd(req, ino));
> -        if (res == -1) {
> -            return (void)fuse_reply_err(req, errno);
> -        }
> -
> -        fd = openat(lo->proc_self_fd, buf, O_RDWR);
> -        free(buf);
> -        if (fd == -1) {
> -            return (void)fuse_reply_err(req, errno);
> +        fd = lo_inode_open(lo, inode, O_RDWR);
> +        if (fd < 0) {
> +            res = -fd;
> +            goto out;
>          }
>      } else {
>          fd = lo_fi_fd(req, fi);
>      }
>  
>      if (datasync) {
> -        res = fdatasync(fd);
> +        res = fdatasync(fd) == -1 ? errno : 0;
>      } else {
> -        res = fsync(fd);
> +        res = fsync(fd) == -1 ? errno : 0;
>      }
>      if (!fi) {
>          close(fd);
>      }
> -    fuse_reply_err(req, res == -1 ? errno : 0);
> +out:
> +    lo_inode_put(lo, &inode);
> +    fuse_reply_err(req, res);
>  }
>  
>  static void lo_read(fuse_req_t req, fuse_ino_t ino, size_t size, off_t offset,
Miklos Szeredi Jan. 27, 2021, 9:25 a.m. UTC | #4
On Tue, Jan 26, 2021 at 6:18 PM Greg Kurz <groug@kaod.org> wrote:
>
> On Tue, 26 Jan 2021 10:35:02 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:

> The patch looks pretty good to me. It just seems to be missing a change in
> lo_create():
>
>     fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
>                 mode);
>
> A malicious guest could have created anything called ${name} in this directory
> before calling FUSE_CREATE and we'll open it blindly, or I'm missing something ?

Right, this seems like an omission.

Also the "& ~O_NOFOLLOW" looks like a copy-paste bug, since unlike
lo_open(), lo_create() is not opening a proc symlink.

So that should be replaced with "| O_NOFOLLOW"

Thanks,
Miklos
Stefan Hajnoczi Jan. 27, 2021, 10:18 a.m. UTC | #5
On Tue, Jan 26, 2021 at 06:16:04PM +0100, Greg Kurz wrote:
> On Tue, 26 Jan 2021 10:35:02 +0000
> Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > A well-behaved FUSE client does not attempt to open special files with
> > FUSE_OPEN because they are handled on the client side (e.g. device nodes
> > are handled by client-side device drivers).
> > 
> > The check to prevent virtiofsd from opening special files is missing in
> > a few cases, most notably FUSE_OPEN. A malicious client can cause
> > virtiofsd to open a device node, potentially allowing the guest to
> > escape. 
> 
> or pretty much anything nasty you can think of, e.g. DoS if the malicious
> client repeatedly asks virtiofsd to open FIFOs the other side of which is
> never opened.
> 
> > This can be exploited by a modified guest device driver. It is
> > not exploitable from guest userspace since the guest kernel will handle
> > special files inside the guest instead of sending FUSE requests.
> > 
> > This patch adds the missing checks to virtiofsd. This is a short-term
> > solution because it does not prevent a compromised virtiofsd process
> > from opening device nodes on the host.
> > 
> > Reported-by: Alex Xu <alex@alxu.ca>
> > Fixes: CVE-2020-35517
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> 
> The patch looks pretty good to me. It just seems to be missing a change in
> lo_create():
> 
>     fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
>                 mode);
> 
> A malicious guest could have created anything called ${name} in this directory
> before calling FUSE_CREATE and we'll open it blindly, or I'm missing something ?

Good point! I will send another revision.

Stefan
Greg Kurz Jan. 27, 2021, 10:20 a.m. UTC | #6
On Wed, 27 Jan 2021 10:25:28 +0100
Miklos Szeredi <mszeredi@redhat.com> wrote:

> On Tue, Jan 26, 2021 at 6:18 PM Greg Kurz <groug@kaod.org> wrote:
> >
> > On Tue, 26 Jan 2021 10:35:02 +0000
> > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > The patch looks pretty good to me. It just seems to be missing a change in
> > lo_create():
> >
> >     fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
> >                 mode);
> >
> > A malicious guest could have created anything called ${name} in this directory
> > before calling FUSE_CREATE and we'll open it blindly, or I'm missing something ?
> 
> Right, this seems like an omission.
> 
> Also the "& ~O_NOFOLLOW" looks like a copy-paste bug, since unlike
> lo_open(), lo_create() is not opening a proc symlink.
> 
> So that should be replaced with "| O_NOFOLLOW"
> 


Yes, I've realized that later on. We should definitely enforce O_NOFOLLOW
to avoid symlink escapes.

Then comes the case of special files... A well-known case is the FIFO that
causes openat() to block as described in my response. FWIW, we addressed
this one in 9P by adding O_NONBLOCK and fixing the flags to the client
expectation with fcntl(F_SETFL). But this is just a protection against
being blocked. Blindly opening a special file can lead to any kind of
troubles you can think of... so it really looks that the only sane way
to be safe from such an attack is to forbid openat() of special files at
the filesystem level.

> Thanks,
> Miklos
>
Miklos Szeredi Jan. 27, 2021, 10:34 a.m. UTC | #7
On Wed, Jan 27, 2021 at 11:20 AM Greg Kurz <groug@kaod.org> wrote:
>
> On Wed, 27 Jan 2021 10:25:28 +0100
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> > On Tue, Jan 26, 2021 at 6:18 PM Greg Kurz <groug@kaod.org> wrote:
> > >
> > > On Tue, 26 Jan 2021 10:35:02 +0000
> > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > > The patch looks pretty good to me. It just seems to be missing a change in
> > > lo_create():
> > >
> > >     fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
> > >                 mode);
> > >
> > > A malicious guest could have created anything called ${name} in this directory
> > > before calling FUSE_CREATE and we'll open it blindly, or I'm missing something ?
> >
> > Right, this seems like an omission.
> >
> > Also the "& ~O_NOFOLLOW" looks like a copy-paste bug, since unlike
> > lo_open(), lo_create() is not opening a proc symlink.
> >
> > So that should be replaced with "| O_NOFOLLOW"
> >
>
>
> Yes, I've realized that later on. We should definitely enforce O_NOFOLLOW
> to avoid symlink escapes.
>
> Then comes the case of special files... A well-known case is the FIFO that
> causes openat() to block as described in my response. FWIW, we addressed
> this one in 9P by adding O_NONBLOCK and fixing the flags to the client
> expectation with fcntl(F_SETFL). But this is just a protection against
> being blocked. Blindly opening a special file can lead to any kind of
> troubles you can think of... so it really looks that the only sane way
> to be safe from such an attack is to forbid openat() of special files at
> the filesystem level.

Another solution specifically for O_CREAT without O_EXCL would be to
turn it into an exclusive create.   If that fails with EEXIST then try
the normal open path (open with O_PATH, fstat, open proc symlink).  If
that fails with ENOENT, then retry the whole thing a certain number of
times.  If it still fails then somebody is definitely messing with us
and we can fail the request with EIO.

Rather ugly, but I can't think of anything better.

Thanks,
Miklos
Greg Kurz Jan. 27, 2021, 1:49 p.m. UTC | #8
On Wed, 27 Jan 2021 11:34:52 +0100
Miklos Szeredi <mszeredi@redhat.com> wrote:

> On Wed, Jan 27, 2021 at 11:20 AM Greg Kurz <groug@kaod.org> wrote:
> >
> > On Wed, 27 Jan 2021 10:25:28 +0100
> > Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > > On Tue, Jan 26, 2021 at 6:18 PM Greg Kurz <groug@kaod.org> wrote:
> > > >
> > > > On Tue, 26 Jan 2021 10:35:02 +0000
> > > > Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > > The patch looks pretty good to me. It just seems to be missing a change in
> > > > lo_create():
> > > >
> > > >     fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
> > > >                 mode);
> > > >
> > > > A malicious guest could have created anything called ${name} in this directory
> > > > before calling FUSE_CREATE and we'll open it blindly, or I'm missing something ?
> > >
> > > Right, this seems like an omission.
> > >
> > > Also the "& ~O_NOFOLLOW" looks like a copy-paste bug, since unlike
> > > lo_open(), lo_create() is not opening a proc symlink.
> > >
> > > So that should be replaced with "| O_NOFOLLOW"
> > >
> >
> >
> > Yes, I've realized that later on. We should definitely enforce O_NOFOLLOW
> > to avoid symlink escapes.
> >
> > Then comes the case of special files... A well-known case is the FIFO that
> > causes openat() to block as described in my response. FWIW, we addressed
> > this one in 9P by adding O_NONBLOCK and fixing the flags to the client
> > expectation with fcntl(F_SETFL). But this is just a protection against
> > being blocked. Blindly opening a special file can lead to any kind of
> > troubles you can think of... so it really looks that the only sane way
> > to be safe from such an attack is to forbid openat() of special files at
> > the filesystem level.
> 
> Another solution specifically for O_CREAT without O_EXCL would be to
> turn it into an exclusive create.  

Would this added O_EXCL then appear on the client side, e.g. to
guest userspace doing fcntl(F_GETFL) ?

> If that fails with EEXIST then try
> the normal open path (open with O_PATH, fstat, open proc symlink).  If

open(O_PATH | O_NOFOLLOW) + fstatat(AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW)
would indeed allow to filter out anything that isn't a directory and
to safely open the proc symlink.

> that fails with ENOENT, then retry the whole thing a certain number of

Indeed someone could have unlinked the file in the meantime, in which
case the open(O_PATH | O_NOFOLLOW) would fail, but if it succeeds then
we cannot hit ENOENT anymore AFAICT.

> times.  If it still fails then somebody is definitely messing with us
> and we can fail the request with EIO.
> 

Not sure what the retry+timeout is trying to mitigate here... why not
returning EIO right away ?


> Rather ugly, but I can't think of anything better.
> 
> Thanks,
> Miklos
>
Miklos Szeredi Jan. 27, 2021, 2:09 p.m. UTC | #9
On Wed, Jan 27, 2021 at 2:49 PM Greg Kurz <groug@kaod.org> wrote:
>
> On Wed, 27 Jan 2021 11:34:52 +0100
> Miklos Szeredi <mszeredi@redhat.com> wrote:

> > Another solution specifically for O_CREAT without O_EXCL would be to
> > turn it into an exclusive create.
>
> Would this added O_EXCL then appear on the client side, e.g. to
> guest userspace doing fcntl(F_GETFL) ?

No.  Guest kernel keeps track of open flags.

> > If that fails with EEXIST then try
> > the normal open path (open with O_PATH, fstat, open proc symlink).  If
>
> open(O_PATH | O_NOFOLLOW) + fstatat(AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW)
> would indeed allow to filter out anything that isn't a directory and
> to safely open the proc symlink.
>
> > that fails with ENOENT, then retry the whole thing a certain number of
>
> Indeed someone could have unlinked the file in the meantime, in which
> case the open(O_PATH | O_NOFOLLOW) would fail, but if it succeeds then
> we cannot hit ENOENT anymore AFAICT.

Right.

> > times.  If it still fails then somebody is definitely messing with us
> > and we can fail the request with EIO.
> >
>
> Not sure what the retry+timeout is trying to mitigate here... why not
> returning EIO right away ?

The semantics of O_CREATE are that it can fail neither because the
file exists nor because it doesn't.  This doesn't matter if the
exported tree is not modified outside of a single guest, because of
locking provided by the guest kernel.

However if we want to support shared access to a tree then O_CREAT
semantics should work even in the face of races due to external
modification of the tree.  I.e. following race:

virtiofsd: open(foo, O_CREAT | O_EXCL) -> EEXIST
other task: unlink(foo)
virtiofsd: open(foo, O_PATH | O_NOFOLLOW) -> ENOENT

To properly support the above the O_CREAT | O_EXCL open would need to
be retried.

Thanks,
Miklos
Greg Kurz Jan. 27, 2021, 3:09 p.m. UTC | #10
On Wed, 27 Jan 2021 15:09:50 +0100
Miklos Szeredi <mszeredi@redhat.com> wrote:

> On Wed, Jan 27, 2021 at 2:49 PM Greg Kurz <groug@kaod.org> wrote:
> >
> > On Wed, 27 Jan 2021 11:34:52 +0100
> > Miklos Szeredi <mszeredi@redhat.com> wrote:
> 
> > > Another solution specifically for O_CREAT without O_EXCL would be to
> > > turn it into an exclusive create.
> >
> > Would this added O_EXCL then appear on the client side, e.g. to
> > guest userspace doing fcntl(F_GETFL) ?
> 
> No.  Guest kernel keeps track of open flags.
> 

That was my impression as well as I didn't find a FUSE_GETFL request.
Thanks for confirming that !

> > > If that fails with EEXIST then try
> > > the normal open path (open with O_PATH, fstat, open proc symlink).  If
> >
> > open(O_PATH | O_NOFOLLOW) + fstatat(AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW)
> > would indeed allow to filter out anything that isn't a directory and
> > to safely open the proc symlink.
> >
> > > that fails with ENOENT, then retry the whole thing a certain number of
> >
> > Indeed someone could have unlinked the file in the meantime, in which
> > case the open(O_PATH | O_NOFOLLOW) would fail, but if it succeeds then
> > we cannot hit ENOENT anymore AFAICT.
> 
> Right.
> 
> > > times.  If it still fails then somebody is definitely messing with us
> > > and we can fail the request with EIO.
> > >
> >
> > Not sure what the retry+timeout is trying to mitigate here... why not
> > returning EIO right away ?
> 
> The semantics of O_CREATE are that it can fail neither because the
> file exists nor because it doesn't.  This doesn't matter if the
> exported tree is not modified outside of a single guest, because of
> locking provided by the guest kernel.
> 

Wrong. O_CREAT can legitimately fail with ENOENT if one element
of the pathname doesn't exist. And even if pathname only has
one element, you can still have O_CREAT to fail the same way
if the path of the parent directory is removed.

cat>enoent.c<<EOF
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char **argv)
{
	mkdir("foo", 0777);
	chdir("foo");
	rmdir("../foo");
	open("bar", O_CREAT);
}
EOF
make enoent
strace ./enoent

[...]

mkdir("foo", 0777)                      = 0
chdir("foo")                            = 0
rmdir("../foo")                         = 0
openat(AT_FDCWD, "bar", O_RDONLY|O_CREAT, 0117130) = -1 ENOENT (No such file or directory)

> However if we want to support shared access to a tree then O_CREAT
> semantics should work even in the face of races due to external
> modification of the tree.  I.e. following race:
> 

Yeah, handling shared access is where the fun starts :)

> virtiofsd: open(foo, O_CREAT | O_EXCL) -> EEXIST
> other task: unlink(foo)
> virtiofsd: open(foo, O_PATH | O_NOFOLLOW) -> ENOENT
> 
> To properly support the above the O_CREAT | O_EXCL open would need to
> be retried.
> 

But in this case, it seems fine to return ENOENT since
the guest userspace cannot really assume it never happens.

> Thanks,
> Miklos
>
Miklos Szeredi Jan. 27, 2021, 3:22 p.m. UTC | #11
On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz <groug@kaod.org> wrote:
>
> On Wed, 27 Jan 2021 15:09:50 +0100
> Miklos Szeredi <mszeredi@redhat.com> wrote:
> > The semantics of O_CREATE are that it can fail neither because the
> > file exists nor because it doesn't.  This doesn't matter if the
> > exported tree is not modified outside of a single guest, because of
> > locking provided by the guest kernel.
> >
>
> Wrong. O_CREAT can legitimately fail with ENOENT if one element

Let me make my  statement more precise:

O_CREAT cannot fail with ENOENT if parent directory exists throughout
the operation.

I'm sure this property is used all over the place in userspace code,
and hence should be supported in strict coherence (cache=none) mode.

For relaxed coherence (cache=auto) I'm not quite sure.  NFS is usually
the reference, we'd need to look into what guarantees it provides wrt.
O_CREAT and remote racing unlink.

Thanks,
Miklos
Greg Kurz Jan. 27, 2021, 3:35 p.m. UTC | #12
On Wed, 27 Jan 2021 16:22:49 +0100
Miklos Szeredi <mszeredi@redhat.com> wrote:

> On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz <groug@kaod.org> wrote:
> >
> > On Wed, 27 Jan 2021 15:09:50 +0100
> > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > The semantics of O_CREATE are that it can fail neither because the
> > > file exists nor because it doesn't.  This doesn't matter if the
> > > exported tree is not modified outside of a single guest, because of
> > > locking provided by the guest kernel.
> > >
> >
> > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> 
> Let me make my  statement more precise:
> 
> O_CREAT cannot fail with ENOENT if parent directory exists throughout
> the operation.
> 

True, but I still don't see what guarantees guest userspace that the
parent directory doesn't go away... I must have missed something.
Please elaborate.

> I'm sure this property is used all over the place in userspace code,
> and hence should be supported in strict coherence (cache=none) mode.
> 
> For relaxed coherence (cache=auto) I'm not quite sure.  NFS is usually
> the reference, we'd need to look into what guarantees it provides wrt.
> O_CREAT and remote racing unlink.
> 
> Thanks,
> Miklos
>
Miklos Szeredi Jan. 27, 2021, 3:47 p.m. UTC | #13
On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz <groug@kaod.org> wrote:
>
> On Wed, 27 Jan 2021 16:22:49 +0100
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz <groug@kaod.org> wrote:
> > >
> > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > > The semantics of O_CREATE are that it can fail neither because the
> > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > exported tree is not modified outside of a single guest, because of
> > > > locking provided by the guest kernel.
> > > >
> > >
> > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> >
> > Let me make my  statement more precise:
> >
> > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > the operation.
> >
>
> True, but I still don't see what guarantees guest userspace that the
> parent directory doesn't go away... I must have missed something.
> Please elaborate.

Generally there's no guarantee, however there can be certain
situations where the caller can indeed rely on the existence of the
parent (e.g. /tmp).

Thanks,
Miklos
Miklos Szeredi Jan. 27, 2021, 3:52 p.m. UTC | #14
On Wed, Jan 27, 2021 at 4:47 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz <groug@kaod.org> wrote:
> >
> > On Wed, 27 Jan 2021 16:22:49 +0100
> > Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz <groug@kaod.org> wrote:
> > > >
> > > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > > > The semantics of O_CREATE are that it can fail neither because the
> > > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > > exported tree is not modified outside of a single guest, because of
> > > > > locking provided by the guest kernel.
> > > > >
> > > >
> > > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> > >
> > > Let me make my  statement more precise:
> > >
> > > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > > the operation.
> > >
> >
> > True, but I still don't see what guarantees guest userspace that the
> > parent directory doesn't go away... I must have missed something.
> > Please elaborate.
>
> Generally there's no guarantee, however there can be certain
> situations where the caller can indeed rely on the existence of the
> parent (e.g. /tmp).

Example from the virtiofs repo:

https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L315
https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L382

In that case breaking O_CREAT would be harmless, since no two
instances are allowed anyway, so it would just give a confusing error.
But it is breakage and it probably wouldn't be hard to find much worse
breakage in real life applications.

Thanks,
Miklos
Greg Kurz Jan. 28, 2021, 12:14 p.m. UTC | #15
On Wed, 27 Jan 2021 16:52:56 +0100
Miklos Szeredi <mszeredi@redhat.com> wrote:

> On Wed, Jan 27, 2021 at 4:47 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz <groug@kaod.org> wrote:
> > >
> > > On Wed, 27 Jan 2021 16:22:49 +0100
> > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > >
> > > > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz <groug@kaod.org> wrote:
> > > > >
> > > > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > > > > The semantics of O_CREATE are that it can fail neither because the
> > > > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > > > exported tree is not modified outside of a single guest, because of
> > > > > > locking provided by the guest kernel.
> > > > > >
> > > > >
> > > > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> > > >
> > > > Let me make my  statement more precise:
> > > >
> > > > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > > > the operation.
> > > >
> > >
> > > True, but I still don't see what guarantees guest userspace that the
> > > parent directory doesn't go away... I must have missed something.
> > > Please elaborate.
> >
> > Generally there's no guarantee, however there can be certain
> > situations where the caller can indeed rely on the existence of the
> > parent (e.g. /tmp).
> 
> Example from the virtiofs repo:
> 
> https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L315
> https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L382
> 
> In that case breaking O_CREAT would be harmless, since no two
> instances are allowed anyway, so it would just give a confusing error.
> But it is breakage and it probably wouldn't be hard to find much worse
> breakage in real life applications.
> 

Ok, I get your point : a guest userspace application can't expect
to hit ENOENT when doing open("/some_dir/foo", O_CREAT) even if
someone else is doing unlink("/some_dir/foo"), which is a different
case than somebody doing rmdir("/some_dir").

But we still have a TOCTOU : the open(O_CREAT|O_EXCL) acts as
the check to use open(O_PATH) and retry+timeout can't fix it
reliably.

A possible fix for the case where the race comes from the
client itself would be to serialize FUSE requests that
create/remove paths in the same directory. I don't know
enough the code yet to assess if it's doable though.

Then this would leave the case where something else on
the host is racing with virtiofsd. IMHO this belongs to
the broader family of "bad things the host can do
in our back". This requires a bigger hammer than
adding band-aids here and there IMHO. So in the
scope of this patch, I don't think we should retry
and timeout, but just return whatever errno that
makes sense.

> Thanks,
> Miklos
>
Miklos Szeredi Jan. 28, 2021, 2 p.m. UTC | #16
On Thu, Jan 28, 2021 at 1:15 PM Greg Kurz <groug@kaod.org> wrote:
>
> On Wed, 27 Jan 2021 16:52:56 +0100
> Miklos Szeredi <mszeredi@redhat.com> wrote:
>
> > On Wed, Jan 27, 2021 at 4:47 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> > >
> > > On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz <groug@kaod.org> wrote:
> > > >
> > > > On Wed, 27 Jan 2021 16:22:49 +0100
> > > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > >
> > > > > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz <groug@kaod.org> wrote:
> > > > > >
> > > > > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > > > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > > > > > The semantics of O_CREATE are that it can fail neither because the
> > > > > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > > > > exported tree is not modified outside of a single guest, because of
> > > > > > > locking provided by the guest kernel.
> > > > > > >
> > > > > >
> > > > > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> > > > >
> > > > > Let me make my  statement more precise:
> > > > >
> > > > > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > > > > the operation.
> > > > >
> > > >
> > > > True, but I still don't see what guarantees guest userspace that the
> > > > parent directory doesn't go away... I must have missed something.
> > > > Please elaborate.
> > >
> > > Generally there's no guarantee, however there can be certain
> > > situations where the caller can indeed rely on the existence of the
> > > parent (e.g. /tmp).
> >
> > Example from the virtiofs repo:
> >
> > https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L315
> > https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L382
> >
> > In that case breaking O_CREAT would be harmless, since no two
> > instances are allowed anyway, so it would just give a confusing error.
> > But it is breakage and it probably wouldn't be hard to find much worse
> > breakage in real life applications.
> >
>
> Ok, I get your point : a guest userspace application can't expect
> to hit ENOENT when doing open("/some_dir/foo", O_CREAT) even if
> someone else is doing unlink("/some_dir/foo"), which is a different
> case than somebody doing rmdir("/some_dir").
>
> But we still have a TOCTOU : the open(O_CREAT|O_EXCL) acts as
> the check to use open(O_PATH) and retry+timeout can't fix it
> reliably.

Right.

> A possible fix for the case where the race comes from the
> client itself would be to serialize FUSE requests that
> create/remove paths in the same directory. I don't know
> enough the code yet to assess if it's doable though.
>
> Then this would leave the case where something else on
> the host is racing with virtiofsd. IMHO this belongs to
> the broader family of "bad things the host can do
> in our back". This requires a bigger hammer than
> adding band-aids here and there IMHO. So in the
> scope of this patch, I don't think we should retry
> and timeout, but just return whatever errno that
> makes sense.

I never suggested a timeout, that would indeed be nonsense.

Just do a simple retry loop with a counter.  I'd set counter to a
small number (5 or whatever), so that basically any accidental races
are cared for, and the only case that would trigger the EIO is if the
file was constantly removed and recreated (and even in that case it
would be pretty difficult to trigger).  This would add only minimal
complexity or overhead.

The proper solution might be adding O_REGULAR, and it actually would
be useful for other O_CREAT users, since it's probably what they want
anyway (but existing semantics can't be changed).

Thanks,
Miklos
Greg Kurz Jan. 28, 2021, 2:26 p.m. UTC | #17
On Thu, 28 Jan 2021 15:00:58 +0100
Miklos Szeredi <mszeredi@redhat.com> wrote:

> On Thu, Jan 28, 2021 at 1:15 PM Greg Kurz <groug@kaod.org> wrote:
> >
> > On Wed, 27 Jan 2021 16:52:56 +0100
> > Miklos Szeredi <mszeredi@redhat.com> wrote:
> >
> > > On Wed, Jan 27, 2021 at 4:47 PM Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > >
> > > > On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz <groug@kaod.org> wrote:
> > > > >
> > > > > On Wed, 27 Jan 2021 16:22:49 +0100
> > > > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > > >
> > > > > > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz <groug@kaod.org> wrote:
> > > > > > >
> > > > > > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > > > > > Miklos Szeredi <mszeredi@redhat.com> wrote:
> > > > > > > > The semantics of O_CREATE are that it can fail neither because the
> > > > > > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > > > > > exported tree is not modified outside of a single guest, because of
> > > > > > > > locking provided by the guest kernel.
> > > > > > > >
> > > > > > >
> > > > > > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> > > > > >
> > > > > > Let me make my  statement more precise:
> > > > > >
> > > > > > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > > > > > the operation.
> > > > > >
> > > > >
> > > > > True, but I still don't see what guarantees guest userspace that the
> > > > > parent directory doesn't go away... I must have missed something.
> > > > > Please elaborate.
> > > >
> > > > Generally there's no guarantee, however there can be certain
> > > > situations where the caller can indeed rely on the existence of the
> > > > parent (e.g. /tmp).
> > >
> > > Example from the virtiofs repo:
> > >
> > > https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L315
> > > https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L382
> > >
> > > In that case breaking O_CREAT would be harmless, since no two
> > > instances are allowed anyway, so it would just give a confusing error.
> > > But it is breakage and it probably wouldn't be hard to find much worse
> > > breakage in real life applications.
> > >
> >
> > Ok, I get your point : a guest userspace application can't expect
> > to hit ENOENT when doing open("/some_dir/foo", O_CREAT) even if
> > someone else is doing unlink("/some_dir/foo"), which is a different
> > case than somebody doing rmdir("/some_dir").
> >
> > But we still have a TOCTOU : the open(O_CREAT|O_EXCL) acts as
> > the check to use open(O_PATH) and retry+timeout can't fix it
> > reliably.
> 
> Right.
> 
> > A possible fix for the case where the race comes from the
> > client itself would be to serialize FUSE requests that
> > create/remove paths in the same directory. I don't know
> > enough the code yet to assess if it's doable though.
> >
> > Then this would leave the case where something else on
> > the host is racing with virtiofsd. IMHO this belongs to
> > the broader family of "bad things the host can do
> > in our back". This requires a bigger hammer than
> > adding band-aids here and there IMHO. So in the
> > scope of this patch, I don't think we should retry
> > and timeout, but just return whatever errno that
> > makes sense.
> 
> I never suggested a timeout, that would indeed be nonsense.
> 

Yeah sorry for that, by timeout I was lazily expressing "retry
a bit and bail out if it doesn't work".

> Just do a simple retry loop with a counter.  I'd set counter to a
> small number (5 or whatever), so that basically any accidental races
> are cared for, and the only case that would trigger the EIO is if the
> file was constantly removed and recreated (and even in that case it
> would be pretty difficult to trigger).  This would add only minimal
> complexity or overhead.
> 

I still don't like the counter thing very much but I can't think of
anything better that _works_ in all cases in the short term... so be
it.

> The proper solution might be adding O_REGULAR, and it actually would
> be useful for other O_CREAT users, since it's probably what they want
> anyway (but existing semantics can't be changed).
> 

Yeah only the kernel can handle this race gracefully and O_REGULAR
would be great, but it might take some time until this percolates
up to QEMU.

> Thanks,
> Miklos
>
diff mbox series

Patch

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 5fb36d9407..b722f43809 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -555,6 +555,30 @@  static int lo_fd(fuse_req_t req, fuse_ino_t ino)
     return fd;
 }
 
+/*
+ * Open a file descriptor for an inode. Returns -EBADF if the inode is not a
+ * regular file or a directory. Use this helper function instead of raw
+ * openat(2) to prevent security issues when a malicious client opens special
+ * files such as block device nodes. Symlink inodes are also rejected since
+ * symlinks must already have been traversed on the client side.
+ */
+static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode,
+                         int open_flags)
+{
+    g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
+    int fd;
+
+    if (!S_ISREG(inode->filetype) && !S_ISDIR(inode->filetype)) {
+        return -EBADF;
+    }
+
+    fd = openat(lo->proc_self_fd, fd_str, open_flags);
+    if (fd < 0) {
+        return -errno;
+    }
+    return fd;
+}
+
 static void lo_init(void *userdata, struct fuse_conn_info *conn)
 {
     struct lo_data *lo = (struct lo_data *)userdata;
@@ -684,8 +708,7 @@  static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
         if (fi) {
             truncfd = fd;
         } else {
-            sprintf(procname, "%i", ifd);
-            truncfd = openat(lo->proc_self_fd, procname, O_RDWR);
+            truncfd = lo_inode_open(lo, inode, O_RDWR);
             if (truncfd < 0) {
                 goto out_err;
             }
@@ -1725,7 +1748,6 @@  static struct lo_inode_plock *lookup_create_plock_ctx(struct lo_data *lo,
                                                       pid_t pid, int *err)
 {
     struct lo_inode_plock *plock;
-    char procname[64];
     int fd;
 
     plock =
@@ -1742,12 +1764,10 @@  static struct lo_inode_plock *lookup_create_plock_ctx(struct lo_data *lo,
     }
 
     /* Open another instance of file which can be used for ofd locks. */
-    sprintf(procname, "%i", inode->fd);
-
     /* TODO: What if file is not writable? */
-    fd = openat(lo->proc_self_fd, procname, O_RDWR);
-    if (fd == -1) {
-        *err = errno;
+    fd = lo_inode_open(lo, inode, O_RDWR);
+    if (fd < 0) {
+        *err = -fd;
         free(plock);
         return NULL;
     }
@@ -1894,18 +1914,24 @@  static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
 {
     int fd;
     ssize_t fh;
-    char buf[64];
     struct lo_data *lo = lo_data(req);
+    struct lo_inode *inode = lo_inode(req, ino);
 
     fuse_log(FUSE_LOG_DEBUG, "lo_open(ino=%" PRIu64 ", flags=%d)\n", ino,
              fi->flags);
 
+    if (!inode) {
+        fuse_reply_err(req, EBADF);
+        return;
+    }
+
     update_open_flags(lo->writeback, lo->allow_direct_io, fi);
 
-    sprintf(buf, "%i", lo_fd(req, ino));
-    fd = openat(lo->proc_self_fd, buf, fi->flags & ~O_NOFOLLOW);
-    if (fd == -1) {
-        return (void)fuse_reply_err(req, errno);
+    fd = lo_inode_open(lo, inode, fi->flags & ~O_NOFOLLOW);
+    if (fd < 0) {
+        lo_inode_put(lo, &inode);
+        fuse_reply_err(req, -fd);
+        return;
     }
 
     pthread_mutex_lock(&lo->mutex);
@@ -1913,6 +1939,7 @@  static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
     pthread_mutex_unlock(&lo->mutex);
     if (fh == -1) {
         close(fd);
+        lo_inode_put(lo, &inode);
         fuse_reply_err(req, ENOMEM);
         return;
     }
@@ -1923,6 +1950,7 @@  static void lo_open(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
     } else if (lo->cache == CACHE_ALWAYS) {
         fi->keep_cache = 1;
     }
+    lo_inode_put(lo, &inode);
     fuse_reply_open(req, fi);
 }
 
@@ -1982,39 +2010,40 @@  static void lo_flush(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
 static void lo_fsync(fuse_req_t req, fuse_ino_t ino, int datasync,
                      struct fuse_file_info *fi)
 {
+    struct lo_inode *inode = lo_inode(req, ino);
+    struct lo_data *lo = lo_data(req);
     int res;
     int fd;
-    char *buf;
 
     fuse_log(FUSE_LOG_DEBUG, "lo_fsync(ino=%" PRIu64 ", fi=0x%p)\n", ino,
              (void *)fi);
 
+    if (!inode) {
+        fuse_reply_err(req, EBADF);
+        return;
+    }
+
     if (!fi) {
-        struct lo_data *lo = lo_data(req);
-
-        res = asprintf(&buf, "%i", lo_fd(req, ino));
-        if (res == -1) {
-            return (void)fuse_reply_err(req, errno);
-        }
-
-        fd = openat(lo->proc_self_fd, buf, O_RDWR);
-        free(buf);
-        if (fd == -1) {
-            return (void)fuse_reply_err(req, errno);
+        fd = lo_inode_open(lo, inode, O_RDWR);
+        if (fd < 0) {
+            res = -fd;
+            goto out;
         }
     } else {
         fd = lo_fi_fd(req, fi);
     }
 
     if (datasync) {
-        res = fdatasync(fd);
+        res = fdatasync(fd) == -1 ? errno : 0;
     } else {
-        res = fsync(fd);
+        res = fsync(fd) == -1 ? errno : 0;
     }
     if (!fi) {
         close(fd);
     }
-    fuse_reply_err(req, res == -1 ? errno : 0);
+out:
+    lo_inode_put(lo, &inode);
+    fuse_reply_err(req, res);
 }
 
 static void lo_read(fuse_req_t req, fuse_ino_t ino, size_t size, off_t offset,