Patchwork [v8,7/7] block: Enable qemu_open/close to work with fd sets

login
register
mail settings
Submitter Corey Bryant
Date Aug. 10, 2012, 2:10 a.m.
Message ID <1344564649-6272-8-git-send-email-coreyb@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/176323/
State New
Headers show

Comments

Corey Bryant - Aug. 10, 2012, 2:10 a.m.
When qemu_open is passed a filename of the "/dev/fdset/nnn"
format (where nnn is the fdset ID), an fd with matching access
mode flags will be searched for within the specified monitor
fd set.  If the fd is found, a dup of the fd will be returned
from qemu_open.

Each fd set has a reference count.  The purpose of the reference
count is to determine if an fd set contains file descriptors that
have open dup() references that have not yet been closed.  It is
incremented on qemu_open and decremented on qemu_close.  It is
not until the refcount is zero that file desriptors in an fd set
can be closed.  If an fd set has dup() references open, then we
must keep the other fds in the fd set open in case a reopen
of the file occurs that requires an fd with a different access
mode.

Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
---
v2:
 -Get rid of file_open and move dup code to qemu_open
  (kwolf@redhat.com)
 -Use strtol wrapper instead of atoi (kwolf@redhat.com)

v3:
 -Add note about fd leakage (eblake@redhat.com)

v4
 -Moved patch to be later in series (lcapitulino@redhat.com)
 -Update qemu_open to check access mode flags and set flags that
  can be set (eblake@redhat.com, kwolf@redhat.com)

v5:
 -This patch was overhauled quite a bit in this version, with
  the addition of fd set and refcount support.
 -Use qemu_set_cloexec() on dup'd fd (eblake@redhat.com)
 -Modify flags set by fcntl on dup'd fd (eblake@redhat.com)
 -Reduce syscalls when setting flags for dup'd fd (eblake@redhat.com)
 -Fix O_RDWR, O_RDONLY, O_WRONLY checks (eblake@redhat.com)

v6:
 -Pass only the fd to qemu_close() and keep track of dup fds per fd
  set. (kwolf@redhat.com, eblake@redhat.com)
 -Handle refcount incr/decr in new dup_fd_add/remove fd functions.
 -Use qemu_set_cloexec() appropriately in qemu_dup() (kwolf@redhat.com)
 -Simplify setting of setfl_flags in qemu_dup() (kwolf@redhat.com)
 -Add preprocessor checks for F_DUPFD_CLOEXEC (eblake@redhat.com)
 -Simplify flag checking in monitor_fdset_get_fd() (kwolf@redhat.com)

v7:
 -Minor updates to reference global mon_fdsets, and to remove
  default_mon usage in osdep.c. (kwolf@redhat.com)

v8:
 -Use camel case for structures. (stefanha@linux.vnet.ibm.com)

 cutils.c      |    5 +++
 monitor.c     |   87 ++++++++++++++++++++++++++++++++++++++++++++
 monitor.h     |    5 +++
 osdep.c       |  112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 qemu-common.h |    1 +
 qemu-tool.c   |   20 +++++++++++
 6 files changed, 230 insertions(+)
Eric Blake - Aug. 10, 2012, 6:16 a.m.
On 08/09/2012 08:10 PM, Corey Bryant wrote:
> When qemu_open is passed a filename of the "/dev/fdset/nnn"
> format (where nnn is the fdset ID), an fd with matching access
> mode flags will be searched for within the specified monitor
> fd set.  If the fd is found, a dup of the fd will be returned
> from qemu_open.
> 
> Each fd set has a reference count.  The purpose of the reference
> count is to determine if an fd set contains file descriptors that
> have open dup() references that have not yet been closed.  It is
> incremented on qemu_open and decremented on qemu_close.  It is
> not until the refcount is zero that file desriptors in an fd set

s/desriptors/descriptors/

> can be closed.  If an fd set has dup() references open, then we
> must keep the other fds in the fd set open in case a reopen
> of the file occurs that requires an fd with a different access
> mode.
> 

> +int monitor_fdset_get_fd(int64_t fdset_id, int flags)
> +{
> +    MonFdset *mon_fdset;
> +    MonFdsetFd *mon_fdset_fd;
> +    int mon_fd_flags;
> +
> +    QLIST_FOREACH(mon_fdset, &mon_fdsets, next) {
> +        if (mon_fdset->id != fdset_id) {
> +            continue;
> +        }
> +        QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) {
> +            if (mon_fdset_fd->removed) {
> +                continue;
> +            }

Is this right?  According to the commit message, the whole point of
leaving an fd in the set is to allow the fd to be reused as a dup()
target for as long as the fdset is alive, even if the monitor no longer
cares about the existence of the fd.  But this will always skip over an
fd marked for removal.  Maybe this function needs a flag to say whether
this is an initial open driven by an explicit user string (in which
case, honor the removed flag - if the user removed the O_RDWR fd and
then tries a drive_add with the same fdset, the drive_add should fail
because from the user's perspective, there is no O_RDWR fd in the set);
vs. an internal usage due to a reopen (use an fd even if removed is
true, because we may be toggling between O_RDWR and O_RDONLY multiple
times long after the monitor has already removed the fdset, based on
actions that were not drive by an explicit /dev/fdset name.)

> +
> +            mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL);
> +            if (mon_fd_flags == -1) {
> +                return -1;
> +            }
> +
> +            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
> +                return mon_fdset_fd->fd;
> +            }

I still wonder if a request for O_RDONLY should be satisfied by an
existing O_RDWR fd, especially in light of the fact that libvirt would
rather pass in only one RDWR fd but qemu block opening currently opens
twice during probing.  But if it turns out to be a problem in practice,
and if libvirt can't really manage to pass two fds into the set, we can
hack that in later.  Meanwhile, I'm okay with this first round patch
requiring an exact match.

> @@ -87,6 +151,39 @@ int qemu_open(const char *name, int flags, ...)
>      int ret;
>      int mode = 0;
>  
> +#ifndef _WIN32
> +    const char *fdset_id_str;
> +
> +    /* Attempt dup of fd from fd set */
> +    if (strstart(name, "/dev/fdset/", &fdset_id_str)) {
> +        int64_t fdset_id;
> +        int fd, dupfd;
> +
> +        fdset_id = qemu_parse_fdset(fdset_id_str);
> +        if (fdset_id == -1) {
> +            errno = EINVAL;
> +            return -1;
> +        }
> +
> +        fd = monitor_fdset_get_fd(fdset_id, flags);
> +        if (fd == -1) {
> +            return -1;
> +        }
> +
> +        dupfd = qemu_dup(fd, flags);
> +        if (fd == -1) {
> +            return -1;
> +        }
> +
> +        ret = monitor_fdset_dup_fd_add(fdset_id, dupfd);
> +        if (ret == -1) {
> +            return -1;

Leaks dupfd (admittedly only on a corner-case failure, but still worth
addressing).
Corey Bryant - Aug. 10, 2012, 2:17 p.m.
On 08/10/2012 02:16 AM, Eric Blake wrote:
> On 08/09/2012 08:10 PM, Corey Bryant wrote:
>> When qemu_open is passed a filename of the "/dev/fdset/nnn"
>> format (where nnn is the fdset ID), an fd with matching access
>> mode flags will be searched for within the specified monitor
>> fd set.  If the fd is found, a dup of the fd will be returned
>> from qemu_open.
>>
>> Each fd set has a reference count.  The purpose of the reference
>> count is to determine if an fd set contains file descriptors that
>> have open dup() references that have not yet been closed.  It is
>> incremented on qemu_open and decremented on qemu_close.  It is
>> not until the refcount is zero that file desriptors in an fd set
>
> s/desriptors/descriptors/
>

Thanks

>> can be closed.  If an fd set has dup() references open, then we
>> must keep the other fds in the fd set open in case a reopen
>> of the file occurs that requires an fd with a different access
>> mode.
>>
>
>> +int monitor_fdset_get_fd(int64_t fdset_id, int flags)
>> +{
>> +    MonFdset *mon_fdset;
>> +    MonFdsetFd *mon_fdset_fd;
>> +    int mon_fd_flags;
>> +
>> +    QLIST_FOREACH(mon_fdset, &mon_fdsets, next) {
>> +        if (mon_fdset->id != fdset_id) {
>> +            continue;
>> +        }
>> +        QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) {
>> +            if (mon_fdset_fd->removed) {
>> +                continue;
>> +            }
>
> Is this right?  According to the commit message, the whole point of
> leaving an fd in the set is to allow the fd to be reused as a dup()
> target for as long as the fdset is alive, even if the monitor no longer
> cares about the existence of the fd.  But this will always skip over an
> fd marked for removal.  Maybe this function needs a flag to say whether
> this is an initial open driven by an explicit user string (in which
> case, honor the removed flag - if the user removed the O_RDWR fd and
> then tries a drive_add with the same fdset, the drive_add should fail
> because from the user's perspective, there is no O_RDWR fd in the set);
> vs. an internal usage due to a reopen (use an fd even if removed is
> true, because we may be toggling between O_RDWR and O_RDONLY multiple
> times long after the monitor has already removed the fdset, based on
> actions that were not drive by an explicit /dev/fdset name.)
>

Hmm..  something needs to change here, either the commit message or the 
code.

For security purposes, I'm thinking that an fd should no longer be 
available for opening after libvirt issues a remove-fd command, with no 
exceptions.  That allows libvirt to have complete control over usage of 
an fd.  Note that other fds in the fd set could still be used on a 
reopen, assuming remove-fd hasn't been called for them.

Does that work for you?  In that case, the code will remain as-is.

Apologies if I've sent conflicting messages on this in the past.

>> +
>> +            mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL);
>> +            if (mon_fd_flags == -1) {
>> +                return -1;
>> +            }
>> +
>> +            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
>> +                return mon_fdset_fd->fd;
>> +            }
>
> I still wonder if a request for O_RDONLY should be satisfied by an
> existing O_RDWR fd, especially in light of the fact that libvirt would
> rather pass in only one RDWR fd but qemu block opening currently opens
> twice during probing.  But if it turns out to be a problem in practice,
> and if libvirt can't really manage to pass two fds into the set, we can
> hack that in later.  Meanwhile, I'm okay with this first round patch
> requiring an exact match.
>

I see your point but I think allowing subset access mode matches could 
allow for a client to get lazy and cause security implications (client 
only adds RW fd's and QEMU only needs R access in some cases).  So I'll 
keep this as is.

>> @@ -87,6 +151,39 @@ int qemu_open(const char *name, int flags, ...)
>>       int ret;
>>       int mode = 0;
>>
>> +#ifndef _WIN32
>> +    const char *fdset_id_str;
>> +
>> +    /* Attempt dup of fd from fd set */
>> +    if (strstart(name, "/dev/fdset/", &fdset_id_str)) {
>> +        int64_t fdset_id;
>> +        int fd, dupfd;
>> +
>> +        fdset_id = qemu_parse_fdset(fdset_id_str);
>> +        if (fdset_id == -1) {
>> +            errno = EINVAL;
>> +            return -1;
>> +        }
>> +
>> +        fd = monitor_fdset_get_fd(fdset_id, flags);
>> +        if (fd == -1) {
>> +            return -1;
>> +        }
>> +
>> +        dupfd = qemu_dup(fd, flags);
>> +        if (fd == -1) {
>> +            return -1;
>> +        }
>> +
>> +        ret = monitor_fdset_dup_fd_add(fdset_id, dupfd);
>> +        if (ret == -1) {
>> +            return -1;
>
> Leaks dupfd (admittedly only on a corner-case failure, but still worth
> addressing).
>

Thanks, I'll fix this.
Eric Blake - Aug. 10, 2012, 3:25 p.m.
On 08/10/2012 08:17 AM, Corey Bryant wrote:

>>> can be closed.  If an fd set has dup() references open, then we
>>> must keep the other fds in the fd set open in case a reopen
>>> of the file occurs that requires an fd with a different access
>>> mode.
>>>
>>
>>
>> Is this right?  According to the commit message, the whole point of
>> leaving an fd in the set is to allow the fd to be reused as a dup()
>> target for as long as the fdset is alive, even if the monitor no longer
>> cares about the existence of the fd.  But this will always skip over an
>> fd marked for removal.

> 
> For security purposes, I'm thinking that an fd should no longer be
> available for opening after libvirt issues a remove-fd command, with no
> exceptions.

If that's the case, then you don't need the removed flag.  Simply close
the original fd and forget about it at the point that the monitor
removes the fd.  The fdset will still remain as long as there is a
refcount, so that libvirt can then re-add to the set.  And that also
argues that query-fdsets MUST list empty fdsets, where all existing fds
have been removed, but where the set can still be added to again just
prior to the next reopen operation.

That is, consider this life cycle:

 libvirt> add-fd opaque="RDWR"
 qemu< fdset=1, fd=3; the set is in use because of the monitor but with
refcount 0
 libvirt> drive_add /dev/fdset/1
 qemu< success; internally, fd 4 was created as a dup of 3, and fdset 1
now has refcount 1 and tracks that fd 4 is tied to the set
 libvirt> remove-fd fdset=1 fd=3, since the drive_add is complete and
libvirt no longer needs to track what fd qemu is using, but does
remember internally that qemu is using fdset 1
 qemu< success; internally, fdset 1 is still refcount 1
 later on, libvirt prepares to do a snapshot, and knows that after the
snapshot, qemu will want to reopen the file RDONLY, as part of making it
the backing image
 libvirt> add-fd fdset=1 opaque="RDONLY"
 qemu< fdset=1, fd=3 (yes, fd 3 is available for use again, since the
earlier remove closed it out)
 libvirt> blockdev-snapshot-sync
 qemu< success; internally, the block device knows that /dev/fdset/1 is
now becoming a backing file, so it tries a reopen to convert its current
RDWR fd 4 into something RDONLY; the probe succeeds by returning fd 5 as
a dup of the readonly fd 3

> 
> Does that work for you?  In that case, the code will remain as-is.

I think that lifecycle will work (either libvirt leaves an fd in the set
for as long as it thinks qemu might need to do independent open
operations, or it removes the fd as soon as it knows that qemu is done
with open, but the fdset remains reserved, tracking all dup'd fds that
were created from the set.

For that matter, your refcount doesn't even have to be a separate field,
but can simply be the length of the list of dup'd fds that are tied to
the set until a call to qemu_close.  That is, a set must remain alive as
long as there is either an add-fd that has not yet been removed, or
there is at least one dup'd fd that has not been qemu_close()d.
Corey Bryant - Aug. 10, 2012, 3:44 p.m.
On 08/10/2012 11:25 AM, Eric Blake wrote:
> On 08/10/2012 08:17 AM, Corey Bryant wrote:
>
>>>> can be closed.  If an fd set has dup() references open, then we
>>>> must keep the other fds in the fd set open in case a reopen
>>>> of the file occurs that requires an fd with a different access
>>>> mode.
>>>>
>>>
>>>
>>> Is this right?  According to the commit message, the whole point of
>>> leaving an fd in the set is to allow the fd to be reused as a dup()
>>> target for as long as the fdset is alive, even if the monitor no longer
>>> cares about the existence of the fd.  But this will always skip over an
>>> fd marked for removal.
>
>>
>> For security purposes, I'm thinking that an fd should no longer be
>> available for opening after libvirt issues a remove-fd command, with no
>> exceptions.
>
> If that's the case, then you don't need the removed flag.  Simply close
> the original fd and forget about it at the point that the monitor

You're right.  Kevin also brought this up to me offline during a 
discussion a little while ago.  I'll plan on closing the fd immediately 
when remove-fd is issued.

> removes the fd.  The fdset will still remain as long as there is a
> refcount, so that libvirt can then re-add to the set.  And that also
> argues that query-fdsets MUST list empty fdsets, where all existing fds
> have been removed, but where the set can still be added to again just
> prior to the next reopen operation.
>
> That is, consider this life cycle:
>
>   libvirt> add-fd opaque="RDWR"
>   qemu< fdset=1, fd=3; the set is in use because of the monitor but with
> refcount 0
>   libvirt> drive_add /dev/fdset/1
>   qemu< success; internally, fd 4 was created as a dup of 3, and fdset 1
> now has refcount 1 and tracks that fd 4 is tied to the set
>   libvirt> remove-fd fdset=1 fd=3, since the drive_add is complete and
> libvirt no longer needs to track what fd qemu is using, but does
> remember internally that qemu is using fdset 1
>   qemu< success; internally, fdset 1 is still refcount 1
>   later on, libvirt prepares to do a snapshot, and knows that after the
> snapshot, qemu will want to reopen the file RDONLY, as part of making it
> the backing image
>   libvirt> add-fd fdset=1 opaque="RDONLY"
>   qemu< fdset=1, fd=3 (yes, fd 3 is available for use again, since the
> earlier remove closed it out)
>   libvirt> blockdev-snapshot-sync
>   qemu< success; internally, the block device knows that /dev/fdset/1 is
> now becoming a backing file, so it tries a reopen to convert its current
> RDWR fd 4 into something RDONLY; the probe succeeds by returning fd 5 as
> a dup of the readonly fd 3
>

This all makes sense to me.

>>
>> Does that work for you?  In that case, the code will remain as-is.
>
> I think that lifecycle will work (either libvirt leaves an fd in the set
> for as long as it thinks qemu might need to do independent open
> operations, or it removes the fd as soon as it knows that qemu is done
> with open, but the fdset remains reserved, tracking all dup'd fds that
> were created from the set.

Yep

>
> For that matter, your refcount doesn't even have to be a separate field,
> but can simply be the length of the list of dup'd fds that are tied to
> the set until a call to qemu_close.  That is, a set must remain alive as
> long as there is either an add-fd that has not yet been removed, or
> there is at least one dup'd fd that has not been qemu_close()d.
>

I agree.  I'll plan on dropping refcount and just checking the dup_fds list.
Kevin Wolf - Aug. 10, 2012, 4:34 p.m.
Am 10.08.2012 04:10, schrieb Corey Bryant:
> When qemu_open is passed a filename of the "/dev/fdset/nnn"
> format (where nnn is the fdset ID), an fd with matching access
> mode flags will be searched for within the specified monitor
> fd set.  If the fd is found, a dup of the fd will be returned
> from qemu_open.
> 
> Each fd set has a reference count.  The purpose of the reference
> count is to determine if an fd set contains file descriptors that
> have open dup() references that have not yet been closed.  It is
> incremented on qemu_open and decremented on qemu_close.  It is
> not until the refcount is zero that file desriptors in an fd set
> can be closed.  If an fd set has dup() references open, then we
> must keep the other fds in the fd set open in case a reopen
> of the file occurs that requires an fd with a different access
> mode.
> 
> Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>

> @@ -78,6 +79,69 @@ int qemu_madvise(void *addr, size_t len, int advice)
>  #endif
>  }
>  
> +/*
> + * Dups an fd and sets the flags
> + */
> +static int qemu_dup(int fd, int flags)

qemu_dup() is probably not a good name. We'll want to use it when we
need to get a wrapper around dup(). And I suspect that we will need it
for making bdrv_reopen() compatible with fdset refcounting.

> +{
> +    int ret;
> +    int serrno;
> +    int dup_flags;
> +    int setfl_flags;
> +
> +    if (flags & O_CLOEXEC) {
> +#ifdef F_DUPFD_CLOEXEC
> +        ret = fcntl(fd, F_DUPFD_CLOEXEC, 0);
> +#else
> +        ret = dup(fd);
> +        if (ret != -1) {
> +            qemu_set_cloexec(ret);
> +        }
> +#endif
> +    } else {
> +        ret = dup(fd);
> +    }

qemu_open() is supposed to add O_CLOEXEC by itself, so I think we should
execute the then branch unconditionally (or we would have to change the
qemu_dup() call below to add it - but the fact that O_CLOEXEC isn't even
necessarily defined doesn't help with that).

Kevin
Corey Bryant - Aug. 10, 2012, 4:56 p.m.
On 08/10/2012 12:34 PM, Kevin Wolf wrote:
> Am 10.08.2012 04:10, schrieb Corey Bryant:
>> When qemu_open is passed a filename of the "/dev/fdset/nnn"
>> format (where nnn is the fdset ID), an fd with matching access
>> mode flags will be searched for within the specified monitor
>> fd set.  If the fd is found, a dup of the fd will be returned
>> from qemu_open.
>>
>> Each fd set has a reference count.  The purpose of the reference
>> count is to determine if an fd set contains file descriptors that
>> have open dup() references that have not yet been closed.  It is
>> incremented on qemu_open and decremented on qemu_close.  It is
>> not until the refcount is zero that file desriptors in an fd set
>> can be closed.  If an fd set has dup() references open, then we
>> must keep the other fds in the fd set open in case a reopen
>> of the file occurs that requires an fd with a different access
>> mode.
>>
>> Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
>
>> @@ -78,6 +79,69 @@ int qemu_madvise(void *addr, size_t len, int advice)
>>   #endif
>>   }
>>
>> +/*
>> + * Dups an fd and sets the flags
>> + */
>> +static int qemu_dup(int fd, int flags)
>
> qemu_dup() is probably not a good name. We'll want to use it when we
> need to get a wrapper around dup(). And I suspect that we will need it
> for making bdrv_reopen() compatible with fdset refcounting.
>

Do you you have a suggestion for a name?

>> +{
>> +    int ret;
>> +    int serrno;
>> +    int dup_flags;
>> +    int setfl_flags;
>> +
>> +    if (flags & O_CLOEXEC) {
>> +#ifdef F_DUPFD_CLOEXEC
>> +        ret = fcntl(fd, F_DUPFD_CLOEXEC, 0);
>> +#else
>> +        ret = dup(fd);
>> +        if (ret != -1) {
>> +            qemu_set_cloexec(ret);
>> +        }
>> +#endif
>> +    } else {
>> +        ret = dup(fd);
>> +    }
>
> qemu_open() is supposed to add O_CLOEXEC by itself, so I think we should
> execute the then branch unconditionally (or we would have to change the
> qemu_dup() call below to add it - but the fact that O_CLOEXEC isn't even
> necessarily defined doesn't help with that).

Sure I can modify this to always add O_CLOEXEC.  (I know you mentioned 
this before but I think I interpreted the comment as a question to others.)

Do you also want me to modify qemu_open to always add O_CLOEXEC?
Corey Bryant - Aug. 10, 2012, 5:03 p.m.
On 08/10/2012 12:56 PM, Corey Bryant wrote:
>
>
> On 08/10/2012 12:34 PM, Kevin Wolf wrote:
>> Am 10.08.2012 04:10, schrieb Corey Bryant:
>>> When qemu_open is passed a filename of the "/dev/fdset/nnn"
>>> format (where nnn is the fdset ID), an fd with matching access
>>> mode flags will be searched for within the specified monitor
>>> fd set.  If the fd is found, a dup of the fd will be returned
>>> from qemu_open.
>>>
>>> Each fd set has a reference count.  The purpose of the reference
>>> count is to determine if an fd set contains file descriptors that
>>> have open dup() references that have not yet been closed.  It is
>>> incremented on qemu_open and decremented on qemu_close.  It is
>>> not until the refcount is zero that file desriptors in an fd set
>>> can be closed.  If an fd set has dup() references open, then we
>>> must keep the other fds in the fd set open in case a reopen
>>> of the file occurs that requires an fd with a different access
>>> mode.
>>>
>>> Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
>>
>>> @@ -78,6 +79,69 @@ int qemu_madvise(void *addr, size_t len, int advice)
>>>   #endif
>>>   }
>>>
>>> +/*
>>> + * Dups an fd and sets the flags
>>> + */
>>> +static int qemu_dup(int fd, int flags)
>>
>> qemu_dup() is probably not a good name. We'll want to use it when we
>> need to get a wrapper around dup(). And I suspect that we will need it
>> for making bdrv_reopen() compatible with fdset refcounting.
>>
>
> Do you you have a suggestion for a name?
>
>>> +{
>>> +    int ret;
>>> +    int serrno;
>>> +    int dup_flags;
>>> +    int setfl_flags;
>>> +
>>> +    if (flags & O_CLOEXEC) {
>>> +#ifdef F_DUPFD_CLOEXEC
>>> +        ret = fcntl(fd, F_DUPFD_CLOEXEC, 0);
>>> +#else
>>> +        ret = dup(fd);
>>> +        if (ret != -1) {
>>> +            qemu_set_cloexec(ret);
>>> +        }
>>> +#endif
>>> +    } else {
>>> +        ret = dup(fd);
>>> +    }
>>
>> qemu_open() is supposed to add O_CLOEXEC by itself, so I think we should
>> execute the then branch unconditionally (or we would have to change the
>> qemu_dup() call below to add it - but the fact that O_CLOEXEC isn't even
>> necessarily defined doesn't help with that).
>
> Sure I can modify this to always add O_CLOEXEC.  (I know you mentioned
> this before but I think I interpreted the comment as a question to others.)
>
> Do you also want me to modify qemu_open to always add O_CLOEXEC?
>

My mistake.. it's always set in qemu_open already.

Patch

diff --git a/cutils.c b/cutils.c
index 9d4c570..8b0d2bb 100644
--- a/cutils.c
+++ b/cutils.c
@@ -382,3 +382,8 @@  int qemu_parse_fd(const char *param)
     }
     return fd;
 }
+
+int qemu_parse_fdset(const char *param)
+{
+    return qemu_parse_fd(param);
+}
diff --git a/monitor.c b/monitor.c
index 96aafec..63a43a2 100644
--- a/monitor.c
+++ b/monitor.c
@@ -155,6 +155,7 @@  struct MonFdset {
     int64_t id;
     int refcount;
     QLIST_HEAD(, MonFdsetFd) fds;
+    QLIST_HEAD(, MonFdsetFd) dup_fds;
     QLIST_ENTRY(MonFdset) next;
 };
 
@@ -2587,6 +2588,92 @@  FdsetInfoList *qmp_query_fdsets(Error **errp)
     return fdset_list;
 }
 
+int monitor_fdset_get_fd(int64_t fdset_id, int flags)
+{
+    MonFdset *mon_fdset;
+    MonFdsetFd *mon_fdset_fd;
+    int mon_fd_flags;
+
+    QLIST_FOREACH(mon_fdset, &mon_fdsets, next) {
+        if (mon_fdset->id != fdset_id) {
+            continue;
+        }
+        QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) {
+            if (mon_fdset_fd->removed) {
+                continue;
+            }
+
+            mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL);
+            if (mon_fd_flags == -1) {
+                return -1;
+            }
+
+            if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) {
+                return mon_fdset_fd->fd;
+            }
+        }
+        errno = EACCES;
+        return -1;
+    }
+    errno = ENOENT;
+    return -1;
+}
+
+int monitor_fdset_dup_fd_add(int64_t fdset_id, int dup_fd)
+{
+    MonFdset *mon_fdset;
+    MonFdsetFd *mon_fdset_fd_dup;
+
+    QLIST_FOREACH(mon_fdset, &mon_fdsets, next) {
+        if (mon_fdset->id != fdset_id) {
+            continue;
+        }
+        QLIST_FOREACH(mon_fdset_fd_dup, &mon_fdset->dup_fds, next) {
+            if (mon_fdset_fd_dup->fd == dup_fd) {
+                return -1;
+            }
+        }
+        mon_fdset_fd_dup = g_malloc0(sizeof(*mon_fdset_fd_dup));
+        mon_fdset_fd_dup->fd = dup_fd;
+        QLIST_INSERT_HEAD(&mon_fdset->dup_fds, mon_fdset_fd_dup, next);
+        mon_fdset->refcount++;
+        return 0;
+    }
+    return -1;
+}
+
+static int _monitor_fdset_dup_fd_find(int dup_fd, bool remove)
+{
+    MonFdset *mon_fdset;
+    MonFdsetFd *mon_fdset_fd_dup;
+
+    QLIST_FOREACH(mon_fdset, &mon_fdsets, next) {
+        QLIST_FOREACH(mon_fdset_fd_dup, &mon_fdset->dup_fds, next) {
+            if (mon_fdset_fd_dup->fd == dup_fd) {
+                if (remove) {
+                    QLIST_REMOVE(mon_fdset_fd_dup, next);
+                    mon_fdset->refcount--;
+                    if (mon_fdset->refcount == 0) {
+                        monitor_fdset_cleanup(mon_fdset);
+                    }
+                }
+                return mon_fdset->id;
+            }
+        }
+    }
+    return -1;
+}
+
+int monitor_fdset_dup_fd_find(int dup_fd)
+{
+    return _monitor_fdset_dup_fd_find(dup_fd, false);
+}
+
+int monitor_fdset_dup_fd_remove(int dup_fd)
+{
+    return _monitor_fdset_dup_fd_find(dup_fd, true);
+}
+
 /* mon_cmds and info_cmds would be sorted at runtime */
 static mon_cmd_t mon_cmds[] = {
 #include "hmp-commands.h"
diff --git a/monitor.h b/monitor.h
index 5f4de1b..30b660f 100644
--- a/monitor.h
+++ b/monitor.h
@@ -86,4 +86,9 @@  int qmp_qom_set(Monitor *mon, const QDict *qdict, QObject **ret);
 
 int qmp_qom_get(Monitor *mon, const QDict *qdict, QObject **ret);
 
+int monitor_fdset_get_fd(int64_t fdset_id, int flags);
+int monitor_fdset_dup_fd_add(int64_t fdset_id, int dup_fd);
+int monitor_fdset_dup_fd_remove(int dup_fd);
+int monitor_fdset_dup_fd_find(int dup_fd);
+
 #endif /* !MONITOR_H */
diff --git a/osdep.c b/osdep.c
index 7f876ae..9ff3561 100644
--- a/osdep.c
+++ b/osdep.c
@@ -48,6 +48,7 @@  extern int madvise(caddr_t, size_t, int);
 #include "qemu-common.h"
 #include "trace.h"
 #include "qemu_socket.h"
+#include "monitor.h"
 
 static bool fips_enabled = false;
 
@@ -78,6 +79,69 @@  int qemu_madvise(void *addr, size_t len, int advice)
 #endif
 }
 
+/*
+ * Dups an fd and sets the flags
+ */
+static int qemu_dup(int fd, int flags)
+{
+    int ret;
+    int serrno;
+    int dup_flags;
+    int setfl_flags;
+
+    if (flags & O_CLOEXEC) {
+#ifdef F_DUPFD_CLOEXEC
+        ret = fcntl(fd, F_DUPFD_CLOEXEC, 0);
+#else
+        ret = dup(fd);
+        if (ret != -1) {
+            qemu_set_cloexec(ret);
+        }
+#endif
+    } else {
+        ret = dup(fd);
+    }
+
+    if (ret == -1) {
+        goto fail;
+    }
+
+    dup_flags = fcntl(ret, F_GETFL);
+    if (dup_flags == -1) {
+        goto fail;
+    }
+
+    if ((flags & O_SYNC) != (dup_flags & O_SYNC)) {
+        errno = EINVAL;
+        goto fail;
+    }
+
+    /* Set/unset flags that we can with fcntl */
+    setfl_flags = O_APPEND | O_ASYNC | O_DIRECT | O_NOATIME | O_NONBLOCK;
+    dup_flags &= ~setfl_flags;
+    dup_flags |= (flags & setfl_flags);
+    if (fcntl(ret, F_SETFL, dup_flags) == -1) {
+        goto fail;
+    }
+
+    /* Truncate the file in the cases that open() would truncate it */
+    if (flags & O_TRUNC ||
+            ((flags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))) {
+        if (ftruncate(ret, 0) == -1) {
+            goto fail;
+        }
+    }
+
+    return ret;
+
+fail:
+    serrno = errno;
+    if (ret != -1) {
+        close(ret);
+    }
+    errno = serrno;
+    return -1;
+}
 
 /*
  * Opens a file with FD_CLOEXEC set
@@ -87,6 +151,39 @@  int qemu_open(const char *name, int flags, ...)
     int ret;
     int mode = 0;
 
+#ifndef _WIN32
+    const char *fdset_id_str;
+
+    /* Attempt dup of fd from fd set */
+    if (strstart(name, "/dev/fdset/", &fdset_id_str)) {
+        int64_t fdset_id;
+        int fd, dupfd;
+
+        fdset_id = qemu_parse_fdset(fdset_id_str);
+        if (fdset_id == -1) {
+            errno = EINVAL;
+            return -1;
+        }
+
+        fd = monitor_fdset_get_fd(fdset_id, flags);
+        if (fd == -1) {
+            return -1;
+        }
+
+        dupfd = qemu_dup(fd, flags);
+        if (fd == -1) {
+            return -1;
+        }
+
+        ret = monitor_fdset_dup_fd_add(fdset_id, dupfd);
+        if (ret == -1) {
+            return -1;
+        }
+
+        return dupfd;
+    }
+#endif
+
     if (flags & O_CREAT) {
         va_list ap;
 
@@ -109,6 +206,21 @@  int qemu_open(const char *name, int flags, ...)
 
 int qemu_close(int fd)
 {
+    int64_t fdset_id;
+
+    /* Close fd that was dup'd from an fdset */
+    fdset_id = monitor_fdset_dup_fd_find(fd);
+    if (fdset_id != -1) {
+        int ret;
+
+        ret = close(fd);
+        if (ret == 0) {
+            monitor_fdset_dup_fd_remove(fd);
+        }
+
+        return ret;
+    }
+
     return close(fd);
 }
 
diff --git a/qemu-common.h b/qemu-common.h
index e53126d..9becb32 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -166,6 +166,7 @@  int qemu_fls(int i);
 int qemu_fdatasync(int fd);
 int fcntl_setfl(int fd, int flag);
 int qemu_parse_fd(const char *param);
+int qemu_parse_fdset(const char *param);
 
 /*
  * strtosz() suffixes used to specify the default treatment of an
diff --git a/qemu-tool.c b/qemu-tool.c
index 318c5fc..b7622f5 100644
--- a/qemu-tool.c
+++ b/qemu-tool.c
@@ -57,6 +57,26 @@  void monitor_protocol_event(MonitorEvent event, QObject *data)
 {
 }
 
+int monitor_fdset_get_fd(int64_t fdset_id, int flags)
+{
+    return -1;
+}
+
+int monitor_fdset_dup_fd_add(int64_t fdset_id, int dup_fd)
+{
+    return -1;
+}
+
+int monitor_fdset_dup_fd_remove(int dup_fd)
+{
+    return -1;
+}
+
+int monitor_fdset_dup_fd_find(int dup_fd)
+{
+    return -1;
+}
+
 int64_t cpu_get_clock(void)
 {
     return qemu_get_clock_ns(rt_clock);