diff mbox

[libvirt,v4,0/7] file descriptor passing using pass-fd

Message ID 20120626150349.GQ14451@redhat.com
State New
Headers show

Commit Message

Daniel P. Berrangé June 26, 2012, 3:03 p.m. UTC
On Tue, Jun 26, 2012 at 03:11:40PM +0100, Daniel P. Berrange wrote:
> On Tue, Jun 26, 2012 at 09:52:51AM -0400, Corey Bryant wrote:
> > >So now from a client's POV you'd have a flow like
> > >
> > >    * drive_add "file=/dev/fd/N"  FDSET={N}
> > 
> > IIUC then drive_add would loop and pass each fd in the set via SCM_RIGHTS?
> 
> Yes, you'd probably use the JSON to tell QEMU exactly
> how many FDs you're intending to pass with the command,
> and then once the command is received, you'd have N*SCM_RIGHTS
> messages sent/received.
> 
> > >
> > >And in QEMU you'd have something like
> > >
> > >    * handle_monitor_command
> > >         - recvmsg all FDs, and stash them in a thread local "FDContext"
> > >           context
> > >         - invoke monitor command handler
> > >               - Sees file=/dev/fd/N
> > >               - Fetch /dev/fd/N from "FDContext"
> > >               - If success remove /dev/fd/N from "FDContext"
> > >         - close() all FDs left in "FDContext"
> > >
> > >The key point with this is that because the FDs are directly
> > >associated with a monitor command, QEMU can /guarantee/ that
> > >FDs are never leaked, regardless of client behaviour.
> > 
> > Wouldn't this leak fds if libvirt crashed part way through sending
> > the set of fds?
> 
> No, because you've scoped the FDs to the current monitor instance,
> and the current command being processed you know to close all FDs
> when the associated command has finished, as well as closing them
> all if you see EOF on the monitor socket while in the middle of
> receiving a command.

Here is a quick proof of concept (ie untested) patch to demonstrate
what I mean. It relies on Cory's patch which converts everything
to use qemu_open. It is also still valuable to make the change
to qemu_open() to support "/dev/fd/N" for passing FDs during
QEMU initial startup for CLI args.

IMHO, what I propose here is preferrable for QMP clients that
our current plan of requiring use of 3 monitor commands (passfd,
XXXXX, closefd).

From 55a264b647d90a30c1cc00cb1896535246bcd9b7 Mon Sep 17 00:00:00 2001
From: "Daniel P. Berrange" <berrange@redhat.com>
Date: Tue, 26 Jun 2012 15:48:13 +0100
Subject: [PATCH] Support passing FDs with the monitor command that uses them

Add support for a syntax "/dev/fdlist/N", where 'N' refers to
the N'th FD that was passed along with the currently processing
command in this thread.

The QMP client sends

        {
          "execute": "drive_add",
          "arguments": {
               "filename": "/dev/fdlist/0",
               "backingfilename": "/dev/fdlist/1" },
          "nfds": 2
        }

followed by "nfds" * SCM_RIGHTS packets

The FDs received along with the "drive_add" command are placed
in a thread-local list, then the QMP handler function is invoked.
The qemu_open() function will resolve any path "/dev/fdlist/0"
to use an FD stored in the  thread-local list, and dup() it.
One the QMP handler function is finished, all FDs in the thread
local list are closed.

THe reason for choice of "/dev/fdlist/N", is to allow the
"/dev/fd/N" syntax to be used for FDs that are passed to QEMU
at startup time, which will be in the global FD number namespace,
not the thread-local index.

The reason for using a thread local is to ensure the FDs are
only made available to the currently executing monitor command
handler in this thread, and not any other threads.

The advantage over using a separate 'passfd' command, is that
this guarentees all FDs are closed in QEMU once the QMP command
handler is finished. There is no risk of leaks being triggered
by the client either intentionally, or via it crashing.

NB: this is a proof of concept demonstration of the overall
architectural design. The patch would need more work before
it was suitable for merging

 - Pick a better place/file for the fdlist APIs
 - Do proper error reporting in various places
 - Check for possibility that qemu_chr_fe_get_msgfd
   can block or return EAGAIN.

Comments

Corey Bryant June 26, 2012, 3:37 p.m. UTC | #1
On 06/26/2012 11:03 AM, Daniel P. Berrange wrote:
> On Tue, Jun 26, 2012 at 03:11:40PM +0100, Daniel P. Berrange wrote:
>> On Tue, Jun 26, 2012 at 09:52:51AM -0400, Corey Bryant wrote:
>>>> So now from a client's POV you'd have a flow like
>>>>
>>>>     * drive_add "file=/dev/fd/N"  FDSET={N}
>>>
>>> IIUC then drive_add would loop and pass each fd in the set via SCM_RIGHTS?
>>
>> Yes, you'd probably use the JSON to tell QEMU exactly
>> how many FDs you're intending to pass with the command,
>> and then once the command is received, you'd have N*SCM_RIGHTS
>> messages sent/received.
>>
>>>>
>>>> And in QEMU you'd have something like
>>>>
>>>>     * handle_monitor_command
>>>>          - recvmsg all FDs, and stash them in a thread local "FDContext"
>>>>            context
>>>>          - invoke monitor command handler
>>>>                - Sees file=/dev/fd/N
>>>>                - Fetch /dev/fd/N from "FDContext"
>>>>                - If success remove /dev/fd/N from "FDContext"
>>>>          - close() all FDs left in "FDContext"
>>>>
>>>> The key point with this is that because the FDs are directly
>>>> associated with a monitor command, QEMU can /guarantee/ that
>>>> FDs are never leaked, regardless of client behaviour.
>>>
>>> Wouldn't this leak fds if libvirt crashed part way through sending
>>> the set of fds?
>>
>> No, because you've scoped the FDs to the current monitor instance,
>> and the current command being processed you know to close all FDs
>> when the associated command has finished, as well as closing them
>> all if you see EOF on the monitor socket while in the middle of
>> receiving a command.
>
> Here is a quick proof of concept (ie untested) patch to demonstrate
> what I mean. It relies on Cory's patch which converts everything
> to use qemu_open. It is also still valuable to make the change
> to qemu_open() to support "/dev/fd/N" for passing FDs during
> QEMU initial startup for CLI args.
>
> IMHO, what I propose here is preferrable for QMP clients that
> our current plan of requiring use of 3 monitor commands (passfd,
> XXXXX, closefd).

Thanks for the PoC.

Two other required updates that I can think of would be:

1) Update change, block_stream, block_reopen, snapshot_blkdev, and 
perhaps other monitor commands to support receiving fd's via SCM_RIGHTS.

2) Support re-opening files with different access modes (O_RDONLY, 
O_WRONLY, or O_RDWR).  This would be similar to the force option for 
pass-fd.
Corey Bryant June 26, 2012, 6:40 p.m. UTC | #2
On 06/26/2012 11:37 AM, Corey Bryant wrote:
>
>
> On 06/26/2012 11:03 AM, Daniel P. Berrange wrote:
>> On Tue, Jun 26, 2012 at 03:11:40PM +0100, Daniel P. Berrange wrote:
>>> On Tue, Jun 26, 2012 at 09:52:51AM -0400, Corey Bryant wrote:
>>>>> So now from a client's POV you'd have a flow like
>>>>>
>>>>>     * drive_add "file=/dev/fd/N"  FDSET={N}
>>>>
>>>> IIUC then drive_add would loop and pass each fd in the set via
>>>> SCM_RIGHTS?
>>>
>>> Yes, you'd probably use the JSON to tell QEMU exactly
>>> how many FDs you're intending to pass with the command,
>>> and then once the command is received, you'd have N*SCM_RIGHTS
>>> messages sent/received.
>>>
>>>>>
>>>>> And in QEMU you'd have something like
>>>>>
>>>>>     * handle_monitor_command
>>>>>          - recvmsg all FDs, and stash them in a thread local
>>>>> "FDContext"
>>>>>            context
>>>>>          - invoke monitor command handler
>>>>>                - Sees file=/dev/fd/N
>>>>>                - Fetch /dev/fd/N from "FDContext"
>>>>>                - If success remove /dev/fd/N from "FDContext"
>>>>>          - close() all FDs left in "FDContext"
>>>>>
>>>>> The key point with this is that because the FDs are directly
>>>>> associated with a monitor command, QEMU can /guarantee/ that
>>>>> FDs are never leaked, regardless of client behaviour.
>>>>
>>>> Wouldn't this leak fds if libvirt crashed part way through sending
>>>> the set of fds?
>>>
>>> No, because you've scoped the FDs to the current monitor instance,
>>> and the current command being processed you know to close all FDs
>>> when the associated command has finished, as well as closing them
>>> all if you see EOF on the monitor socket while in the middle of
>>> receiving a command.
>>
>> Here is a quick proof of concept (ie untested) patch to demonstrate
>> what I mean. It relies on Cory's patch which converts everything
>> to use qemu_open. It is also still valuable to make the change
>> to qemu_open() to support "/dev/fd/N" for passing FDs during
>> QEMU initial startup for CLI args.
>>
>> IMHO, what I propose here is preferrable for QMP clients that
>> our current plan of requiring use of 3 monitor commands (passfd,
>> XXXXX, closefd).
>
> Thanks for the PoC.
>
> Two other required updates that I can think of would be:
>
> 1) Update change, block_stream, block_reopen, snapshot_blkdev, and
> perhaps other monitor commands to support receiving fd's via SCM_RIGHTS.
>

Nevermind my comment.  I see that your PoC supports passing nfds for any 
QMP command.

I'm curious what Kevin's thoughts are on this and the overall approach.

> 2) Support re-opening files with different access modes (O_RDONLY,
> O_WRONLY, or O_RDWR).  This would be similar to the force option for
> pass-fd.
>

I'm still not quite sure how we'd go about this.  We need a way to 
determine the existing QEMU fd that is to be re-associated with the new 
fd, when using a /dev/fdlist/0 filename.  In this new approach, 0 
corresponds to an index, not an fd.  The prior approach of using the 
/dev/fd/nnn, where nnn corresponded to an actual QEMU fd, made this easy.

>
>>
>>  From 55a264b647d90a30c1cc00cb1896535246bcd9b7 Mon Sep 17 00:00:00 2001
>> From: "Daniel P. Berrange" <berrange@redhat.com>
>> Date: Tue, 26 Jun 2012 15:48:13 +0100
>> Subject: [PATCH] Support passing FDs with the monitor command that
>> uses them
>>
>> Add support for a syntax "/dev/fdlist/N", where 'N' refers to
>> the N'th FD that was passed along with the currently processing
>> command in this thread.
>>
>> The QMP client sends
>>
>>          {
>>            "execute": "drive_add",
>>            "arguments": {
>>                 "filename": "/dev/fdlist/0",
>>                 "backingfilename": "/dev/fdlist/1" },
>>            "nfds": 2
>>          }
>>
>> followed by "nfds" * SCM_RIGHTS packets
>>
>> The FDs received along with the "drive_add" command are placed
>> in a thread-local list, then the QMP handler function is invoked.
>> The qemu_open() function will resolve any path "/dev/fdlist/0"
>> to use an FD stored in the  thread-local list, and dup() it.
>> One the QMP handler function is finished, all FDs in the thread
>> local list are closed.
>>
>> THe reason for choice of "/dev/fdlist/N", is to allow the
>> "/dev/fd/N" syntax to be used for FDs that are passed to QEMU
>> at startup time, which will be in the global FD number namespace,
>> not the thread-local index.
>>
>> The reason for using a thread local is to ensure the FDs are
>> only made available to the currently executing monitor command
>> handler in this thread, and not any other threads.
>>
>> The advantage over using a separate 'passfd' command, is that
>> this guarentees all FDs are closed in QEMU once the QMP command
>> handler is finished. There is no risk of leaks being triggered
>> by the client either intentionally, or via it crashing.
>>
>> NB: this is a proof of concept demonstration of the overall
>> architectural design. The patch would need more work before
>> it was suitable for merging
>>
>>   - Pick a better place/file for the fdlist APIs
>>   - Do proper error reporting in various places
>>   - Check for possibility that qemu_chr_fe_get_msgfd
>>     can block or return EAGAIN.
>>
>> diff --git a/monitor.c b/monitor.c
>> index f6107ba..a3a4bd4 100644
>> --- a/monitor.c
>> +++ b/monitor.c
>> @@ -4490,6 +4490,8 @@ static void handle_qmp_command(JSONMessageParser
>> *parser, QList *tokens)
>>       const mon_cmd_t *cmd;
>>       const char *cmd_name;
>>       Monitor *mon = cur_mon;
>> +    int nfds;
>> +    size_t i;
>>
>>       args = input = NULL;
>>
>> @@ -4535,6 +4537,17 @@ static void
>> handle_qmp_command(JSONMessageParser *parser, QList *tokens)
>>           goto err_out;
>>       }
>>
>> +    nfds = qdict_get_int(input, "nfds");
>> +    for (i = 0 ; i < nfds ; i++) {
>> +    int fd = qemu_chr_fe_get_msgfd(mon->chr);
>> +    if (fd == -1) {
>> +        qerror_report(QERR_FD_NOT_SUPPLIED);
>> +        goto err_out;
>> +    }
>> +
>> +    qemu_fdlist_add(fd);
>> +    }
>> +
>>       if (handler_is_async(cmd)) {
>>           err = qmp_async_cmd_handler(mon, cmd, args);
>>           if (err) {
>> @@ -4552,6 +4565,7 @@ err_out:
>>   out:
>>       QDECREF(input);
>>       QDECREF(args);
>> +    qemu_fdlist_closeall();
>>   }
>>
>>   /**
>> diff --git a/osdep.c b/osdep.c
>> index 03817f0..0849cb6 100644
>> --- a/osdep.c
>> +++ b/osdep.c
>> @@ -75,6 +75,81 @@ int qemu_madvise(void *addr, size_t len, int advice)
>>   #endif
>>   }
>>
>> +typedef struct QEMUFDList {
>> +    int *fds;
>> +    size_t nfds;
>> +} QEMUFDList;
>> +
>> +static pthread_key_t fdlist_key;
>> +
>> +static void qemu_fdlist_cleanup(void *data)
>> +{
>> +    size_t i;
>> +    QEMUFDList *fdlist = data;
>> +
>> +    if (!fdlist)
>> +    return;
>> +
>> +    for (i = 0 ; i < fdlist->nfds ; i++) {
>> +    close(fdlist->fds[i]);
>> +    }
>> +}
>> +
>> +int qemu_fdlist_init(void)
>> +{
>> +    if (pthread_key_create(&fdlist_key,
>> +               qemu_fdlist_cleanup) < 0)
>> +    return -1;
>> +
>> +    return 0;
>> +}
>> +
>> +static QEMUFDList *qemu_fdlist_for_thread(void)
>> +{
>> +    QEMUFDList *fdlist = pthread_getspecific(fdlist_key);
>> +
>> +    if (fdlist == NULL) {
>> +    fdlist = g_new0(QEMUFDList, 1);
>> +    pthread_setspecific(fdlist_key, fdlist);
>> +    }
>> +
>> +    return fdlist;
>> +}
>> +
>> +void qemu_fdlist_add(int fd)
>> +{
>> +    QEMUFDList *fdlist = qemu_fdlist_for_thread();
>> +
>> +    fdlist->fds = g_renew(int, fdlist->fds, fdlist->nfds + 1);
>> +    fdlist->fds[fdlist->nfds++] = fd;
>> +}
>> +
>> +int qemu_fdlist_dup(int idx)
>> +{
>> +    QEMUFDList *fdlist = qemu_fdlist_for_thread();
>> +
>> +    if (idx >= fdlist->nfds) {
>> +    errno = EINVAL;
>> +    return -1;
>> +    }
>> +
>> +    return dup(fdlist->fds[idx]);
>> +}
>> +
>> +void qemu_fdlist_closeall(void)
>> +{
>> +    QEMUFDList *fdlist = qemu_fdlist_for_thread();
>> +    size_t i;
>> +
>> +    for (i = 0 ; i < fdlist->nfds ; i++) {
>> +    close(fdlist->fds[i]);
>> +    }
>> +    g_free(fdlist->fds);
>> +    fdlist->fds = NULL;
>> +    fdlist->nfds = 0;
>> +}
>> +
>> +
>>
>>   /*
>>    * Opens a file with FD_CLOEXEC set
>> @@ -83,6 +158,26 @@ int qemu_open(const char *name, int flags, ...)
>>   {
>>       int ret;
>>       int mode = 0;
>> +    const char *p;
>> +
>> +    /* Attempt dup of fd for pre-opened file */
>> +    if (strstart(name, "/dev/fdlist/", &p)) {
>> +        int idx;
>> +    int fd;
>> +
>> +        idx = qemu_parse_fd(p);
>> +        if (idx == -1) {
>> +            return -1;
>> +        }
>> +
>> +    fd = qemu_fdlist_dup(idx);
>> +    if (fd == -1) {
>> +        return -1;
>> +    }
>> +
>> +    /* XXX verify flags match */
>> +    return fd;
>> +    }
>>
>>       if (flags & O_CREAT) {
>>           va_list ap;
>> diff --git a/qemu-common.h b/qemu-common.h
>> index 8f87e41..b6b7203 100644
>> --- a/qemu-common.h
>> +++ b/qemu-common.h
>> @@ -194,6 +194,11 @@ ssize_t qemu_send_full(int fd, const void *buf,
>> size_t count, int flags)
>>   ssize_t qemu_recv_full(int fd, void *buf, size_t count, int flags)
>>       QEMU_WARN_UNUSED_RESULT;
>>
>> +int qemu_fdlist_init(void);
>> +void qemu_fdlist_add(int fd);
>> +int qemu_fdlist_dup(int idx);
>> +void qemu_fdlist_closeall(void);
>> +
>>   #ifndef _WIN32
>>   int qemu_eventfd(int pipefd[2]);
>>   int qemu_pipe(int pipefd[2]);
>> diff --git a/vl.c b/vl.c
>> index 1329c30..e515c84 100644
>> --- a/vl.c
>> +++ b/vl.c
>> @@ -2321,6 +2321,11 @@ int main(int argc, char **argv, char **envp)
>>       QLIST_INIT (&vm_change_state_head);
>>       os_setup_early_signal_handling();
>>
>> +    if (qemu_fdlist_init() < 0) {
>> +    perror("fdlist");
>> +    exit(EXIT_FAILURE);
>> +    }
>> +
>>       module_call_init(MODULE_INIT_MACHINE);
>>       machine = find_default_machine();
>>       cpu_model = NULL;
>>
>
>
Daniel P. Berrangé June 26, 2012, 8:42 p.m. UTC | #3
On Tue, Jun 26, 2012 at 02:40:03PM -0400, Corey Bryant wrote:
> 
> 
> On 06/26/2012 11:37 AM, Corey Bryant wrote:
> >
> >
> >On 06/26/2012 11:03 AM, Daniel P. Berrange wrote:
> >>On Tue, Jun 26, 2012 at 03:11:40PM +0100, Daniel P. Berrange wrote:
> >>>On Tue, Jun 26, 2012 at 09:52:51AM -0400, Corey Bryant wrote:
> >>>>>So now from a client's POV you'd have a flow like
> >>>>>
> >>>>>    * drive_add "file=/dev/fd/N"  FDSET={N}
> >>>>
> >>>>IIUC then drive_add would loop and pass each fd in the set via
> >>>>SCM_RIGHTS?
> >>>
> >>>Yes, you'd probably use the JSON to tell QEMU exactly
> >>>how many FDs you're intending to pass with the command,
> >>>and then once the command is received, you'd have N*SCM_RIGHTS
> >>>messages sent/received.
> >>>
> >>>>>
> >>>>>And in QEMU you'd have something like
> >>>>>
> >>>>>    * handle_monitor_command
> >>>>>         - recvmsg all FDs, and stash them in a thread local
> >>>>>"FDContext"
> >>>>>           context
> >>>>>         - invoke monitor command handler
> >>>>>               - Sees file=/dev/fd/N
> >>>>>               - Fetch /dev/fd/N from "FDContext"
> >>>>>               - If success remove /dev/fd/N from "FDContext"
> >>>>>         - close() all FDs left in "FDContext"
> >>>>>
> >>>>>The key point with this is that because the FDs are directly
> >>>>>associated with a monitor command, QEMU can /guarantee/ that
> >>>>>FDs are never leaked, regardless of client behaviour.
> >>>>
> >>>>Wouldn't this leak fds if libvirt crashed part way through sending
> >>>>the set of fds?
> >>>
> >>>No, because you've scoped the FDs to the current monitor instance,
> >>>and the current command being processed you know to close all FDs
> >>>when the associated command has finished, as well as closing them
> >>>all if you see EOF on the monitor socket while in the middle of
> >>>receiving a command.
> >>
> >>Here is a quick proof of concept (ie untested) patch to demonstrate
> >>what I mean. It relies on Cory's patch which converts everything
> >>to use qemu_open. It is also still valuable to make the change
> >>to qemu_open() to support "/dev/fd/N" for passing FDs during
> >>QEMU initial startup for CLI args.
> >>
> >>IMHO, what I propose here is preferrable for QMP clients that
> >>our current plan of requiring use of 3 monitor commands (passfd,
> >>XXXXX, closefd).
> >
> >Thanks for the PoC.
> >
> >Two other required updates that I can think of would be:
> >
> >1) Update change, block_stream, block_reopen, snapshot_blkdev, and
> >perhaps other monitor commands to support receiving fd's via SCM_RIGHTS.
> >
> 
> Nevermind my comment.  I see that your PoC supports passing nfds for
> any QMP command.
> 
> I'm curious what Kevin's thoughts are on this and the overall approach.
> 
> >2) Support re-opening files with different access modes (O_RDONLY,
> >O_WRONLY, or O_RDWR).  This would be similar to the force option for
> >pass-fd.
> >
> 
> I'm still not quite sure how we'd go about this.  We need a way to
> determine the existing QEMU fd that is to be re-associated with the
> new fd, when using a /dev/fdlist/0 filename.  In this new approach,
> 0 corresponds to an index, not an fd.  The prior approach of using
> the /dev/fd/nnn, where nnn corresponded to an actual QEMU fd, made
> this easy.

Hmm, I'm not sure I understand what the use case is for the 'force'
parameter ? In my proof of concept I left out the  block of code
from qemu_open() you had for dup'ing the FD with various different
flags set, but that was just for brevity. I think it ought to fit
in the same way, unless you're refering to a different area of the
code.

>> >>@@ -83,6 +158,26 @@ int qemu_open(const char *name, int flags, ...)
> >>  {
> >>      int ret;
> >>      int mode = 0;
> >>+    const char *p;
> >>+
> >>+    /* Attempt dup of fd for pre-opened file */
> >>+    if (strstart(name, "/dev/fdlist/", &p)) {
> >>+        int idx;
> >>+    int fd;
> >>+
> >>+        idx = qemu_parse_fd(p);
> >>+        if (idx == -1) {
> >>+            return -1;
> >>+        }
> >>+
> >>+    fd = qemu_fdlist_dup(idx);
> >>+    if (fd == -1) {
> >>+        return -1;
> >>+    }
> >>+
> >>+    /* XXX verify flags match */

eg, this part of the code you had some work related to
'flags'

> >>+    return fd;
> >>+    }
> >>
> >>      if (flags & O_CREAT) {
> >>          va_list ap;

Daniel
Corey Bryant June 26, 2012, 10:46 p.m. UTC | #4
On 06/26/2012 04:42 PM, Daniel P. Berrange wrote:
> On Tue, Jun 26, 2012 at 02:40:03PM -0400, Corey Bryant wrote:
>>
>>
>> On 06/26/2012 11:37 AM, Corey Bryant wrote:
>>>
>>>
>>> On 06/26/2012 11:03 AM, Daniel P. Berrange wrote:
>>>> On Tue, Jun 26, 2012 at 03:11:40PM +0100, Daniel P. Berrange wrote:
>>>>> On Tue, Jun 26, 2012 at 09:52:51AM -0400, Corey Bryant wrote:
>>>>>>> So now from a client's POV you'd have a flow like
>>>>>>>
>>>>>>>     * drive_add "file=/dev/fd/N"  FDSET={N}
>>>>>>
>>>>>> IIUC then drive_add would loop and pass each fd in the set via
>>>>>> SCM_RIGHTS?
>>>>>
>>>>> Yes, you'd probably use the JSON to tell QEMU exactly
>>>>> how many FDs you're intending to pass with the command,
>>>>> and then once the command is received, you'd have N*SCM_RIGHTS
>>>>> messages sent/received.
>>>>>
>>>>>>>
>>>>>>> And in QEMU you'd have something like
>>>>>>>
>>>>>>>     * handle_monitor_command
>>>>>>>          - recvmsg all FDs, and stash them in a thread local
>>>>>>> "FDContext"
>>>>>>>            context
>>>>>>>          - invoke monitor command handler
>>>>>>>                - Sees file=/dev/fd/N
>>>>>>>                - Fetch /dev/fd/N from "FDContext"
>>>>>>>                - If success remove /dev/fd/N from "FDContext"
>>>>>>>          - close() all FDs left in "FDContext"
>>>>>>>
>>>>>>> The key point with this is that because the FDs are directly
>>>>>>> associated with a monitor command, QEMU can /guarantee/ that
>>>>>>> FDs are never leaked, regardless of client behaviour.
>>>>>>
>>>>>> Wouldn't this leak fds if libvirt crashed part way through sending
>>>>>> the set of fds?
>>>>>
>>>>> No, because you've scoped the FDs to the current monitor instance,
>>>>> and the current command being processed you know to close all FDs
>>>>> when the associated command has finished, as well as closing them
>>>>> all if you see EOF on the monitor socket while in the middle of
>>>>> receiving a command.
>>>>
>>>> Here is a quick proof of concept (ie untested) patch to demonstrate
>>>> what I mean. It relies on Cory's patch which converts everything
>>>> to use qemu_open. It is also still valuable to make the change
>>>> to qemu_open() to support "/dev/fd/N" for passing FDs during
>>>> QEMU initial startup for CLI args.
>>>>
>>>> IMHO, what I propose here is preferrable for QMP clients that
>>>> our current plan of requiring use of 3 monitor commands (passfd,
>>>> XXXXX, closefd).
>>>
>>> Thanks for the PoC.
>>>
>>> Two other required updates that I can think of would be:
>>>
>>> 1) Update change, block_stream, block_reopen, snapshot_blkdev, and
>>> perhaps other monitor commands to support receiving fd's via SCM_RIGHTS.
>>>
>>
>> Nevermind my comment.  I see that your PoC supports passing nfds for
>> any QMP command.
>>
>> I'm curious what Kevin's thoughts are on this and the overall approach.
>>
>>> 2) Support re-opening files with different access modes (O_RDONLY,
>>> O_WRONLY, or O_RDWR).  This would be similar to the force option for
>>> pass-fd.
>>>
>>
>> I'm still not quite sure how we'd go about this.  We need a way to
>> determine the existing QEMU fd that is to be re-associated with the
>> new fd, when using a /dev/fdlist/0 filename.  In this new approach,
>> 0 corresponds to an index, not an fd.  The prior approach of using
>> the /dev/fd/nnn, where nnn corresponded to an actual QEMU fd, made
>> this easy.
>
> Hmm, I'm not sure I understand what the use case is for the 'force'
> parameter ? In my proof of concept I left out the  block of code
> from qemu_open() you had for dup'ing the FD with various different
> flags set, but that was just for brevity. I think it ought to fit
> in the same way, unless you're refering to a different area of the
> code.
>

The access modes (O_RDONLY, O_WRONLY, or O_RDWR) can only be set on the 
open() call, which is performed by libvirt.  fcntl(F_SETFL) can't change 
them.  So the point of pass-fd's force option is to allow libvirt to 
open() the same file with a new access mode, pass it to qemu, and copy 
it to the existing qemu fd. For example:

fd1 = open("/path/to/file.img", O_RDONLY)
fd2 = open("/path/to/file.img", O_RDWR)
pass-fd disk0 (fd1) returns fd=3 (/dev/fd/3 in qemu has O_RDONLY)
drive_add file:/dev/fd/3
pass-fd disk0 (fd2) returns fd=3 (/dev/fd/3 in qemu now has O_RDWR)

This is required because the access_mode flags used by qemu_open() are 
the same as the passed fd's access_mode flags.  And qemu will re-open 
files with different access modes.

>>>>> @@ -83,6 +158,26 @@ int qemu_open(const char *name, int flags, ...)
>>>>   {
>>>>       int ret;
>>>>       int mode = 0;
>>>> +    const char *p;
>>>> +
>>>> +    /* Attempt dup of fd for pre-opened file */
>>>> +    if (strstart(name, "/dev/fdlist/", &p)) {
>>>> +        int idx;
>>>> +    int fd;
>>>> +
>>>> +        idx = qemu_parse_fd(p);
>>>> +        if (idx == -1) {
>>>> +            return -1;
>>>> +        }
>>>> +
>>>> +    fd = qemu_fdlist_dup(idx);
>>>> +    if (fd == -1) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    /* XXX verify flags match */
>
> eg, this part of the code you had some work related to
> 'flags'
>

Yes, no problem and thanks again for the code. :)  I'm hoping we can get 
some more input soon so I can get moving in one direction or another.

>>>> +    return fd;
>>>> +    }
>>>>
>>>>       if (flags & O_CREAT) {
>>>>           va_list ap;
>
> Daniel
>
Kevin Wolf June 27, 2012, 8:58 a.m. UTC | #5
Am 26.06.2012 20:40, schrieb Corey Bryant:
>>> Here is a quick proof of concept (ie untested) patch to demonstrate
>>> what I mean. It relies on Cory's patch which converts everything
>>> to use qemu_open. It is also still valuable to make the change
>>> to qemu_open() to support "/dev/fd/N" for passing FDs during
>>> QEMU initial startup for CLI args.
>>>
>>> IMHO, what I propose here is preferrable for QMP clients that
>>> our current plan of requiring use of 3 monitor commands (passfd,
>>> XXXXX, closefd).
>>
>> Thanks for the PoC.
>>
>> Two other required updates that I can think of would be:
>>
>> 1) Update change, block_stream, block_reopen, snapshot_blkdev, and
>> perhaps other monitor commands to support receiving fd's via SCM_RIGHTS.
>>
> 
> Nevermind my comment.  I see that your PoC supports passing nfds for any 
> QMP command.
> 
> I'm curious what Kevin's thoughts are on this and the overall approach.

I'm not against introducing this nfd thing as a general feature for QMP
commands instead of a separate pass-fd command. It's not obvious to me
that everyone would agree with that, so let's CC Luiz at least.

The that I'm unsure about is what we should do with qemu reopening the
file. If you close the fd immediately, you obviously can't do that any
more. Even worse, libvirt doesn't have a unique ID for each passed file
descriptor any more, so even though we have introduced a QMP feature for
file descriptor passing, we would still need to touch all commands to
allow assigning a new fd.

I think having one stable original fd that libvirt can refer to is much
nicer.

Kevin
Luiz Capitulino June 28, 2012, 1:53 p.m. UTC | #6
On Wed, 27 Jun 2012 10:58:01 +0200
Kevin Wolf <kwolf@redhat.com> wrote:

> Am 26.06.2012 20:40, schrieb Corey Bryant:
> >>> Here is a quick proof of concept (ie untested) patch to demonstrate
> >>> what I mean. It relies on Cory's patch which converts everything
> >>> to use qemu_open. It is also still valuable to make the change
> >>> to qemu_open() to support "/dev/fd/N" for passing FDs during
> >>> QEMU initial startup for CLI args.
> >>>
> >>> IMHO, what I propose here is preferrable for QMP clients that
> >>> our current plan of requiring use of 3 monitor commands (passfd,
> >>> XXXXX, closefd).
> >>
> >> Thanks for the PoC.
> >>
> >> Two other required updates that I can think of would be:
> >>
> >> 1) Update change, block_stream, block_reopen, snapshot_blkdev, and
> >> perhaps other monitor commands to support receiving fd's via SCM_RIGHTS.
> >>
> > 
> > Nevermind my comment.  I see that your PoC supports passing nfds for any 
> > QMP command.
> > 
> > I'm curious what Kevin's thoughts are on this and the overall approach.
> 
> I'm not against introducing this nfd thing as a general feature for QMP
> commands instead of a separate pass-fd command. It's not obvious to me
> that everyone would agree with that, so let's CC Luiz at least.

I don't have objections either, but I want to point out two things.

First, we've agreed on adding new commands instead of extending existing
ones. I feel that not everyone is agreement with this and maybe we could
revisit this topic, but in principle new commands should be added.

Secondly, this is just a small suggestion, but we're in a point in time
where several block commands are being added. I think it's a good moment
to think hard how the ideal block API would look like and try to design
new commands based on that.
diff mbox

Patch

diff --git a/monitor.c b/monitor.c
index f6107ba..a3a4bd4 100644
--- a/monitor.c
+++ b/monitor.c
@@ -4490,6 +4490,8 @@  static void handle_qmp_command(JSONMessageParser *parser, QList *tokens)
     const mon_cmd_t *cmd;
     const char *cmd_name;
     Monitor *mon = cur_mon;
+    int nfds;
+    size_t i;
 
     args = input = NULL;
 
@@ -4535,6 +4537,17 @@  static void handle_qmp_command(JSONMessageParser *parser, QList *tokens)
         goto err_out;
     }
 
+    nfds = qdict_get_int(input, "nfds");
+    for (i = 0 ; i < nfds ; i++) {
+	int fd = qemu_chr_fe_get_msgfd(mon->chr);
+	if (fd == -1) {
+	    qerror_report(QERR_FD_NOT_SUPPLIED);
+	    goto err_out;
+	}
+
+	qemu_fdlist_add(fd);
+    }
+
     if (handler_is_async(cmd)) {
         err = qmp_async_cmd_handler(mon, cmd, args);
         if (err) {
@@ -4552,6 +4565,7 @@  err_out:
 out:
     QDECREF(input);
     QDECREF(args);
+    qemu_fdlist_closeall();
 }
 
 /**
diff --git a/osdep.c b/osdep.c
index 03817f0..0849cb6 100644
--- a/osdep.c
+++ b/osdep.c
@@ -75,6 +75,81 @@  int qemu_madvise(void *addr, size_t len, int advice)
 #endif
 }
 
+typedef struct QEMUFDList {
+    int *fds;
+    size_t nfds;
+} QEMUFDList;
+
+static pthread_key_t fdlist_key;
+
+static void qemu_fdlist_cleanup(void *data)
+{
+    size_t i;
+    QEMUFDList *fdlist = data;
+
+    if (!fdlist)
+	return;
+
+    for (i = 0 ; i < fdlist->nfds ; i++) {
+	close(fdlist->fds[i]);
+    }
+}
+
+int qemu_fdlist_init(void)
+{
+    if (pthread_key_create(&fdlist_key,
+			   qemu_fdlist_cleanup) < 0)
+	return -1;
+
+    return 0;
+}
+
+static QEMUFDList *qemu_fdlist_for_thread(void)
+{
+    QEMUFDList *fdlist = pthread_getspecific(fdlist_key);
+
+    if (fdlist == NULL) {
+	fdlist = g_new0(QEMUFDList, 1);
+	pthread_setspecific(fdlist_key, fdlist);
+    }
+
+    return fdlist;
+}
+
+void qemu_fdlist_add(int fd)
+{
+    QEMUFDList *fdlist = qemu_fdlist_for_thread();
+
+    fdlist->fds = g_renew(int, fdlist->fds, fdlist->nfds + 1);
+    fdlist->fds[fdlist->nfds++] = fd;
+}
+
+int qemu_fdlist_dup(int idx)
+{
+    QEMUFDList *fdlist = qemu_fdlist_for_thread();
+
+    if (idx >= fdlist->nfds) {
+	errno = EINVAL;
+	return -1;
+    }
+
+    return dup(fdlist->fds[idx]);
+}
+
+void qemu_fdlist_closeall(void)
+{
+    QEMUFDList *fdlist = qemu_fdlist_for_thread();
+    size_t i;
+
+    for (i = 0 ; i < fdlist->nfds ; i++) {
+	close(fdlist->fds[i]);
+    }
+    g_free(fdlist->fds);
+    fdlist->fds = NULL;
+    fdlist->nfds = 0;
+}
+
+
 
 /*
  * Opens a file with FD_CLOEXEC set
@@ -83,6 +158,26 @@  int qemu_open(const char *name, int flags, ...)
 {
     int ret;
     int mode = 0;
+    const char *p;
+
+    /* Attempt dup of fd for pre-opened file */
+    if (strstart(name, "/dev/fdlist/", &p)) {
+        int idx;
+	int fd;
+
+        idx = qemu_parse_fd(p);
+        if (idx == -1) {
+            return -1;
+        }
+
+	fd = qemu_fdlist_dup(idx);
+	if (fd == -1) {
+	    return -1;
+	}
+
+	/* XXX verify flags match */
+	return fd;
+    }
 
     if (flags & O_CREAT) {
         va_list ap;
diff --git a/qemu-common.h b/qemu-common.h
index 8f87e41..b6b7203 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -194,6 +194,11 @@  ssize_t qemu_send_full(int fd, const void *buf, size_t count, int flags)
 ssize_t qemu_recv_full(int fd, void *buf, size_t count, int flags)
     QEMU_WARN_UNUSED_RESULT;
 
+int qemu_fdlist_init(void);
+void qemu_fdlist_add(int fd);
+int qemu_fdlist_dup(int idx);
+void qemu_fdlist_closeall(void);
+
 #ifndef _WIN32
 int qemu_eventfd(int pipefd[2]);
 int qemu_pipe(int pipefd[2]);
diff --git a/vl.c b/vl.c
index 1329c30..e515c84 100644
--- a/vl.c
+++ b/vl.c
@@ -2321,6 +2321,11 @@  int main(int argc, char **argv, char **envp)
     QLIST_INIT (&vm_change_state_head);
     os_setup_early_signal_handling();
 
+    if (qemu_fdlist_init() < 0) {
+	perror("fdlist");
+	exit(EXIT_FAILURE);
+    }
+
     module_call_init(MODULE_INIT_MACHINE);
     machine = find_default_machine();
     cpu_model = NULL;