Patchwork [v3,02/10] main-loop: switch to g_poll() on POSIX hosts

login
register
mail settings
Submitter Stefan Hajnoczi
Date Feb. 4, 2013, 12:12 p.m.
Message ID <1359979973-31338-3-git-send-email-stefanha@redhat.com>
Download mbox | patch
Permalink /patch/217914/
State New
Headers show

Comments

Stefan Hajnoczi - Feb. 4, 2013, 12:12 p.m.
Use g_poll(3) instead of select(2).  Well, this is kind of a cheat.
It's true that we're now using g_poll(3) on POSIX hosts but the *_fill()
and *_poll() functions are still using rfds/wfds/xfds.

We've set the scene to start converting *_fill() and *_poll() functions
step-by-step until no more rfds/wfds/xfds users remain.  Then we'll drop
the temporary gpollfds_from_select() and gpollfds_to_select() functions
and be left with native g_poll(2).

On Windows things are a little crazy: convert from rfds/wfds/xfds to
GPollFDs, back to rfds/wfds/xfds, call select(2), rfds/wfds/xfds back to
GPollFDs, and finally back to rfds/wfds/xfds again.  This is only
temporary and keeps the Windows build working through the following
patches.  We'll drop this excessive conversion later and be left with a
single GPollFDs -> select(2) -> GPollFDs sequence that allows Windows to
use select(2) while the rest of QEMU only knows about GPollFD.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 main-loop.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 127 insertions(+), 8 deletions(-)
Laszlo Ersek - Feb. 4, 2013, 2:37 p.m.
Comments in-line.

On 02/04/13 13:12, Stefan Hajnoczi wrote:
> Use g_poll(3) instead of select(2).  Well, this is kind of a cheat.
> It's true that we're now using g_poll(3) on POSIX hosts but the *_fill()
> and *_poll() functions are still using rfds/wfds/xfds.
>
> We've set the scene to start converting *_fill() and *_poll() functions
> step-by-step until no more rfds/wfds/xfds users remain.  Then we'll drop
> the temporary gpollfds_from_select() and gpollfds_to_select() functions
> and be left with native g_poll(2).
>
> On Windows things are a little crazy: convert from rfds/wfds/xfds to
> GPollFDs, back to rfds/wfds/xfds, call select(2), rfds/wfds/xfds back to
> GPollFDs, and finally back to rfds/wfds/xfds again.  This is only
> temporary and keeps the Windows build working through the following
> patches.  We'll drop this excessive conversion later and be left with a
> single GPollFDs -> select(2) -> GPollFDs sequence that allows Windows to
> use select(2) while the rest of QEMU only knows about GPollFD.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  main-loop.c | 135 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 127 insertions(+), 8 deletions(-)

I assume this is complex because the pre-patch situation is overly
complex as well:

                             slirp -> fd_set
                         iohandler -> fd_set
             unix                                   windows
---------------------------------  --------------------------------------------
    before            after               before                after
--------------  -----------------  ---------------------  ---------------------
glib -> fd_set  glib -> fd_set     glib    -> GPollFD     glib    -> GPollFD
                fd_set -> GPollFD  WaitObj -> GPollFD     WaitObj -> GPollFD

SELECT          G_POLL             G_POLL (glib/WaitObj)  G_POLL (glib/WaitObj)

                                                          fd_set -> GPollFD
                                                          GPollFD -> fd_set

                                   SELECT (slirp/ioh.)    SELECT (slirp/ioh.)

(I'm ignoring the after-select / after-poll operations; they (should)
mirror the pre-select / pre-poll ones.)

No idea why the windows version has been mixing g_poll() and select().
I'd hope that this series kills select() for uniformity's sake, but the
00/10 subject and the commit msg for this patch indicate otherwise.

"main-loop.c" is full of global variables which makes it a *pain* to read.

>
> diff --git a/main-loop.c b/main-loop.c
> index d0d8fe4..f1dcd14 100644
> --- a/main-loop.c
> +++ b/main-loop.c
> @@ -117,6 +117,8 @@ void qemu_notify_event(void)
>      aio_notify(qemu_aio_context);
>  }
>
> +static GArray *gpollfds;
> +
>  int qemu_init_main_loop(void)
>  {
>      int ret;
> @@ -133,6 +135,7 @@ int qemu_init_main_loop(void)
>          return ret;
>      }
>
> +    gpollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
>      qemu_aio_context = aio_context_new();
>      src = aio_get_g_source(qemu_aio_context);
>      g_source_attach(src, NULL);
> @@ -146,6 +149,62 @@ static GPollFD poll_fds[1024 * 2]; /* this is probably overkill */
>  static int n_poll_fds;
>  static int max_priority;
>
> +/* Load rfds/wfds/xfds into gpollfds.  Will be removed a few commits later. */
> +static void gpollfds_from_select(void)
> +{
> +    int fd;
> +    for (fd = 0; fd <= nfds; fd++) {

"nfds" is an example of a global variable being global ("file scope")
for no good reason. Anyway it does seem to have inclusive meaning
("highest file descriptor in all sets").

> +        int events = 0;
> +        if (FD_ISSET(fd, &rfds)) {
> +            events |= G_IO_IN | G_IO_HUP | G_IO_ERR;
> +        }

("=" would have sufficed)

> +        if (FD_ISSET(fd, &wfds)) {
> +            events |= G_IO_OUT | G_IO_ERR;
> +        }
> +        if (FD_ISSET(fd, &xfds)) {
> +            events |= G_IO_PRI;
> +        }
> +        if (events) {
> +            GPollFD pfd = {
> +                .fd = fd,
> +                .events = events,
> +            };
> +            g_array_append_val(gpollfds, pfd);
> +        }
> +    }
> +}

(I don't like "gpollfds" being global, but) this seems OK. It matches
the glib docs and also How the Mapping Should Be Done (TM).

This function corresponds to the "unix / after / fd_set -> GPollFD"
entry of the diagram, and as such it is composed with "glib -> fd_set"
(glib_select_fill()). I'll note that glib_select_fill() sets "xfds" for
G_IO_ERR but not G_IO_PRI. (According to SUSv4, in our case xfds is the
logical-or of "error pending" and "OOB/urgent pending".)

We can assume that each entry stored by g_main_context_query() will have
at least one of IN and OUT set in "events". Further assuming that glib
follows its own documentation, that implies that G_IO_ERR will be set
unconditionally (because it always accompanies each of IN and OUT).
glib_select_fill() will map that to xfds, which we then map here to
G_IO_PRI. The total effect is that G_IO_PRI is set on all file
descriptors, always, even if we only try to write.

I think ultimately support for OOB data should be killed throughout, as
a policy, including xfds & G_IO_PRI. Pending errors are returned by
read()/write()/etc just fine; see eg. libvirt commit d19149dd.

> +
> +/* Store gpollfds revents into rfds/wfds/xfds.  Will be removed a few commits
> + * later.
> + */
> +static void gpollfds_to_select(int ret)
> +{
> +    int i;
> +
> +    FD_ZERO(&rfds);
> +    FD_ZERO(&wfds);
> +    FD_ZERO(&xfds);
> +
> +    if (ret <= 0) {
> +        return;
> +    }

ie. for g_poll() timeouts, errors & signal interrupts, we clear the sets
and that's it. OK.

> +
> +    for (i = 0; i < gpollfds->len; i++) {
> +        int fd = g_array_index(gpollfds, GPollFD, i).fd;
> +        int revents = g_array_index(gpollfds, GPollFD, i).revents;
> +
> +        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) {
> +            FD_SET(fd, &rfds);
> +        }
> +        if (revents & (G_IO_OUT | G_IO_ERR)) {
> +            FD_SET(fd, &wfds);
> +        }
> +        if (revents & G_IO_PRI) {
> +            FD_SET(fd, &xfds);
> +        }
> +    }
> +}

Looks good.

> +
>  #ifndef _WIN32
>  static void glib_select_fill(int *max_fd, fd_set *rfds, fd_set *wfds,
>                               fd_set *xfds, uint32_t *cur_timeout)
> @@ -212,22 +271,22 @@ static void glib_select_poll(fd_set *rfds, fd_set *wfds, fd_set *xfds,
>
>  static int os_host_main_loop_wait(uint32_t timeout)
>  {
> -    struct timeval tv, *tvarg = NULL;
>      int ret;
>
>      glib_select_fill(&nfds, &rfds, &wfds, &xfds, &timeout);
>
> -    if (timeout < UINT32_MAX) {
> -        tvarg = &tv;
> -        tv.tv_sec = timeout / 1000;
> -        tv.tv_usec = (timeout % 1000) * 1000;
> -    }
> -
>      if (timeout > 0) {
>          qemu_mutex_unlock_iothread();
>      }
>
> -    ret = select(nfds + 1, &rfds, &wfds, &xfds, tvarg);
> +    /* We'll eventually drop fd_set completely.  But for now we still have
> +     * *_fill() and *_poll() functions that use rfds/wfds/xfds.
> +     */
> +    gpollfds_from_select();
> +
> +    ret = g_poll((GPollFD *)gpollfds->data, gpollfds->len, timeout);

This exploits for the timeout parameter that (int)UINT32_MAX equals -1
on all supported platforms.

Timeout values in [ INT_MAX+1, UINT_MAX32-1 ] lose their meaning, but I
guess that's no problem in practice.

> +
> +    gpollfds_to_select(ret);
>
>      if (timeout > 0) {
>          qemu_mutex_lock_iothread();
> @@ -327,6 +386,55 @@ void qemu_fd_register(int fd)
>                     FD_CONNECT | FD_WRITE | FD_OOB);
>  }
>
> +static int pollfds_fill(GArray *pollfds, fd_set *rfds, fd_set *wfds,
> +                        fd_set *xfds)
> +{
> +    int nfds = -1;

"nfds" shadows the global "nfds". The return value of pollfds_fill() is
then assigned to the global "nfds". Not very elegant (but it's all
rooted in the existense of the global nfds).

> +    int i;
> +
> +    for (i = 0; i < pollfds->len; i++) {
> +        GPollFD *pfd = &g_array_index(pollfds, GPollFD, i);
> +        int fd = pfd->fd;
> +        int events = pfd->events;
> +        if (events & (G_IO_IN | G_IO_HUP | G_IO_ERR)) {
> +            FD_SET(fd, rfds);
> +            nfds = MAX(nfds, fd);
> +        }
> +        if (events & (G_IO_OUT | G_IO_ERR)) {
> +            FD_SET(fd, wfds);
> +            nfds = MAX(nfds, fd);
> +        }
> +        if (events & G_IO_PRI) {
> +            FD_SET(fd, xfds);
> +            nfds = MAX(nfds, fd);
> +        }
> +    }
> +    return nfds;
> +}
> +
> +static void pollfds_poll(GArray *pollfds, int nfds, fd_set *rfds,
> +                         fd_set *wfds, fd_set *xfds)
> +{

The "nfds" parameter both shadows the global variable and is unused in
the function.

> +    int i;
> +
> +    for (i = 0; i < pollfds->len; i++) {
> +        GPollFD *pfd = &g_array_index(pollfds, GPollFD, i);
> +        int fd = pfd->fd;
> +        int revents = 0;
> +
> +        if (FD_ISSET(fd, rfds)) {
> +            revents |= G_IO_IN | G_IO_HUP | G_IO_ERR;
> +        }

("=" would have sufficed)

> +        if (FD_ISSET(fd, wfds)) {
> +            revents |= G_IO_OUT | G_IO_ERR;
> +        }
> +        if (FD_ISSET(fd, xfds)) {
> +            revents |= G_IO_PRI;
> +        }
> +        pfd->revents |= revents & pfd->events;

I find this |= vs. = more confusing than the other two. "pfd->revents"
is zero here.

> +    }
> +}
> +
>  static int os_host_main_loop_wait(uint32_t timeout)
>  {
>      GMainContext *context = g_main_context_default();
> @@ -382,12 +490,22 @@ static int os_host_main_loop_wait(uint32_t timeout)
>       * improve socket latency.
>       */
>
> +    /* This back-and-forth between GPollFDs and select(2) is temporary.  We'll
> +     * drop it in a couple of patches, I promise :).
> +     */
> +    gpollfds_from_select();
> +    FD_ZERO(&rfds);
> +    FD_ZERO(&wfds);
> +    FD_ZERO(&xfds);
> +    nfds = pollfds_fill(gpollfds, &rfds, &wfds, &xfds);
>      if (nfds >= 0) {
>          select_ret = select(nfds + 1, &rfds, &wfds, &xfds, &tv0);
>          if (select_ret != 0) {
>              timeout = 0;
> +            pollfds_poll(gpollfds, nfds, &rfds, &wfds, &xfds);

This function shouldn't be invoked for "select_ret == -1" (in case
that's possible here); "revents" will end up a copy of "events":

"On failure, the objects pointed to by the readfds, writefds, and
errorfds arguments shall not be modified."

>          }
>      }
> +    gpollfds_to_select(select_ret || g_poll_ret);

I can see this logical-or mimics the return value below; however I think
it's not robust. If one of the operands is -1, and the other is
non-positive, then gpollfds_to_select() should not traverse the
GPollFDs, but it will.

... Actually, "g_poll_ret" should have absolutely no effect on
gpollfds_to_select(): "g_poll_ret" corresponds to the "poll_fds" array,
which covers the glib-internal file descriptors and the WaitObjects.
Here (= in the windows case), "gpollfds" covers only slirp & iohandler,
thus only "select_ret" should be passed in.

>
>      return select_ret || g_poll_ret;

I find this return value unfathomable (how is (-1||0) equivalent to
(1||1)?), but it's out of scope for now.

>  }
> @@ -403,6 +521,7 @@ int main_loop_wait(int nonblocking)
>      }
>
>      /* poll any events */
> +    g_array_set_size(gpollfds, 0); /* reset for new iteration */
>      /* XXX: separate device handlers from system ones */
>      nfds = -1;
>      FD_ZERO(&rfds);

I can't decide (yet) if any of the above is important; probably not.
From a quick look at the tree at the series' end, most of it seems to be
gone by then. I'll let you decide.

Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Stefan Hajnoczi - Feb. 4, 2013, 3:07 p.m.
On Mon, Feb 04, 2013 at 03:37:39PM +0100, Laszlo Ersek wrote:
> I assume this is complex because the pre-patch situation is overly
> complex as well:
> 
>                              slirp -> fd_set
>                          iohandler -> fd_set
>              unix                                   windows
> ---------------------------------  --------------------------------------------
>     before            after               before                after
> --------------  -----------------  ---------------------  ---------------------
> glib -> fd_set  glib -> fd_set     glib    -> GPollFD     glib    -> GPollFD
>                 fd_set -> GPollFD  WaitObj -> GPollFD     WaitObj -> GPollFD
> 
> SELECT          G_POLL             G_POLL (glib/WaitObj)  G_POLL (glib/WaitObj)
> 
>                                                           fd_set -> GPollFD
>                                                           GPollFD -> fd_set
> 
>                                    SELECT (slirp/ioh.)    SELECT (slirp/ioh.)
> 
> (I'm ignoring the after-select / after-poll operations; they (should)
> mirror the pre-select / pre-poll ones.)
> 
> No idea why the windows version has been mixing g_poll() and select().
> I'd hope that this series kills select() for uniformity's sake, but the
> 00/10 subject and the commit msg for this patch indicate otherwise.

This is why I CCed Fabien.  I left the g_poll-followed-by-select
behavior.  It may be to do with Windows treating sockets different from
other objects.  Paolo may know the answer, too.

> "main-loop.c" is full of global variables which makes it a *pain* to read.

Yes.

> > +        if (FD_ISSET(fd, &wfds)) {
> > +            events |= G_IO_OUT | G_IO_ERR;
> > +        }
> > +        if (FD_ISSET(fd, &xfds)) {
> > +            events |= G_IO_PRI;
> > +        }
> > +        if (events) {
> > +            GPollFD pfd = {
> > +                .fd = fd,
> > +                .events = events,
> > +            };
> > +            g_array_append_val(gpollfds, pfd);
> > +        }
> > +    }
> > +}
> 
> (I don't like "gpollfds" being global, but) this seems OK. It matches
> the glib docs and also How the Mapping Should Be Done (TM).
> 
> This function corresponds to the "unix / after / fd_set -> GPollFD"
> entry of the diagram, and as such it is composed with "glib -> fd_set"
> (glib_select_fill()). I'll note that glib_select_fill() sets "xfds" for
> G_IO_ERR but not G_IO_PRI. (According to SUSv4, in our case xfds is the
> logical-or of "error pending" and "OOB/urgent pending".)
>
> We can assume that each entry stored by g_main_context_query() will have
> at least one of IN and OUT set in "events". Further assuming that glib
> follows its own documentation, that implies that G_IO_ERR will be set
> unconditionally (because it always accompanies each of IN and OUT).
> glib_select_fill() will map that to xfds, which we then map here to
> G_IO_PRI. The total effect is that G_IO_PRI is set on all file
> descriptors, always, even if we only try to write.
> 
> I think ultimately support for OOB data should be killed throughout, as
> a policy, including xfds & G_IO_PRI. Pending errors are returned by
> read()/write()/etc just fine; see eg. libvirt commit d19149dd.

slirp uses OOB because it implements TCP.  I don't think we can drop it.

However, later in the series we drop the glib_select_fill()
rfds/wfds/xfds conversion and use GPollFD directly.  That solves these
conversion issues.

> > +
> > +    gpollfds_to_select(ret);
> >
> >      if (timeout > 0) {
> >          qemu_mutex_lock_iothread();
> > @@ -327,6 +386,55 @@ void qemu_fd_register(int fd)
> >                     FD_CONNECT | FD_WRITE | FD_OOB);
> >  }
> >
> > +static int pollfds_fill(GArray *pollfds, fd_set *rfds, fd_set *wfds,
> > +                        fd_set *xfds)
> > +{
> > +    int nfds = -1;
> 
> "nfds" shadows the global "nfds". The return value of pollfds_fill() is
> then assigned to the global "nfds". Not very elegant (but it's all
> rooted in the existense of the global nfds).

Global nfds is dropped later in the series.  I had real trouble picking
good names due to the globals and ended up with temporary shadowing,
which is fixed later.

> > +        if (FD_ISSET(fd, wfds)) {
> > +            revents |= G_IO_OUT | G_IO_ERR;
> > +        }
> > +        if (FD_ISSET(fd, xfds)) {
> > +            revents |= G_IO_PRI;
> > +        }
> > +        pfd->revents |= revents & pfd->events;
> 
> I find this |= vs. = more confusing than the other two. "pfd->revents"
> is zero here.

Yes, it's an interesting feature of pollfds_poll().  It does not clobber
revents.  This means you can apply additional *_poll()-style functions
before calling pollfds_poll().  I thought this was neat but I guess we
can drop it.

> 
> > +    }
> > +}
> > +
> >  static int os_host_main_loop_wait(uint32_t timeout)
> >  {
> >      GMainContext *context = g_main_context_default();
> > @@ -382,12 +490,22 @@ static int os_host_main_loop_wait(uint32_t timeout)
> >       * improve socket latency.
> >       */
> >
> > +    /* This back-and-forth between GPollFDs and select(2) is temporary.  We'll
> > +     * drop it in a couple of patches, I promise :).
> > +     */
> > +    gpollfds_from_select();
> > +    FD_ZERO(&rfds);
> > +    FD_ZERO(&wfds);
> > +    FD_ZERO(&xfds);
> > +    nfds = pollfds_fill(gpollfds, &rfds, &wfds, &xfds);
> >      if (nfds >= 0) {
> >          select_ret = select(nfds + 1, &rfds, &wfds, &xfds, &tv0);
> >          if (select_ret != 0) {
> >              timeout = 0;
> > +            pollfds_poll(gpollfds, nfds, &rfds, &wfds, &xfds);
> 
> This function shouldn't be invoked for "select_ret == -1" (in case
> that's possible here); "revents" will end up a copy of "events":
> 
> "On failure, the objects pointed to by the readfds, writefds, and
> errorfds arguments shall not be modified."

Thanks, will fix.

> 
> >          }
> >      }
> > +    gpollfds_to_select(select_ret || g_poll_ret);
> 
> I can see this logical-or mimics the return value below; however I think
> it's not robust. If one of the operands is -1, and the other is
> non-positive, then gpollfds_to_select() should not traverse the
> GPollFDs, but it will.
> 
> ... Actually, "g_poll_ret" should have absolutely no effect on
> gpollfds_to_select(): "g_poll_ret" corresponds to the "poll_fds" array,
> which covers the glib-internal file descriptors and the WaitObjects.
> Here (= in the windows case), "gpollfds" covers only slirp & iohandler,
> thus only "select_ret" should be passed in.

Will take a look at this for the next version of this patch.

> I can't decide (yet) if any of the above is important; probably not.
> From a quick look at the tree at the series' end, most of it seems to be
> gone by then. I'll let you decide.

Thanks for the review.  If I need to respin I'll address comments.

Stefan
Laszlo Ersek - Feb. 4, 2013, 4:17 p.m.
On 02/04/13 16:07, Stefan Hajnoczi wrote:

> slirp uses OOB because it implements TCP.  I don't think we can drop it.

I considered slirp for that reason, and I did grep the tree for OOB and
saw some hits. I didn't track it down but I thought slirp would use
lower level sockets (SOCK_RAW?) for the TCP implementation. Ie. it would
provide OOB but not depend on it (because OOB doesn't make sense at the
IP level; the urgent pointer is in the TCP header). Anyway it's not
important.

> Yes, it's an interesting feature of pollfds_poll().  It does not clobber
> revents.  This means you can apply additional *_poll()-style functions
> before calling pollfds_poll().  I thought this was neat but I guess we
> can drop it.

Ah I didn't realize that. I think a comment should be enough, should you
respin it.

> Thanks for the review.  If I need to respin I'll address comments.

It's more clear now, thanks!

Laszlo
Paolo Bonzini - Feb. 4, 2013, 5:35 p.m.
Il 04/02/2013 16:07, Stefan Hajnoczi ha scritto:
>> > No idea why the windows version has been mixing g_poll() and select().
>> > I'd hope that this series kills select() for uniformity's sake, but the
>> > 00/10 subject and the commit msg for this patch indicate otherwise.
> This is why I CCed Fabien.  I left the g_poll-followed-by-select
> behavior.  It may be to do with Windows treating sockets different from
> other objects.  Paolo may know the answer, too.
> 

Yes, indeed.

Windows uses g_poll for HANDLEs and select for sockets.  All sockets are
configured to trigger an event handle when there is a change in the
socket availability, so g_poll exits and select is reexecuted.

Paolo
Fabien Chouteau - Feb. 4, 2013, 5:51 p.m.
On 02/04/2013 06:35 PM, Paolo Bonzini wrote:
> Il 04/02/2013 16:07, Stefan Hajnoczi ha scritto:
>>>> No idea why the windows version has been mixing g_poll() and select().
>>>> I'd hope that this series kills select() for uniformity's sake, but the
>>>> 00/10 subject and the commit msg for this patch indicate otherwise.
>> This is why I CCed Fabien.  I left the g_poll-followed-by-select
>> behavior.  It may be to do with Windows treating sockets different from
>> other objects.  Paolo may know the answer, too.
>>
>
> Yes, indeed.
>
> Windows uses g_poll for HANDLEs and select for sockets.  All sockets are
> configured to trigger an event handle when there is a change in the
> socket availability, so g_poll exits and select is reexecuted.
>

Yes, the reason for this select() call is that we know when an event
occurred on one socket, but we don't know what kind of event and which
socket.

Would it be different if we had a HANDLE for each socket, instead of
one for all?
Paolo Bonzini - Feb. 4, 2013, 6:11 p.m.
Il 04/02/2013 18:51, Fabien Chouteau ha scritto:
> On 02/04/2013 06:35 PM, Paolo Bonzini wrote:
>> Il 04/02/2013 16:07, Stefan Hajnoczi ha scritto:
>>>>> No idea why the windows version has been mixing g_poll() and select().
>>>>> I'd hope that this series kills select() for uniformity's sake, but the
>>>>> 00/10 subject and the commit msg for this patch indicate otherwise.
>>> This is why I CCed Fabien.  I left the g_poll-followed-by-select
>>> behavior.  It may be to do with Windows treating sockets different from
>>> other objects.  Paolo may know the answer, too.
>>
>> Windows uses g_poll for HANDLEs and select for sockets.  All sockets are
>> configured to trigger an event handle when there is a change in the
>> socket availability, so g_poll exits and select is reexecuted.
> 
> Yes, the reason for this select() call is that we know when an event
> occurred on one socket, but we don't know what kind of event and which
> socket.
> 
> Would it be different if we had a HANDLE for each socket, instead of
> one for all?

Nope, I think you cannot assign different events for read and write.

Paolo

Patch

diff --git a/main-loop.c b/main-loop.c
index d0d8fe4..f1dcd14 100644
--- a/main-loop.c
+++ b/main-loop.c
@@ -117,6 +117,8 @@  void qemu_notify_event(void)
     aio_notify(qemu_aio_context);
 }
 
+static GArray *gpollfds;
+
 int qemu_init_main_loop(void)
 {
     int ret;
@@ -133,6 +135,7 @@  int qemu_init_main_loop(void)
         return ret;
     }
 
+    gpollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
     qemu_aio_context = aio_context_new();
     src = aio_get_g_source(qemu_aio_context);
     g_source_attach(src, NULL);
@@ -146,6 +149,62 @@  static GPollFD poll_fds[1024 * 2]; /* this is probably overkill */
 static int n_poll_fds;
 static int max_priority;
 
+/* Load rfds/wfds/xfds into gpollfds.  Will be removed a few commits later. */
+static void gpollfds_from_select(void)
+{
+    int fd;
+    for (fd = 0; fd <= nfds; fd++) {
+        int events = 0;
+        if (FD_ISSET(fd, &rfds)) {
+            events |= G_IO_IN | G_IO_HUP | G_IO_ERR;
+        }
+        if (FD_ISSET(fd, &wfds)) {
+            events |= G_IO_OUT | G_IO_ERR;
+        }
+        if (FD_ISSET(fd, &xfds)) {
+            events |= G_IO_PRI;
+        }
+        if (events) {
+            GPollFD pfd = {
+                .fd = fd,
+                .events = events,
+            };
+            g_array_append_val(gpollfds, pfd);
+        }
+    }
+}
+
+/* Store gpollfds revents into rfds/wfds/xfds.  Will be removed a few commits
+ * later.
+ */
+static void gpollfds_to_select(int ret)
+{
+    int i;
+
+    FD_ZERO(&rfds);
+    FD_ZERO(&wfds);
+    FD_ZERO(&xfds);
+
+    if (ret <= 0) {
+        return;
+    }
+
+    for (i = 0; i < gpollfds->len; i++) {
+        int fd = g_array_index(gpollfds, GPollFD, i).fd;
+        int revents = g_array_index(gpollfds, GPollFD, i).revents;
+
+        if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) {
+            FD_SET(fd, &rfds);
+        }
+        if (revents & (G_IO_OUT | G_IO_ERR)) {
+            FD_SET(fd, &wfds);
+        }
+        if (revents & G_IO_PRI) {
+            FD_SET(fd, &xfds);
+        }
+    }
+}
+
 #ifndef _WIN32
 static void glib_select_fill(int *max_fd, fd_set *rfds, fd_set *wfds,
                              fd_set *xfds, uint32_t *cur_timeout)
@@ -212,22 +271,22 @@  static void glib_select_poll(fd_set *rfds, fd_set *wfds, fd_set *xfds,
 
 static int os_host_main_loop_wait(uint32_t timeout)
 {
-    struct timeval tv, *tvarg = NULL;
     int ret;
 
     glib_select_fill(&nfds, &rfds, &wfds, &xfds, &timeout);
 
-    if (timeout < UINT32_MAX) {
-        tvarg = &tv;
-        tv.tv_sec = timeout / 1000;
-        tv.tv_usec = (timeout % 1000) * 1000;
-    }
-
     if (timeout > 0) {
         qemu_mutex_unlock_iothread();
     }
 
-    ret = select(nfds + 1, &rfds, &wfds, &xfds, tvarg);
+    /* We'll eventually drop fd_set completely.  But for now we still have
+     * *_fill() and *_poll() functions that use rfds/wfds/xfds.
+     */
+    gpollfds_from_select();
+
+    ret = g_poll((GPollFD *)gpollfds->data, gpollfds->len, timeout);
+
+    gpollfds_to_select(ret);
 
     if (timeout > 0) {
         qemu_mutex_lock_iothread();
@@ -327,6 +386,55 @@  void qemu_fd_register(int fd)
                    FD_CONNECT | FD_WRITE | FD_OOB);
 }
 
+static int pollfds_fill(GArray *pollfds, fd_set *rfds, fd_set *wfds,
+                        fd_set *xfds)
+{
+    int nfds = -1;
+    int i;
+
+    for (i = 0; i < pollfds->len; i++) {
+        GPollFD *pfd = &g_array_index(pollfds, GPollFD, i);
+        int fd = pfd->fd;
+        int events = pfd->events;
+        if (events & (G_IO_IN | G_IO_HUP | G_IO_ERR)) {
+            FD_SET(fd, rfds);
+            nfds = MAX(nfds, fd);
+        }
+        if (events & (G_IO_OUT | G_IO_ERR)) {
+            FD_SET(fd, wfds);
+            nfds = MAX(nfds, fd);
+        }
+        if (events & G_IO_PRI) {
+            FD_SET(fd, xfds);
+            nfds = MAX(nfds, fd);
+        }
+    }
+    return nfds;
+}
+
+static void pollfds_poll(GArray *pollfds, int nfds, fd_set *rfds,
+                         fd_set *wfds, fd_set *xfds)
+{
+    int i;
+
+    for (i = 0; i < pollfds->len; i++) {
+        GPollFD *pfd = &g_array_index(pollfds, GPollFD, i);
+        int fd = pfd->fd;
+        int revents = 0;
+
+        if (FD_ISSET(fd, rfds)) {
+            revents |= G_IO_IN | G_IO_HUP | G_IO_ERR;
+        }
+        if (FD_ISSET(fd, wfds)) {
+            revents |= G_IO_OUT | G_IO_ERR;
+        }
+        if (FD_ISSET(fd, xfds)) {
+            revents |= G_IO_PRI;
+        }
+        pfd->revents |= revents & pfd->events;
+    }
+}
+
 static int os_host_main_loop_wait(uint32_t timeout)
 {
     GMainContext *context = g_main_context_default();
@@ -382,12 +490,22 @@  static int os_host_main_loop_wait(uint32_t timeout)
      * improve socket latency.
      */
 
+    /* This back-and-forth between GPollFDs and select(2) is temporary.  We'll
+     * drop it in a couple of patches, I promise :).
+     */
+    gpollfds_from_select();
+    FD_ZERO(&rfds);
+    FD_ZERO(&wfds);
+    FD_ZERO(&xfds);
+    nfds = pollfds_fill(gpollfds, &rfds, &wfds, &xfds);
     if (nfds >= 0) {
         select_ret = select(nfds + 1, &rfds, &wfds, &xfds, &tv0);
         if (select_ret != 0) {
             timeout = 0;
+            pollfds_poll(gpollfds, nfds, &rfds, &wfds, &xfds);
         }
     }
+    gpollfds_to_select(select_ret || g_poll_ret);
 
     return select_ret || g_poll_ret;
 }
@@ -403,6 +521,7 @@  int main_loop_wait(int nonblocking)
     }
 
     /* poll any events */
+    g_array_set_size(gpollfds, 0); /* reset for new iteration */
     /* XXX: separate device handlers from system ones */
     nfds = -1;
     FD_ZERO(&rfds);