diff mbox series

[RFC,v3,01/27] char-io: fix possible race on IOWatchPoll

Message ID 20171106094643.14881-2-peterx@redhat.com
State New
Headers show
Series [RFC,v3,01/27] char-io: fix possible race on IOWatchPoll | expand

Commit Message

Peter Xu Nov. 6, 2017, 9:46 a.m. UTC
This is not a problem if we are only having one single loop thread like
before.  However, after per-monitor thread is introduced, this is not
true any more, and the race can happen.

The race can be triggered with "make check -j8" sometimes:

  qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
  io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.

This patch keeps the reference for the watch object when creating in
io_add_watch_poll(), so that the object will never be released in the
context main loop, especially when the context loop is running in
another standalone thread.  Meanwhile, when we want to remove the watch
object, we always first detach the watch object from its owner context,
then we continue with the cleanup.

Without this patch, calling io_remove_watch_poll() in main loop thread
is not thread-safe, since the other per-monitor thread may be modifying
the watch object at the same time.

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 chardev/char-io.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Comments

Fam Zheng Nov. 7, 2017, 6:43 a.m. UTC | #1
On Mon, 11/06 17:46, Peter Xu wrote:
> This is not a problem if we are only having one single loop thread like
> before.  However, after per-monitor thread is introduced, this is not
> true any more, and the race can happen.
> 
> The race can be triggered with "make check -j8" sometimes:
> 
>   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
>   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> 
> This patch keeps the reference for the watch object when creating in
> io_add_watch_poll(), so that the object will never be released in the
> context main loop, especially when the context loop is running in
> another standalone thread.  Meanwhile, when we want to remove the watch
> object, we always first detach the watch object from its owner context,
> then we continue with the cleanup.
> 
> Without this patch, calling io_remove_watch_poll() in main loop thread
> is not thread-safe, since the other per-monitor thread may be modifying
> the watch object at the same time.

Looks sane,

Reviewed-by: Fam Zheng <famz@redhat.com>
Stefan Hajnoczi Nov. 13, 2017, 4:52 p.m. UTC | #2
On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> This is not a problem if we are only having one single loop thread like
> before.  However, after per-monitor thread is introduced, this is not
> true any more, and the race can happen.
> 
> The race can be triggered with "make check -j8" sometimes:

Please mention a specific test case that fails.

> 
>   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
>   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> 
> This patch keeps the reference for the watch object when creating in
> io_add_watch_poll(), so that the object will never be released in the
> context main loop, especially when the context loop is running in
> another standalone thread.  Meanwhile, when we want to remove the watch
> object, we always first detach the watch object from its owner context,
> then we continue with the cleanup.
> 
> Without this patch, calling io_remove_watch_poll() in main loop thread
> is not thread-safe, since the other per-monitor thread may be modifying
> the watch object at the same time.
> 
> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  chardev/char-io.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/chardev/char-io.c b/chardev/char-io.c
> index f81052481a..50b5bac704 100644
> --- a/chardev/char-io.c
> +++ b/chardev/char-io.c
> @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
>      g_free(name);
>  
>      g_source_attach(&iwp->parent, context);
> -    g_source_unref(&iwp->parent);
>      return (GSource *)iwp;
>  }
>  
> @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource *source)
>      IOWatchPoll *iwp;
>  
>      iwp = io_watch_poll_from_source(source);
> +
> +    /*
> +     * Here the order of destruction really matters.  We need to first
> +     * detach the IOWatchPoll object from the context (which may still
> +     * be running in another loop thread), only after that could we
> +     * continue to operate on iwp->src, or there may be race condition
> +     * between current thread and the context loop thread.
> +     *
> +     * Let's blame the glib bug mentioned in commit 2b316774f6
> +     * ("qemu-char: do not operate on sources from finalize
> +     * callbacks") for this extra complexity.

I don't understand how this bug is to blame.  Isn't the problem here a
race condition between two QEMU threads?

Why are two threads accessing the watch at the same time?

> +     */
> +    g_source_destroy(&iwp->parent);
>      if (iwp->src) {
>          g_source_destroy(iwp->src);
>          g_source_unref(iwp->src);
>          iwp->src = NULL;
>      }
> -    g_source_destroy(&iwp->parent);
> +    g_source_unref(&iwp->parent);
>  }
>  
>  void remove_fd_in_watch(Chardev *chr)
> -- 
> 2.13.5
>
Peter Xu Nov. 14, 2017, 6:09 a.m. UTC | #3
On Mon, Nov 13, 2017 at 04:52:11PM +0000, Stefan Hajnoczi wrote:
> On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> > This is not a problem if we are only having one single loop thread like
> > before.  However, after per-monitor thread is introduced, this is not
> > true any more, and the race can happen.
> > 
> > The race can be triggered with "make check -j8" sometimes:
> 
> Please mention a specific test case that fails.

It was any of the check-qtest-$(TARGET)s that failed.  I'll mention
that in next post.

> 
> > 
> >   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
> >   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> > 
> > This patch keeps the reference for the watch object when creating in
> > io_add_watch_poll(), so that the object will never be released in the
> > context main loop, especially when the context loop is running in
> > another standalone thread.  Meanwhile, when we want to remove the watch
> > object, we always first detach the watch object from its owner context,
> > then we continue with the cleanup.
> > 
> > Without this patch, calling io_remove_watch_poll() in main loop thread
> > is not thread-safe, since the other per-monitor thread may be modifying
> > the watch object at the same time.
> > 
> > Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  chardev/char-io.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > index f81052481a..50b5bac704 100644
> > --- a/chardev/char-io.c
> > +++ b/chardev/char-io.c
> > @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
> >      g_free(name);
> >  
> >      g_source_attach(&iwp->parent, context);
> > -    g_source_unref(&iwp->parent);
> >      return (GSource *)iwp;
> >  }
> >  
> > @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource *source)
> >      IOWatchPoll *iwp;
> >  
> >      iwp = io_watch_poll_from_source(source);
> > +
> > +    /*
> > +     * Here the order of destruction really matters.  We need to first
> > +     * detach the IOWatchPoll object from the context (which may still
> > +     * be running in another loop thread), only after that could we
> > +     * continue to operate on iwp->src, or there may be race condition
> > +     * between current thread and the context loop thread.
> > +     *
> > +     * Let's blame the glib bug mentioned in commit 2b316774f6
> > +     * ("qemu-char: do not operate on sources from finalize
> > +     * callbacks") for this extra complexity.
> 
> I don't understand how this bug is to blame.  Isn't the problem here a
> race condition between two QEMU threads?

Yes, it is.

The problem is, we won't have the race condition if glib does not have
that bug mentioned.  Then the thread running GMainContext will have
full control of iwp->src destruction, and destruction of it would be
fairly straightforward (unref iwp->src in IOWatchPoll destructor).
Now IIUC we are doing this in a hacky way, say, we destroy iwp->src
explicitly from main thread before quitting (see [1] below, the whole
if clause).

> 
> Why are two threads accessing the watch at the same time?

Here is how I understand:

Firstly we need to tackle with that bug, by an explicit destruction of
iwp->src below; meanwhile when we are destroying it, the GMainContext
can still be running somewhere (it's not happening in current series
since I stopped iothread earlier than this point, however it can still
happen if in the future we don't do that), then we possibly want this
patch.

Again, without this patch, current series should work; however I do
hope this patch can be in, in case someday we want to provide complete
thread safety for Chardevs (now it is not really thread-safe).

> 
> > +     */
> > +    g_source_destroy(&iwp->parent);
> >      if (iwp->src) {
> >          g_source_destroy(iwp->src);
> >          g_source_unref(iwp->src);
> >          iwp->src = NULL;
> >      }

[1]

> > -    g_source_destroy(&iwp->parent);
> > +    g_source_unref(&iwp->parent);
> >  }
> >  
> >  void remove_fd_in_watch(Chardev *chr)
> > -- 
> > 2.13.5
> > 

Thanks,
Stefan Hajnoczi Nov. 14, 2017, 10:32 a.m. UTC | #4
On Tue, Nov 14, 2017 at 02:09:39PM +0800, Peter Xu wrote:
> On Mon, Nov 13, 2017 at 04:52:11PM +0000, Stefan Hajnoczi wrote:
> > On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> > > This is not a problem if we are only having one single loop thread like
> > > before.  However, after per-monitor thread is introduced, this is not
> > > true any more, and the race can happen.
> > > 
> > > The race can be triggered with "make check -j8" sometimes:
> > 
> > Please mention a specific test case that fails.
> 
> It was any of the check-qtest-$(TARGET)s that failed.  I'll mention
> that in next post.
> 
> > 
> > > 
> > >   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
> > >   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> > > 
> > > This patch keeps the reference for the watch object when creating in
> > > io_add_watch_poll(), so that the object will never be released in the
> > > context main loop, especially when the context loop is running in
> > > another standalone thread.  Meanwhile, when we want to remove the watch
> > > object, we always first detach the watch object from its owner context,
> > > then we continue with the cleanup.
> > > 
> > > Without this patch, calling io_remove_watch_poll() in main loop thread
> > > is not thread-safe, since the other per-monitor thread may be modifying
> > > the watch object at the same time.
> > > 
> > > Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  chardev/char-io.c | 16 ++++++++++++++--
> > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > > index f81052481a..50b5bac704 100644
> > > --- a/chardev/char-io.c
> > > +++ b/chardev/char-io.c
> > > @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
> > >      g_free(name);
> > >  
> > >      g_source_attach(&iwp->parent, context);
> > > -    g_source_unref(&iwp->parent);
> > >      return (GSource *)iwp;
> > >  }
> > >  
> > > @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource *source)
> > >      IOWatchPoll *iwp;
> > >  
> > >      iwp = io_watch_poll_from_source(source);
> > > +
> > > +    /*
> > > +     * Here the order of destruction really matters.  We need to first
> > > +     * detach the IOWatchPoll object from the context (which may still
> > > +     * be running in another loop thread), only after that could we
> > > +     * continue to operate on iwp->src, or there may be race condition
> > > +     * between current thread and the context loop thread.
> > > +     *
> > > +     * Let's blame the glib bug mentioned in commit 2b316774f6
> > > +     * ("qemu-char: do not operate on sources from finalize
> > > +     * callbacks") for this extra complexity.
> > 
> > I don't understand how this bug is to blame.  Isn't the problem here a
> > race condition between two QEMU threads?
> 
> Yes, it is.
> 
> The problem is, we won't have the race condition if glib does not have
> that bug mentioned.  Then the thread running GMainContext will have
> full control of iwp->src destruction, and destruction of it would be
> fairly straightforward (unref iwp->src in IOWatchPoll destructor).
> Now IIUC we are doing this in a hacky way, say, we destroy iwp->src
> explicitly from main thread before quitting (see [1] below, the whole
> if clause).
> 
> > 
> > Why are two threads accessing the watch at the same time?
> 
> Here is how I understand:
> 
> Firstly we need to tackle with that bug, by an explicit destruction of
> iwp->src below; meanwhile when we are destroying it, the GMainContext
> can still be running somewhere (it's not happening in current series
> since I stopped iothread earlier than this point, however it can still
> happen if in the future we don't do that), then we possibly want this
> patch.
> 
> Again, without this patch, current series should work; however I do
> hope this patch can be in, in case someday we want to provide complete
> thread safety for Chardevs (now it is not really thread-safe).

You said qtests fail with "Assertion `iwp->src == NULL' failed" but then
you said "without this patch, current series should work".  How do you
reproduce the failure if it doesn't occur?

It looks like remove_fd_in_watch() -> io_remove_watch_poll() callers
fall into two categories: called from within the event loop and called
when a chardev is destroyed.  Do the thread-safety issues occur when the
chardev is destroyed by the QEMU main loop thread?  Or did I miss cases
where remove_fd_in_watch() is called from other threads?

> 
> > 
> > > +     */
> > > +    g_source_destroy(&iwp->parent);
> > >      if (iwp->src) {
> > >          g_source_destroy(iwp->src);
> > >          g_source_unref(iwp->src);
> > >          iwp->src = NULL;
> > >      }
> 
> [1]
> 
> > > -    g_source_destroy(&iwp->parent);
> > > +    g_source_unref(&iwp->parent);
> > >  }
> > >  
> > >  void remove_fd_in_watch(Chardev *chr)
> > > -- 
> > > 2.13.5
> > > 
> 
> Thanks,
> 
> -- 
> Peter Xu
Peter Xu Nov. 14, 2017, 11:31 a.m. UTC | #5
On Tue, Nov 14, 2017 at 10:32:19AM +0000, Stefan Hajnoczi wrote:
> On Tue, Nov 14, 2017 at 02:09:39PM +0800, Peter Xu wrote:
> > On Mon, Nov 13, 2017 at 04:52:11PM +0000, Stefan Hajnoczi wrote:
> > > On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> > > > This is not a problem if we are only having one single loop thread like
> > > > before.  However, after per-monitor thread is introduced, this is not
> > > > true any more, and the race can happen.
> > > > 
> > > > The race can be triggered with "make check -j8" sometimes:
> > > 
> > > Please mention a specific test case that fails.
> > 
> > It was any of the check-qtest-$(TARGET)s that failed.  I'll mention
> > that in next post.
> > 
> > > 
> > > > 
> > > >   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
> > > >   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> > > > 
> > > > This patch keeps the reference for the watch object when creating in
> > > > io_add_watch_poll(), so that the object will never be released in the
> > > > context main loop, especially when the context loop is running in
> > > > another standalone thread.  Meanwhile, when we want to remove the watch
> > > > object, we always first detach the watch object from its owner context,
> > > > then we continue with the cleanup.
> > > > 
> > > > Without this patch, calling io_remove_watch_poll() in main loop thread
> > > > is not thread-safe, since the other per-monitor thread may be modifying
> > > > the watch object at the same time.
> > > > 
> > > > Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  chardev/char-io.c | 16 ++++++++++++++--
> > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > > > index f81052481a..50b5bac704 100644
> > > > --- a/chardev/char-io.c
> > > > +++ b/chardev/char-io.c
> > > > @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
> > > >      g_free(name);
> > > >  
> > > >      g_source_attach(&iwp->parent, context);
> > > > -    g_source_unref(&iwp->parent);
> > > >      return (GSource *)iwp;
> > > >  }
> > > >  
> > > > @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource *source)
> > > >      IOWatchPoll *iwp;
> > > >  
> > > >      iwp = io_watch_poll_from_source(source);
> > > > +
> > > > +    /*
> > > > +     * Here the order of destruction really matters.  We need to first
> > > > +     * detach the IOWatchPoll object from the context (which may still
> > > > +     * be running in another loop thread), only after that could we
> > > > +     * continue to operate on iwp->src, or there may be race condition
> > > > +     * between current thread and the context loop thread.
> > > > +     *
> > > > +     * Let's blame the glib bug mentioned in commit 2b316774f6
> > > > +     * ("qemu-char: do not operate on sources from finalize
> > > > +     * callbacks") for this extra complexity.
> > > 
> > > I don't understand how this bug is to blame.  Isn't the problem here a
> > > race condition between two QEMU threads?
> > 
> > Yes, it is.
> > 
> > The problem is, we won't have the race condition if glib does not have
> > that bug mentioned.  Then the thread running GMainContext will have
> > full control of iwp->src destruction, and destruction of it would be
> > fairly straightforward (unref iwp->src in IOWatchPoll destructor).
> > Now IIUC we are doing this in a hacky way, say, we destroy iwp->src
> > explicitly from main thread before quitting (see [1] below, the whole
> > if clause).
> > 
> > > 
> > > Why are two threads accessing the watch at the same time?
> > 
> > Here is how I understand:
> > 
> > Firstly we need to tackle with that bug, by an explicit destruction of
> > iwp->src below; meanwhile when we are destroying it, the GMainContext
> > can still be running somewhere (it's not happening in current series
> > since I stopped iothread earlier than this point, however it can still
> > happen if in the future we don't do that), then we possibly want this
> > patch.
> > 
> > Again, without this patch, current series should work; however I do
> > hope this patch can be in, in case someday we want to provide complete
> > thread safety for Chardevs (now it is not really thread-safe).
> 
> You said qtests fail with "Assertion `iwp->src == NULL' failed" but then
> you said "without this patch, current series should work".  How do you
> reproduce the failure if it doesn't occur?

Actually it occurs in some old versions, but not in current version.
Current version destroys the iothread earlier (as Dan suggested), so
it can avoid the issue.  Sorry for not being clear.

> 
> It looks like remove_fd_in_watch() -> io_remove_watch_poll() callers
> fall into two categories: called from within the event loop and called
> when a chardev is destroyed.  Do the thread-safety issues occur when the
> chardev is destroyed by the QEMU main loop thread?  Or did I miss cases
> where remove_fd_in_watch() is called from other threads?

I think this can also be called in monitor iothread?  Even if so, it's
pretty safe since if the monitor iothread is calling
remove_fd_in_watch() then it must not be using it after all.  The race
can happen when we are destroying the IOWatchPoll while the other
event loop thread (which may not be the main thread) is still running,
just like what I did in my old series.

> 
> > 
> > > 
> > > > +     */
> > > > +    g_source_destroy(&iwp->parent);
> > > >      if (iwp->src) {
> > > >          g_source_destroy(iwp->src);
> > > >          g_source_unref(iwp->src);
> > > >          iwp->src = NULL;
> > > >      }
> > 
> > [1]
> > 
> > > > -    g_source_destroy(&iwp->parent);
> > > > +    g_source_unref(&iwp->parent);
> > > >  }
> > > >  
> > > >  void remove_fd_in_watch(Chardev *chr)
> > > > -- 
> > > > 2.13.5
> > > > 
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
Stefan Hajnoczi Nov. 15, 2017, 9:37 a.m. UTC | #6
On Tue, Nov 14, 2017 at 07:31:10PM +0800, Peter Xu wrote:
> On Tue, Nov 14, 2017 at 10:32:19AM +0000, Stefan Hajnoczi wrote:
> > On Tue, Nov 14, 2017 at 02:09:39PM +0800, Peter Xu wrote:
> > > On Mon, Nov 13, 2017 at 04:52:11PM +0000, Stefan Hajnoczi wrote:
> > > > On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> > > > > This is not a problem if we are only having one single loop thread like
> > > > > before.  However, after per-monitor thread is introduced, this is not
> > > > > true any more, and the race can happen.
> > > > > 
> > > > > The race can be triggered with "make check -j8" sometimes:
> > > > 
> > > > Please mention a specific test case that fails.
> > > 
> > > It was any of the check-qtest-$(TARGET)s that failed.  I'll mention
> > > that in next post.
> > > 
> > > > 
> > > > > 
> > > > >   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
> > > > >   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> > > > > 
> > > > > This patch keeps the reference for the watch object when creating in
> > > > > io_add_watch_poll(), so that the object will never be released in the
> > > > > context main loop, especially when the context loop is running in
> > > > > another standalone thread.  Meanwhile, when we want to remove the watch
> > > > > object, we always first detach the watch object from its owner context,
> > > > > then we continue with the cleanup.
> > > > > 
> > > > > Without this patch, calling io_remove_watch_poll() in main loop thread
> > > > > is not thread-safe, since the other per-monitor thread may be modifying
> > > > > the watch object at the same time.
> > > > > 
> > > > > Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >  chardev/char-io.c | 16 ++++++++++++++--
> > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > > > > index f81052481a..50b5bac704 100644
> > > > > --- a/chardev/char-io.c
> > > > > +++ b/chardev/char-io.c
> > > > > @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
> > > > >      g_free(name);
> > > > >  
> > > > >      g_source_attach(&iwp->parent, context);
> > > > > -    g_source_unref(&iwp->parent);
> > > > >      return (GSource *)iwp;
> > > > >  }
> > > > >  
> > > > > @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource *source)
> > > > >      IOWatchPoll *iwp;
> > > > >  
> > > > >      iwp = io_watch_poll_from_source(source);
> > > > > +
> > > > > +    /*
> > > > > +     * Here the order of destruction really matters.  We need to first
> > > > > +     * detach the IOWatchPoll object from the context (which may still
> > > > > +     * be running in another loop thread), only after that could we
> > > > > +     * continue to operate on iwp->src, or there may be race condition
> > > > > +     * between current thread and the context loop thread.
> > > > > +     *
> > > > > +     * Let's blame the glib bug mentioned in commit 2b316774f6
> > > > > +     * ("qemu-char: do not operate on sources from finalize
> > > > > +     * callbacks") for this extra complexity.
> > > > 
> > > > I don't understand how this bug is to blame.  Isn't the problem here a
> > > > race condition between two QEMU threads?
> > > 
> > > Yes, it is.
> > > 
> > > The problem is, we won't have the race condition if glib does not have
> > > that bug mentioned.  Then the thread running GMainContext will have
> > > full control of iwp->src destruction, and destruction of it would be
> > > fairly straightforward (unref iwp->src in IOWatchPoll destructor).
> > > Now IIUC we are doing this in a hacky way, say, we destroy iwp->src
> > > explicitly from main thread before quitting (see [1] below, the whole
> > > if clause).
> > > 
> > > > 
> > > > Why are two threads accessing the watch at the same time?
> > > 
> > > Here is how I understand:
> > > 
> > > Firstly we need to tackle with that bug, by an explicit destruction of
> > > iwp->src below; meanwhile when we are destroying it, the GMainContext
> > > can still be running somewhere (it's not happening in current series
> > > since I stopped iothread earlier than this point, however it can still
> > > happen if in the future we don't do that), then we possibly want this
> > > patch.
> > > 
> > > Again, without this patch, current series should work; however I do
> > > hope this patch can be in, in case someday we want to provide complete
> > > thread safety for Chardevs (now it is not really thread-safe).
> > 
> > You said qtests fail with "Assertion `iwp->src == NULL' failed" but then
> > you said "without this patch, current series should work".  How do you
> > reproduce the failure if it doesn't occur?
> 
> Actually it occurs in some old versions, but not in current version.
> Current version destroys the iothread earlier (as Dan suggested), so
> it can avoid the issue.  Sorry for not being clear.
> 
> > 
> > It looks like remove_fd_in_watch() -> io_remove_watch_poll() callers
> > fall into two categories: called from within the event loop and called
> > when a chardev is destroyed.  Do the thread-safety issues occur when the
> > chardev is destroyed by the QEMU main loop thread?  Or did I miss cases
> > where remove_fd_in_watch() is called from other threads?
> 
> I think this can also be called in monitor iothread?

When I say "event loop", I mean any thread that is running an event loop
including IOThreads and the main loop thread.

What do you mean by "monitor iothread"?

> Even if so, it's
> pretty safe since if the monitor iothread is calling
> remove_fd_in_watch() then it must not be using it after all.  The race
> can happen when we are destroying the IOWatchPoll while the other
> event loop thread (which may not be the main thread) is still running,
> just like what I did in my old series.

The scenario this patch is trying to address doesn't make a lot of sense
since there will be further thread-safety problems if two threads are
modifying a Chardev at the same time.  A lock will probably be required
to protect the state and this patch might not be necessary then.

This patch seems very speculative and it's unclear what concrete
scenario it addresses.  I suggest dropping the patch from this series so
it is not a distraction from what you're actually trying to achieve.

Stefan
Peter Xu Nov. 15, 2017, 9:48 a.m. UTC | #7
On Wed, Nov 15, 2017 at 09:37:40AM +0000, Stefan Hajnoczi wrote:
> On Tue, Nov 14, 2017 at 07:31:10PM +0800, Peter Xu wrote:
> > On Tue, Nov 14, 2017 at 10:32:19AM +0000, Stefan Hajnoczi wrote:
> > > On Tue, Nov 14, 2017 at 02:09:39PM +0800, Peter Xu wrote:
> > > > On Mon, Nov 13, 2017 at 04:52:11PM +0000, Stefan Hajnoczi wrote:
> > > > > On Mon, Nov 06, 2017 at 05:46:17PM +0800, Peter Xu wrote:
> > > > > > This is not a problem if we are only having one single loop thread like
> > > > > > before.  However, after per-monitor thread is introduced, this is not
> > > > > > true any more, and the race can happen.
> > > > > > 
> > > > > > The race can be triggered with "make check -j8" sometimes:
> > > > > 
> > > > > Please mention a specific test case that fails.
> > > > 
> > > > It was any of the check-qtest-$(TARGET)s that failed.  I'll mention
> > > > that in next post.
> > > > 
> > > > > 
> > > > > > 
> > > > > >   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
> > > > > >   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> > > > > > 
> > > > > > This patch keeps the reference for the watch object when creating in
> > > > > > io_add_watch_poll(), so that the object will never be released in the
> > > > > > context main loop, especially when the context loop is running in
> > > > > > another standalone thread.  Meanwhile, when we want to remove the watch
> > > > > > object, we always first detach the watch object from its owner context,
> > > > > > then we continue with the cleanup.
> > > > > > 
> > > > > > Without this patch, calling io_remove_watch_poll() in main loop thread
> > > > > > is not thread-safe, since the other per-monitor thread may be modifying
> > > > > > the watch object at the same time.
> > > > > > 
> > > > > > Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >  chardev/char-io.c | 16 ++++++++++++++--
> > > > > >  1 file changed, 14 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/chardev/char-io.c b/chardev/char-io.c
> > > > > > index f81052481a..50b5bac704 100644
> > > > > > --- a/chardev/char-io.c
> > > > > > +++ b/chardev/char-io.c
> > > > > > @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
> > > > > >      g_free(name);
> > > > > >  
> > > > > >      g_source_attach(&iwp->parent, context);
> > > > > > -    g_source_unref(&iwp->parent);
> > > > > >      return (GSource *)iwp;
> > > > > >  }
> > > > > >  
> > > > > > @@ -131,12 +130,25 @@ static void io_remove_watch_poll(GSource *source)
> > > > > >      IOWatchPoll *iwp;
> > > > > >  
> > > > > >      iwp = io_watch_poll_from_source(source);
> > > > > > +
> > > > > > +    /*
> > > > > > +     * Here the order of destruction really matters.  We need to first
> > > > > > +     * detach the IOWatchPoll object from the context (which may still
> > > > > > +     * be running in another loop thread), only after that could we
> > > > > > +     * continue to operate on iwp->src, or there may be race condition
> > > > > > +     * between current thread and the context loop thread.
> > > > > > +     *
> > > > > > +     * Let's blame the glib bug mentioned in commit 2b316774f6
> > > > > > +     * ("qemu-char: do not operate on sources from finalize
> > > > > > +     * callbacks") for this extra complexity.
> > > > > 
> > > > > I don't understand how this bug is to blame.  Isn't the problem here a
> > > > > race condition between two QEMU threads?
> > > > 
> > > > Yes, it is.
> > > > 
> > > > The problem is, we won't have the race condition if glib does not have
> > > > that bug mentioned.  Then the thread running GMainContext will have
> > > > full control of iwp->src destruction, and destruction of it would be
> > > > fairly straightforward (unref iwp->src in IOWatchPoll destructor).
> > > > Now IIUC we are doing this in a hacky way, say, we destroy iwp->src
> > > > explicitly from main thread before quitting (see [1] below, the whole
> > > > if clause).
> > > > 
> > > > > 
> > > > > Why are two threads accessing the watch at the same time?
> > > > 
> > > > Here is how I understand:
> > > > 
> > > > Firstly we need to tackle with that bug, by an explicit destruction of
> > > > iwp->src below; meanwhile when we are destroying it, the GMainContext
> > > > can still be running somewhere (it's not happening in current series
> > > > since I stopped iothread earlier than this point, however it can still
> > > > happen if in the future we don't do that), then we possibly want this
> > > > patch.
> > > > 
> > > > Again, without this patch, current series should work; however I do
> > > > hope this patch can be in, in case someday we want to provide complete
> > > > thread safety for Chardevs (now it is not really thread-safe).
> > > 
> > > You said qtests fail with "Assertion `iwp->src == NULL' failed" but then
> > > you said "without this patch, current series should work".  How do you
> > > reproduce the failure if it doesn't occur?
> > 
> > Actually it occurs in some old versions, but not in current version.
> > Current version destroys the iothread earlier (as Dan suggested), so
> > it can avoid the issue.  Sorry for not being clear.
> > 
> > > 
> > > It looks like remove_fd_in_watch() -> io_remove_watch_poll() callers
> > > fall into two categories: called from within the event loop and called
> > > when a chardev is destroyed.  Do the thread-safety issues occur when the
> > > chardev is destroyed by the QEMU main loop thread?  Or did I miss cases
> > > where remove_fd_in_watch() is called from other threads?
> > 
> > I think this can also be called in monitor iothread?
> 
> When I say "event loop", I mean any thread that is running an event loop
> including IOThreads and the main loop thread.
> 
> What do you mean by "monitor iothread"?

Ah, I see.  Yes, then I think it's true - the failure only happens
when remove_fd_in_watch() is called during destruction in main loop
thread.

> 
> > Even if so, it's
> > pretty safe since if the monitor iothread is calling
> > remove_fd_in_watch() then it must not be using it after all.  The race
> > can happen when we are destroying the IOWatchPoll while the other
> > event loop thread (which may not be the main thread) is still running,
> > just like what I did in my old series.
> 
> The scenario this patch is trying to address doesn't make a lot of sense
> since there will be further thread-safety problems if two threads are
> modifying a Chardev at the same time.  A lock will probably be required
> to protect the state and this patch might not be necessary then.
> 
> This patch seems very speculative and it's unclear what concrete
> scenario it addresses.  I suggest dropping the patch from this series so
> it is not a distraction from what you're actually trying to achieve.

Ok, then let me drop it.  Thanks,
diff mbox series

Patch

diff --git a/chardev/char-io.c b/chardev/char-io.c
index f81052481a..50b5bac704 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -122,7 +122,6 @@  GSource *io_add_watch_poll(Chardev *chr,
     g_free(name);
 
     g_source_attach(&iwp->parent, context);
-    g_source_unref(&iwp->parent);
     return (GSource *)iwp;
 }
 
@@ -131,12 +130,25 @@  static void io_remove_watch_poll(GSource *source)
     IOWatchPoll *iwp;
 
     iwp = io_watch_poll_from_source(source);
+
+    /*
+     * Here the order of destruction really matters.  We need to first
+     * detach the IOWatchPoll object from the context (which may still
+     * be running in another loop thread), only after that could we
+     * continue to operate on iwp->src, or there may be race condition
+     * between current thread and the context loop thread.
+     *
+     * Let's blame the glib bug mentioned in commit 2b316774f6
+     * ("qemu-char: do not operate on sources from finalize
+     * callbacks") for this extra complexity.
+     */
+    g_source_destroy(&iwp->parent);
     if (iwp->src) {
         g_source_destroy(iwp->src);
         g_source_unref(iwp->src);
         iwp->src = NULL;
     }
-    g_source_destroy(&iwp->parent);
+    g_source_unref(&iwp->parent);
 }
 
 void remove_fd_in_watch(Chardev *chr)