Patchwork [v2] posix-aio-compat: fix latency issues

login
register
mail settings
Submitter Avi Kivity
Date Aug. 14, 2011, 4:04 a.m.
Message ID <1313294689-21572-1-git-send-email-avi@redhat.com>
Download mbox | patch
Permalink /patch/109954/
State New
Headers show

Comments

Avi Kivity - Aug. 14, 2011, 4:04 a.m.
In certain circumstances, posix-aio-compat can incur a lot of latency:
 - threads are created by vcpu threads, so if vcpu affinity is set,
   aio threads inherit vcpu affinity.  This can cause many aio threads
   to compete for one cpu.
 - we can create up to max_threads (64) aio threads in one go; since a
   pthread_create can take around 30μs, we have up to 2ms of cpu time
   under a global lock.

Fix by:
 - moving thread creation to the main thread, so we inherit the main
   thread's affinity instead of the vcpu thread's affinity.
 - if a thread is currently being created, and we need to create yet
   another thread, let thread being born create the new thread, reducing
   the amount of time we spend under the main thread.
 - drop the local lock while creating a thread (we may still hold the
   global mutex, though)

Note this doesn't eliminate latency completely; scheduler artifacts or
lack of host cpu resources can still cause it.  We may want pre-allocated
threads when this cannot be tolerated.

Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
v2: simplify do_spawn_thread() locking

 posix-aio-compat.c |   44 ++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 42 insertions(+), 2 deletions(-)
Kevin Wolf - Aug. 22, 2011, 5:22 p.m.
Am 14.08.2011 06:04, schrieb Avi Kivity:
> In certain circumstances, posix-aio-compat can incur a lot of latency:
>  - threads are created by vcpu threads, so if vcpu affinity is set,
>    aio threads inherit vcpu affinity.  This can cause many aio threads
>    to compete for one cpu.
>  - we can create up to max_threads (64) aio threads in one go; since a
>    pthread_create can take around 30μs, we have up to 2ms of cpu time
>    under a global lock.
> 
> Fix by:
>  - moving thread creation to the main thread, so we inherit the main
>    thread's affinity instead of the vcpu thread's affinity.
>  - if a thread is currently being created, and we need to create yet
>    another thread, let thread being born create the new thread, reducing
>    the amount of time we spend under the main thread.
>  - drop the local lock while creating a thread (we may still hold the
>    global mutex, though)
> 
> Note this doesn't eliminate latency completely; scheduler artifacts or
> lack of host cpu resources can still cause it.  We may want pre-allocated
> threads when this cannot be tolerated.
> 
> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.
> 
> Signed-off-by: Avi Kivity <avi@redhat.com>
> ---
> v2: simplify do_spawn_thread() locking

Thanks, applied to the block branch.

Kevin
Jan Kiszka - Aug. 22, 2011, 5:29 p.m.
On 2011-08-14 06:04, Avi Kivity wrote:
> In certain circumstances, posix-aio-compat can incur a lot of latency:
>  - threads are created by vcpu threads, so if vcpu affinity is set,
>    aio threads inherit vcpu affinity.  This can cause many aio threads
>    to compete for one cpu.
>  - we can create up to max_threads (64) aio threads in one go; since a
>    pthread_create can take around 30μs, we have up to 2ms of cpu time
>    under a global lock.
> 
> Fix by:
>  - moving thread creation to the main thread, so we inherit the main
>    thread's affinity instead of the vcpu thread's affinity.
>  - if a thread is currently being created, and we need to create yet
>    another thread, let thread being born create the new thread, reducing
>    the amount of time we spend under the main thread.
>  - drop the local lock while creating a thread (we may still hold the
>    global mutex, though)
> 
> Note this doesn't eliminate latency completely; scheduler artifacts or
> lack of host cpu resources can still cause it.  We may want pre-allocated
> threads when this cannot be tolerated.
> 
> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.

At this chance: What is the state of getting rid of the remaining delta
between upstream's version and qemu-kvm?

Jan
Stefan Hajnoczi - Aug. 23, 2011, 11:01 a.m.
On Mon, Aug 22, 2011 at 6:29 PM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2011-08-14 06:04, Avi Kivity wrote:
>> In certain circumstances, posix-aio-compat can incur a lot of latency:
>>  - threads are created by vcpu threads, so if vcpu affinity is set,
>>    aio threads inherit vcpu affinity.  This can cause many aio threads
>>    to compete for one cpu.
>>  - we can create up to max_threads (64) aio threads in one go; since a
>>    pthread_create can take around 30μs, we have up to 2ms of cpu time
>>    under a global lock.
>>
>> Fix by:
>>  - moving thread creation to the main thread, so we inherit the main
>>    thread's affinity instead of the vcpu thread's affinity.
>>  - if a thread is currently being created, and we need to create yet
>>    another thread, let thread being born create the new thread, reducing
>>    the amount of time we spend under the main thread.
>>  - drop the local lock while creating a thread (we may still hold the
>>    global mutex, though)
>>
>> Note this doesn't eliminate latency completely; scheduler artifacts or
>> lack of host cpu resources can still cause it.  We may want pre-allocated
>> threads when this cannot be tolerated.
>>
>> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.
>
> At this chance: What is the state of getting rid of the remaining delta
> between upstream's version and qemu-kvm?

That would be nice.  qemu-kvm.git uses a signalfd to handle I/O
completion whereas qemu.git uses a signal, writes to a pipe from the
signal handler, and uses qemu_notify_event() to break the vcpu.  Once
the force iothread patch is merged we should be able to move to
qemu-kvm.git's signalfd approach.

Stefan
Anthony Liguori - Aug. 23, 2011, 12:40 p.m.
On 08/23/2011 06:01 AM, Stefan Hajnoczi wrote:
> On Mon, Aug 22, 2011 at 6:29 PM, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
>> On 2011-08-14 06:04, Avi Kivity wrote:
>>> In certain circumstances, posix-aio-compat can incur a lot of latency:
>>>   - threads are created by vcpu threads, so if vcpu affinity is set,
>>>     aio threads inherit vcpu affinity.  This can cause many aio threads
>>>     to compete for one cpu.
>>>   - we can create up to max_threads (64) aio threads in one go; since a
>>>     pthread_create can take around 30μs, we have up to 2ms of cpu time
>>>     under a global lock.
>>>
>>> Fix by:
>>>   - moving thread creation to the main thread, so we inherit the main
>>>     thread's affinity instead of the vcpu thread's affinity.
>>>   - if a thread is currently being created, and we need to create yet
>>>     another thread, let thread being born create the new thread, reducing
>>>     the amount of time we spend under the main thread.
>>>   - drop the local lock while creating a thread (we may still hold the
>>>     global mutex, though)
>>>
>>> Note this doesn't eliminate latency completely; scheduler artifacts or
>>> lack of host cpu resources can still cause it.  We may want pre-allocated
>>> threads when this cannot be tolerated.
>>>
>>> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.
>>
>> At this chance: What is the state of getting rid of the remaining delta
>> between upstream's version and qemu-kvm?
>
> That would be nice.  qemu-kvm.git uses a signalfd to handle I/O
> completion whereas qemu.git uses a signal, writes to a pipe from the
> signal handler, and uses qemu_notify_event() to break the vcpu.  Once
> the force iothread patch is merged we should be able to move to
> qemu-kvm.git's signalfd approach.

No need to use a signal at all actually.  The use of a signal is 
historic and was required to work around the TCG race that I referred to 
in another thread.

You should be able to just use an eventfd or pipe.

Better yet, we should look at using GThreadPool to replace posix-aio-compat.

Regards,

Anthony Liguori

>
> Stefan
>
Jan Kiszka - Aug. 23, 2011, 1:02 p.m.
On 2011-08-23 14:40, Anthony Liguori wrote:
> On 08/23/2011 06:01 AM, Stefan Hajnoczi wrote:
>> On Mon, Aug 22, 2011 at 6:29 PM, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
>>> On 2011-08-14 06:04, Avi Kivity wrote:
>>>> In certain circumstances, posix-aio-compat can incur a lot of latency:
>>>>   - threads are created by vcpu threads, so if vcpu affinity is set,
>>>>     aio threads inherit vcpu affinity.  This can cause many aio threads
>>>>     to compete for one cpu.
>>>>   - we can create up to max_threads (64) aio threads in one go; since a
>>>>     pthread_create can take around 30μs, we have up to 2ms of cpu time
>>>>     under a global lock.
>>>>
>>>> Fix by:
>>>>   - moving thread creation to the main thread, so we inherit the main
>>>>     thread's affinity instead of the vcpu thread's affinity.
>>>>   - if a thread is currently being created, and we need to create yet
>>>>     another thread, let thread being born create the new thread, reducing
>>>>     the amount of time we spend under the main thread.
>>>>   - drop the local lock while creating a thread (we may still hold the
>>>>     global mutex, though)
>>>>
>>>> Note this doesn't eliminate latency completely; scheduler artifacts or
>>>> lack of host cpu resources can still cause it.  We may want pre-allocated
>>>> threads when this cannot be tolerated.
>>>>
>>>> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.
>>>
>>> At this chance: What is the state of getting rid of the remaining delta
>>> between upstream's version and qemu-kvm?
>>
>> That would be nice.  qemu-kvm.git uses a signalfd to handle I/O
>> completion whereas qemu.git uses a signal, writes to a pipe from the
>> signal handler, and uses qemu_notify_event() to break the vcpu.  Once
>> the force iothread patch is merged we should be able to move to
>> qemu-kvm.git's signalfd approach.
> 
> No need to use a signal at all actually.  The use of a signal is 
> historic and was required to work around the TCG race that I referred to 
> in another thread.
> 
> You should be able to just use an eventfd or pipe.
> 
> Better yet, we should look at using GThreadPool to replace posix-aio-compat.

When interacting with the thread pool is part of some time-critical path
(easily possible with a real-time Linux guest), general-purpose
implementations like what glib offers are typically out of the game.
They do not provide sufficient customizability, specifically control
over their internal synchronization and allocation policies. That
applies to the other rather primitive glib threading and locking
services as well.

Jan
Anthony Liguori - Aug. 23, 2011, 2:02 p.m.
On 08/23/2011 08:02 AM, Jan Kiszka wrote:
> On 2011-08-23 14:40, Anthony Liguori wrote:
>> You should be able to just use an eventfd or pipe.
>>
>> Better yet, we should look at using GThreadPool to replace posix-aio-compat.
>
> When interacting with the thread pool is part of some time-critical path
> (easily possible with a real-time Linux guest), general-purpose
> implementations like what glib offers are typically out of the game.
> They do not provide sufficient customizability, specifically control
> over their internal synchronization and allocation policies. That
> applies to the other rather primitive glib threading and locking
> services as well.

We can certainly enhance glib.  glib is a cross platform library.  I 
don't see a compelling reason to invent a new cross platform library 
just for QEMU especially if the justification is future features, not 
current features.

Regards,

Anthony Liguori

>
> Jan
>
Jan Kiszka - Aug. 23, 2011, 2:10 p.m.
On 2011-08-23 16:02, Anthony Liguori wrote:
> On 08/23/2011 08:02 AM, Jan Kiszka wrote:
>> On 2011-08-23 14:40, Anthony Liguori wrote:
>>> You should be able to just use an eventfd or pipe.
>>>
>>> Better yet, we should look at using GThreadPool to replace posix-aio-compat.
>>
>> When interacting with the thread pool is part of some time-critical path
>> (easily possible with a real-time Linux guest), general-purpose
>> implementations like what glib offers are typically out of the game.
>> They do not provide sufficient customizability, specifically control
>> over their internal synchronization and allocation policies. That
>> applies to the other rather primitive glib threading and locking
>> services as well.
> 
> We can certainly enhance glib.  glib is a cross platform library.  I 

Do you want to carry forked glib bits in QEMU?

> don't see a compelling reason to invent a new cross platform library 
> just for QEMU especially if the justification is future features, not 
> current features.

Tweaking affinity of aio threads is already a current requirement.

And we already have a working threading and locking system. One that is
growing beyond glib's level of support quickly (think of RCU).

Jan
Avi Kivity - Aug. 28, 2011, 8:09 a.m.
On 08/23/2011 05:10 PM, Jan Kiszka wrote:
> On 2011-08-23 16:02, Anthony Liguori wrote:
> >  On 08/23/2011 08:02 AM, Jan Kiszka wrote:
> >>  On 2011-08-23 14:40, Anthony Liguori wrote:
> >>>  You should be able to just use an eventfd or pipe.
> >>>
> >>>  Better yet, we should look at using GThreadPool to replace posix-aio-compat.
> >>
> >>  When interacting with the thread pool is part of some time-critical path
> >>  (easily possible with a real-time Linux guest), general-purpose
> >>  implementations like what glib offers are typically out of the game.
> >>  They do not provide sufficient customizability, specifically control
> >>  over their internal synchronization and allocation policies. That
> >>  applies to the other rather primitive glib threading and locking
> >>  services as well.
> >
> >  We can certainly enhance glib.  glib is a cross platform library.  I
>
> Do you want to carry forked glib bits in QEMU?

We can make real-time depend on a newer glib version.

>
> >  don't see a compelling reason to invent a new cross platform library
> >  just for QEMU especially if the justification is future features, not
> >  current features.
>
> Tweaking affinity of aio threads is already a current requirement.
>
> And we already have a working threading and locking system. One that is
> growing beyond glib's level of support quickly (think of RCU).
>

glib will have to support RCU as well.  But for this topic, I agree with 
you for now.

Patch

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 8dc00cb..c3febfb 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -30,6 +30,7 @@ 
 
 #include "block/raw-posix-aio.h"
 
+static void do_spawn_thread(void);
 
 struct qemu_paiocb {
     BlockDriverAIOCB common;
@@ -64,6 +65,9 @@  static pthread_attr_t attr;
 static int max_threads = 64;
 static int cur_threads = 0;
 static int idle_threads = 0;
+static int new_threads = 0;     /* backlog of threads we need to create */
+static int pending_threads = 0; /* threads created but not running yet */
+static QEMUBH *new_thread_bh;
 static QTAILQ_HEAD(, qemu_paiocb) request_list;
 
 #ifdef CONFIG_PREADV
@@ -311,6 +315,11 @@  static void *aio_thread(void *unused)
 
     pid = getpid();
 
+    mutex_lock(&lock);
+    pending_threads--;
+    mutex_unlock(&lock);
+    do_spawn_thread();
+
     while (1) {
         struct qemu_paiocb *aiocb;
         ssize_t ret = 0;
@@ -381,11 +390,20 @@  static void *aio_thread(void *unused)
     return NULL;
 }
 
-static void spawn_thread(void)
+static void do_spawn_thread(void)
 {
     sigset_t set, oldset;
 
-    cur_threads++;
+    mutex_lock(&lock);
+    if (!new_threads) {
+        mutex_unlock(&lock);
+        return;
+    }
+
+    new_threads--;
+    pending_threads++;
+
+    mutex_unlock(&lock);
 
     /* block all signals */
     if (sigfillset(&set)) die("sigfillset");
@@ -396,6 +414,27 @@  static void spawn_thread(void)
     if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore");
 }
 
+static void spawn_thread_bh_fn(void *opaque)
+{
+    do_spawn_thread();
+}
+
+static void spawn_thread(void)
+{
+    cur_threads++;
+    new_threads++;
+    /* If there are threads being created, they will spawn new workers, so
+     * we don't spend time creating many threads in a loop holding a mutex or
+     * starving the current vcpu.
+     *
+     * If there are no idle threads, ask the main thread to create one, so we
+     * inherit the correct affinity instead of the vcpu affinity.
+     */
+    if (!pending_threads) {
+        qemu_bh_schedule(new_thread_bh);
+    }
+}
+
 static void qemu_paio_submit(struct qemu_paiocb *aiocb)
 {
     aiocb->ret = -EINPROGRESS;
@@ -665,6 +704,7 @@  int paio_init(void)
         die2(ret, "pthread_attr_setdetachstate");
 
     QTAILQ_INIT(&request_list);
+    new_thread_bh = qemu_bh_new(spawn_thread_bh_fn, NULL);
 
     posix_aio_state = s;
     return 0;