Patchwork Add realtime option

login
register
mail settings
Submitter Satoru Moriya
Date Nov. 3, 2012, 4:43 a.m.
Message ID <8631DC5930FA9E468F04F3FD3A5D007213990E4C@USINDEM103.corp.hds.com>
Download mbox | patch
Permalink /patch/196802/
State New
Headers show

Comments

Satoru Moriya - Nov. 3, 2012, 4:43 a.m.
We have some plans to migrate old enterprise/control systems which
require low latency (msec order) to kvm virtualized environment.
In order to satisfy the requirements, this patch adds realtime option
to qemu:

 -realtime maxprio=<prio>,policy=<pol>

This option change the scheduling policy and priority to realtime one
(only vcpu thread) as specified with argument and mlock all qemu and
guest memory.

Of course, we need much more improvements to keep latency low in qemu
virtualized environment and this is a first step. OTOH, we can meet the
requirement of our first migration project with this patch.

These are basic performance test results:

Host : 4 core, 4GB, 3.7.0-rc3
Guest: 1 core, 512MB, 3.6.3-1.fc17

Benchmark: cyclictest
https://rt.wiki.kernel.org/index.php/Cyclictest

Command:
 $ cyclictest -p 99 -n -m -q -l 100000

Results:
 - no load (1:normal qemu, 2:realtime qemu)
   1. T: 0 ( 544) P:99 I:1000 C:100000 Min: 11 Act: 32 Avg: 157 Max: 10029
   2. T: 0 ( 449) P:99 I:1000 C:100000 Min: 16 Act: 30 Avg:  29 Max:   540

 - load (heavy network traffic) (3:normal qemu, 4: realtime qemu)
   3. T: 0 (3455) P:99 I:1000 C:100000 Min: 10 Act: 38 Avg: 364 Max: 18394
   4. T: 0 ( 493) P:99 I:1000 C:100000 Min: 12 Act: 21 Avg:  76 Max: 10796

Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
---
 cpus.c          | 10 ++++++++++
 cpus.h          |  3 +++
 qemu-config.c   | 16 ++++++++++++++++
 qemu-options.hx |  9 +++++++++
 vl.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 89 insertions(+)

                 }
                 numa_add(optarg);
                 break;
+            case QEMU_OPTION_realtime:
+                opts = qemu_opts_parse(qemu_find_opts("realtime"), optarg, 0);
+                if (!opts) {
+                    fprintf(stderr, "parse error: %s\n", optarg);
+                    exit(1);
+                }
+                configure_realtime(opts);
+                break;
             case QEMU_OPTION_display:
                 display_type = select_display(optarg);
                 break;
--
1.7.11.7
Jan Kiszka - Nov. 3, 2012, 7:45 a.m.
On 2012-11-03 05:43, Satoru Moriya wrote:
> We have some plans to migrate old enterprise/control systems which
> require low latency (msec order) to kvm virtualized environment.
> In order to satisfy the requirements, this patch adds realtime option
> to qemu:
> 
>  -realtime maxprio=<prio>,policy=<pol>
> 
> This option change the scheduling policy and priority to realtime one
> (only vcpu thread) as specified with argument and mlock all qemu and
> guest memory.

This patch breaks win32 build. All the POSIX stuff has to be pushed into
os-posix.c e.g. I'm introducing some os_prioritize() function for that
purpose, empty on win32.

Then another question is how to get the parameters around. I played with
many options, ending up so far with

/* called by os_prioritize */
void qemu_init_realtime(int rt_sched_policy, int max_sched_priority);
/* called by threaded subsystems */
bool qemu_realtime_is_enabled(void);
void qemu_realtime_get_parameters(int *policy, int *max_priority);

all hosted by qemu-thread-*.c (empty/aborting on win32). This allows to
adjust subsystems to realtime without pushing all the parameters into
global variables.

> 
> Of course, we need much more improvements to keep latency low in qemu
> virtualized environment and this is a first step. OTOH, we can meet the
> requirement of our first migration project with this patch.
> 
> These are basic performance test results:
> 
> Host : 4 core, 4GB, 3.7.0-rc3
> Guest: 1 core, 512MB, 3.6.3-1.fc17
> 
> Benchmark: cyclictest
> https://rt.wiki.kernel.org/index.php/Cyclictest
> 
> Command:
>  $ cyclictest -p 99 -n -m -q -l 100000
> 
> Results:
>  - no load (1:normal qemu, 2:realtime qemu)
>    1. T: 0 ( 544) P:99 I:1000 C:100000 Min: 11 Act: 32 Avg: 157 Max: 10029
>    2. T: 0 ( 449) P:99 I:1000 C:100000 Min: 16 Act: 30 Avg:  29 Max:   540
> 
>  - load (heavy network traffic) (3:normal qemu, 4: realtime qemu)
>    3. T: 0 (3455) P:99 I:1000 C:100000 Min: 10 Act: 38 Avg: 364 Max: 18394
>    4. T: 0 ( 493) P:99 I:1000 C:100000 Min: 12 Act: 21 Avg:  76 Max: 10796

What are the numbers of "chrt -f -p 99 <vcpu_tid>" compared to this?

My point is: This alone is not yet a good justification for the switch
and its current semantic. The approach of just raising the VCPU priority
is quite fragile without [V]CPU isolation. If you raise the VCPU over
its event threads, specifically the iothread, you risk starvation, e.g
during boot (BIOS will poll endlessly for PIT or disk). Yes, there is
/proc/sys/kernel/sched_rt_*, but this is what you typically disable when
doing realtime seriously, particularly if your guest doesn't idle during
operation.

The model I would propose for mainline first is different: maxprio goes
to the event threads, maxprio - 1 to all vcpus (means that maxprio must
be > 1). This setup is less likely to starve and makes more sense
(interrupts must have higher prio than CPUs).

However, that's also not yet generic as we will have scenarios where
only part of the event sources and VCPUs will be prioritized and the
rest shall remain low prio / SCHED_OTHER. Besides defining a way to
express such configurations, the problem is that they may not work
during guest boot. So some realtime profile switching concept may also
be needed. I haven't made up my mind on these issues yet. Not to speak
of the horrible mess of configuring a PREEMPT-RT host...

What is clear, though, is that we need a reference show case for
realtime QEMU/KVM. One that is as easy to reproduce as possible, doesn't
depend on proprietary realtime guests and clearly shows the advantages
of all the needed changes for a reasonable use case. I'd like to discuss
this at the RT-KVM BoF at the KVM Forum next week. Will you and/or any
of your colleagues be there?

Jan
Satoru Moriya - Nov. 5, 2012, 11:49 p.m.
On 11/03/2012 03:45 AM, Jan Kiszka wrote:> On 2012-11-03 05:43, Satoru Moriya wrote:
>> We have some plans to migrate old enterprise/control systems which
>> require low latency (msec order) to kvm virtualized environment.
>> In order to satisfy the requirements, this patch adds realtime option
>> to qemu:
>>
>>  -realtime maxprio=<prio>,policy=<pol>
>>
>> This option change the scheduling policy and priority to realtime one
>> (only vcpu thread) as specified with argument and mlock all qemu and
>> guest memory.
> 
> This patch breaks win32 build. All the POSIX stuff has to be pushed into
> os-posix.c e.g. I'm introducing some os_prioritize() function for that
> purpose, empty on win32.
>
> Then another question is how to get the parameters around. I played with
> many options, ending up so far with
> 
> /* called by os_prioritize */
> void qemu_init_realtime(int rt_sched_policy, int max_sched_priority);
> /* called by threaded subsystems */
> bool qemu_realtime_is_enabled(void);
> void qemu_realtime_get_parameters(int *policy, int *max_priority);
> 
> all hosted by qemu-thread-*.c (empty/aborting on win32). This allows to
> adjust subsystems to realtime without pushing all the parameters into
> global variables.

Thanks. I'll re-implement the patch based on your comment.

>> Benchmark: cyclictest
>> https://rt.wiki.kernel.org/index.php/Cyclictest
>>
>> Command:
>>  $ cyclictest -p 99 -n -m -q -l 100000
>>
>> Results:
>>  - no load (1:normal qemu, 2:realtime qemu)
>>    1. T: 0 ( 544) P:99 I:1000 C:100000 Min: 11 Act: 32 Avg: 157 Max: 10029
>>    2. T: 0 ( 449) P:99 I:1000 C:100000 Min: 16 Act: 30 Avg:  29 Max:   540
>>
>>  - load (heavy network traffic) (3:normal qemu, 4: realtime qemu)
>>    3. T: 0 (3455) P:99 I:1000 C:100000 Min: 10 Act: 38 Avg: 364 Max: 18394
>>    4. T: 0 ( 493) P:99 I:1000 C:100000 Min: 12 Act: 21 Avg:  76 Max: 10796
> 
> What are the numbers of "chrt -f -p 99 <vcpu_tid>" compared to this?

I'm afraid that I don't have the results now. I'll post it later or
next version.

> My point is: This alone is not yet a good justification for the switch
> and its current semantic. The approach of just raising the VCPU priority
> is quite fragile without [V]CPU isolation. If you raise the VCPU over
> its event threads, specifically the iothread, you risk starvation, e.g
> during boot (BIOS will poll endlessly for PIT or disk).

I think it doesn't happen if host has enough cpu core (at least vcpu+1).
Is it wrong?

> Yes, there is
> /proc/sys/kernel/sched_rt_*, but this is what you typically disable when
> doing realtime seriously, particularly if your guest doesn't idle during
> operation.
>
> The model I would propose for mainline first is different: maxprio goes
> to the event threads, maxprio - 1 to all vcpus (means that maxprio must
> be > 1). This setup is less likely to starve and makes more sense
> (interrupts must have higher prio than CPUs).

Ok, I'll try your approach and test it.

> However, that's also not yet generic as we will have scenarios where
> only part of the event sources and VCPUs will be prioritized and the
> rest shall remain low prio / SCHED_OTHER. Besides defining a way to
> express such configurations, the problem is that they may not work
> during guest boot. So some realtime profile switching concept may also
> be needed. I haven't made up my mind on these issues yet. Not to speak
> of the horrible mess of configuring a PREEMPT-RT host...
>
> What is clear, though, is that we need a reference show case for
> realtime QEMU/KVM. One that is as easy to reproduce as possible, doesn't
> depend on proprietary realtime guests and clearly shows the advantages
> of all the needed changes for a reasonable use case. I'd like to discuss
> this at the RT-KVM BoF at the KVM Forum next week. Will you and/or any
> of your colleagues be there?

Yes. I'll attend the BOF.

Regards,
Satoru

Patch

diff --git a/cpus.c b/cpus.c
index d9c332f..456e6ea 100644
--- a/cpus.c
+++ b/cpus.c
@@ -734,6 +734,7 @@  static void *qemu_kvm_cpu_thread_fn(void *arg)
     CPUArchState *env = arg;
     CPUState *cpu = ENV_GET_CPU(env);
     int r;
+    struct sched_param sp;
 
     qemu_mutex_lock(&qemu_global_mutex);
     qemu_thread_get_self(cpu->thread);
@@ -746,6 +747,15 @@  static void *qemu_kvm_cpu_thread_fn(void *arg)
         exit(1);
     }
 
+    if (realtime) {
+        sp.sched_priority = realtime_prio;
+        r = sched_setscheduler(0, realtime_pol, &sp);
+        if (r < 0) {
+            perror("Setting realtime policy failed");
+            exit(1);
+        }
+    }
+
     qemu_kvm_init_cpu_signals(env);
 
     /* signal CPU creation */
diff --git a/cpus.h b/cpus.h
index 81bd817..a6b2688 100644
--- a/cpus.h
+++ b/cpus.h
@@ -16,6 +16,9 @@  void qtest_clock_warp(int64_t dest);
 /* vl.c */
 extern int smp_cores;
 extern int smp_threads;
+extern int realtime;
+extern int realtime_prio;
+extern int realtime_pol;
 void set_numa_modes(void);
 void set_cpu_log(const char *optarg);
 void set_cpu_log_filename(const char *optarg);
diff --git a/qemu-config.c b/qemu-config.c
index 3154cac..13290c6 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -658,6 +658,21 @@  QemuOptsList qemu_boot_opts = {
             .type = QEMU_OPT_STRING,
         },
         { /*End of list */ }
+    },
+};
+
+QemuOptsList qemu_realtime_opts = {
+    .name = "realtime",
+    .head = QTAILQ_HEAD_INITIALIZER(qemu_realtime_opts.head),
+    .desc = {
+        {
+            .name = "maxprio",
+            .type = QEMU_OPT_NUMBER,
+        }, {
+            .name = "policy",
+            .type = QEMU_OPT_STRING,
+        },
+        { /* End of List */ }
     },
 };
 
@@ -699,6 +714,7 @@  static QemuOptsList *vm_config_groups[32] = {
     &qemu_iscsi_opts,
     &qemu_sandbox_opts,
     &qemu_add_fd_opts,
+    &qemu_realtime_opts,
     NULL,
 };
 
diff --git a/qemu-options.hx b/qemu-options.hx
index fe8f15c..eb8ba05 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2405,6 +2405,15 @@  STEXI
 Do not start CPU at startup (you must type 'c' in the monitor).
 ETEXI
 
+DEF("realtime", HAS_ARG, QEMU_OPTION_realtime,
+    "-realtime maxprio=prio[,policy=pol]\n",
+    QEMU_ARCH_ALL)
+STEXI
+@item -realtime maxprio=@var{prio}[,policy=@var{pol}]
+@findex -realtime
+run qemu as a realtime process with priority @var{prio} and policy @var{pol}.
+ETEXI
+
 DEF("gdb", HAS_ARG, QEMU_OPTION_gdb, \
     "-gdb dev        wait for gdb connection on 'dev'\n", QEMU_ARCH_ALL)
 STEXI
diff --git a/vl.c b/vl.c
index 0f5b07b..a08fe79 100644
--- a/vl.c
+++ b/vl.c
@@ -248,6 +248,10 @@  int nb_numa_nodes;
 uint64_t node_mem[MAX_NODES];
 unsigned long *node_cpumask[MAX_NODES];
 
+int realtime;
+int realtime_prio;
+int realtime_pol;
+
 uint8_t qemu_uuid[16];
 
 static QEMUBootSetHandler *boot_set_handler;
@@ -1151,6 +1155,45 @@  static void smp_parse(const char *optarg)
         max_cpus = smp_cpus;
 }
 
+static void configure_realtime(QemuOpts *opts) {
+    int prio, max_prio, min_prio;
+    const char *pol;
+
+    pol = qemu_opt_get(opts, "policy");
+    if (pol) {
+        if (!strcmp(pol, "rr")) {
+            realtime_pol = SCHED_RR;
+        } else if (!strcmp(pol, "fifo")) {
+            realtime_pol = SCHED_FIFO;
+        } else {
+            fprintf(stderr, "qemu: invalid option value '%s'\n", pol);
+            exit(1);
+        }
+    } else {
+        realtime_pol = SCHED_RR;
+    }
+    prio = qemu_opt_get_number(opts, "maxprio", 1);
+
+    min_prio = sched_get_priority_min(realtime_pol);
+    max_prio = sched_get_priority_max(realtime_pol);
+
+    if (prio < min_prio) {
+        realtime_prio = min_prio;
+    } else if (max_prio < prio) {
+        realtime_prio = max_prio;
+    } else {
+        realtime_prio = prio;
+    }
+
+    if (mlockall(MCL_CURRENT | MCL_FUTURE)) {
+        perror("mlock");
+        exit(1);
+    }
+
+    realtime = 1;
+}
+
 /***********************************************************/
 /* USB devices */
 
@@ -2712,6 +2755,14 @@  int main(int argc, char **argv, char **envp)