diff mbox

[v1,00/17] dataplane: optimization and multi virtqueue support

Message ID 20140810114624.0305b7af@tom-ThinkPad-T410
State New
Headers show

Commit Message

Ming Lei Aug. 10, 2014, 3:46 a.m. UTC
Hi Kevin, Paolo, Stefan and all,


On Wed, 6 Aug 2014 10:48:55 +0200
Kevin Wolf <kwolf@redhat.com> wrote:

> Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:

> 
> Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> coroutines instead of exiting them, so it can't make any use of the
> coroutine pool. On my laptop, I get this (where fixed coroutine is a
> version that simply removes the yield at the end):
> 
>                 | bypass        | fixed coro    | buggy coro
> ----------------+---------------+---------------+--------------
> time            | 1.09s         | 1.10s         | 1.62s
> L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> insns per cycle | 2.39          | 2.39          | 1.90
> 
> Begs the question whether you see a similar effect on a real qemu and
> the coroutine pool is still not big enough? With correct use of
> coroutines, the difference seems to be barely measurable even without
> any I/O involved.

Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
loading, and cause operations per sec very low(~40K/sec), finally I write a new
and simple one which can generate hundreds of kilo operations per sec and
the number should match with some fast storage devices, and it does show there
is not small effect from coroutine.

Extremely if just getppid() syscall is run in each iteration, with using coroutine,
only 3M operations/sec can be got, and without using coroutine, the number can
reach 16M/sec, and there is more than 4 times difference!!!

From another file read bench which is the default one:

      just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration

without using coroutine, operations per second can increase ~20% compared
with using coroutine. If reading 1024 bytes each time, the number still can
increase ~10%. The operations per second level is between 200K~400K per
sec which should match the IOPS in dataplane test, and the tests are
done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads). 

When reading 8192 and more bytes each time, the difference between using
coroutine and not can't be observed obviously.

Surely, the test result should depend on how fast the machine is, but even
for fast machine, I guess the similar result still can be observed by
decreasing read bytes each time.




Thanks,

Comments

Kevin Wolf Aug. 11, 2014, 2:03 p.m. UTC | #1
Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
> Hi Kevin, Paolo, Stefan and all,
> 
> 
> On Wed, 6 Aug 2014 10:48:55 +0200
> Kevin Wolf <kwolf@redhat.com> wrote:
> 
> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
> 
> > 
> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
> > coroutines instead of exiting them, so it can't make any use of the
> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
> > version that simply removes the yield at the end):
> > 
> >                 | bypass        | fixed coro    | buggy coro
> > ----------------+---------------+---------------+--------------
> > time            | 1.09s         | 1.10s         | 1.62s
> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
> > insns per cycle | 2.39          | 2.39          | 1.90
> > 
> > Begs the question whether you see a similar effect on a real qemu and
> > the coroutine pool is still not big enough? With correct use of
> > coroutines, the difference seems to be barely measurable even without
> > any I/O involved.
> 
> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
> loading, and cause operations per sec very low(~40K/sec), finally I write a new
> and simple one which can generate hundreds of kilo operations per sec and
> the number should match with some fast storage devices, and it does show there
> is not small effect from coroutine.
> 
> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
> only 3M operations/sec can be got, and without using coroutine, the number can
> reach 16M/sec, and there is more than 4 times difference!!!

I see that you're measuring a lot of things, but the one thing that is
unclear to me is what question those benchmarks are supposed to answer.

Basically I see two different, useful types of benchmark:

1. Look at coroutines in isolation and try to get a directly coroutine-
   related function (like create/destroy or yield/reenter) faster. This
   is what tests/test-coroutine does.

   This is quite good at telling you what costs the coroutine functions
   have and where you need to optimise - without taking the pratical
   benefits into account, so it's not suitable for comparison.

2. Look at the whole thing in its realistic environment. This should
   probably involve at least some asynchronous I/O, but ideally use the
   whole block layer. qemu-img bench tries to do this. For being even
   closer to the real environment you'd have to use the virtio-blk code
   as well, which you currently only get with a full VM (perhaps qtest
   could do something interesting here in theory).

   This is good for telling how big the costs are in relation to the
   total workload (and code saved elsewhere) in practice. This is the
   set of tests that can meaningfully be compared to a callback-based
   solution.

Running arbitrary workloads like getppid() or open/read/close isn't as
useful as these. It doesn't isolate the coroutines as well as tests that
run literally nothing else than coroutine functions, and it is too
removed from the actual use case to get the relation between additional
costs, saving and total workload figured out for the real case.

> From another file read bench which is the default one:
> 
>       just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration
> 
> without using coroutine, operations per second can increase ~20% compared
> with using coroutine. If reading 1024 bytes each time, the number still can
> increase ~10%. The operations per second level is between 200K~400K per
> sec which should match the IOPS in dataplane test, and the tests are
> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads). 
> 
> When reading 8192 and more bytes each time, the difference between using
> coroutine and not can't be observed obviously.

All it tells you is that the variation of the workload can make the
coroutine cost disappear in the noise. It doesn't tell you much about
how the real use case.

And you're comparing apples and oranges anyway: The real question in
qemu is whether you use coroutines or pass around heap-allocated state
between callbacks. Your benchmark doesn't have a single callback because
it hasn't got any asynchronous operations and doesn't need to allocate
and pass any state.

It does, however, have an unnecessary yield() for the coroutine case
because you felt that the real case is more complex and does yield
(which is true, but it's more complex for both coroutines and
callbacks).

> Surely, the test result should depend on how fast the machine is, but even
> for fast machine, I guess the similar result still can be observed by
> decreasing read bytes each time.

Yes, results looked similar on my laptop. (They just don't tell me
much.)


Let's have a look at some fio results from my laptop:

aggrb KB/s  | master    | coroutine | bypass
------------+-----------+-----------+------------
run 1       | 419934    | 449518    | 445823
run 2       | 444358    | 456365    | 448332
run 3       | 444076    | 455209    | 441552


And here from my lab test box:

aggrb KB/s  | master    | coroutine | bypass
------------+-----------+-----------+------------
run 1       | 25330     | 56378     | 53541
run 2       | 26041     | 55709     | 54136
run 3       | 25811     | 56829     | 49080

The improvement of the bypass patches is barely measurable on my laptop
(if it even exists), whereas it seems to be a pretty big thing for my
lab test box. In any case, the optimised coroutine code seems to beat
the bypass on both machines. (That is for random reads anyway. For
sequential, I get a much larger variation, and on my lab test box bypass
is ahead, whereas on my laptop both are roughly on the same level.)


Another thing I tried is creating the coroutine already in virtio-blk to
avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
the result of my benchmarks there, maybe you have an idea: For random
reads, I see a significant improvement, for sequential however a clear
degradation.

aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
------------+-----------+-----------+------------------------------
seq. read   | 738       | 738       | 694
random read | 442       | 459       | 475

I would appreciate any ideas about what's going on with sequential reads
here and how it can be fixed. Anyway, on my machines, coroutines don't
look like a lost case at all.

Kevin
Ming Lei Aug. 12, 2014, 7:53 a.m. UTC | #2
On Mon, Aug 11, 2014 at 10:03 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 10.08.2014 um 05:46 hat Ming Lei geschrieben:
>> Hi Kevin, Paolo, Stefan and all,
>>
>>
>> On Wed, 6 Aug 2014 10:48:55 +0200
>> Kevin Wolf <kwolf@redhat.com> wrote:
>>
>> > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben:
>>
>> >
>> > Anyhow, the coroutine version of your benchmark is buggy, it leaks all
>> > coroutines instead of exiting them, so it can't make any use of the
>> > coroutine pool. On my laptop, I get this (where fixed coroutine is a
>> > version that simply removes the yield at the end):
>> >
>> >                 | bypass        | fixed coro    | buggy coro
>> > ----------------+---------------+---------------+--------------
>> > time            | 1.09s         | 1.10s         | 1.62s
>> > L1-dcache-loads | 921,836,360   | 932,781,747   | 1,298,067,438
>> > insns per cycle | 2.39          | 2.39          | 1.90
>> >
>> > Begs the question whether you see a similar effect on a real qemu and
>> > the coroutine pool is still not big enough? With correct use of
>> > coroutines, the difference seems to be barely measurable even without
>> > any I/O involved.
>>
>> Now I fixes the coroutine leak bug, and previous crypt bench is a bit high
>> loading, and cause operations per sec very low(~40K/sec), finally I write a new
>> and simple one which can generate hundreds of kilo operations per sec and
>> the number should match with some fast storage devices, and it does show there
>> is not small effect from coroutine.
>>
>> Extremely if just getppid() syscall is run in each iteration, with using coroutine,
>> only 3M operations/sec can be got, and without using coroutine, the number can
>> reach 16M/sec, and there is more than 4 times difference!!!
>
> I see that you're measuring a lot of things, but the one thing that is
> unclear to me is what question those benchmarks are supposed to answer.
>
> Basically I see two different, useful types of benchmark:
>
> 1. Look at coroutines in isolation and try to get a directly coroutine-
>    related function (like create/destroy or yield/reenter) faster. This
>    is what tests/test-coroutine does.

Actually the tests/test-coroutine does tell us there is not small cost
introduced by using coroutine, as Paolo's computation in his environment[1]:

    - one yield takes 83ns
    - one enter takes 97ns
    - this will introduce 8.3% cost by using coroutine if the block
can reach 300K
      IOPS, like your case of loop over tmpfs
    - it may cause 13.8% cost if the block device can reach 500K IOPS

The cost may show in IOPS, or in CPU utilization or both, which
depends how fast the CPU is.

The above computation suppose all coroutine allocation hits on the pool,
and does not consider effect from switching stack. If both two considered,
the cost becomes more surely.

[1], https://lists.nongnu.org/archive/html/qemu-devel/2014-08/msg01544.html

>    This is quite good at telling you what costs the coroutine functions
>    have and where you need to optimise - without taking the pratical
>    benefits into account, so it's not suitable for comparison.
>
> 2. Look at the whole thing in its realistic environment. This should
>    probably involve at least some asynchronous I/O, but ideally use the
>    whole block layer. qemu-img bench tries to do this. For being even
>    closer to the real environment you'd have to use the virtio-blk code
>    as well, which you currently only get with a full VM (perhaps qtest
>    could do something interesting here in theory).
>
>    This is good for telling how big the costs are in relation to the
>    total workload (and code saved elsewhere) in practice. This is the
>    set of tests that can meaningfully be compared to a callback-based
>    solution.
>
> Running arbitrary workloads like getppid() or open/read/close isn't as
> useful as these. It doesn't isolate the coroutines as well as tests that
> run literally nothing else than coroutine functions, and it is too
> removed from the actual use case to get the relation between additional
> costs, saving and total workload figured out for the real case.

If you think getppid() doesn't isolate the coroutine, you can just do nop,
then you will find the cost may reach 90%.  Basically it is nothing to do
with what the load does, and it is much related to how fast the load can
run. The quicker, the more cost introduced by using coroutine, please
see the computation in above link.

Also another reason I use gettpid() is that:

     After IO plug&unplug is introduced,  bdrv_aio_readv/bdrv_aio_writev
     becomes much quicker because most of times they just queue I/O req
     into I/O queue, no io submit involved at all. Even though coroutine
     operations take little time(<100ns), it still may make a difference
     compared with the time for queuing I/O only, at lest for high-speed I/O,
     like > 300K IOPS in your case.

>> From another file read bench which is the default one:
>>
>>       just doing open(file), read(fd, buf in stack, 512), sum and close() in each iteration
>>
>> without using coroutine, operations per second can increase ~20% compared
>> with using coroutine. If reading 1024 bytes each time, the number still can
>> increase ~10%. The operations per second level is between 200K~400K per
>> sec which should match the IOPS in dataplane test, and the tests are
>> done in my lenovo T410 notepad(CPU: 2.6GHz, dual core, four threads).
>>
>> When reading 8192 and more bytes each time, the difference between using
>> coroutine and not can't be observed obviously.
>
> All it tells you is that the variation of the workload can make the
> coroutine cost disappear in the noise. It doesn't tell you much about
> how the real use case.

When cost disappear, the IOPS has become very small. That also said
coroutine can fit in high-speed IO case.

> And you're comparing apples and oranges anyway: The real question in
> qemu is whether you use coroutines or pass around heap-allocated state
> between callbacks. Your benchmark doesn't have a single callback because
> it hasn't got any asynchronous operations and doesn't need to allocate
> and pass any state.
>
> It does, however, have an unnecessary yield() for the coroutine case
> because you felt that the real case is more complex and does yield
> (which is true, but it's more complex for both coroutines and
> callbacks).
>
>> Surely, the test result should depend on how fast the machine is, but even
>> for fast machine, I guess the similar result still can be observed by
>> decreasing read bytes each time.
>
> Yes, results looked similar on my laptop. (They just don't tell me
> much.)
>
>
> Let's have a look at some fio results from my laptop:
>
> aggrb KB/s  | master    | coroutine | bypass
> ------------+-----------+-----------+------------
> run 1       | 419934    | 449518    | 445823
> run 2       | 444358    | 456365    | 448332
> run 3       | 444076    | 455209    | 441552
>
>
> And here from my lab test box:
>
> aggrb KB/s  | master    | coroutine | bypass
> ------------+-----------+-----------+------------
> run 1       | 25330     | 56378     | 53541
> run 2       | 26041     | 55709     | 54136
> run 3       | 25811     | 56829     | 49080
>
> The improvement of the bypass patches is barely measurable on my laptop
> (if it even exists), whereas it seems to be a pretty big thing for my
> lab test box. In any case, the optimised coroutine code seems to beat
> the bypass on both machines. (That is for random reads anyway. For
> sequential, I get a much larger variation, and on my lab test box bypass
> is ahead, whereas on my laptop both are roughly on the same level.)
>
> Another thing I tried is creating the coroutine already in virtio-blk to
> avoid the overhead of the bdrv_aio_* emulation. I don't quite understand
> the result of my benchmarks there, maybe you have an idea: For random
> reads, I see a significant improvement, for sequential however a clear
> degradation.
>
> aggrb MB/s  | bypass    | coroutine | virtio-blk-created coroutine
> ------------+-----------+-----------+------------------------------
> seq. read   | 738       | 738       | 694
> random read | 442       | 459       | 475
>
> I would appreciate any ideas about what's going on with sequential reads
> here and how it can be fixed. Anyway, on my machines, coroutines don't
> look like a lost case at all.

Firstly I hope you can bypass the coroutine only to do the test, that said, use
same code path except for coroutine operation to observe effect from coroutine.

Secondly, maybe your machine is fast enough, and we can't observe the
IOPS difference easily, but there should be the difference in CPU utilization,
since the above computation tells us the coroutine cost does exist. Faster
the block faster, the more.


Thanks,
diff mbox

Patch

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index ae64b3d..78c3b60 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -15,6 +15,12 @@  STEXI
 @item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename
 ETEXI
 
+DEF("co_bench", co_bench,
+    "co_bench -c count -f read_file_name -s read_size -q -b")
+STEXI
+@item co_bench [-c @var{count}] [-f @var{filename}] [-s @var{read_size}] [-b] [-q]
+ETEXI
+
 DEF("check", img_check,
     "check [-q] [-f fmt] [--output=ofmt]  [-r [leaks | all]] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 3e1b7c4..c9c7ac3 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -366,6 +366,138 @@  static int add_old_style_options(const char *fmt, QemuOpts *opts,
     return 0;
 }
 
+struct co_data {
+    const char *file_name;
+    unsigned long sum;
+    int read_size;
+    bool bypass;
+};
+
+static unsigned long file_bench(struct co_data *co)
+{
+    const int size = co->read_size;
+    int fd = open(co->file_name, O_RDONLY);
+    char buf[size];
+    int len, i;
+    unsigned long sum = 0;
+
+    if (fd < 0) {
+        perror("open file failed\n");
+        exit(-1);
+    }
+
+    /* the 1st page should have been in page cache, needn't worry about block */
+    len = read(fd, buf, size);
+    if (len != size) {
+        perror("open file failed\n");
+        exit(-1);
+    }
+    close(fd);
+
+    for (i = 0; i < len; i++) {
+        sum += buf[i];
+    }
+
+    return sum;
+}
+
+static void syscall_bench(void *opaque)
+{
+    struct co_data *data = opaque;
+
+#if 0
+    /*
+     * Doing getppid() only will show operations per sec may increase 5
+     * times in my T410 notepad via bypassing coroutine!!!
+     */
+    data->sum += getppid();
+#else
+    /*
+     * open, read 1024 bytes, and close will show ~10% increase in my
+     * T410 notepad via bypassing coroutine!!!
+     *
+     * open, read 512bytes, and close will show ~20% increase in my
+     * T410 notepad via bypassing coroutine!!!
+     *
+     * Below link provides 'perf stat' on several hw events:
+     *
+     *       http://pastebin.com/5s750m8C
+     *
+     * And with bypassing coroutine, dcache loads decreases, insns per
+     * cycle increased 0.7, branch-misses ratio decreases 0.4%, and
+     * dTLB-loads decreases too.
+     */
+    data->sum += file_bench(data);
+#endif
+
+    if (!data->bypass) {
+        qemu_coroutine_yield();
+    }
+}
+
+static int co_bench(int argc, char **argv)
+{
+    int c;
+    unsigned long cnt = 1;
+    int num = 1;
+    unsigned long i;
+    struct co_data data = {
+        .file_name = argv[-1],
+        .sum = 0,
+        .read_size = 1024,
+        .bypass = false,
+    };
+    Coroutine *co, *last_co = NULL;
+    struct timeval t1, t2;
+    unsigned long tv = 0;
+
+    for (;;) {
+        c = getopt(argc, argv, "bc:s:f:");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case 'b':
+            data.bypass = true;
+            break;
+        case 'c':
+            num = atoi(optarg);
+            break;
+        case 's':
+            data.read_size = atoi(optarg);
+            break;
+        case 'f':
+            data.file_name = optarg;
+            break;
+        }
+    }
+
+    printf("%s: iterations %d, bypass: %s, file %s, read_size: %d\n",
+           __func__, num,
+           data.bypass ? "yes" : "no",
+           data.file_name, data.read_size);
+    gettimeofday(&t1, NULL);
+    for (i = 0; i < num * cnt; i++) {
+        if (!data.bypass) {
+            if (last_co) {
+                qemu_coroutine_enter(last_co, NULL);
+            }
+            co = qemu_coroutine_create(syscall_bench);
+            last_co = co;
+            qemu_coroutine_enter(co, &data);
+        } else {
+            syscall_bench(&data);
+        }
+    }
+    gettimeofday(&t2, NULL);
+    tv = (t2.tv_sec - t1.tv_sec) * 1000000 +
+        (t2.tv_usec - t1.tv_usec);
+    printf("\ttotal time: %lums, %5.0fK ops per sec\n", tv / 1000,
+           (double)((cnt * num * 1000) / tv));
+
+    return (int)data.sum;
+}
+
 static int img_create(int argc, char **argv)
 {
     int c;