From patchwork Mon Dec 1 07:46:03 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 416320 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 0A59414016B for ; Mon, 1 Dec 2014 18:47:05 +1100 (AEDT) Received: from localhost ([::1]:53167 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XvLhG-0005VG-1o for incoming@patchwork.ozlabs.org; Mon, 01 Dec 2014 02:47:02 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55129) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XvLgs-00054u-Fs for qemu-devel@nongnu.org; Mon, 01 Dec 2014 02:46:44 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XvLgl-0000fR-Kf for qemu-devel@nongnu.org; Mon, 01 Dec 2014 02:46:38 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:37526) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XvLgl-0000fL-C8 for qemu-devel@nongnu.org; Mon, 01 Dec 2014 02:46:31 -0500 Received: from [183.13.122.212] (helo=tom-ThinkPad-T410) by youngberry.canonical.com with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1XvLgg-0005WK-Jo; Mon, 01 Dec 2014 07:46:27 +0000 Date: Mon, 1 Dec 2014 15:46:03 +0800 From: Ming Lei To: Peter Lieven Message-ID: <20141201154603.2e6a0565@tom-ThinkPad-T410> In-Reply-To: <547C132D.3070303@kamp.de> References: <1417183941-26329-1-git-send-email-pbonzini@redhat.com> <547C132D.3070303@kamp.de> X-Mailer: Claws Mail 3.8.0 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 91.189.89.112 Cc: Kevin Wolf , Paolo Bonzini , qemu-devel , Stefan Hajnoczi Subject: Re: [Qemu-devel] [PATCH 0/7] coroutine: optimizations X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org On Mon, 01 Dec 2014 08:05:17 +0100 Peter Lieven wrote: > On 01.12.2014 06:55, Ming Lei wrote: > > On Fri, Nov 28, 2014 at 10:12 PM, Paolo Bonzini wrote: > >> As discussed in the other thread, this brings speedups from > >> dropping the coroutine mutex (which serializes multiple iothreads, > >> too) and using ELF thread-local storage. > >> > >> The speedup in perf/cost is about 30% (190->145). Windows port tested > >> with tests/test-coroutine.exe under Wine. > > The data is very nice, and in my laptop, 'perf cost' can be decreased > > from 244ns to 174ns. > > > > BTW, the cost by using coroutine to run function isn't only from these > > helpers(*_yield, *_enter, *_create, and perf-cost just measures > > this part of cost), but also some implicit/invisible part. I have some > > test cases which can show the problem. If someone is interested, > > I can post them in list. > > Of course, maybe the problem can be solved or impaired. OK, please try below patch: From 917d5cc0a273f9825b10abd52152c54e08c81ef8 Mon Sep 17 00:00:00 2001 From: Ming Lei Date: Mon, 1 Dec 2014 11:11:23 +0800 Subject: [PATCH] test-coroutine: introduce perf-cost-with-load The perf/cost test case only covers explicit cost by using coroutine. This patch provides a open/close file test case, and from this case, we can find there is also some implicit or invisible cost except for the cost measured by /perf/cost. In my environment, follows the test result after appying this patch and running perf/cost and perf/cost-with-load: {*LOG(start):{/perf/cost}:LOG*} /perf/cost: {*LOG(message):{Run operation 40000000 iterations 7.539413 s, 5305K operations/s, 188ns per coroutine}:LOG*} OK {*LOG(stop):(0;0;7.539497):LOG*} {*LOG(start):{/perf/cost-with-load}:LOG*} /perf/cost-with-load: {*LOG(message):{Run operation 1000000 iterations 2.648014 s, 377K operations/s, 2648ns per operation without using coroutine}:LOG*} {*LOG(message):{Run operation 1000000 iterations 2.919133 s, 342K operations/s, 2919ns per operation, 271ns(cost introduced by coroutine) per operation with using coroutine}:LOG*} OK {*LOG(stop):(0;0;5.567333):LOG*} From above data, we can see 188ns is introduced for running one coroutine, but in /perf/cost-with-load, the actual cost introduced is 271ns, and the extra 83ns cost is invisible and implicit. The similar result can be found in following test case too: - read from /dev/nullb0 which is opened with O_DIRECT (it is sort of aio read simulation, need 3.13+ kernel for /dev/nullbX support by 'modprobe null_blk', this case can show +150ns extra cost) - statvfs() syscall, there is ~30ns extra cost for running one statvfs() with coroutine --- tests/test-coroutine.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c index 27d1b6f..7323a91 100644 --- a/tests/test-coroutine.c +++ b/tests/test-coroutine.c @@ -311,6 +311,72 @@ static void perf_baseline(void) maxcycles, duration); } +static void perf_cost_load_worker(void *opaque) +{ + int fd; + + fd = open("/proc/self/exe", O_RDONLY); + assert(fd >= 0); + close(fd); +} + +static __attribute__((noinline)) void perf_cost_load_func(void *opaque) +{ + perf_cost_load_worker(opaque); + qemu_coroutine_yield(); +} + +static double perf_cost_load(unsigned long maxcycles, bool use_co) +{ + unsigned long i = 0; + double duration; + + g_test_timer_start(); + if (use_co) { + Coroutine *co; + while (i++ < maxcycles) { + co = qemu_coroutine_create(perf_cost_load_func); + qemu_coroutine_enter(co, &i); + qemu_coroutine_enter(co, NULL); + } + } else { + while (i++ < maxcycles) { + perf_cost_load_worker(&i); + } + } + duration = g_test_timer_elapsed(); + + return duration; +} + +static void perf_cost_with_load(void) +{ + const unsigned long maxcycles = 1000000; + double duration; + unsigned long ops; + unsigned long cost_co, cost; + + duration = perf_cost_load(maxcycles, false); + ops = (long)(maxcycles / (duration * 1000)); + cost = (unsigned long)(1000000000.0 * duration / maxcycles); + g_test_message("Run operation %lu iterations %f s, %luK operations/s, " + "%luns per operation without using coroutine", + maxcycles, + duration, ops, + cost); + + duration = perf_cost_load(maxcycles, true); + ops = (long)(maxcycles / (duration * 1000)); + cost_co = (unsigned long)(1000000000.0 * duration / maxcycles); + g_test_message("Run operation %lu iterations %f s, %luK operations/s, " + "%luns per operation, " + "%luns(cost introduced by coroutine) per operation " + "with using coroutine", + maxcycles, + duration, ops, + cost_co, cost_co - cost); +} + static __attribute__((noinline)) void perf_cost_func(void *opaque) { qemu_coroutine_yield(); @@ -355,6 +421,7 @@ int main(int argc, char **argv) g_test_add_func("/perf/yield", perf_yield); g_test_add_func("/perf/function-call", perf_baseline); g_test_add_func("/perf/cost", perf_cost); + g_test_add_func("/perf/cost-with-load", perf_cost_with_load); } return g_test_run(); }