From patchwork Thu Aug 28 15:40:27 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Henderson X-Patchwork-Id: 383942 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 0EBE1140132 for ; Fri, 29 Aug 2014 01:42:00 +1000 (EST) Received: from localhost ([::1]:37643 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XN1pm-0003Sl-7i for incoming@patchwork.ozlabs.org; Thu, 28 Aug 2014 11:41:58 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51868) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XN1oX-0001kl-50 for qemu-devel@nongnu.org; Thu, 28 Aug 2014 11:40:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XN1oO-00062T-3G for qemu-devel@nongnu.org; Thu, 28 Aug 2014 11:40:41 -0400 Received: from mail-qa0-x229.google.com ([2607:f8b0:400d:c00::229]:52072) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XN1oN-00062P-UK for qemu-devel@nongnu.org; Thu, 28 Aug 2014 11:40:32 -0400 Received: by mail-qa0-f41.google.com with SMTP id m5so893282qaj.14 for ; Thu, 28 Aug 2014 08:40:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; bh=iX+Rr57TSJOG7/xBEt/wksEpdTPdlj/s8FxfhnB7DtI=; b=GeNHcE+mut1yiTyaPmpalvKSAjVOIMdE0E4hc/PTUtZhNA4Yus1uQ1p3ukSo6K2y6N nbMNoMfRoFL5MTYESrVthIZs9+1eqfLaomIMvXGBEZxV3JmlD1AMXRDmHCh3xAEzUcOw RlTU2lkZrZKHkSMFusXPjFK6HDqjXdSNmBxIJ0oqt1Re7mK2IR9fi/Zxmeoi65DFM1g+ P51DF5R1KKIEt7xT7mgWmLvT8kYVSp8xvoF/VEeTNXV9U9TTgMDc86JeAYH92BtTPAaT xULHEEshBYFKuo4+hTWKrPotPOfByqVFdPAitYxfsrfH0oWVdj112vZ/QZEgFNIdjCgM qYfA== X-Received: by 10.140.17.79 with SMTP id 73mr7276948qgc.47.1409240430880; Thu, 28 Aug 2014 08:40:30 -0700 (PDT) Received: from anchor.twiddle.net (50-194-63-110-static.hfc.comcastbusiness.net. [50.194.63.110]) by mx.google.com with ESMTPSA id x6sm12581830qas.27.2014.08.28.08.40.29 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 28 Aug 2014 08:40:29 -0700 (PDT) Message-ID: <53FF4D6B.2050202@twiddle.net> Date: Thu, 28 Aug 2014 08:40:27 -0700 From: Richard Henderson User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.7.0 MIME-Version: 1.0 To: qemu-devel X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2607:f8b0:400d:c00::229 Subject: [Qemu-devel] [RFC] Use of host vector operations in host helper functions X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Most of the time, guest vector operations are rare enough that it doesn't really matter that we implement them with a loop around integer operations. But for target-alpha, there's one vector comparison operation that appears in every guest string operation, and is used heavily enough that it's in the top 10 functions in the profile: cmpbge (compare bytes greater or equal). I did some experiments, where I rewrote the function using gcc's "generic" vector types and builtin operations. Irritatingly, gcc won't use a wider vector insn to implement a narrower operation, so I needed to widen by hand in order to get vectorization for SSE2, but: ---------------------------------------------------------------------- ---------------------------------------------------------------------- allows very good optimization on x86_64: 0000000000000120 : 120: 48 89 7c 24 e8 mov %rdi,-0x18(%rsp) 125: 48 b8 01 01 01 01 01 movabs $0x101010101010101,%rax 12c: 01 01 01 12f: f3 0f 7e 5c 24 e8 movq -0x18(%rsp),%xmm3 135: 48 89 74 24 e8 mov %rsi,-0x18(%rsp) 13a: f3 0f 7e 64 24 e8 movq -0x18(%rsp),%xmm4 140: f3 0f 7e c3 movq %xmm3,%xmm0 144: f3 0f 7e cc movq %xmm4,%xmm1 148: 66 0f 6f d1 movdqa %xmm1,%xmm2 14c: 66 0f d8 d0 psubusb %xmm0,%xmm2 150: 66 0f ef c0 pxor %xmm0,%xmm0 154: 66 0f 74 c2 pcmpeqb %xmm2,%xmm0 158: 66 0f 7f 44 24 e8 movdqa %xmm0,-0x18(%rsp) 15e: 48 8b 54 24 e8 mov -0x18(%rsp),%rdx 163: 48 21 c2 and %rax,%rdx 166: 48 89 d0 mov %rdx,%rax 169: 48 c1 e8 07 shr $0x7,%rax 16d: 48 09 d0 or %rdx,%rax 170: 48 89 c2 mov %rax,%rdx 173: 48 c1 ea 0e shr $0xe,%rdx 177: 48 09 c2 or %rax,%rdx 17a: 48 89 d0 mov %rdx,%rax 17d: 48 c1 e8 1c shr $0x1c,%rax 181: 48 09 d0 or %rdx,%rax 184: 0f b6 c0 movzbl %al,%eax 187: c3 retq which is just about as good as you could hope for (modulo two extra movq insns). Profiling a (guest) compilation of glibc, helper_cmpbge is reduced from 3% to 0.8% of emulation time, and from 7th to 11th in the ranking. GCC doesn't do a half-bad job on other hosts either: aarch64: b4: 4f000400 movi v0.4s, #0x0 b8: 4ea01c01 mov v1.16b, v0.16b bc: 4e081c00 mov v0.d[0], x0 c0: 4e081c21 mov v1.d[0], x1 c4: 6e213c00 cmhs v0.16b, v0.16b, v1.16b c8: 4e083c00 mov x0, v0.d[0] cc: 9200c000 and x0, x0, #0x101010101010101 d0: aa401c00 orr x0, x0, x0, lsr #7 d4: aa403800 orr x0, x0, x0, lsr #14 d8: aa407000 orr x0, x0, x0, lsr #28 dc: 53001c00 uxtb w0, w0 e0: d65f03c0 ret Of course aarch64 *does* have an 8-byte vector size that gcc knows how to use. If I adjust the patch above to use it, only the first two insns are eliminated -- surely not a measurable difference. power7: ... vcmpgtub 13,0,1 vcmpequb 0,0,1 xxlor 32,45,32 ... But I guess the larger question here is: how much of this should we accept? (0) Ignore this and do nothing? (1) No general infrastructure. Special case this one insn with #ifdef __SSE2__ and ignore anything else. (2) Put in just enough infrastructure to know if compiler support for general vectors is available, and then use it ad hoc when such functions are shown to be high on the profile? (3) Put in more infrastructure and allow it to be used to implement most guest vector operations, possibly tidying their implementations? r~ diff --git a/target-alpha/int_helper.c b/target-alpha/int_helper.c index c023fa1..ec71c17 100644 --- a/target-alpha/int_helper.c +++ b/target-alpha/int_helper.c @@ -60,6 +60,42 @@ uint64_t helper_zap(uint64_t val, uint64_t mask) uint64_t helper_cmpbge(uint64_t op1, uint64_t op2) { +#if 1 + uint64_t r; + + /* The cmpbge instruction is heavily used in the implementation of + every string function on Alpha. We can do much better than either + the default loop below, or even an unrolled version by using the + native vector support. */ + { + typedef uint64_t Q __attribute__((vector_size(16))); + typedef uint8_t B __attribute__((vector_size(16))); + + Q q1 = (Q){ op1, 0 }; + Q q2 = (Q){ op2, 0 }; + + q1 = (Q)((B)q1 >= (B)q2); + + r = q1[0]; + } + + /* Select only one bit from each byte. */ + r &= 0x0101010101010101; + + /* Collect the bits into the bottom byte. */ + /* .......A.......B.......C.......D.......E.......F.......G.......H */ + r |= r >> (8 - 1); + + /* .......A......AB......BC......CD......DE......EF......FG......GH */ + r |= r >> (16 - 2); + + /* .......A......AB.....ABC....ABCD....BCDE....CDEF....DEFG....EFGH */ + r |= r >> (32 - 4); + + /* .......A......AB.....ABC....ABCD...ABCDE..ABCDEF.ABCDEFGABCDEFGH */ + /* Return only the low 8 bits. */ + return r & 0xff; +#else uint8_t opa, opb, res; int i; @@ -72,6 +108,7 @@ uint64_t helper_cmpbge(uint64_t op1, uint64_t op2) } } return res; +#endif } uint64_t helper_minub8(uint64_t op1, uint64_t op2)