From patchwork Wed Aug 18 21:49:51 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Henderson X-Patchwork-Id: 62091 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 0C6C4B70D1 for ; Thu, 19 Aug 2010 07:50:02 +1000 (EST) Received: (qmail 2239 invoked by alias); 18 Aug 2010 21:50:00 -0000 Received: (qmail 2225 invoked by uid 22791); 18 Aug 2010 21:49:59 -0000 X-SWARE-Spam-Status: No, hits=-5.5 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_HI, SPF_HELO_PASS, T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Wed, 18 Aug 2010 21:49:53 +0000 Received: from int-mx08.intmail.prod.int.phx2.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o7ILnpaK013940 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 18 Aug 2010 17:49:52 -0400 Received: from anchor.twiddle.home (ovpn-113-22.phx2.redhat.com [10.3.113.22]) by int-mx08.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id o7ILnpIH023435; Wed, 18 Aug 2010 17:49:51 -0400 Message-ID: <4C6C557F.7050806@redhat.com> Date: Wed, 18 Aug 2010 14:49:51 -0700 From: Richard Henderson User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.7) Gecko/20100720 Fedora/3.1.1-1.fc13 Thunderbird/3.1.1 MIME-Version: 1.0 To: luisgpm@linux.vnet.ibm.com CC: GCC Patches , meissner@linux.vnet.ibm.com Subject: Re: [CFT, v4] Vectorized _cpp_clean_line References: <4C601691.1000303@moene.org> <4C601E08.4020303@google.com> <4C6035C2.9020505@moene.org> <4C60378B.4060303@google.com> <4C603AC2.5070403@moene.org> <45B5C4E0-DFA5-413E-8FC8-E13077862245@apple.com> <877hjy8qwk.fsf@basil.nowhere.org> <4C64699B.20804@redhat.com> <20100812220708.GC7058@basil.fritz.box> <4C647448.6080707@redhat.com> <20100813070300.GA12885@gargoyle.fritz.box> <4C66CBAC.3010406@redhat.com> <1281998097.3725.3.camel@gargoyle> <4C69C317.2080207@redhat.com> <1282142212.3725.6.camel@gargoyle> <4C6BF5F7.7040100@redhat.com> <1282149264.3725.15.camel@gargoyle> <4C6C0D92.7080100@redhat.com> <1282151361.3725.19.camel@gargoyle> <4C6C166A.90306@redhat.com> <1282152938.3725.27.camel@gargoyle> <4C6C39DB.8070409@redhat.com> In-Reply-To: <4C6C39DB.8070409@redhat.com> X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org On 08/18/2010 12:51 PM, Richard Henderson wrote: > Before: > > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 29.41 0.05 0.05 168709 0.00 0.00 ._cpp_clean_line > 17.65 0.08 0.03 606267 0.00 0.00 ._cpp_lex_direct > 11.76 0.10 0.02 168081 0.00 0.00 .linemap_line_start > 11.76 0.12 0.02 .variably_modified_type_p > 5.88 0.13 0.01 503900 0.00 0.00 ._cpp_lex_token > > After: > > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 20.83 0.05 0.05 606267 0.00 0.00 ._cpp_lex_direct > 20.83 0.10 0.05 228345 0.00 0.00 .ht_lookup_with_hash > 16.67 0.14 0.04 .variably_modified_type_p > 4.17 0.15 0.01 503900 0.00 0.00 ._cpp_lex_token > 4.17 0.16 0.01 304199 0.00 0.00 ._cpp_lex_identifier > 4.17 0.17 0.01 304199 0.00 0.00 .cpp_token_as_text > 4.17 0.18 0.01 168709 0.00 0.00 ._cpp_clean_line > > Note that ._cpp_clean_line is about 5 times faster. > > Is the cpu you're testing on (power7 or what?) just that much > better with the original integer code? Not to overload you too much, but here's an incremental patch to test as well. You do have to add "-maltivec" to {BOOT,STAGE}_CFLAGS at the moment, since the ppc backend does not yet support the kind of target mixing and matching that i386 port does. On that same G5 I get: % cumulative self self total time seconds seconds calls ms/call ms/call name 37.50 0.06 0.06 606267 0.00 0.00 ._cpp_lex_direct 12.50 0.08 0.02 504424 0.00 0.00 .cpp_get_token_with_location 12.50 0.10 0.02 503900 0.00 0.00 ._cpp_lex_token 12.50 0.12 0.02 .variably_modified_type_p 6.25 0.13 0.01 457580 0.00 0.00 .cpp_output_token 6.25 0.14 0.01 217473 0.00 0.00 .init_pp_output 6.25 0.15 0.01 40842 0.00 0.00 ._cpp_test_assertion 6.25 0.16 0.01 1 10.00 140.00 .preprocess_file 0.00 0.16 0.00 633800 0.00 0.00 .cpp_get_token 0.00 0.16 0.00 505969 0.00 0.00 .linemap_lookup 0.00 0.16 0.00 304199 0.00 0.00 ._cpp_lex_identifier 0.00 0.16 0.00 304199 0.00 0.00 .cpp_token_as_text 0.00 0.16 0.00 228345 0.00 0.00 .ht_lookup_with_hash 0.00 0.16 0.00 180752 0.00 0.00 ._cpp_get_fresh_line 0.00 0.16 0.00 173870 0.00 0.00 ._cpp_init_tokenrun 0.00 0.16 0.00 168709 0.00 0.00 ._cpp_clean_line I.e. _cpp_clean_line has essentially vanished off the radar. It seems likely that oprofile across an entire bootstrap stage would be more likely to be able to pick out how much time it really takes. r~ diff --git a/libcpp/lex.c b/libcpp/lex.c index 8e56784..1e8e847 100644 --- a/libcpp/lex.c +++ b/libcpp/lex.c @@ -220,7 +220,7 @@ acc_char_index (word_type cmp ATTRIBUTE_UNUSED, and branches without increasing the number of arithmetic operations. It's almost certainly going to be a win with 64-bit word size. */ -static bool +static bool ATTRIBUTE_UNUSED search_line_acc_char (const uchar *s, const uchar *end, const uchar **out) { const word_type repl_nl = acc_char_replicate ('\n'); @@ -497,6 +497,116 @@ search_line_fast (const uchar *s, const uchar *end, const uchar **out) return search_line_acc_char (s, end, out); } +#elif defined(__ALTIVEC__) + +static bool +search_line_fast (const uchar *s, const uchar *end, const uchar **out) +{ + typedef __vector unsigned char vc; + + const vc repl_nl = { + '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', + '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n' + }; + const vc repl_cr = { + '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r', + '\r', '\r', '\r', '\r', '\r', '\r', '\r', '\r' + }; + const vc repl_bs = { + '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\', + '\\', '\\', '\\', '\\', '\\', '\\', '\\', '\\' + }; + const vc repl_qm = { + '?', '?', '?', '?', '?', '?', '?', '?', + '?', '?', '?', '?', '?', '?', '?', '?', + }; + const vc ones = { + -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, + }; + const vc zero = { 0 }; + + ptrdiff_t left; + vc data, vsl, t; + + left = end - s; + data = __builtin_vec_ld(0, (const vc *)s); + vsl = __builtin_vec_lvsr(0, s); + t = __builtin_vec_perm(zero, ones, vsl); + data &= t; + + left += (uintptr_t)s & 15; + s = (const uchar *)((uintptr_t)s & -16); + goto start; + + do + { + vc m_nl, m_cr, m_bs, m_qm; + + left -= 16; + s += 16; + if (__builtin_expect (left <= 0, 0)) + { + *out = s; + return false; + } + data = __builtin_vec_ld(0, (const vc *)s); + + start: + m_nl = (vc) __builtin_vec_cmpeq(data, repl_nl); + m_cr = (vc) __builtin_vec_cmpeq(data, repl_cr); + m_bs = (vc) __builtin_vec_cmpeq(data, repl_bs); + m_qm = (vc) __builtin_vec_cmpeq(data, repl_qm); + t = (m_nl | m_cr) | (m_bs | m_qm); + } + while (!__builtin_vec_vcmpeq_p(/*__CR6_LT_REV*/3, t, zero)); + + /* A match somewhere. Scan T for the match. */ + { +#define N (sizeof(vc) / sizeof(long)) + + union { + vc v; + unsigned long l[N]; + } u; + typedef char check_count[(N == 2 || N == 4) * 2 - 1]; + + unsigned long l, i = 0; + + u.v = t; + switch (N) + { + case 4: + l = u.l[i++]; + if (l != 0) + break; + s += sizeof(unsigned long); + l = u.l[i++]; + if (l != 0) + break; + s += sizeof(unsigned long); + case 2: + l = u.l[i++]; + if (l != 0) + break; + s += sizeof(unsigned long); + l = u.l[i]; + } + + l = __builtin_clzl(l); + l /= 8; + *out = s + l; + return true; + +#undef N + } +} + +void +init_vectorized_lexer (void) +{ +} + #else /* We only have one accellerated alternative. Use a direct call so that