From patchwork Fri Apr 22 01:52:27 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1620546 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=S5tPH4tD; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4Kky9C5gg8z9s0w for ; Fri, 22 Apr 2022 11:53:02 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 24A7F3857359 for ; Fri, 22 Apr 2022 01:53:00 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 24A7F3857359 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1650592380; bh=4fWP89Tizyd7BdLvN05iGONORiynP373ZJkxQVG0h1U=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=S5tPH4tDZxB5B/Qem5RWpA2YfHHMi5dhz5mR16iZ+MFThB19pk4JRjPm8pDv4dUQZ 4o86CdThHbuJZ0Cq5XGSqlxOMV+/+YE3+b37hhuTzqBZMz+11PDtisEHLtIGD6YLob uFs6DRR6q0LvV+K/xh1kjDvJHemF4NHUvjyr53vs= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) by sourceware.org (Postfix) with ESMTPS id 8359C3857820 for ; Fri, 22 Apr 2022 01:52:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8359C3857820 Received: by mail-qk1-x72e.google.com with SMTP id e128so4912230qkd.7 for ; Thu, 21 Apr 2022 18:52:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=4fWP89Tizyd7BdLvN05iGONORiynP373ZJkxQVG0h1U=; b=GpcWlhTQgANKPqlRcAIAZtSxEP6QDR1V2huLzjxCgwVYBRjO8az0b0TtHiip1yEpOM kg18apv2cPDXmJ8KpJoLkDXrSIm8z1U4CkR6mYl9blmo55u7N1wN7TuTKYCRWacaiosZ a146jRfg+POg9RJ20toHpgcHB8Z2sNqUo09iZyVLWeJkfxlRpIyy7knlL1Egp1Mig5IZ kygberc1bifKL4MNe9DCZr/zp6aBHmSHgm/dKPL9dZIp9Zvda7IoJOnz97xT5M5gZUrk 0Qb9/1JFGQ0u8MGvd7RW9ue2tJ18qBSReDFXwZ2H/HrIxqbacIeKnX767aR0RCwT6xXK hfYw== X-Gm-Message-State: AOAM5336QVN7ypUIN22gmlHqzu1YqnuuTS2qtHIovOl2dXGxnbMe082r GTchh5Rx8Kp6Q7S2RKSLwSAzCWk4Vm4= X-Google-Smtp-Source: ABdhPJw3jyJOADA7IeeNornTeedJJOWVfMnIG4bjbx+tx+LZLWGCzG0TxD4gkXwgAMVqSJg0OHCPEA== X-Received: by 2002:a37:a552:0:b0:69f:10de:ad8 with SMTP id o79-20020a37a552000000b0069f10de0ad8mr1420349qke.347.1650592355739; Thu, 21 Apr 2022 18:52:35 -0700 (PDT) Received: from localhost.localdomain ([173.245.203.170]) by smtp.googlemail.com with ESMTPSA id n11-20020a05622a11cb00b002f344f11849sm433268qtk.71.2022.04.21.18.52.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Apr 2022 18:52:35 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v3 1/4] benchtests: Improve bench-strrchr Date: Thu, 21 Apr 2022 20:52:27 -0500 Message-Id: <20220422015230.3241772-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220421031410.2142238-1-goldstein.w.n@gmail.com> References: <20220421031410.2142238-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" 1. Use json-lib for printing results. 2. Expose all parameters (before pos, seek_char, and max_char where not printed). 3. Add benchmarks that test multiple occurence of seek_char in the string. --- benchtests/bench-strrchr.c | 124 ++++++++++++++++++++++++------------- 1 file changed, 80 insertions(+), 44 deletions(-) diff --git a/benchtests/bench-strrchr.c b/benchtests/bench-strrchr.c index abdae60c51..ce4307a098 100644 --- a/benchtests/bench-strrchr.c +++ b/benchtests/bench-strrchr.c @@ -23,6 +23,7 @@ # define TEST_NAME "strrchr" #endif #include "bench-string.h" +#include "json-lib.h" #define BIG_CHAR MAX_CHAR @@ -53,7 +54,8 @@ SIMPLE_STRRCHR (const CHAR *s, int c) } static void -do_one_test (impl_t *impl, const CHAR *s, int c, CHAR *exp_res) +do_one_test (json_ctx_t *json_ctx, impl_t *impl, const CHAR *s, int c, + CHAR *exp_res) { CHAR *res = CALL (impl, s, c); size_t i, iters = INNER_LOOP_ITERS8; @@ -61,8 +63,8 @@ do_one_test (impl_t *impl, const CHAR *s, int c, CHAR *exp_res) if (res != exp_res) { - error (0, 0, "Wrong result in function %s %p %p", impl->name, - res, exp_res); + error (0, 0, "Wrong result in function %s %p %p", impl->name, res, + exp_res); ret = 1; return; } @@ -73,23 +75,25 @@ do_one_test (impl_t *impl, const CHAR *s, int c, CHAR *exp_res) CALL (impl, s, c); } TIMING_NOW (stop); - TIMING_DIFF (cur, start, stop); - TIMING_PRINT_MEAN ((double) cur, (double) iters); + json_element_double (json_ctx, (double) cur / (double) iters); } static void -do_test (size_t align, size_t pos, size_t len, int seek_char, int max_char) +do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, + int seek_char, int max_char, size_t freq) /* For wcsrchr: align here means align not in bytes, but in wchar_ts, in bytes it will equal to align * (sizeof (wchar_t)) len for wcschr here isn't in bytes but it's number of wchar_t symbols. */ { size_t i; + size_t pos_chunk_sz = freq ? (pos / freq) : pos; + size_t last_pos = len; CHAR *result; CHAR *buf = (CHAR *) buf1; - align &= 7; + align &= (getpagesize () - 1); if ((align + len) * sizeof (CHAR) >= page_size) return; @@ -103,6 +107,16 @@ do_test (size_t align, size_t pos, size_t len, int seek_char, int max_char) if ((i > pos || pos >= len) && buf[align + i] == seek_char) buf[align + i] = seek_char + 10 + (random () & 15); } + + if (pos_chunk_sz == 0 && pos) + pos_chunk_sz = 1; + + for (i = pos_chunk_sz; i < pos && i < len; i += pos_chunk_sz) + { + buf[align + i] = seek_char; + last_pos = i; + } + buf[align + len] = 0; if (pos < len) @@ -110,66 +124,88 @@ do_test (size_t align, size_t pos, size_t len, int seek_char, int max_char) buf[align + pos] = seek_char; result = (CHAR *) (buf + align + pos); } + else if (last_pos < len) + result = (CHAR *) (buf + align + last_pos); else if (seek_char == 0) result = (CHAR *) (buf + align + len); else result = NULL; - printf ("Length %4zd, alignment in bytes %2zd:", len, align * sizeof (CHAR)); + json_element_object_begin (json_ctx); + json_attr_uint (json_ctx, "len", len); + json_attr_uint (json_ctx, "pos", pos); + json_attr_uint (json_ctx, "align", align); + json_attr_uint (json_ctx, "freq", freq); + json_attr_uint (json_ctx, "seek", seek_char); + json_attr_uint (json_ctx, "max_char", max_char); + json_array_begin (json_ctx, "timings"); FOR_EACH_IMPL (impl, 0) - do_one_test (impl, (CHAR *) (buf + align), seek_char, result); + do_one_test (json_ctx, impl, (CHAR *) (buf + align), seek_char, result); - putchar ('\n'); + json_array_end (json_ctx); + json_element_object_end (json_ctx); } int test_main (void) { - size_t i; + json_ctx_t json_ctx; + size_t i, j; + int seek; test_init (); + json_init (&json_ctx, 0, stdout); - printf ("%20s", ""); - FOR_EACH_IMPL (impl, 0) - printf ("\t%s", impl->name); - putchar ('\n'); - - for (i = 1; i < 8; ++i) - { - do_test (0, 16 << i, 2048, 23, SMALL_CHAR); - do_test (i, 16 << i, 2048, 23, SMALL_CHAR); - } + json_document_begin (&json_ctx); + json_attr_string (&json_ctx, "timing_type", TIMING_TYPE); - for (i = 1; i < 8; ++i) - { - do_test (i, 64, 256, 23, SMALL_CHAR); - do_test (i, 64, 256, 23, BIG_CHAR); - } + json_attr_object_begin (&json_ctx, "functions"); + json_attr_object_begin (&json_ctx, TEST_NAME); + json_attr_string (&json_ctx, "bench-variant", ""); - for (i = 0; i < 32; ++i) - { - do_test (0, i, i + 1, 23, SMALL_CHAR); - do_test (0, i, i + 1, 23, BIG_CHAR); - } + json_array_begin (&json_ctx, "ifuncs"); + FOR_EACH_IMPL (impl, 0) + json_element_string (&json_ctx, impl->name); + json_array_end (&json_ctx); - for (i = 1; i < 8; ++i) - { - do_test (0, 16 << i, 2048, 0, SMALL_CHAR); - do_test (i, 16 << i, 2048, 0, SMALL_CHAR); - } + json_array_begin (&json_ctx, "results"); - for (i = 1; i < 8; ++i) + for (seek = 0; seek <= 23; seek += 23) { - do_test (i, 64, 256, 0, SMALL_CHAR); - do_test (i, 64, 256, 0, BIG_CHAR); + for (j = 1; j < 32; j += j) + { + for (i = 1; i < 9; ++i) + { + do_test (&json_ctx, 0, 16 << i, 2048, seek, SMALL_CHAR, j); + do_test (&json_ctx, i, 16 << i, 2048, seek, SMALL_CHAR, j); + } + + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, i, 64, 256, seek, SMALL_CHAR, j); + do_test (&json_ctx, i, 64, 256, seek, BIG_CHAR, j); + + do_test (&json_ctx, i * 15, 64, 256, seek, SMALL_CHAR, j); + do_test (&json_ctx, i * 15, 64, 256, seek, BIG_CHAR, j); + } + + for (i = 0; i < 32; ++i) + { + do_test (&json_ctx, 0, i, i + 1, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, i, i + 1, seek, BIG_CHAR, j); + } + if (seek == 0) + { + break; + } + } } - for (i = 0; i < 32; ++i) - { - do_test (0, i, i + 1, 0, SMALL_CHAR); - do_test (0, i, i + 1, 0, BIG_CHAR); - } + json_array_end (&json_ctx); + json_attr_object_end (&json_ctx); + json_attr_object_end (&json_ctx); + json_document_end (&json_ctx); return ret; } From patchwork Fri Apr 22 01:52:28 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1620547 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=tbbTn/k5; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4KkyB51Zvfz9s0w for ; Fri, 22 Apr 2022 11:53:49 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id ABAFF3856253 for ; Fri, 22 Apr 2022 01:53:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org ABAFF3856253 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1650592426; bh=K1mFFuGSwdtO4lTvTf7qnrZCrSHeuIRwzDF6Xl7IQg4=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=tbbTn/k5XlqAb4ou0jvOkAnwB/2d94gRRtjhUbKY715SgvIiRobOJeLXH6dhHxvh/ B5ZE/j8buyM6KgaF4Kg7fBtLxGf8fV8BjRd5ULpsVs6S5TDuqSRRCJ2mcfsi6pq3sN OiV/0KNLrGXrxuHYbCJNbhHShf0VURWxHUBWx6/Q= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x735.google.com (mail-qk1-x735.google.com [IPv6:2607:f8b0:4864:20::735]) by sourceware.org (Postfix) with ESMTPS id 73BFB3857359 for ; Fri, 22 Apr 2022 01:52:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 73BFB3857359 Received: by mail-qk1-x735.google.com with SMTP id 204so4920868qkg.5 for ; Thu, 21 Apr 2022 18:52:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=K1mFFuGSwdtO4lTvTf7qnrZCrSHeuIRwzDF6Xl7IQg4=; b=8RyrNxmUSdXEyJUJsNwDZXHsVOEIhTUiuumfm85nbwnuuhyw86+CsoSof0zPW0JYI0 c1whOMpy5JvNWyTxshzfgzTRcuJr+6oXgY3SpePu1dRiR4mzOSpqAynvdHR4pT2tH1qK z4ZqAkeb46pBmz8dUr9ee6abBARFnCDwfkuel4j7xl9kaysT+T1KYvpySIx5qkgwU4mm 3YsQV3BSYQBGm1wSjCiwHug8NJf6/1NWPfBJwBtCiQhwr/q3N+eUTuXvW1DKfOsysWVM rh47DYXjGWriDRmW3XiAgPMyrUnkbh7gNikFGE7g+ZZDVhavFChw2sQSeWAgKkcCE6sa OZFg== X-Gm-Message-State: AOAM531Si+6bzdaqLYnWcB2wUixMVTX52H4+tWxOyiDAfnzSTwglAT26 0X21SvKkLpk8sZJ+Ze9UgawfQK7Q9ak= X-Google-Smtp-Source: ABdhPJxATuxlPSn4L+J7pjwkr3/w0xxT2CL7VfB5KxY9/5ESm+ZWPEN9POuAe3sLIxOt/fKmAiKtNw== X-Received: by 2002:a37:98c4:0:b0:69a:e14:16a2 with SMTP id a187-20020a3798c4000000b0069a0e1416a2mr1393501qke.610.1650592357512; Thu, 21 Apr 2022 18:52:37 -0700 (PDT) Received: from localhost.localdomain ([173.245.203.170]) by smtp.googlemail.com with ESMTPSA id n11-20020a05622a11cb00b002f344f11849sm433268qtk.71.2022.04.21.18.52.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Apr 2022 18:52:37 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v3 2/4] x86: Optimize {str|wcs}rchr-sse2 Date: Thu, 21 Apr 2022 20:52:28 -0500 Message-Id: <20220422015230.3241772-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220422015230.3241772-1-goldstein.w.n@gmail.com> References: <20220421031410.2142238-1-goldstein.w.n@gmail.com> <20220422015230.3241772-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_NONE, SCC_10_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.741 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu --- sysdeps/x86_64/multiarch/strrchr-sse2.S | 2 +- sysdeps/x86_64/multiarch/wcsrchr-sse2.S | 3 +- sysdeps/x86_64/strrchr.S | 510 +++++++++++++++--------- sysdeps/x86_64/wcsrchr.S | 268 +------------ 4 files changed, 339 insertions(+), 444 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strrchr-sse2.S b/sysdeps/x86_64/multiarch/strrchr-sse2.S index db1b44c23c..866396e947 100644 --- a/sysdeps/x86_64/multiarch/strrchr-sse2.S +++ b/sysdeps/x86_64/multiarch/strrchr-sse2.S @@ -17,7 +17,7 @@ . */ #if IS_IN (libc) -# define strrchr __strrchr_sse2 +# define STRRCHR __strrchr_sse2 # undef weak_alias # define weak_alias(strrchr, rindex) diff --git a/sysdeps/x86_64/multiarch/wcsrchr-sse2.S b/sysdeps/x86_64/multiarch/wcsrchr-sse2.S index 78d1ca6553..69d2f3cdb1 100644 --- a/sysdeps/x86_64/multiarch/wcsrchr-sse2.S +++ b/sysdeps/x86_64/multiarch/wcsrchr-sse2.S @@ -17,7 +17,6 @@ . */ #if IS_IN (libc) -# define wcsrchr __wcsrchr_sse2 +# define STRRCHR __wcsrchr_sse2 #endif - #include "../wcsrchr.S" diff --git a/sysdeps/x86_64/strrchr.S b/sysdeps/x86_64/strrchr.S index 50d886713e..4d7ba4ceb2 100644 --- a/sysdeps/x86_64/strrchr.S +++ b/sysdeps/x86_64/strrchr.S @@ -19,210 +19,360 @@ #include +#ifndef STRRCHR +# define STRRCHR strrchr +#endif + +#ifdef USE_AS_WCSRCHR +# define PCMPEQ pcmpeqd +# define CHAR_SIZE 4 +# define PMINU pminud +#else +# define PCMPEQ pcmpeqb +# define CHAR_SIZE 1 +# define PMINU pminub +#endif + +#define PAGE_SIZE 4096 +#define VEC_SIZE 16 + .text -ENTRY (strrchr) - movd %esi, %xmm1 +ENTRY(STRRCHR) + movd %esi, %xmm0 movq %rdi, %rax - andl $4095, %eax - punpcklbw %xmm1, %xmm1 - cmpq $4032, %rax - punpcklwd %xmm1, %xmm1 - pshufd $0, %xmm1, %xmm1 + andl $(PAGE_SIZE - 1), %eax +#ifndef USE_AS_WCSRCHR + punpcklbw %xmm0, %xmm0 + punpcklwd %xmm0, %xmm0 +#endif + pshufd $0, %xmm0, %xmm0 + cmpl $(PAGE_SIZE - VEC_SIZE), %eax ja L(cross_page) - movdqu (%rdi), %xmm0 + +L(cross_page_continue): + movups (%rdi), %xmm1 pxor %xmm2, %xmm2 - movdqa %xmm0, %xmm3 - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm2, %xmm3 - pmovmskb %xmm0, %ecx - pmovmskb %xmm3, %edx - testq %rdx, %rdx - je L(next_48_bytes) - leaq -1(%rdx), %rax - xorq %rdx, %rax - andq %rcx, %rax - je L(exit) - bsrq %rax, %rax + PCMPEQ %xmm1, %xmm2 + pmovmskb %xmm2, %ecx + testl %ecx, %ecx + jz L(aligned_more) + + PCMPEQ %xmm0, %xmm1 + pmovmskb %xmm1, %eax + leal -1(%rcx), %edx + xorl %edx, %ecx + andl %ecx, %eax + jz L(ret0) + bsrl %eax, %eax addq %rdi, %rax + /* We are off by 3 for wcsrchr if search CHAR is non-zero. If + search CHAR is zero we are correct. Either way `andq + -CHAR_SIZE, %rax` gets the correct result. */ +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif +L(ret0): ret + /* Returns for first vec x1/x2 have hard coded backward search + path for earlier matches. */ .p2align 4 -L(next_48_bytes): - movdqu 16(%rdi), %xmm4 - movdqa %xmm4, %xmm5 - movdqu 32(%rdi), %xmm3 - pcmpeqb %xmm1, %xmm4 - pcmpeqb %xmm2, %xmm5 - movdqu 48(%rdi), %xmm0 - pmovmskb %xmm5, %edx - movdqa %xmm3, %xmm5 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm2, %xmm5 - pcmpeqb %xmm0, %xmm2 - salq $16, %rdx - pmovmskb %xmm3, %r8d - pmovmskb %xmm5, %eax - pmovmskb %xmm2, %esi - salq $32, %r8 - salq $32, %rax - pcmpeqb %xmm1, %xmm0 - orq %rdx, %rax - movq %rsi, %rdx - pmovmskb %xmm4, %esi - salq $48, %rdx - salq $16, %rsi - orq %r8, %rsi - orq %rcx, %rsi - pmovmskb %xmm0, %ecx - salq $48, %rcx - orq %rcx, %rsi - orq %rdx, %rax - je L(loop_header2) - leaq -1(%rax), %rcx - xorq %rax, %rcx - andq %rcx, %rsi - je L(exit) - bsrq %rsi, %rsi - leaq (%rdi,%rsi), %rax +L(first_vec_x0_test): + PCMPEQ %xmm0, %xmm1 + pmovmskb %xmm1, %eax + testl %eax, %eax + jz L(ret0) + bsrl %eax, %eax + addq %r8, %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif ret .p2align 4 -L(loop_header2): - testq %rsi, %rsi - movq %rdi, %rcx - je L(no_c_found) -L(loop_header): - addq $64, %rdi - pxor %xmm7, %xmm7 - andq $-64, %rdi - jmp L(loop_entry) +L(first_vec_x1): + PCMPEQ %xmm0, %xmm2 + pmovmskb %xmm2, %eax + leal -1(%rcx), %edx + xorl %edx, %ecx + andl %ecx, %eax + jz L(first_vec_x0_test) + bsrl %eax, %eax + leaq (VEC_SIZE)(%rdi, %rax), %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif + ret .p2align 4 -L(loop64): - testq %rdx, %rdx - cmovne %rdx, %rsi - cmovne %rdi, %rcx - addq $64, %rdi -L(loop_entry): - movdqa 32(%rdi), %xmm3 - pxor %xmm6, %xmm6 - movdqa 48(%rdi), %xmm2 - movdqa %xmm3, %xmm0 - movdqa 16(%rdi), %xmm4 - pminub %xmm2, %xmm0 - movdqa (%rdi), %xmm5 - pminub %xmm4, %xmm0 - pminub %xmm5, %xmm0 - pcmpeqb %xmm7, %xmm0 - pmovmskb %xmm0, %eax - movdqa %xmm5, %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %r9d - movdqa %xmm4, %xmm0 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %edx - movdqa %xmm3, %xmm0 - pcmpeqb %xmm1, %xmm0 - salq $16, %rdx - pmovmskb %xmm0, %r10d - movdqa %xmm2, %xmm0 - pcmpeqb %xmm1, %xmm0 - salq $32, %r10 - orq %r10, %rdx - pmovmskb %xmm0, %r8d - orq %r9, %rdx - salq $48, %r8 - orq %r8, %rdx +L(first_vec_x1_test): + PCMPEQ %xmm0, %xmm2 + pmovmskb %xmm2, %eax testl %eax, %eax - je L(loop64) - pcmpeqb %xmm6, %xmm4 - pcmpeqb %xmm6, %xmm3 - pcmpeqb %xmm6, %xmm5 - pmovmskb %xmm4, %eax - pmovmskb %xmm3, %r10d - pcmpeqb %xmm6, %xmm2 - pmovmskb %xmm5, %r9d - salq $32, %r10 - salq $16, %rax - pmovmskb %xmm2, %r8d - orq %r10, %rax - orq %r9, %rax - salq $48, %r8 - orq %r8, %rax - leaq -1(%rax), %r8 - xorq %rax, %r8 - andq %r8, %rdx - cmovne %rdi, %rcx - cmovne %rdx, %rsi - bsrq %rsi, %rsi - leaq (%rcx,%rsi), %rax + jz L(first_vec_x0_test) + bsrl %eax, %eax + leaq (VEC_SIZE)(%rdi, %rax), %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif + ret + + .p2align 4 +L(first_vec_x2): + PCMPEQ %xmm0, %xmm3 + pmovmskb %xmm3, %eax + leal -1(%rcx), %edx + xorl %edx, %ecx + andl %ecx, %eax + jz L(first_vec_x1_test) + bsrl %eax, %eax + leaq (VEC_SIZE * 2)(%rdi, %rax), %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif + ret + + .p2align 4 +L(aligned_more): + /* Save original pointer if match was in VEC 0. */ + movq %rdi, %r8 + andq $-VEC_SIZE, %rdi + + movaps VEC_SIZE(%rdi), %xmm2 + pxor %xmm3, %xmm3 + PCMPEQ %xmm2, %xmm3 + pmovmskb %xmm3, %ecx + testl %ecx, %ecx + jnz L(first_vec_x1) + + movaps (VEC_SIZE * 2)(%rdi), %xmm3 + pxor %xmm4, %xmm4 + PCMPEQ %xmm3, %xmm4 + pmovmskb %xmm4, %ecx + testl %ecx, %ecx + jnz L(first_vec_x2) + + addq $VEC_SIZE, %rdi + /* Save pointer again before realigning. */ + movq %rdi, %rsi + andq $-(VEC_SIZE * 2), %rdi + .p2align 4 +L(first_loop): + /* Do 2x VEC at a time. */ + movaps (VEC_SIZE * 2)(%rdi), %xmm4 + movaps (VEC_SIZE * 3)(%rdi), %xmm5 + /* Since SSE2 no pminud so wcsrchr needs seperate logic for + detecting zero. Note if this is found to be a bottleneck it + may be worth adding an SSE4.1 wcsrchr implementation. */ +#ifdef USE_AS_WCSRCHR + movaps %xmm5, %xmm6 + pxor %xmm8, %xmm8 + + PCMPEQ %xmm8, %xmm5 + PCMPEQ %xmm4, %xmm8 + por %xmm5, %xmm8 +#else + movaps %xmm5, %xmm6 + PMINU %xmm4, %xmm5 +#endif + + movaps %xmm4, %xmm9 + PCMPEQ %xmm0, %xmm4 + PCMPEQ %xmm0, %xmm6 + movaps %xmm6, %xmm7 + por %xmm4, %xmm6 +#ifndef USE_AS_WCSRCHR + pxor %xmm8, %xmm8 + PCMPEQ %xmm5, %xmm8 +#endif + pmovmskb %xmm8, %ecx + pmovmskb %xmm6, %eax + + addq $(VEC_SIZE * 2), %rdi + /* Use `addl` 1) so we can undo it with `subl` and 2) it can + macro-fuse with `jz`. */ + addl %ecx, %eax + jz L(first_loop) + + /* Check if there is zero match. */ + testl %ecx, %ecx + jz L(second_loop_match) + + /* Check if there was a match in last iteration. */ + subl %ecx, %eax + jnz L(new_match) + +L(first_loop_old_match): + PCMPEQ %xmm0, %xmm2 + PCMPEQ %xmm0, %xmm3 + pmovmskb %xmm2, %ecx + pmovmskb %xmm3, %eax + addl %eax, %ecx + jz L(first_vec_x0_test) + /* NB: We could move this shift to before the branch and save a + bit of code size / performance on the fall through. The + branch leads to the null case which generally seems hotter + than char in first 3x VEC. */ + sall $16, %eax + orl %ecx, %eax + + bsrl %eax, %eax + addq %rsi, %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif + ret + + .p2align 4 +L(new_match): + pxor %xmm6, %xmm6 + PCMPEQ %xmm9, %xmm6 + pmovmskb %xmm6, %eax + sall $16, %ecx + orl %eax, %ecx + + /* We can't reuse either of the old comparisons as since we mask + of zeros after first zero (instead of using the full + comparison) we can't gurantee no interference between match + after end of string and valid match. */ + pmovmskb %xmm4, %eax + pmovmskb %xmm7, %edx + sall $16, %edx + orl %edx, %eax + + leal -1(%ecx), %edx + xorl %edx, %ecx + andl %ecx, %eax + jz L(first_loop_old_match) + bsrl %eax, %eax + addq %rdi, %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif ret + /* Save minimum state for getting most recent match. We can + throw out all previous work. */ .p2align 4 -L(no_c_found): - movl $1, %esi - xorl %ecx, %ecx - jmp L(loop_header) +L(second_loop_match): + movq %rdi, %rsi + movaps %xmm4, %xmm2 + movaps %xmm7, %xmm3 .p2align 4 -L(exit): - xorl %eax, %eax +L(second_loop): + movaps (VEC_SIZE * 2)(%rdi), %xmm4 + movaps (VEC_SIZE * 3)(%rdi), %xmm5 + /* Since SSE2 no pminud so wcsrchr needs seperate logic for + detecting zero. Note if this is found to be a bottleneck it + may be worth adding an SSE4.1 wcsrchr implementation. */ +#ifdef USE_AS_WCSRCHR + movaps %xmm5, %xmm6 + pxor %xmm8, %xmm8 + + PCMPEQ %xmm8, %xmm5 + PCMPEQ %xmm4, %xmm8 + por %xmm5, %xmm8 +#else + movaps %xmm5, %xmm6 + PMINU %xmm4, %xmm5 +#endif + + movaps %xmm4, %xmm9 + PCMPEQ %xmm0, %xmm4 + PCMPEQ %xmm0, %xmm6 + movaps %xmm6, %xmm7 + por %xmm4, %xmm6 +#ifndef USE_AS_WCSRCHR + pxor %xmm8, %xmm8 + PCMPEQ %xmm5, %xmm8 +#endif + + pmovmskb %xmm8, %ecx + pmovmskb %xmm6, %eax + + addq $(VEC_SIZE * 2), %rdi + /* Either null term or new occurence of CHAR. */ + addl %ecx, %eax + jz L(second_loop) + + /* No null term so much be new occurence of CHAR. */ + testl %ecx, %ecx + jz L(second_loop_match) + + + subl %ecx, %eax + jnz L(second_loop_new_match) + +L(second_loop_old_match): + pmovmskb %xmm2, %ecx + pmovmskb %xmm3, %eax + sall $16, %eax + orl %ecx, %eax + bsrl %eax, %eax + addq %rsi, %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif ret .p2align 4 +L(second_loop_new_match): + pxor %xmm6, %xmm6 + PCMPEQ %xmm9, %xmm6 + pmovmskb %xmm6, %eax + sall $16, %ecx + orl %eax, %ecx + + /* We can't reuse either of the old comparisons as since we mask + of zeros after first zero (instead of using the full + comparison) we can't gurantee no interference between match + after end of string and valid match. */ + pmovmskb %xmm4, %eax + pmovmskb %xmm7, %edx + sall $16, %edx + orl %edx, %eax + + leal -1(%ecx), %edx + xorl %edx, %ecx + andl %ecx, %eax + jz L(second_loop_old_match) + bsrl %eax, %eax + addq %rdi, %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif + ret + + .p2align 4,, 4 L(cross_page): - movq %rdi, %rax - pxor %xmm0, %xmm0 - andq $-64, %rax - movdqu (%rax), %xmm5 - movdqa %xmm5, %xmm6 - movdqu 16(%rax), %xmm4 - pcmpeqb %xmm1, %xmm5 - pcmpeqb %xmm0, %xmm6 - movdqu 32(%rax), %xmm3 - pmovmskb %xmm6, %esi - movdqa %xmm4, %xmm6 - movdqu 48(%rax), %xmm2 - pcmpeqb %xmm1, %xmm4 - pcmpeqb %xmm0, %xmm6 - pmovmskb %xmm6, %edx - movdqa %xmm3, %xmm6 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm0, %xmm6 - pcmpeqb %xmm2, %xmm0 - salq $16, %rdx - pmovmskb %xmm3, %r9d - pmovmskb %xmm6, %r8d - pmovmskb %xmm0, %ecx - salq $32, %r9 - salq $32, %r8 - pcmpeqb %xmm1, %xmm2 - orq %r8, %rdx - salq $48, %rcx - pmovmskb %xmm5, %r8d - orq %rsi, %rdx - pmovmskb %xmm4, %esi - orq %rcx, %rdx - pmovmskb %xmm2, %ecx - salq $16, %rsi - salq $48, %rcx - orq %r9, %rsi - orq %r8, %rsi - orq %rcx, %rsi + movq %rdi, %rsi + andq $-VEC_SIZE, %rsi + movaps (%rsi), %xmm1 + pxor %xmm2, %xmm2 + PCMPEQ %xmm1, %xmm2 + pmovmskb %xmm2, %edx movl %edi, %ecx - subl %eax, %ecx - shrq %cl, %rdx - shrq %cl, %rsi - testq %rdx, %rdx - je L(loop_header2) - leaq -1(%rdx), %rax - xorq %rdx, %rax - andq %rax, %rsi - je L(exit) - bsrq %rsi, %rax + andl $(VEC_SIZE - 1), %ecx + sarl %cl, %edx + jz L(cross_page_continue) + PCMPEQ %xmm0, %xmm1 + pmovmskb %xmm1, %eax + sarl %cl, %eax + leal -1(%rdx), %ecx + xorl %edx, %ecx + andl %ecx, %eax + jz L(ret1) + bsrl %eax, %eax addq %rdi, %rax +#ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +#endif +L(ret1): ret -END (strrchr) +END(STRRCHR) -weak_alias (strrchr, rindex) -libc_hidden_builtin_def (strrchr) +#ifndef USE_AS_WCSRCHR + weak_alias (STRRCHR, rindex) + libc_hidden_builtin_def (STRRCHR) +#endif diff --git a/sysdeps/x86_64/wcsrchr.S b/sysdeps/x86_64/wcsrchr.S index 61552954de..2b80efc5ef 100644 --- a/sysdeps/x86_64/wcsrchr.S +++ b/sysdeps/x86_64/wcsrchr.S @@ -1,4 +1,4 @@ -/* wcsrchr with SSSE3 +/* wcsrchr optimized with SSE2. Copyright (C) 2011-2022 Free Software Foundation, Inc. This file is part of the GNU C Library. @@ -16,266 +16,12 @@ License along with the GNU C Library; if not, see . */ -#include - .text -ENTRY (wcsrchr) +#define USE_AS_WCSRCHR 1 +#define NO_PMINU 1 - movd %rsi, %xmm1 - mov %rdi, %rcx - punpckldq %xmm1, %xmm1 - pxor %xmm2, %xmm2 - punpckldq %xmm1, %xmm1 - and $63, %rcx - cmp $48, %rcx - ja L(crosscache) +#ifndef STRRCHR +# define STRRCHR wcsrchr +#endif - movdqu (%rdi), %xmm0 - pcmpeqd %xmm0, %xmm2 - pcmpeqd %xmm1, %xmm0 - pmovmskb %xmm2, %rcx - pmovmskb %xmm0, %rax - add $16, %rdi - - test %rax, %rax - jnz L(unaligned_match1) - - test %rcx, %rcx - jnz L(return_null) - - and $-16, %rdi - xor %r8, %r8 - jmp L(loop) - - .p2align 4 -L(unaligned_match1): - test %rcx, %rcx - jnz L(prolog_find_zero_1) - - mov %rax, %r8 - mov %rdi, %rsi - and $-16, %rdi - jmp L(loop) - - .p2align 4 -L(crosscache): - and $15, %rcx - and $-16, %rdi - pxor %xmm3, %xmm3 - movdqa (%rdi), %xmm0 - pcmpeqd %xmm0, %xmm3 - pcmpeqd %xmm1, %xmm0 - pmovmskb %xmm3, %rdx - pmovmskb %xmm0, %rax - shr %cl, %rdx - shr %cl, %rax - add $16, %rdi - - test %rax, %rax - jnz L(unaligned_match) - - test %rdx, %rdx - jnz L(return_null) - - xor %r8, %r8 - jmp L(loop) - - .p2align 4 -L(unaligned_match): - test %rdx, %rdx - jnz L(prolog_find_zero) - - mov %rax, %r8 - lea (%rdi, %rcx), %rsi - -/* Loop start on aligned string. */ - .p2align 4 -L(loop): - movdqa (%rdi), %xmm0 - pcmpeqd %xmm0, %xmm2 - add $16, %rdi - pcmpeqd %xmm1, %xmm0 - pmovmskb %xmm2, %rcx - pmovmskb %xmm0, %rax - or %rax, %rcx - jnz L(matches) - - movdqa (%rdi), %xmm3 - pcmpeqd %xmm3, %xmm2 - add $16, %rdi - pcmpeqd %xmm1, %xmm3 - pmovmskb %xmm2, %rcx - pmovmskb %xmm3, %rax - or %rax, %rcx - jnz L(matches) - - movdqa (%rdi), %xmm4 - pcmpeqd %xmm4, %xmm2 - add $16, %rdi - pcmpeqd %xmm1, %xmm4 - pmovmskb %xmm2, %rcx - pmovmskb %xmm4, %rax - or %rax, %rcx - jnz L(matches) - - movdqa (%rdi), %xmm5 - pcmpeqd %xmm5, %xmm2 - add $16, %rdi - pcmpeqd %xmm1, %xmm5 - pmovmskb %xmm2, %rcx - pmovmskb %xmm5, %rax - or %rax, %rcx - jz L(loop) - - .p2align 4 -L(matches): - test %rax, %rax - jnz L(match) -L(return_value): - test %r8, %r8 - jz L(return_null) - mov %r8, %rax - mov %rsi, %rdi - - test $15 << 4, %ah - jnz L(match_fourth_wchar) - test %ah, %ah - jnz L(match_third_wchar) - test $15 << 4, %al - jnz L(match_second_wchar) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(match): - pmovmskb %xmm2, %rcx - test %rcx, %rcx - jnz L(find_zero) - mov %rax, %r8 - mov %rdi, %rsi - jmp L(loop) - - .p2align 4 -L(find_zero): - test $15, %cl - jnz L(find_zero_in_first_wchar) - test %cl, %cl - jnz L(find_zero_in_second_wchar) - test $15, %ch - jnz L(find_zero_in_third_wchar) - - and $1 << 13 - 1, %rax - jz L(return_value) - - test $15 << 4, %ah - jnz L(match_fourth_wchar) - test %ah, %ah - jnz L(match_third_wchar) - test $15 << 4, %al - jnz L(match_second_wchar) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(find_zero_in_first_wchar): - test $1, %rax - jz L(return_value) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(find_zero_in_second_wchar): - and $1 << 5 - 1, %rax - jz L(return_value) - - test $15 << 4, %al - jnz L(match_second_wchar) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(find_zero_in_third_wchar): - and $1 << 9 - 1, %rax - jz L(return_value) - - test %ah, %ah - jnz L(match_third_wchar) - test $15 << 4, %al - jnz L(match_second_wchar) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(prolog_find_zero): - add %rcx, %rdi - mov %rdx, %rcx -L(prolog_find_zero_1): - test $15, %cl - jnz L(prolog_find_zero_in_first_wchar) - test %cl, %cl - jnz L(prolog_find_zero_in_second_wchar) - test $15, %ch - jnz L(prolog_find_zero_in_third_wchar) - - and $1 << 13 - 1, %rax - jz L(return_null) - - test $15 << 4, %ah - jnz L(match_fourth_wchar) - test %ah, %ah - jnz L(match_third_wchar) - test $15 << 4, %al - jnz L(match_second_wchar) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(prolog_find_zero_in_first_wchar): - test $1, %rax - jz L(return_null) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(prolog_find_zero_in_second_wchar): - and $1 << 5 - 1, %rax - jz L(return_null) - - test $15 << 4, %al - jnz L(match_second_wchar) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(prolog_find_zero_in_third_wchar): - and $1 << 9 - 1, %rax - jz L(return_null) - - test %ah, %ah - jnz L(match_third_wchar) - test $15 << 4, %al - jnz L(match_second_wchar) - lea -16(%rdi), %rax - ret - - .p2align 4 -L(match_second_wchar): - lea -12(%rdi), %rax - ret - - .p2align 4 -L(match_third_wchar): - lea -8(%rdi), %rax - ret - - .p2align 4 -L(match_fourth_wchar): - lea -4(%rdi), %rax - ret - - .p2align 4 -L(return_null): - xor %rax, %rax - ret - -END (wcsrchr) +#include "../strrchr.S" From patchwork Fri Apr 22 01:52:29 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1620549 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=Vp3epTx9; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4KkyC46HP5z9s0w for ; Fri, 22 Apr 2022 11:54:40 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0C4CE3857359 for ; Fri, 22 Apr 2022 01:54:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0C4CE3857359 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1650592479; bh=oUJaiyNz3FTvqPYorVRCmm9R8MGw3qnKOlTV5M8ZUg8=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Vp3epTx9aXFlsHnwIctjEOuNM9MserAE5fS/H3pLpT8EOYxYjaoE3vGKqqe8FCWgr D23pckAuUS0HfhHwTPBcj2RQzKW50lGrCO2BXyXYlaqIUWN1KHY2H0yUzthJyK2Ibi DaH9cNpJGS9oh6yj23W8R7PiBaxcb9DZnVp+Pd9Q= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qt1-x82c.google.com (mail-qt1-x82c.google.com [IPv6:2607:f8b0:4864:20::82c]) by sourceware.org (Postfix) with ESMTPS id 3C23F3856DD8 for ; Fri, 22 Apr 2022 01:52:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3C23F3856DD8 Received: by mail-qt1-x82c.google.com with SMTP id ay11so4608428qtb.4 for ; Thu, 21 Apr 2022 18:52:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=oUJaiyNz3FTvqPYorVRCmm9R8MGw3qnKOlTV5M8ZUg8=; b=T7ptnY5JIHPavyddu5Pou+0wwlxu3XqOT1+RxFfyMPoeJNoo8L42G6lZEoqHg6BtV7 G97jyoC8ZPMMeOcFL6PsE765NURuL/waFraR5nbQPHGuk88XpvsOitHxBCp+a6Wc2NKh 9h6SroUyu+qtdFIBQ6OzUaTe8Gd8p3sGa/YcD7H8pL7XEDfG8rgOFztlrg7MRqpMqi+I lVARzOW4kKh6YyXe+TeLdBdYA24yDdlmfwWVWj74CmysgHdQMV2jg0exWEnBoTP11m7G 9h2oFS7zSjlBP0+/kfQj+Z1yZr1iMQHVRkzxVTxOvLSy8y7tAMdDaBOXsVsJJlugE6/Z X+IA== X-Gm-Message-State: AOAM530/oPggE40rUnm9xLuIX2xEk2Co+9ILcXgn1ksjU5GNel+GJP3C w4h19wq6e4YJv1ddoGUPY7v9XQetcQM= X-Google-Smtp-Source: ABdhPJxqqyWBThLhwh3hy0Abghm8pSEw6Jy5ulSMR/VR7M5dY3bBiFE0vKV3spFTGzFhDZZzYopTdA== X-Received: by 2002:ac8:5905:0:b0:2f2:480:ef2e with SMTP id 5-20020ac85905000000b002f20480ef2emr1673190qty.272.1650592359000; Thu, 21 Apr 2022 18:52:39 -0700 (PDT) Received: from localhost.localdomain ([173.245.203.170]) by smtp.googlemail.com with ESMTPSA id n11-20020a05622a11cb00b002f344f11849sm433268qtk.71.2022.04.21.18.52.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Apr 2022 18:52:38 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v3 3/4] x86: Optimize {str|wcs}rchr-avx2 Date: Thu, 21 Apr 2022 20:52:29 -0500 Message-Id: <20220422015230.3241772-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220422015230.3241772-1-goldstein.w.n@gmail.com> References: <20220421031410.2142238-1-goldstein.w.n@gmail.com> <20220422015230.3241772-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.832 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu --- sysdeps/x86_64/multiarch/strrchr-avx2.S | 426 +++++++++++++++--------- 1 file changed, 269 insertions(+), 157 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strrchr-avx2.S b/sysdeps/x86_64/multiarch/strrchr-avx2.S index 1df2adfad0..bd26ba80d5 100644 --- a/sysdeps/x86_64/multiarch/strrchr-avx2.S +++ b/sysdeps/x86_64/multiarch/strrchr-avx2.S @@ -27,9 +27,13 @@ # ifdef USE_AS_WCSRCHR # define VPBROADCAST vpbroadcastd # define VPCMPEQ vpcmpeqd +# define VPMIN vpminud +# define CHAR_SIZE 4 # else # define VPBROADCAST vpbroadcastb # define VPCMPEQ vpcmpeqb +# define VPMIN vpminub +# define CHAR_SIZE 1 # endif # ifndef VZEROUPPER @@ -41,196 +45,304 @@ # endif # define VEC_SIZE 32 +# define PAGE_SIZE 4096 - .section SECTION(.text),"ax",@progbits -ENTRY (STRRCHR) - movd %esi, %xmm4 - movl %edi, %ecx + .section SECTION(.text), "ax", @progbits +ENTRY(STRRCHR) + movd %esi, %xmm7 + movl %edi, %eax /* Broadcast CHAR to YMM4. */ - VPBROADCAST %xmm4, %ymm4 + VPBROADCAST %xmm7, %ymm7 vpxor %xmm0, %xmm0, %xmm0 - /* Check if we may cross page boundary with one vector load. */ - andl $(2 * VEC_SIZE - 1), %ecx - cmpl $VEC_SIZE, %ecx - ja L(cros_page_boundary) + /* Shift here instead of `andl` to save code size (saves a fetch + block). */ + sall $20, %eax + cmpl $((PAGE_SIZE - VEC_SIZE) << 20), %eax + ja L(cross_page) +L(page_cross_continue): vmovdqu (%rdi), %ymm1 - VPCMPEQ %ymm1, %ymm0, %ymm2 - VPCMPEQ %ymm1, %ymm4, %ymm3 - vpmovmskb %ymm2, %ecx - vpmovmskb %ymm3, %eax - addq $VEC_SIZE, %rdi + /* Check end of string match. */ + VPCMPEQ %ymm1, %ymm0, %ymm6 + vpmovmskb %ymm6, %ecx + testl %ecx, %ecx + jz L(aligned_more) + + /* Only check match with search CHAR if needed. */ + VPCMPEQ %ymm1, %ymm7, %ymm1 + vpmovmskb %ymm1, %eax + /* Check if match before first zero. */ + blsmskl %ecx, %ecx + andl %ecx, %eax + jz L(ret0) + bsrl %eax, %eax + addq %rdi, %rax + /* We are off by 3 for wcsrchr if search CHAR is non-zero. If + search CHAR is zero we are correct. Either way `andq + -CHAR_SIZE, %rax` gets the correct result. */ +# ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +# endif +L(ret0): +L(return_vzeroupper): + ZERO_UPPER_VEC_REGISTERS_RETURN + + /* Returns for first vec x1/x2 have hard coded backward search + path for earlier matches. */ + .p2align 4,, 10 +L(first_vec_x1): + VPCMPEQ %ymm2, %ymm7, %ymm6 + vpmovmskb %ymm6, %eax + blsmskl %ecx, %ecx + andl %ecx, %eax + jnz L(first_vec_x1_return) + + .p2align 4,, 4 +L(first_vec_x0_test): + VPCMPEQ %ymm1, %ymm7, %ymm6 + vpmovmskb %ymm6, %eax + testl %eax, %eax + jz L(ret1) + bsrl %eax, %eax + addq %r8, %rax +# ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +# endif +L(ret1): + VZEROUPPER_RETURN + .p2align 4,, 10 +L(first_vec_x0_x1_test): + VPCMPEQ %ymm2, %ymm7, %ymm6 + vpmovmskb %ymm6, %eax + /* Check ymm2 for search CHAR match. If no match then check ymm1 + before returning. */ testl %eax, %eax - jnz L(first_vec) + jz L(first_vec_x0_test) + .p2align 4,, 4 +L(first_vec_x1_return): + bsrl %eax, %eax + leaq 1(%rdi, %rax), %rax +# ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +# endif + VZEROUPPER_RETURN - testl %ecx, %ecx - jnz L(return_null) - andq $-VEC_SIZE, %rdi - xorl %edx, %edx - jmp L(aligned_loop) + .p2align 4,, 10 +L(first_vec_x2): + VPCMPEQ %ymm3, %ymm7, %ymm6 + vpmovmskb %ymm6, %eax + blsmskl %ecx, %ecx + /* If no in-range search CHAR match in ymm3 then need to check + ymm1/ymm2 for an earlier match (we delay checking search + CHAR matches until needed). */ + andl %ecx, %eax + jz L(first_vec_x0_x1_test) + bsrl %eax, %eax + leaq (VEC_SIZE + 1)(%rdi, %rax), %rax +# ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +# endif + VZEROUPPER_RETURN + .p2align 4 -L(first_vec): - /* Check if there is a nul CHAR. */ +L(aligned_more): + /* Save original pointer if match was in VEC 0. */ + movq %rdi, %r8 + + /* Align src. */ + orq $(VEC_SIZE - 1), %rdi + vmovdqu 1(%rdi), %ymm2 + VPCMPEQ %ymm2, %ymm0, %ymm6 + vpmovmskb %ymm6, %ecx testl %ecx, %ecx - jnz L(char_and_nul_in_first_vec) + jnz L(first_vec_x1) - /* Remember the match and keep searching. */ - movl %eax, %edx - movq %rdi, %rsi - andq $-VEC_SIZE, %rdi - jmp L(aligned_loop) + vmovdqu (VEC_SIZE + 1)(%rdi), %ymm3 + VPCMPEQ %ymm3, %ymm0, %ymm6 + vpmovmskb %ymm6, %ecx + testl %ecx, %ecx + jnz L(first_vec_x2) + /* Save pointer again before realigning. */ + movq %rdi, %rsi + addq $(VEC_SIZE + 1), %rdi + andq $-(VEC_SIZE * 2), %rdi .p2align 4 -L(cros_page_boundary): - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi - vmovdqa (%rdi), %ymm1 - VPCMPEQ %ymm1, %ymm0, %ymm2 - VPCMPEQ %ymm1, %ymm4, %ymm3 - vpmovmskb %ymm2, %edx - vpmovmskb %ymm3, %eax - shrl %cl, %edx - shrl %cl, %eax - addq $VEC_SIZE, %rdi - - /* Check if there is a CHAR. */ +L(first_aligned_loop): + /* Do 2x VEC at a time. Any more and the cost of finding the + match outweights loop benefit. */ + vmovdqa (VEC_SIZE * 0)(%rdi), %ymm4 + vmovdqa (VEC_SIZE * 1)(%rdi), %ymm5 + + VPCMPEQ %ymm4, %ymm7, %ymm6 + VPMIN %ymm4, %ymm5, %ymm8 + VPCMPEQ %ymm5, %ymm7, %ymm10 + vpor %ymm6, %ymm10, %ymm5 + VPCMPEQ %ymm8, %ymm0, %ymm8 + vpor %ymm5, %ymm8, %ymm9 + + vpmovmskb %ymm9, %eax + addq $(VEC_SIZE * 2), %rdi + /* No zero or search CHAR. */ testl %eax, %eax - jnz L(found_char) - - testl %edx, %edx - jnz L(return_null) + jz L(first_aligned_loop) - jmp L(aligned_loop) - - .p2align 4 -L(found_char): - testl %edx, %edx - jnz L(char_and_nul) + /* If no zero CHAR then go to second loop (this allows us to + throw away all prior work). */ + vpmovmskb %ymm8, %ecx + testl %ecx, %ecx + jz L(second_aligned_loop_prep) - /* Remember the match and keep searching. */ - movl %eax, %edx - leaq (%rdi, %rcx), %rsi + /* Search char could be zero so we need to get the true match. + */ + vpmovmskb %ymm5, %eax + testl %eax, %eax + jnz L(first_aligned_loop_return) - .p2align 4 -L(aligned_loop): - vmovdqa (%rdi), %ymm1 - VPCMPEQ %ymm1, %ymm0, %ymm2 - addq $VEC_SIZE, %rdi - VPCMPEQ %ymm1, %ymm4, %ymm3 - vpmovmskb %ymm2, %ecx - vpmovmskb %ymm3, %eax - orl %eax, %ecx - jnz L(char_nor_null) - - vmovdqa (%rdi), %ymm1 - VPCMPEQ %ymm1, %ymm0, %ymm2 - add $VEC_SIZE, %rdi - VPCMPEQ %ymm1, %ymm4, %ymm3 - vpmovmskb %ymm2, %ecx + .p2align 4,, 4 +L(first_vec_x1_or_x2): + VPCMPEQ %ymm3, %ymm7, %ymm3 + VPCMPEQ %ymm2, %ymm7, %ymm2 vpmovmskb %ymm3, %eax - orl %eax, %ecx - jnz L(char_nor_null) - - vmovdqa (%rdi), %ymm1 - VPCMPEQ %ymm1, %ymm0, %ymm2 - addq $VEC_SIZE, %rdi - VPCMPEQ %ymm1, %ymm4, %ymm3 - vpmovmskb %ymm2, %ecx - vpmovmskb %ymm3, %eax - orl %eax, %ecx - jnz L(char_nor_null) - - vmovdqa (%rdi), %ymm1 - VPCMPEQ %ymm1, %ymm0, %ymm2 - addq $VEC_SIZE, %rdi - VPCMPEQ %ymm1, %ymm4, %ymm3 - vpmovmskb %ymm2, %ecx - vpmovmskb %ymm3, %eax - orl %eax, %ecx - jz L(aligned_loop) - - .p2align 4 -L(char_nor_null): - /* Find a CHAR or a nul CHAR in a loop. */ - testl %eax, %eax - jnz L(match) -L(return_value): - testl %edx, %edx - jz L(return_null) - movl %edx, %eax - movq %rsi, %rdi + vpmovmskb %ymm2, %edx + /* Use add for macro-fusion. */ + addq %rax, %rdx + jz L(first_vec_x0_test) + /* NB: We could move this shift to before the branch and save a + bit of code size / performance on the fall through. The + branch leads to the null case which generally seems hotter + than char in first 3x VEC. */ + salq $32, %rax + addq %rdx, %rax + bsrq %rax, %rax + leaq 1(%rsi, %rax), %rax +# ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +# endif + VZEROUPPER_RETURN + .p2align 4,, 8 +L(first_aligned_loop_return): + VPCMPEQ %ymm4, %ymm0, %ymm4 + vpmovmskb %ymm4, %edx + salq $32, %rcx + orq %rdx, %rcx + + vpmovmskb %ymm10, %eax + vpmovmskb %ymm6, %edx + salq $32, %rax + orq %rdx, %rax + blsmskq %rcx, %rcx + andq %rcx, %rax + jz L(first_vec_x1_or_x2) + + bsrq %rax, %rax + leaq -(VEC_SIZE * 2)(%rdi, %rax), %rax # ifdef USE_AS_WCSRCHR - /* Keep the first bit for each matching CHAR for bsr. */ - andl $0x11111111, %eax + andq $-CHAR_SIZE, %rax # endif - bsrl %eax, %eax - leaq -VEC_SIZE(%rdi, %rax), %rax -L(return_vzeroupper): - ZERO_UPPER_VEC_REGISTERS_RETURN + VZEROUPPER_RETURN + /* Search char cannot be zero. */ .p2align 4 -L(match): - /* Find a CHAR. Check if there is a nul CHAR. */ - vpmovmskb %ymm2, %ecx - testl %ecx, %ecx - jnz L(find_nul) - - /* Remember the match and keep searching. */ - movl %eax, %edx +L(second_aligned_loop_set_furthest_match): + /* Save VEC and pointer from most recent match. */ +L(second_aligned_loop_prep): movq %rdi, %rsi - jmp L(aligned_loop) + vmovdqu %ymm6, %ymm2 + vmovdqu %ymm10, %ymm3 .p2align 4 -L(find_nul): -# ifdef USE_AS_WCSRCHR - /* Keep the first bit for each matching CHAR for bsr. */ - andl $0x11111111, %ecx - andl $0x11111111, %eax -# endif - /* Mask out any matching bits after the nul CHAR. */ - movl %ecx, %r8d - subl $1, %r8d - xorl %ecx, %r8d - andl %r8d, %eax +L(second_aligned_loop): + /* Search 2x at at time. */ + vmovdqa (VEC_SIZE * 0)(%rdi), %ymm4 + vmovdqa (VEC_SIZE * 1)(%rdi), %ymm5 + + VPCMPEQ %ymm4, %ymm7, %ymm6 + VPMIN %ymm4, %ymm5, %ymm1 + VPCMPEQ %ymm5, %ymm7, %ymm10 + vpor %ymm6, %ymm10, %ymm5 + VPCMPEQ %ymm1, %ymm0, %ymm1 + vpor %ymm5, %ymm1, %ymm9 + + vpmovmskb %ymm9, %eax + addq $(VEC_SIZE * 2), %rdi testl %eax, %eax - /* If there is no CHAR here, return the remembered one. */ - jz L(return_value) - bsrl %eax, %eax - leaq -VEC_SIZE(%rdi, %rax), %rax - VZEROUPPER_RETURN - - .p2align 4 -L(char_and_nul): - /* Find both a CHAR and a nul CHAR. */ - addq %rcx, %rdi - movl %edx, %ecx -L(char_and_nul_in_first_vec): -# ifdef USE_AS_WCSRCHR - /* Keep the first bit for each matching CHAR for bsr. */ - andl $0x11111111, %ecx - andl $0x11111111, %eax -# endif - /* Mask out any matching bits after the nul CHAR. */ - movl %ecx, %r8d - subl $1, %r8d - xorl %ecx, %r8d - andl %r8d, %eax + jz L(second_aligned_loop) + vpmovmskb %ymm1, %ecx + testl %ecx, %ecx + jz L(second_aligned_loop_set_furthest_match) + vpmovmskb %ymm5, %eax testl %eax, %eax - /* Return null pointer if the nul CHAR comes first. */ - jz L(return_null) - bsrl %eax, %eax - leaq -VEC_SIZE(%rdi, %rax), %rax + jnz L(return_new_match) + + /* This is the hot patch. We know CHAR is inbounds and that + ymm3/ymm2 have latest match. */ + .p2align 4,, 4 +L(return_old_match): + vpmovmskb %ymm3, %eax + vpmovmskb %ymm2, %edx + salq $32, %rax + orq %rdx, %rax + bsrq %rax, %rax + /* Search char cannot be zero so safe to just use lea for + wcsrchr. */ + leaq (VEC_SIZE * -2 -(CHAR_SIZE - 1))(%rsi, %rax), %rax VZEROUPPER_RETURN - .p2align 4 -L(return_null): - xorl %eax, %eax + /* Last iteration also potentially has a match. */ + .p2align 4,, 8 +L(return_new_match): + VPCMPEQ %ymm4, %ymm0, %ymm4 + vpmovmskb %ymm4, %edx + salq $32, %rcx + orq %rdx, %rcx + + vpmovmskb %ymm10, %eax + vpmovmskb %ymm6, %edx + salq $32, %rax + orq %rdx, %rax + blsmskq %rcx, %rcx + andq %rcx, %rax + jz L(return_old_match) + bsrq %rax, %rax + /* Search char cannot be zero so safe to just use lea for + wcsrchr. */ + leaq (VEC_SIZE * -2 -(CHAR_SIZE - 1))(%rdi, %rax), %rax VZEROUPPER_RETURN -END (STRRCHR) + .p2align 4,, 4 +L(cross_page): + movq %rdi, %rsi + andq $-VEC_SIZE, %rsi + vmovdqu (%rsi), %ymm1 + VPCMPEQ %ymm1, %ymm0, %ymm6 + vpmovmskb %ymm6, %ecx + /* Shift out zero CHAR matches that are before the begining of + src (rdi). */ + shrxl %edi, %ecx, %ecx + testl %ecx, %ecx + jz L(page_cross_continue) + VPCMPEQ %ymm1, %ymm7, %ymm1 + vpmovmskb %ymm1, %eax + + /* Shift out search CHAR matches that are before the begining of + src (rdi). */ + shrxl %edi, %eax, %eax + blsmskl %ecx, %ecx + /* Check if any search CHAR match in range. */ + andl %ecx, %eax + jz L(ret2) + bsrl %eax, %eax + addq %rdi, %rax +# ifdef USE_AS_WCSRCHR + andq $-CHAR_SIZE, %rax +# endif +L(ret2): + VZEROUPPER_RETURN +END(STRRCHR) #endif From patchwork Fri Apr 22 01:52:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1620550 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=p9ho0Zu0; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4KkyD03Qppz9sFq for ; Fri, 22 Apr 2022 11:55:28 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 5D476385B804 for ; Fri, 22 Apr 2022 01:55:26 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5D476385B804 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1650592526; bh=yF7+j+/lhDrtS3W+awcXehGzPKcVa5XN00WjZ2tI2xs=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=p9ho0Zu0pRUU+rFGZprrwfe/bAFmSUNrAygpvMdCUG/EDyWsdJHId+o4TJwQNPxh5 6cU7vPvart2FWv69ABGTuvNUPgQM/bTlZ2XGVQyz/YyupJ+9gIgmS30NaPdgVQit4a tlwDiLqY8Thmsj4tzrIvlsJz8TEDaaZrdwMDzo4A= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) by sourceware.org (Postfix) with ESMTPS id 804783856DF0 for ; Fri, 22 Apr 2022 01:52:41 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 804783856DF0 Received: by mail-qt1-x835.google.com with SMTP id t26so4608531qtn.6 for ; Thu, 21 Apr 2022 18:52:41 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=yF7+j+/lhDrtS3W+awcXehGzPKcVa5XN00WjZ2tI2xs=; b=XtbshquVWr3CusNwNuG7TVVN1q9Np6sPDsdI1vFp7Apx0+YjJFa3z+tUQQk36WLVVJ sEwqoLBfWEwEbUhGRZ0k+cPRw84QWdKIQFxTt98cdOaa0U7M9bITCLSN2OWTxIEgAUyY F0cQnUrzMEtuv91PXwPDRo6+z9uSQbxDbmw0LQGvHNFExcH+aCgPduc4kHhOFfHbUQRZ hXNY9rxo7LUqGDw8Pu/SeTdBYpg9TtGLV+wyZvDDcx7OBxoiCAzEsEePunru5gsFi8Yg v9slDe5ZWTR5p0QC1zwPyirrHA6UVAAmwhv8XqLrnxa7VoUe4UdjQjSDpXLQXoZ8AK8B 6Btg== X-Gm-Message-State: AOAM530EcZtkYChvTwtXcDdC63vLgCTgC7hJazkq4D1jfbxpj4h1k7JI SY4PUb3YYXbNl+ZrMisG9b+nwo+u930= X-Google-Smtp-Source: ABdhPJw9i8OeGNsoF2rKwS2KuoK41hvuTDhqRAVHEVCiWfZRTHMdulzduKUL1hl2GkWNLvquqYArFw== X-Received: by 2002:ac8:5f07:0:b0:2e1:d695:d857 with SMTP id x7-20020ac85f07000000b002e1d695d857mr1720787qta.40.1650592360623; Thu, 21 Apr 2022 18:52:40 -0700 (PDT) Received: from localhost.localdomain ([173.245.203.170]) by smtp.googlemail.com with ESMTPSA id n11-20020a05622a11cb00b002f344f11849sm433268qtk.71.2022.04.21.18.52.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Apr 2022 18:52:40 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v3 4/4] x86: Optimize {str|wcs}rchr-evex Date: Thu, 21 Apr 2022 20:52:30 -0500 Message-Id: <20220422015230.3241772-4-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220422015230.3241772-1-goldstein.w.n@gmail.com> References: <20220421031410.2142238-1-goldstein.w.n@gmail.com> <20220422015230.3241772-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" The new code unrolls the main loop slightly without adding too much overhead and minimizes the comparisons for the search CHAR. Geometric Mean of all benchmarks New / Old: 0.755 See email for all results. Full xcheck passes on x86_64 with and without multiarch enabled. Reviewed-by: H.J. Lu --- sysdeps/x86_64/multiarch/strrchr-evex.S | 471 +++++++++++++++--------- 1 file changed, 290 insertions(+), 181 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strrchr-evex.S b/sysdeps/x86_64/multiarch/strrchr-evex.S index adeddaed32..8014c285b3 100644 --- a/sysdeps/x86_64/multiarch/strrchr-evex.S +++ b/sysdeps/x86_64/multiarch/strrchr-evex.S @@ -24,242 +24,351 @@ # define STRRCHR __strrchr_evex # endif -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 +# define VMOVU vmovdqu64 +# define VMOVA vmovdqa64 # ifdef USE_AS_WCSRCHR +# define SHIFT_REG esi + +# define kunpck kunpckbw +# define kmov_2x kmovd +# define maskz_2x ecx +# define maskm_2x eax +# define CHAR_SIZE 4 +# define VPMIN vpminud +# define VPTESTN vptestnmd # define VPBROADCAST vpbroadcastd -# define VPCMP vpcmpd -# define SHIFT_REG r8d +# define VPCMP vpcmpd # else +# define SHIFT_REG edi + +# define kunpck kunpckdq +# define kmov_2x kmovq +# define maskz_2x rcx +# define maskm_2x rax + +# define CHAR_SIZE 1 +# define VPMIN vpminub +# define VPTESTN vptestnmb # define VPBROADCAST vpbroadcastb -# define VPCMP vpcmpb -# define SHIFT_REG ecx +# define VPCMP vpcmpb # endif # define XMMZERO xmm16 # define YMMZERO ymm16 # define YMMMATCH ymm17 -# define YMM1 ymm18 +# define YMMSAVE ymm18 + +# define YMM1 ymm19 +# define YMM2 ymm20 +# define YMM3 ymm21 +# define YMM4 ymm22 +# define YMM5 ymm23 +# define YMM6 ymm24 +# define YMM7 ymm25 +# define YMM8 ymm26 -# define VEC_SIZE 32 - .section .text.evex,"ax",@progbits -ENTRY (STRRCHR) - movl %edi, %ecx +# define VEC_SIZE 32 +# define PAGE_SIZE 4096 + .section .text.evex, "ax", @progbits +ENTRY(STRRCHR) + movl %edi, %eax /* Broadcast CHAR to YMMMATCH. */ VPBROADCAST %esi, %YMMMATCH - vpxorq %XMMZERO, %XMMZERO, %XMMZERO - - /* Check if we may cross page boundary with one vector load. */ - andl $(2 * VEC_SIZE - 1), %ecx - cmpl $VEC_SIZE, %ecx - ja L(cros_page_boundary) + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + jg L(cross_page_boundary) +L(page_cross_continue): VMOVU (%rdi), %YMM1 - - /* Each bit in K0 represents a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM1, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 + /* k0 has a 1 for each zero CHAR in YMM1. */ + VPTESTN %YMM1, %YMM1, %k0 kmovd %k0, %ecx - kmovd %k1, %eax - - addq $VEC_SIZE, %rdi - - testl %eax, %eax - jnz L(first_vec) - testl %ecx, %ecx - jnz L(return_null) - - andq $-VEC_SIZE, %rdi - xorl %edx, %edx - jmp L(aligned_loop) - - .p2align 4 -L(first_vec): - /* Check if there is a null byte. */ - testl %ecx, %ecx - jnz L(char_and_nul_in_first_vec) - - /* Remember the match and keep searching. */ - movl %eax, %edx - movq %rdi, %rsi - andq $-VEC_SIZE, %rdi - jmp L(aligned_loop) - - .p2align 4 -L(cros_page_boundary): - andl $(VEC_SIZE - 1), %ecx - andq $-VEC_SIZE, %rdi + jz L(aligned_more) + /* fallthrough: zero CHAR in first VEC. */ + /* K1 has a 1 for each search CHAR match in YMM1. */ + VPCMP $0, %YMMMATCH, %YMM1, %k1 + kmovd %k1, %eax + /* Build mask up until first zero CHAR (used to mask of + potential search CHAR matches past the end of the string). + */ + blsmskl %ecx, %ecx + andl %ecx, %eax + jz L(ret0) + /* Get last match (the `andl` removed any out of bounds + matches). */ + bsrl %eax, %eax # ifdef USE_AS_WCSRCHR - /* NB: Divide shift count by 4 since each bit in K1 represent 4 - bytes. */ - movl %ecx, %SHIFT_REG - sarl $2, %SHIFT_REG + leaq (%rdi, %rax, CHAR_SIZE), %rax +# else + addq %rdi, %rax # endif +L(ret0): + ret - VMOVA (%rdi), %YMM1 - - /* Each bit in K0 represents a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM1, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ + /* Returns for first vec x1/x2/x3 have hard coded backward + search path for earlier matches. */ + .p2align 4,, 6 +L(first_vec_x1): + VPCMP $0, %YMMMATCH, %YMM2, %k1 + kmovd %k1, %eax + blsmskl %ecx, %ecx + /* eax non-zero if search CHAR in range. */ + andl %ecx, %eax + jnz L(first_vec_x1_return) + + /* fallthrough: no match in YMM2 then need to check for earlier + matches (in YMM1). */ + .p2align 4,, 4 +L(first_vec_x0_test): VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k0, %edx kmovd %k1, %eax - - shrxl %SHIFT_REG, %edx, %edx - shrxl %SHIFT_REG, %eax, %eax - addq $VEC_SIZE, %rdi - - /* Check if there is a CHAR. */ testl %eax, %eax - jnz L(found_char) - - testl %edx, %edx - jnz L(return_null) - - jmp L(aligned_loop) - - .p2align 4 -L(found_char): - testl %edx, %edx - jnz L(char_and_nul) - - /* Remember the match and keep searching. */ - movl %eax, %edx - leaq (%rdi, %rcx), %rsi + jz L(ret1) + bsrl %eax, %eax +# ifdef USE_AS_WCSRCHR + leaq (%rsi, %rax, CHAR_SIZE), %rax +# else + addq %rsi, %rax +# endif +L(ret1): + ret - .p2align 4 -L(aligned_loop): - VMOVA (%rdi), %YMM1 - addq $VEC_SIZE, %rdi + .p2align 4,, 10 +L(first_vec_x1_or_x2): + VPCMP $0, %YMM3, %YMMMATCH, %k3 + VPCMP $0, %YMM2, %YMMMATCH, %k2 + /* K2 and K3 have 1 for any search CHAR match. Test if any + matches between either of them. Otherwise check YMM1. */ + kortestd %k2, %k3 + jz L(first_vec_x0_test) + + /* Guranteed that YMM2 and YMM3 are within range so merge the + two bitmasks then get last result. */ + kunpck %k2, %k3, %k3 + kmovq %k3, %rax + bsrq %rax, %rax + leaq (VEC_SIZE)(%r8, %rax, CHAR_SIZE), %rax + ret - /* Each bit in K0 represents a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM1, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k0, %ecx + .p2align 4,, 6 +L(first_vec_x3): + VPCMP $0, %YMMMATCH, %YMM4, %k1 kmovd %k1, %eax - orl %eax, %ecx - jnz L(char_nor_null) + blsmskl %ecx, %ecx + /* If no search CHAR match in range check YMM1/YMM2/YMM3. */ + andl %ecx, %eax + jz L(first_vec_x1_or_x2) + bsrl %eax, %eax + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax + ret - VMOVA (%rdi), %YMM1 - add $VEC_SIZE, %rdi + .p2align 4,, 6 +L(first_vec_x0_x1_test): + VPCMP $0, %YMMMATCH, %YMM2, %k1 + kmovd %k1, %eax + /* Check YMM2 for last match first. If no match try YMM1. */ + testl %eax, %eax + jz L(first_vec_x0_test) + .p2align 4,, 4 +L(first_vec_x1_return): + bsrl %eax, %eax + leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax + ret - /* Each bit in K0 represents a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM1, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k0, %ecx + .p2align 4,, 10 +L(first_vec_x2): + VPCMP $0, %YMMMATCH, %YMM3, %k1 kmovd %k1, %eax - orl %eax, %ecx - jnz L(char_nor_null) + blsmskl %ecx, %ecx + /* Check YMM3 for last match first. If no match try YMM2/YMM1. + */ + andl %ecx, %eax + jz L(first_vec_x0_x1_test) + bsrl %eax, %eax + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + ret - VMOVA (%rdi), %YMM1 - addq $VEC_SIZE, %rdi - /* Each bit in K0 represents a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM1, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 + .p2align 4 +L(aligned_more): + /* Need to keep original pointer incase YMM1 has last match. */ + movq %rdi, %rsi + andq $-VEC_SIZE, %rdi + VMOVU VEC_SIZE(%rdi), %YMM2 + VPTESTN %YMM2, %YMM2, %k0 kmovd %k0, %ecx - kmovd %k1, %eax - orl %eax, %ecx - jnz L(char_nor_null) + testl %ecx, %ecx + jnz L(first_vec_x1) - VMOVA (%rdi), %YMM1 - addq $VEC_SIZE, %rdi + VMOVU (VEC_SIZE * 2)(%rdi), %YMM3 + VPTESTN %YMM3, %YMM3, %k0 + kmovd %k0, %ecx + testl %ecx, %ecx + jnz L(first_vec_x2) - /* Each bit in K0 represents a null byte in YMM1. */ - VPCMP $0, %YMMZERO, %YMM1, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 + VMOVU (VEC_SIZE * 3)(%rdi), %YMM4 + VPTESTN %YMM4, %YMM4, %k0 kmovd %k0, %ecx - kmovd %k1, %eax - orl %eax, %ecx - jz L(aligned_loop) + movq %rdi, %r8 + testl %ecx, %ecx + jnz L(first_vec_x3) + andq $-(VEC_SIZE * 2), %rdi .p2align 4 -L(char_nor_null): - /* Find a CHAR or a null byte in a loop. */ +L(first_aligned_loop): + /* Preserve YMM1, YMM2, YMM3, and YMM4 until we can gurantee + they don't store a match. */ + VMOVA (VEC_SIZE * 4)(%rdi), %YMM5 + VMOVA (VEC_SIZE * 5)(%rdi), %YMM6 + + VPCMP $0, %YMM5, %YMMMATCH, %k2 + vpxord %YMM6, %YMMMATCH, %YMM7 + + VPMIN %YMM5, %YMM6, %YMM8 + VPMIN %YMM8, %YMM7, %YMM7 + + VPTESTN %YMM7, %YMM7, %k1 + subq $(VEC_SIZE * -2), %rdi + kortestd %k1, %k2 + jz L(first_aligned_loop) + + VPCMP $0, %YMM6, %YMMMATCH, %k3 + VPTESTN %YMM8, %YMM8, %k1 + ktestd %k1, %k1 + jz L(second_aligned_loop_prep) + + kortestd %k2, %k3 + jnz L(return_first_aligned_loop) + + .p2align 4,, 6 +L(first_vec_x1_or_x2_or_x3): + VPCMP $0, %YMM4, %YMMMATCH, %k4 + kmovd %k4, %eax testl %eax, %eax - jnz L(match) -L(return_value): - testl %edx, %edx - jz L(return_null) - movl %edx, %eax - movq %rsi, %rdi + jz L(first_vec_x1_or_x2) bsrl %eax, %eax -# ifdef USE_AS_WCSRCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq -VEC_SIZE(%rdi, %rax, 4), %rax -# else - leaq -VEC_SIZE(%rdi, %rax), %rax -# endif + leaq (VEC_SIZE * 3)(%r8, %rax, CHAR_SIZE), %rax ret - .p2align 4 -L(match): - /* Find a CHAR. Check if there is a null byte. */ - kmovd %k0, %ecx - testl %ecx, %ecx - jnz L(find_nul) + .p2align 4,, 8 +L(return_first_aligned_loop): + VPTESTN %YMM5, %YMM5, %k0 + kunpck %k0, %k1, %k0 + kmov_2x %k0, %maskz_2x + + blsmsk %maskz_2x, %maskz_2x + kunpck %k2, %k3, %k3 + kmov_2x %k3, %maskm_2x + and %maskz_2x, %maskm_2x + jz L(first_vec_x1_or_x2_or_x3) - /* Remember the match and keep searching. */ - movl %eax, %edx + bsr %maskm_2x, %maskm_2x + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + ret + + .p2align 4 + /* We can throw away the work done for the first 4x checks here + as we have a later match. This is the 'fast' path persay. + */ +L(second_aligned_loop_prep): +L(second_aligned_loop_set_furthest_match): movq %rdi, %rsi - jmp L(aligned_loop) + kunpck %k2, %k3, %k4 .p2align 4 -L(find_nul): - /* Mask out any matching bits after the null byte. */ - movl %ecx, %r8d - subl $1, %r8d - xorl %ecx, %r8d - andl %r8d, %eax - testl %eax, %eax - /* If there is no CHAR here, return the remembered one. */ - jz L(return_value) - bsrl %eax, %eax +L(second_aligned_loop): + VMOVU (VEC_SIZE * 4)(%rdi), %YMM1 + VMOVU (VEC_SIZE * 5)(%rdi), %YMM2 + + VPCMP $0, %YMM1, %YMMMATCH, %k2 + vpxord %YMM2, %YMMMATCH, %YMM3 + + VPMIN %YMM1, %YMM2, %YMM4 + VPMIN %YMM3, %YMM4, %YMM3 + + VPTESTN %YMM3, %YMM3, %k1 + subq $(VEC_SIZE * -2), %rdi + kortestd %k1, %k2 + jz L(second_aligned_loop) + + VPCMP $0, %YMM2, %YMMMATCH, %k3 + VPTESTN %YMM4, %YMM4, %k1 + ktestd %k1, %k1 + jz L(second_aligned_loop_set_furthest_match) + + kortestd %k2, %k3 + /* branch here because there is a significant advantage interms + of output dependency chance in using edx. */ + jnz L(return_new_match) +L(return_old_match): + kmovq %k4, %rax + bsrq %rax, %rax + leaq (VEC_SIZE * 2)(%rsi, %rax, CHAR_SIZE), %rax + ret + +L(return_new_match): + VPTESTN %YMM1, %YMM1, %k0 + kunpck %k0, %k1, %k0 + kmov_2x %k0, %maskz_2x + + blsmsk %maskz_2x, %maskz_2x + kunpck %k2, %k3, %k3 + kmov_2x %k3, %maskm_2x + and %maskz_2x, %maskm_2x + jz L(return_old_match) + + bsr %maskm_2x, %maskm_2x + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + ret + +L(cross_page_boundary): + /* eax contains all the page offset bits of src (rdi). `xor rdi, + rax` sets pointer will all page offset bits cleared so + offset of (PAGE_SIZE - VEC_SIZE) will get last aligned VEC + before page cross (guranteed to be safe to read). Doing this + as opposed to `movq %rdi, %rax; andq $-VEC_SIZE, %rax` saves + a bit of code size. */ + xorq %rdi, %rax + VMOVU (PAGE_SIZE - VEC_SIZE)(%rax), %YMM1 + VPTESTN %YMM1, %YMM1, %k0 + kmovd %k0, %ecx + + /* Shift out zero CHAR matches that are before the begining of + src (rdi). */ # ifdef USE_AS_WCSRCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq -VEC_SIZE(%rdi, %rax, 4), %rax -# else - leaq -VEC_SIZE(%rdi, %rax), %rax + movl %edi, %esi + andl $(VEC_SIZE - 1), %esi + shrl $2, %esi # endif - ret + shrxl %SHIFT_REG, %ecx, %ecx - .p2align 4 -L(char_and_nul): - /* Find both a CHAR and a null byte. */ - addq %rcx, %rdi - movl %edx, %ecx -L(char_and_nul_in_first_vec): - /* Mask out any matching bits after the null byte. */ - movl %ecx, %r8d - subl $1, %r8d - xorl %ecx, %r8d - andl %r8d, %eax - testl %eax, %eax - /* Return null pointer if the null byte comes first. */ - jz L(return_null) + testl %ecx, %ecx + jz L(page_cross_continue) + + /* Found zero CHAR so need to test for search CHAR. */ + VPCMP $0, %YMMMATCH, %YMM1, %k1 + kmovd %k1, %eax + /* Shift out search CHAR matches that are before the begining of + src (rdi). */ + shrxl %SHIFT_REG, %eax, %eax + + /* Check if any search CHAR match in range. */ + blsmskl %ecx, %ecx + andl %ecx, %eax + jz L(ret3) bsrl %eax, %eax # ifdef USE_AS_WCSRCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. */ - leaq -VEC_SIZE(%rdi, %rax, 4), %rax + leaq (%rdi, %rax, CHAR_SIZE), %rax # else - leaq -VEC_SIZE(%rdi, %rax), %rax + addq %rdi, %rax # endif +L(ret3): ret - .p2align 4 -L(return_null): - xorl %eax, %eax - ret - -END (STRRCHR) +END(STRRCHR) #endif