From patchwork Tue Oct 22 09:40:03 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xuelei Zhang X-Patchwork-Id: 1181184 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-106177-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="pI3MoB3F"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 46y7lp73tNz9sP4 for ; Tue, 22 Oct 2019 20:40:22 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:subject:date:message-id:mime-version :content-type; q=dns; s=default; b=Rvf7TLwTbKfzMEerzSddf+eCLaX0u jJ4bk4j9moVqZt4Ra0sxRhk5XEogrZ0tcx9TfhijUA0Jap2BAhBvAEqzjTZLxsi/ teNBtDIZ26PJ42NRykbKI38Gi88mybhiRpzU51BS4CY7IWqQ+gFf01uhQNgjVTju z0rQwpvQRO+pjg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:subject:date:message-id:mime-version :content-type; s=default; bh=PMb30LbOSeEkytstkHeOVdsRzy8=; b=pI3 MoB3FfuNv1n+5kVIoInOrCmd/8cS6RNPEAy0THmQKlCgd4Wn3Fj97oth/bXTalbb AmLE8WwhPLT70ygzLBNLXZT6BztVX85ynC7mNhULoe67Ian1GUpySeGBHOjAxbfe fGxdz5pukev5FAor4FmmCItmzBii929iOMDgVhTk= Received: (qmail 18131 invoked by alias); 22 Oct 2019 09:40:17 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 18123 invoked by uid 89); 22 Oct 2019 09:40:17 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-19.5 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_MANYTO, SPF_PASS autolearn=ham version=3.3.1 spammy=earliest X-HELO: huawei.com From: Xuelei Zhang To: , , , , , Subject: [PATCH v2] aarch64: Optimized implementation of strnlen Date: Tue, 22 Oct 2019 17:40:03 +0800 Message-ID: <20191022094003.9612-1-zhangxuelei4@huawei.com> MIME-Version: 1.0 Optimize the strlen implementation by using vector operations and loop unrooling in main loop. Compared to aarch64/strnlen.S, it reduces latency of cases in bench-strnlen by 11%~24% when the length of src is greater than 64 bytes, with gains throughout the benchmark. --- sysdeps/aarch64/strnlen.S | 52 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 51 insertions(+), 1 deletion(-) diff --git a/sysdeps/aarch64/strnlen.S b/sysdeps/aarch64/strnlen.S index 70283c80749..a57753b0a28 100644 --- a/sysdeps/aarch64/strnlen.S +++ b/sysdeps/aarch64/strnlen.S @@ -45,6 +45,11 @@ #define pos x13 #define limit_wd x14 +#define dataq q2 +#define datav v2 +#define datab2 b3 +#define dataq2 q3 +#define datav2 v3 #define REP8_01 0x0101010101010101 #define REP8_7f 0x7f7f7f7f7f7f7f7f #define REP8_80 0x8080808080808080 @@ -71,7 +76,7 @@ ENTRY_ALIGN_AND_PAD (__strnlen, 6, 9) cycle, as we get much better parallelism out of the operations. */ /* Start of critial section -- keep to one 64Byte cache line. */ -L(loop): + ldp data1, data2, [src], #16 L(realigned): sub tmp1, data1, zeroones @@ -119,6 +124,51 @@ L(nul_in_data2): csel len, len, limit, ls /* Return the lower value. */ RET +L(loop): + ldr dataq, [src], #16 + uminv datab2, datav.16b + mov tmp1, datav2.d[0] + subs limit_wd, limit_wd, #1 + ccmp tmp1, #0, #4, pl /* NZCV = 0000 */ + b.eq L(loop_end) + ldr dataq, [src], #16 + uminv datab2, datav.16b + mov tmp1, datav2.d[0] + subs limit_wd, limit_wd, #1 + ccmp tmp1, #0, #4, pl /* NZCV = 0000 */ + b.ne L(loop) +L(loop_end): + /* End of critical section -- keep to one 64Byte cache line. */ + + cbnz tmp1, L(hit_limit) /* No null in final Qword. */ + + /* We know there's a null in the final Qword. The easiest thing + to do now is work out the length of the string and return + MIN (len, limit). */ + +#ifdef __AARCH64EB__ + rev64 datav.16b, datav.16b +#endif + /* Set te NULL byte as 0xff and the rest as 0x00, move the data into a + pair of scalars and then compute the length from the earliest NULL + byte. */ + + cmeq datav.16b, datav.16b, #0 + mov data1, datav.d[0] + mov data2, datav.d[1] + cmp data1, 0 + csel data1, data1, data2, ne + sub len, src, srcin + sub len, len, #16 + rev data1, data1 + add tmp2, len, 8 + clz tmp1, data1 + csel len, len, tmp2, ne + add len, len, tmp1, lsr 3 + cmp len, limit + csel len, len, limit, ls /* Return the lower value. */ + RET + L(misaligned): /* Deal with a partial first word. We're doing two things in parallel here;