From patchwork Wed Nov 17 14:12:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wilco Dijkstra X-Patchwork-Id: 1556179 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=xeJhQvQ8; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4HvPyr3kykz9sCD for ; Thu, 18 Nov 2021 01:12:52 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id E41CB3858D28 for ; Wed, 17 Nov 2021 14:12:49 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E41CB3858D28 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1637158369; bh=vQTKx0RVUn6hMKYGeLUI7SBe7aa4FXlZ9f1acEtGQBM=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=xeJhQvQ8RRu46EayPSu8NNgTWAuVC4uS/5HnuMcjRwnkMzDAF2EWE/5gNMSkumiGo LSoetpMmpR4WrJu0WTTCBsL1zGTE01D2CyHA60E43wwCeYV0mGhqD9KuFZehbXAtNH JtVmvH7D9h48xvB1Wgi0+/oa+QK4JGVrWdBk7CA0= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from EUR04-HE1-obe.outbound.protection.outlook.com (mail-eopbgr70047.outbound.protection.outlook.com [40.107.7.47]) by sourceware.org (Postfix) with ESMTPS id CE3053858D28 for ; Wed, 17 Nov 2021 14:12:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org CE3053858D28 Received: from AM6PR0502CA0066.eurprd05.prod.outlook.com (2603:10a6:20b:56::43) by AM6PR08MB4567.eurprd08.prod.outlook.com (2603:10a6:20b:b0::26) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4690.26; Wed, 17 Nov 2021 14:12:29 +0000 Received: from VE1EUR03FT033.eop-EUR03.prod.protection.outlook.com (2603:10a6:20b:56:cafe::22) by AM6PR0502CA0066.outlook.office365.com (2603:10a6:20b:56::43) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4713.19 via Frontend Transport; Wed, 17 Nov 2021 14:12:29 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by VE1EUR03FT033.mail.protection.outlook.com (10.152.18.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4690.20 via Frontend Transport; Wed, 17 Nov 2021 14:12:28 +0000 Received: ("Tessian outbound 1cd1a01725a6:v110"); Wed, 17 Nov 2021 14:12:28 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 72f420f19045c3c1 X-CR-MTA-TID: 64aa7808 Received: from 9d763b230d81.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id F57BD367-DC33-4FD0-8E71-D50CA599657F.1; Wed, 17 Nov 2021 14:12:21 +0000 Received: from EUR02-HE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id 9d763b230d81.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Wed, 17 Nov 2021 14:12:21 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=WlgsLmoLhplWV0YW9XMCfENQXP1rLV7YZ/ApgHY5WEMUw29mBAHTrmggg7hCLLxrDkqPoqZWNexPuwXWkxXmVP4jTbCbWFtxpg7olQ4Km9g+64mcuCqmUr1uq+llcA4vJQS9oJEesEw9knt5rm2wFqfG2FGZhntb7ebVBQ5BQrjUxTxpyGBy2NULU6m27xZqH3UrMFQAYb/hOm+Mt6GUFk2aNBO7XrCnrG52JX/RdGJcrGZvcQOVd9k9D1xwHNGoHtQSmV6WecJLpxWDhzEvJj6nBh81fvzM6vpc+/2zX/D5VuAdZ7JHMkf8vNmRPtiW8b/JSg4SMlKDez/ejxKW7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=vQTKx0RVUn6hMKYGeLUI7SBe7aa4FXlZ9f1acEtGQBM=; b=fDUdUd0HGh0Spn5CdDFdEZp9hLJ5zilUjCH2UBLk4lOc96zg6wcAiT0JKInZ6/KCxTQmbkOoBeJKivoMLGmf6YOqCPO8YXoVzYf/N4tMovn4ayRwat4zbjFS2SsPK2+NSiEOJs6kPRJFE2c8S7OpWlH3eJE/+lvg9U3fbXRn1S5pkriJO/OcUH2UsmyOB4a/++XrKzrybfhy7K22idiPysUZAEpDdXGny91H6tWrcXcQFlw3hhimclPGrKP7sSM7HWWUVbWjhQRUKebqXEpWsXUOmmfnMqXcBJIxBZow0ubiiFzMQRwG3293T3VTZKo/fPmEpMR2Dykind3DA6jyqA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VI1PR08MB3102.eurprd08.prod.outlook.com (2603:10a6:803:3e::21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4713.19; Wed, 17 Nov 2021 14:12:15 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::e49f:f587:130d:78e4]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::e49f:f587:130d:78e4%8]) with mapi id 15.20.4713.019; Wed, 17 Nov 2021 14:12:14 +0000 To: 'GNU C Library' Subject: [PATCH] AArch64: Optimize memcmp Thread-Topic: [PATCH] AArch64: Optimize memcmp Thread-Index: AQHX27y7jjV0duzAM0uzHwCBqMOFIw== Date: Wed, 17 Nov 2021 14:12:13 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: suggested_attachment_session_id: 2ade74a6-15bf-2b9f-51af-78b975c24ce2 Authentication-Results-Original: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: afcaa976-b26c-4b68-5b2e-08d9a9d44a11 x-ms-traffictypediagnostic: VI1PR08MB3102:|AM6PR08MB4567: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:635;OLM:635; X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: RI65G5kuqtXVfOck3dsz4uXAMB+EzsWotghPJkOr1WjiX3tY1/m906dGOa0ZY78FdRNV3v+spOnZ0f/H/9UU8G/d7RWJXai85IbOhQnYc2Hq6z+Avn/tv3AlQqKLWQzRbXF2c95SLtE8a4lrPTWIegRw6vzZ5d3+aD3hGyVIx95rSDiZDGik7nuDILbJFy7wA29DFuHaGIGYPWyxpT9DOgY7dqKprWlFPO0eD5aAwDijgJhSTI2q2Rzp8GKmmDTf4toIvXzw3+4max6M8Q7HAZiNYWB7f1WT8gm7lluje2PFDY7i7DcKZQ9zCocGt3A9AjfYXQ75uEns+vWMIIkvFUf1LFv+BX2iVnXmnjXoarDtaQoNyyeakC902p9gHibf5vyR0JfkP/BArglnznyZ9gnZyFyZsrCRyNvaWi/2u5EQfLqeCc2J5F+kyxF+/j88qLDDdXSocEf9IlpNPhXJHiqXswVvWVhjM6TEKuo/TG1mxBs6UdtMy/W5YURSfdxGlGh4KOjrY8jDWRf2rKkHFQNqaZGjTjBK6lzDR18va4HJ5aLI+QCzvl8vcEb5oiG0PmmV+9ez0iEb7h81ozbFXGUUy/6/5MAhNXCMeEzeX39ySRXZuS7yeFO59IctPQuW/4e0Tl2YIpxlYgxXmxq0pXZIWZruY44r/GKA5x8Mpww/PSa99hnjK2ZyY/hbxHHPnh7hw9Kua5HfQQHBZdNCprsX/4AGrH+Wvym4pLErdOetdWzoakw6wj/M9z2qV9lB4vdTAuvAdYvZGwoEHI7i59s43B66U1h4C3lWGRtq0CE= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(366004)(6506007)(6916009)(508600001)(55016002)(33656002)(7696005)(8936002)(91956017)(66476007)(76116006)(38100700002)(64756008)(38070700005)(5660300002)(122000001)(86362001)(66556008)(66946007)(66446008)(2906002)(83380400001)(8676002)(9686003)(26005)(316002)(186003)(52536014)(71200400001)(473944003)(414714003)(357404004); DIR:OUT; SFP:1101; MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB3102 Original-Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: VE1EUR03FT033.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: 1e02a9e0-e368-4649-d78c-08d9a9d44142 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: guyIlkhuqSQpdOPecblT/K/GadqnJgxXfgZRhuOMLbwuqlXDRcOSlZhSvRG5/mMQ1LJVQLdeO6O35i6b1GA/vdvybR6kmDSj4//JfX6bLWoKjlCn+6u+ia0tibVFrtIIjnS83nWiFUMZjFvYQd48RSKl6YLDDFQGCOmL50lb+32MkGVTa7JoxcPCr+xlUgpVh32SN0edFDtzMVrVDwiIHuwtMNpCgzssZXWAhiq+GfW9SmNkrDtNJ6NpfuOVwAzoBVt0GKxB1Il7rFxu3PdFIyhzpsBxNNZz7tHK4w7DIvnxJZTe6OA/ZYNPWXVYNEhUd8K9yUZJ/5sc9hzJvQhBFq3YyikqBbw5bcgq7Cd/EfVRjE/1T10vGnRqfqPEF5BiPceZwC03WzXqUMEVzd83v+nO6uu3YfknC7BnNZLcyehjyyJv1LGlYldOfZ23l7ZN6OEpRMzf6TAo6ICzxAmob29ZB9pVDQjE59Dly0yEOYVWEtQXgwTvq7nKZQbLPI9GcfYXdhPEUK/fTWN25a8qrCBLllf8la4wrhoejL6QDJ3i25DOHNk2W2qfUPiLkNfhRNHjqviR3ES7sWPIjAu5vZW5dDOC0hDNJ21xpDS78/S0WUnrUz6c9fTmquPWrPo5O+AmqpIS7Zqocw4lSHk832KcJR0glm1FGVcABR3KIRWNMfN9g2Ka7viyzWXAppjYdSxnqjpYbo1wnMkb2dO9uGA3I6we02LUQIy+dakZxBtgst3A6usEtsy4XTvl7CzTLRZBLYv87e4gh3Ow3C6NUwufGy26agVhyIn8YGrnzyY= X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(36840700001)(46966006)(336012)(9686003)(52536014)(8676002)(6916009)(55016002)(316002)(86362001)(82310400003)(6506007)(70206006)(70586007)(83380400001)(186003)(47076005)(7696005)(356005)(36860700001)(5660300002)(508600001)(81166007)(8936002)(33656002)(2906002)(26005)(473944003)(414714003)(357404004); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Nov 2021 14:12:28.8998 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: afcaa976-b26c-4b68-5b2e-08d9a9d44a11 X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: VE1EUR03FT033.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM6PR08MB4567 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2, SPF_HELO_PASS, SPF_PASS, TXREP, UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Wilco Dijkstra via Libc-alpha From: Wilco Dijkstra Reply-To: Wilco Dijkstra Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Rewrite memcmp to improve performance. On small and medium inputs performance is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per iteration, which is 30-50% faster depending on the size. Passes regress, OK for commit? diff --git a/sysdeps/aarch64/memcmp.S b/sysdeps/aarch64/memcmp.S index c1937f6f5c103a6f74383d7d40aeca1b5ad6ff59..c7d56a8af01b4f5e7fd5cc45407d52dbdfa91e98 100644 --- a/sysdeps/aarch64/memcmp.S +++ b/sysdeps/aarch64/memcmp.S @@ -22,105 +22,79 @@ /* Assumptions: * - * ARMv8-a, AArch64, unaligned accesses. + * ARMv8-a, AArch64, Advanced SIMD, unaligned accesses. */ -/* Parameters and result. */ -#define src1 x0 -#define src2 x1 -#define limit x2 -#define result w0 - -/* Internal variables. */ -#define data1 x3 -#define data1w w3 -#define data1h x4 -#define data2 x5 -#define data2w w5 -#define data2h x6 -#define tmp1 x7 -#define tmp2 x8 - -ENTRY_ALIGN (memcmp, 6) +#define src1 x0 +#define src2 x1 +#define limit x2 +#define result w0 + +#define data1 x3 +#define data1w w3 +#define data2 x4 +#define data2w w4 +#define data3 x5 +#define data3w w5 +#define data4 x6 +#define data4w w6 +#define tmp x6 +#define src1end x7 +#define src2end x8 + + +ENTRY (memcmp) PTR_ARG (0) PTR_ARG (1) SIZE_ARG (2) - subs limit, limit, 16 + cmp limit, 16 b.lo L(less16) - - ldp data1, data1h, [src1], 16 - ldp data2, data2h, [src2], 16 + ldp data1, data3, [src1] + ldp data2, data4, [src2] ccmp data1, data2, 0, ne - ccmp data1h, data2h, 0, eq - b.ne L(return64) + ccmp data3, data4, 0, eq + b.ne L(return2) - subs limit, limit, 16 + add src1end, src1, limit + add src2end, src2, limit + cmp limit, 32 b.ls L(last_bytes) - cmp limit, 112 - b.lo L(loop16) - - and tmp1, src1, 15 - add limit, limit, tmp1 - sub src1, src1, tmp1 - sub src2, src2, tmp1 - subs limit, limit, 48 + cmp limit, 160 + b.hs L(loop_align) + sub limit, limit, 32 - /* Compare 128 up bytes using aligned access. */ .p2align 4 -L(loop64): - ldp data1, data1h, [src1] - ldp data2, data2h, [src2] - cmp data1, data2 - ccmp data1h, data2h, 0, eq - b.ne L(return64) - - ldp data1, data1h, [src1, 16] - ldp data2, data2h, [src2, 16] - cmp data1, data2 - ccmp data1h, data2h, 0, eq - b.ne L(return64) - - ldp data1, data1h, [src1, 32] - ldp data2, data2h, [src2, 32] - cmp data1, data2 - ccmp data1h, data2h, 0, eq - b.ne L(return64) - - ldp data1, data1h, [src1, 48] - ldp data2, data2h, [src2, 48] +L(loop32): + ldp data1, data3, [src1, 16] + ldp data2, data4, [src2, 16] cmp data1, data2 - ccmp data1h, data2h, 0, eq - b.ne L(return64) + ccmp data3, data4, 0, eq + b.ne L(return2) + cmp limit, 16 + b.ls L(last_bytes) - subs limit, limit, 64 - add src1, src1, 64 - add src2, src2, 64 - b.pl L(loop64) - adds limit, limit, 48 - b.lo L(last_bytes) - -L(loop16): - ldp data1, data1h, [src1], 16 - ldp data2, data2h, [src2], 16 + ldp data1, data3, [src1, 32] + ldp data2, data4, [src2, 32] cmp data1, data2 - ccmp data1h, data2h, 0, eq - b.ne L(return64) + ccmp data3, data4, 0, eq + b.ne L(return2) + add src1, src1, 32 + add src2, src2, 32 +L(last64): + subs limit, limit, 32 + b.hi L(loop32) - subs limit, limit, 16 - b.hi L(loop16) /* Compare last 1-16 bytes using unaligned access. */ L(last_bytes): - add src1, src1, limit - add src2, src2, limit - ldp data1, data1h, [src1] - ldp data2, data2h, [src2] + ldp data1, data3, [src1end, -16] + ldp data2, data4, [src2end, -16] +L(return2): + cmp data1, data2 + csel data1, data1, data3, ne + csel data2, data2, data4, ne /* Compare data bytes and set return value to 0, -1 or 1. */ -L(return64): - cmp data1, data2 - csel data1, data1, data1h, ne - csel data2, data2, data2h, ne L(return): #ifndef __AARCH64EB__ rev data1, data1 @@ -133,45 +107,98 @@ L(return): .p2align 4 L(less16): - adds limit, limit, 8 - b.lo L(less8) //lo:< + add src1end, src1, limit + add src2end, src2, limit + tbz limit, 3, L(less8) ldr data1, [src1] ldr data2, [src2] - /* equal 8 optimized */ - ccmp data1, data2, 0, ne - b.ne L(return) - - ldr data1, [src1, limit] - ldr data2, [src2, limit] - b L(return) + ldr data3, [src1end, -8] + ldr data4, [src2end, -8] + b L(return2) .p2align 4 L(less8): - adds limit, limit, 4 - b.lo L(less4) + tbz limit, 2, L(less4) ldr data1w, [src1] ldr data2w, [src2] - ccmp data1w, data2w, 0, ne - b.ne L(return) - ldr data1w, [src1, limit] - ldr data2w, [src2, limit] - b L(return) + ldr data3w, [src1end, -4] + ldr data4w, [src2end, -4] + b L(return2) - .p2align 4 L(less4): - adds limit, limit, 4 - b.eq L(ret_0) - -L(byte_loop): - ldrb data1w, [src1], 1 - ldrb data2w, [src2], 1 - subs limit, limit, 1 - ccmp data1w, data2w, 0, ne /* NZCV = 0b0000. */ - b.eq L(byte_loop) + tbz limit, 1, L(less2) + ldrh data1w, [src1] + ldrh data2w, [src2] + cmp data1w, data2w + b.ne L(return) +L(less2): + mov result, 0 + tbz limit, 0, L(return_zero) + ldrb data1w, [src1end, -1] + ldrb data2w, [src2end, -1] sub result, data1w, data2w +L(return_zero): ret -L(ret_0): - mov result, 0 + +L(loop_align): + ldp data1, data3, [src1, 16] + ldp data2, data4, [src2, 16] + cmp data1, data2 + ccmp data3, data4, 0, eq + b.ne L(return2) + + /* Align src2 and adjust src1, src2 and limit. */ + and tmp, src2, 15 + sub tmp, tmp, 16 + sub src2, src2, tmp + add limit, limit, tmp + sub src1, src1, tmp + sub limit, limit, 64 + 16 + + .p2align 4 +L(loop64): + ldr q0, [src1, 16] + ldr q1, [src2, 16] + subs limit, limit, 64 + ldr q2, [src1, 32] + ldr q3, [src2, 32] + eor v0.16b, v0.16b, v1.16b + eor v1.16b, v2.16b, v3.16b + ldr q2, [src1, 48] + ldr q3, [src2, 48] + umaxp v0.16b, v0.16b, v1.16b + ldr q4, [src1, 64]! + ldr q5, [src2, 64]! + eor v1.16b, v2.16b, v3.16b + eor v2.16b, v4.16b, v5.16b + umaxp v1.16b, v1.16b, v2.16b + umaxp v0.16b, v0.16b, v1.16b + umaxp v0.16b, v0.16b, v0.16b + fmov tmp, d0 + ccmp tmp, 0, 0, hi + b.eq L(loop64) + + /* If equal, process last 1-64 bytes using scalar loop. */ + add limit, limit, 64 + 16 + cbz tmp, L(last64) + + /* Determine the 8-byte aligned offset of the first difference. */ +#ifdef __AARCH64EB__ + rev16 tmp, tmp +#endif + rev tmp, tmp + clz tmp, tmp + bic tmp, tmp, 7 + sub tmp, tmp, 48 + ldr data1, [src1, tmp] + ldr data2, [src2, tmp] +#ifndef __AARCH64EB__ + rev data1, data1 + rev data2, data2 +#endif + mov result, 1 + cmp data1, data2 + cneg result, result, lo ret END (memcmp)