From patchwork Mon Mar 25 21:52:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anton Youdkevitch X-Patchwork-Id: 1064847 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=sourceware.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=libc-alpha-return-100888-incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=bell-sw.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="eik/vWsP"; dkim=pass (1024-bit key; unprotected) header.d=bell-sw.com header.i=@bell-sw.com header.b="ByLH1F3r"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 44Sp0C689Qz9sSM for ; Tue, 26 Mar 2019 08:52:43 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=qhnodk7wh4LL5DrHZyKosgfEQU4zi letYv8LMk/41xJ/UeyabM/RNNAfPnoXDuD+jmosUXfq4pckMQqjzf+N6PfIUYGJp SW3FCSL2d+Oin4cStlBnSo2wAchZ0HreDrrBtXMyjD1iiQKo8w1FSZkog4kwIhko ZWVbQ877yYkC04= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:subject:to:message-id:date:mime-version :content-type; s=default; bh=itZG+B7HryqEs4numAjv6aqONzU=; b=eik /vWsPW8CSINsBF/kw/aLq4MtB+KvHxIHpvvrrbVgl2LAMcUdt7XmgmCYl5RtwAeg dLXrwSvSl9Yg4K5Z4avBF6Zrc1CLuhuGroYg18NKtg2iKEKRfyltFIgncf18Crq9 XmTvCv/zNCDmZ0tiRGnafE8515qiyIiIO2T8Rhqs= Received: (qmail 19368 invoked by alias); 25 Mar 2019 21:52:35 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 19353 invoked by uid 89); 25 Mar 2019 21:52:35 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-26.3 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_LOW, RCVD_IN_RP_RNBL, SPF_PASS autolearn=ham version=3.3.1 spammy=repeated X-HELO: forward100j.mail.yandex.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bell-sw.com; s=mail; t=1553550743; bh=b2R4hj+VO6ICR9LxCbPEJs+kvny65nWs0Hu0lP0pGnE=; h=To:Subject:From:Date:Message-ID; b=ByLH1F3rKD2zTKsXi7M9yjiLZOC1t3wK4rX85TNsmRTk5xZPN3NQ5XlxJF4HRISvk 93gpXxBCvwY5gqrJyyJJI7WUI62JXruchmi6LcQh08inxaYEpiTNLkFnoAfqbh6TFn hWzraKCEPGvjUdWo5DnNz63Yb7A1MennB01ePmgw= Authentication-Results: mxback6j.mail.yandex.net; dkim=pass header.i=@bell-sw.com From: Anton Youdkevitch Subject: [PATCH v4] aarch64: thunderx2 memcpy optimizations for ext-based code path To: Wilco.Dijkstra@arm.com, libc-alpha@sourceware.org Message-ID: <5C994D96.6000303@bell-sw.com> Date: Tue, 26 Mar 2019 00:52:22 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 Wilco, I appreciate you comments very much. Here is the patch considering the points you made. 1. Always taken conditional branch at the beginning is removed. 2. Epilogue code is placed after the end of the loop to reduce the number of branches. 3. The redundant "mov" instructions inside the loop are gone due to the changed order of the registers in the ext instructions inside the loop. 4. Invariant code in the loop epilogue is no more repeated for each ext chunk. make check shows no regression diff --git a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S index b2215c1..c8c5e8b 100644 --- a/sysdeps/aarch64/multiarch/memcpy_thunderx2.S +++ b/sysdeps/aarch64/multiarch/memcpy_thunderx2.S @@ -382,7 +382,8 @@ L(bytes_0_to_3): strb A_lw, [dstin] strb B_lw, [dstin, tmp1] strb A_hw, [dstend, -1] -L(end): ret +L(end): + ret .p2align 4 @@ -544,6 +545,7 @@ L(dst_unaligned): str C_q, [dst], #16 ldp F_q, G_q, [src], #32 bic dst, dst, 15 + subs count, count, 32 adrp tmp2, L(ext_table) add tmp2, tmp2, :lo12:L(ext_table) add tmp2, tmp2, tmp1, LSL #2 @@ -556,31 +558,22 @@ L(dst_unaligned): L(ext_size_ ## shft):;\ ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\ ext B_v.16b, D_v.16b, E_v.16b, 16-shft;\ - subs count, count, 32;\ - b.ge 2f;\ -1:;\ - stp A_q, B_q, [dst], #32;\ ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ - ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ - stp H_q, I_q, [dst], #16;\ - add dst, dst, tmp1;\ - str G_q, [dst], #16;\ - b L(copy_long_check32);\ -2:;\ +1:;\ stp A_q, B_q, [dst], #32;\ prfm pldl1strm, [src, MEMCPY_PREFETCH_LDR];\ - ldp D_q, J_q, [src], #32;\ - ext H_v.16b, E_v.16b, F_v.16b, 16-shft;\ + ldp C_q, D_q, [src], #32;\ ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ - mov C_v.16b, G_v.16b;\ stp H_q, I_q, [dst], #32;\ + ext A_v.16b, G_v.16b, C_v.16b, 16-shft;\ + ext B_v.16b, C_v.16b, D_v.16b, 16-shft;\ ldp F_q, G_q, [src], #32;\ - ext A_v.16b, C_v.16b, D_v.16b, 16-shft;\ - ext B_v.16b, D_v.16b, J_v.16b, 16-shft;\ - mov E_v.16b, J_v.16b;\ + ext H_v.16b, D_v.16b, F_v.16b, 16-shft;\ subs count, count, 64;\ - b.ge 2b;\ - b 1b;\ + b.ge 1b;\ +2:;\ + ext I_v.16b, F_v.16b, G_v.16b, 16-shft;\ + b L(ext_tail); EXT_CHUNK(1) EXT_CHUNK(2) @@ -598,6 +591,14 @@ EXT_CHUNK(13) EXT_CHUNK(14) EXT_CHUNK(15) +L(ext_tail): + stp A_q, B_q, [dst], #32 + stp H_q, I_q, [dst], #16 + add dst, dst, tmp1 + str G_q, [dst], #16 + b L(copy_long_check32) + + END (MEMCPY) .section .rodata .p2align 4