From patchwork Tue Oct 26 21:33:11 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?RG91ZyBLd2FuICjpl5zmjK/lvrcp?= X-Patchwork-Id: 69293 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 962AAB70CC for ; Wed, 27 Oct 2010 08:33:53 +1100 (EST) Received: (qmail 12499 invoked by alias); 26 Oct 2010 21:33:36 -0000 Received: (qmail 12458 invoked by uid 22791); 26 Oct 2010 21:33:24 -0000 X-SWARE-Spam-Status: No, hits=3.5 required=5.0 tests=AWL, BAYES_00, CHARSET_FARAWAY_HEADER, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, MIME_CHARSET_FARAWAY, SPF_HELO_PASS, TW_CL, TW_LN, TW_LZ, TW_VS, TW_XF, TW_XS, TW_XX, TW_YY, T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from smtp-out.google.com (HELO smtp-out.google.com) (74.125.121.35) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 26 Oct 2010 21:33:18 +0000 Received: from kpbe18.cbf.corp.google.com (kpbe18.cbf.corp.google.com [172.25.105.82]) by smtp-out.google.com with ESMTP id o9QLXE52021530 for ; Tue, 26 Oct 2010 14:33:15 -0700 Received: from pwj7 (pwj7.prod.google.com [10.241.219.71]) by kpbe18.cbf.corp.google.com with ESMTP id o9QLXCc1030788 for ; Tue, 26 Oct 2010 14:33:13 -0700 Received: by pwj7 with SMTP id 7so1211286pwj.10 for ; Tue, 26 Oct 2010 14:33:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.142.125.18 with SMTP id x18mr6776499wfc.287.1288128791594; Tue, 26 Oct 2010 14:33:11 -0700 (PDT) Received: by 10.142.238.12 with HTTP; Tue, 26 Oct 2010 14:33:11 -0700 (PDT) In-Reply-To: <201010260036.04736.paul@codesourcery.com> References: <201010221920.37094.paul@codesourcery.com> <201010260036.04736.paul@codesourcery.com> Date: Tue, 26 Oct 2010 14:33:11 -0700 Message-ID: Subject: Re: [PATCH][ARM] Optimized 64-bit multiplication for THUMB-1 From: =?Big5?B?RG91ZyBLd2FuICjD9q62vHcp?= To: Paul Brook Cc: gcc-patches , Nick Clifton , Richard Earnshaw X-System-Of-Record: true Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Hi, I looked at the definition of the ARM_FUNC_START macro, the cases in which the macro does not force use of ARM mode are: - __thumb2__ is defined, the macro is defined but no .arm used. - __ARM_ARCH_6M__ is defined, the macro is not defined. In both of the cases above, the code protected by the test is not assembled, so there is no problem observed. I can add an .arm to be explicit like the attached patch. Would that be better? -Doug 在 2010年10月25日下午4:36,Paul Brook 寫道: >> Hi Paul, >> >> Thank you very much for your review and comments. I have fixed the >> push/pop and use of 2-argument code in 32-bit code. I am not quite >> sure what the problem in the __thumb2__ test is. I built arm-eabi-gcc >> with arches armv4, armv5te, armv7-a and no-arch and all build was >> successful. I did change the test so that forcing ARM mode is only >> done if: > > No. You're missing the point. ARM_FUNC_START does not force the use of ARM > mode. See comments near the definition of that macro. > > Paul > Index: gcc/gcc/config/arm/lib1funcs.asm =================================================================== --- gcc/gcc/config/arm/lib1funcs.asm (revision 165462) +++ gcc/gcc/config/arm/lib1funcs.asm (working copy) @@ -1274,6 +1274,91 @@ LSYM(Lover12): #endif #endif /* L_dvmd_lnx */ + +#ifdef L_muldi3 + +/* ------------------------------------------------------------------------ */ +/* Dword multiplication operation. + + The THUMB ISA lacks an instruction to compute the higher half of the + 64-bit result from a 32-bit by 32-bit multiplication. This makes 64-bit + multiplication difficult to implement efficiently. The ARM ISAs after V3M + have UMULL and MLA which can be used to implement 64-bit muliplication + efficiently. On a target that support both ARM V3M+ and THUMB ISA's (but + not THUMB2), we want to use the ARM version of _muldi3 in the THUMB libgcc. + + We do not need to use the ARM version for THUMB2 targets as the THUMB2 + targets also support MLA and UMULL. */ + +/* We cannot use the faster version for following situations: + + -ARM architetures older than V3M lack the UMULL instruction. + -Target is ARMV6M, which does not run ARM code. */ + +#undef USE_FAST_MULDI3 +#if (__ARM_ARCH__ > 3 || defined(__ARM_ARCH_3M__)) && !defined(__ARM_ARCH_6M__) +#define USE_FAST_MULDI3 +#endif + +/* Force using ARM code if: + 1. ARM mode has UMULL (i.e. USE_FAST_MULDI3 is defined) and + 2. This is THUMB-1 mode and + 3. INTERWORKING is enabled. */ + +#if defined(USE_FAST_MULDI3) \ + && (defined(__thumb__) && !defined(__thumb2__)) \ + && defined(__THUMB_INTERWORK__) + ARM_FUNC_START muldi3 + ARM_FUNC_ALIAS aeabi_lmul muldi3 + .arm +#else + FUNC_START muldi3 + FUNC_ALIAS aeabi_lmul muldi3 +#endif + +#if defined(USE_FAST_MULDI3) + /* Fast version for ARM with umull and THUMB2. */ + mul xxh, xxh, yyl + mla yyh, xxl, yyh, xxh + umull xxl, xxh, yyl, xxl + add xxh, xxh, yyh + RET +#else + /* Slow version for both THUMB and older ARMs lacking umull. */ + mul xxh, yyl /* xxh := AH*BL */ + do_push {r4, r5, r6, r7} + mul yyh, xxl /* yyh := AL*BH */ + ldr r4, .L_mask + lsr r5, xxl, #16 /* r5 := (AL>>16) */ + lsr r6, yyl, #16 /* r6 := (BL>>16) */ + lsr r7, xxl, #16 /* r7 := (AL>>16) */ + mul r5, r6 /* r5 = (AL>>16) * (BL>>16) */ + and xxl, r4 /* xxl = AL & 0xffff */ + and yyl, r4 /* yyl = BL & 0xffff */ + add xxh, yyh /* xxh = AH*BL+AL*BH */ + mul r6, xxl /* r6 = (AL&0xffff) * (BL>>16) */ + mul r7, yyl /* r7 = (AL>>16) * (BL&0xffff) */ + add xxh, r5 + mul xxl, yyl /* xxl = (AL&0xffff) * (BL&0xffff) */ + mov r4, #0 + adds r6, r7 /* partial sum to result[47:16]. */ + adc r4, r4 /* carry to result[48]. */ + lsr yyh, r6, #16 + lsl r4, r4, #16 + lsl yyl, r6, #16 + add xxh, r4 + adds xxl, yyl + adc xxh, yyh + do_pop {r4, r5, r6, r7} + RET + .align 2 +.L_mask: + .word 65535 +#endif + + FUNC_END muldi3 +#endif + #ifdef L_clear_cache #if defined __ARM_EABI__ && defined __linux__ @ EABI GNU/Linux call to cacheflush syscall. Index: gcc/gcc/config/arm/t-strongarm-elf =================================================================== --- gcc/gcc/config/arm/t-strongarm-elf (revision 165462) +++ gcc/gcc/config/arm/t-strongarm-elf (working copy) @@ -16,7 +16,8 @@ # along with GCC; see the file COPYING3. If not see # . -LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func _clzsi2 _clzdi2 +LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func \ + _clzsi2 _clzdi2 _muldi3 # We want fine grained libraries, so use the new code to build the # floating point emulation libraries. Index: gcc/gcc/config/arm/t-vxworks =================================================================== --- gcc/gcc/config/arm/t-vxworks (revision 165462) +++ gcc/gcc/config/arm/t-vxworks (working copy) @@ -16,7 +16,8 @@ # along with GCC; see the file COPYING3. If not see # . -LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func _call_via_rX _interwork_call_via_rX _clzsi2 _clzdi2 +LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _bb_init_func \ + _call_via_rX _interwork_call_via_rX _clzsi2 _clzdi2 _muldi3 # We want fine grained libraries, so use the new code to build the # floating point emulation libraries. Index: gcc/gcc/config/arm/t-pe =================================================================== --- gcc/gcc/config/arm/t-pe (revision 165462) +++ gcc/gcc/config/arm/t-pe (working copy) @@ -17,7 +17,7 @@ # along with GCC; see the file COPYING3. If not see # . -LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _call_via_rX _interwork_call_via_rX _clzsi2 _clzdi2 +LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _call_via_rX _interwork_call_via_rX _clzsi2 _clzdi2 _muldi3 # We want fine grained libraries, so use the new code to build the # floating point emulation libraries. Index: gcc/gcc/config/arm/t-arm-elf =================================================================== --- gcc/gcc/config/arm/t-arm-elf (revision 165462) +++ gcc/gcc/config/arm/t-arm-elf (working copy) @@ -29,7 +29,7 @@ LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _arm_truncdfsf2 _arm_negsf2 _arm_addsubsf3 _arm_muldivsf3 \ _arm_cmpsf2 _arm_unordsf2 _arm_fixsfsi _arm_fixunssfsi \ _arm_floatdidf _arm_floatdisf _arm_floatundidf _arm_floatundisf \ - _clzsi2 _clzdi2 + _clzsi2 _clzdi2 _muldi3 MULTILIB_OPTIONS = marm/mthumb MULTILIB_DIRNAMES = arm thumb Index: gcc/gcc/config/arm/t-linux =================================================================== --- gcc/gcc/config/arm/t-linux (revision 165462) +++ gcc/gcc/config/arm/t-linux (working copy) @@ -23,7 +23,7 @@ TARGET_LIBGCC2_CFLAGS = -fomit-frame-pointer -fPIC LIB1ASMSRC = arm/lib1funcs.asm LIB1ASMFUNCS = _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_lnx _clzsi2 _clzdi2 \ - _arm_addsubdf3 _arm_addsubsf3 + _arm_addsubdf3 _arm_addsubsf3 _muldi3 # MULTILIB_OPTIONS = mhard-float/msoft-float # MULTILIB_DIRNAMES = hard-float soft-float Index: gcc/gcc/config/arm/t-symbian =================================================================== --- gcc/gcc/config/arm/t-symbian (revision 165462) +++ gcc/gcc/config/arm/t-symbian (working copy) @@ -16,7 +16,8 @@ # along with GCC; see the file COPYING3. If not see # . -LIB1ASMFUNCS += _bb_init_func _call_via_rX _interwork_call_via_rX _clzsi2 _clzdi2 +LIB1ASMFUNCS += _bb_init_func _call_via_rX _interwork_call_via_rX _clzsi2 \ + _clzdi2 _muldi3 # These functions have __aeabi equivalents and will never be called by GCC. # By putting them in LIB1ASMFUNCS, we avoid the standard libgcc2.c code being Index: gcc/gcc/config/arm/t-wince-pe =================================================================== --- gcc/gcc/config/arm/t-wince-pe (revision 165462) +++ gcc/gcc/config/arm/t-wince-pe (working copy) @@ -16,7 +16,8 @@ # along with GCC; see the file COPYING3. If not see # . -LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _call_via_rX _interwork_call_via_rX _clzsi2 _clzdi2 +LIB1ASMFUNCS += _udivsi3 _divsi3 _umodsi3 _modsi3 _dvmd_tls _call_via_rX \ + _interwork_call_via_rX _clzsi2 _clzdi2 _muldi3 # We want fine grained libraries, so use the new code to build the # floating point emulation libraries.