From patchwork Wed Sep 11 22:51:07 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wei Mi X-Patchwork-Id: 274387 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id D65862C008A for ; Thu, 12 Sep 2013 08:51:17 +1000 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:date:message-id:subject:from:to:cc:content-type; q=dns; s=default; b=atW79uQdu0NMc7IQPssdfGfTNBI7JLew2U/aF3JeHcf uWl1V/ePXFO3AlpOBrnysCT2rhSOapmHAuBSAt0ewiEB/77VbDfWQQF+5V5d+CFx S/NosNXCpEbE0iAPgsODgscsmO29x/nON5CTj9qk8aws+VtHd11UjdKoBx1GDCgc = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:date:message-id:subject:from:to:cc:content-type; s=default; bh=OdgLck8mhsLtnIK7VW07e0QArbA=; b=IILjLdDmqW5Hx2jH8 R+WvRJgXPwQk1CJKQm6N7jyPNkPAzfHP+pmrnIiaHJ9Gg4swgzU2BnAmzB+yY+Jr Sm1mufKTYcRDXnAp8aKJaPaMPHDgJDYqakeXGKdno+JWUwp36nGsJe9oR9FIAUH0 Ag1DrapBuQ2hpnRcbmGnkG2N1Y= Received: (qmail 11617 invoked by alias); 11 Sep 2013 22:51:10 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 11602 invoked by uid 89); 11 Sep 2013 22:51:09 -0000 Received: from mail-ie0-f178.google.com (HELO mail-ie0-f178.google.com) (209.85.223.178) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Wed, 11 Sep 2013 22:51:09 +0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00, LOTS_OF_MONEY, NO_RELAYS autolearn=no version=3.3.2 X-HELO: mail-ie0-f178.google.com Received: by mail-ie0-f178.google.com with SMTP id to1so10568568ieb.9 for ; Wed, 11 Sep 2013 15:51:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to:cc :content-type; bh=2pBzz32lkwn7Ye2YqFJ33cl6ran3UWGHrpHLl7O3IEU=; b=AJKjMADvsu/OufJweqCTNOpTwkh3+E9ZxoUIsTWn+PW+qHDv+7F45AL9sDu24V+QDx 8Idha6XUnj8cx9k9IeLurdzbnZ2BpiUc7YpoegCKw4tGfiuVIkpYNAI3Urd+Beu1lH8X FO6FhcUJEMtIJzHwmTTlNbQixN97I4rF1aIN7xqtTPKwSOk3oNK7F07UGcnup5qo9T4L 2SVgZGobJ1W6HnX6QvBF7tKhh48IwOCcUs8JWAeuxiP2sNWNFeibOLV8svpaRB3OJPlR /Z8hcrW6itk9RVJ9R+fbEMuM+M7c+eOn1Xt7r9dsXjomDTZ2gWtJrABIvGvCuVyl7VFN qWlw== X-Gm-Message-State: ALoCoQmRqpbUpfhykAKnI7K6DIikwzW3/qYLobMvG10Uj+ZSm8O8412GQKa7+O3pc3dIx9bdCNz2rdjawEUnfUeUBHh2jHEvDSSuPREvh0FG2JD76IjJh5AEyStvqD5/8XQr/dmH38tbymHTAlc9YdPa89X/+oueaV2EVglv+tGSJeGjt+F0qqMPU7tKg5qTKOqiygRJPujMC9KSdb4bMaVQifXwrqaAWQ== MIME-Version: 1.0 X-Received: by 10.50.6.106 with SMTP id z10mr12813796igz.9.1378939867302; Wed, 11 Sep 2013 15:51:07 -0700 (PDT) Received: by 10.64.35.193 with HTTP; Wed, 11 Sep 2013 15:51:07 -0700 (PDT) Date: Wed, 11 Sep 2013 15:51:07 -0700 Message-ID: Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL From: Wei Mi To: GCC Patches Cc: David Li , "Zamyatin, Igor" For the following testcase 1.c, on westmere and sandybridge, performance with the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to 2.83s). It means cvtss2sd is often better than unpcklps+cvtps2pd on recent x86 platforms. 1.c: float total = 0.2; int k = 5; int main() { int i; for (i = 0; i < 1000000000; i++) { total += (0.5 + k); } return total == 0.3; } assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts .L2: unpcklps %xmm0, %xmm0 subl $1, %eax cvtps2pd %xmm0, %xmm0 addsd %xmm1, %xmm0 unpcklpd %xmm0, %xmm0 cvtpd2ps %xmm0, %xmm0 jne .L2 assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts .L2: cvtss2sd %xmm0, %xmm0 subl $1, %eax addsd %xmm1, %xmm0 cvtsd2ss %xmm0, %xmm0 jne .L2 But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance with the option -mtune=^use_vector_fp_converts is worse. Analysis to the assembly shows the performance degradation comes from partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back. 2.c: double b[1024]; float a[1024]; int main() { int i; for(i = 0 ; i < 1024 * 1024 * 256; i++) a[i & 1023] = a[i & 1023] * (float)b[i & 1023]; return (int)a[512]; } without -mtune-crtl=^use_vector_fp_converts .L2: movl %eax, %edx addl $1, %eax andl $1023, %edx cmpl $268435456, %eax movsd b(,%rdx,8), %xmm0 cvtpd2ps %xmm0, %xmm0 ==> without partial reg stall because of movsd. mulss a(,%rdx,4), %xmm0 movss %xmm0, a(,%rdx,4) jne .L2 with -mtune-crtl=^use_vector_fp_converts .L2: movl %eax, %edx addl $1, %eax andl $1023, %edx cmpl $268435456, %eax cvtsd2ss b(,%rdx,8), %xmm0 ==> with partial reg stall. Needs to insert "pxor %xmm0, %xmm0" before current insn. mulss a(,%rdx,4), %xmm0 movss %xmm0, a(,%rdx,4) jne .L2 So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use cvtss2sd/cvtsd2ss directly, and add "pxor %xmmreg %xmmreg" before cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk? Thanks, Wei Mi. 2013-09-11 Wei Mi * config/i386/x86-tune.def (DEF_TUNE): Remove m_CORE_ALL. * config/i386/i386.md: Add define_peephole2 to break partial reg stall for cvtss2sd/cvtsd2ss. Index: config/i386/x86-tune.def =================================================================== --- config/i386/x86-tune.def (revision 201963) +++ config/i386/x86-tune.def (working copy) @@ -189,7 +189,7 @@ DEF_TUNE (X86_TUNE_NOT_VECTORMODE, "not_ /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion from FP to FP. */ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts", - m_CORE_ALL | m_AMDFAM10 | m_GENERIC) + m_AMDFAM10 | m_GENERIC) /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion from integer to FP. */ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10) Index: config/i386/i386.md =================================================================== --- config/i386/i386.md (revision 201963) +++ config/i386/i386.md (working copy) @@ -5075,6 +5075,63 @@ emit_move_insn (operands[0], CONST0_RTX (mode)); }) +;; Break partial reg stall for cvtsd2ss. + +(define_peephole2 + [(set (match_operand:SF 0 "register_operand") + (float_truncate:SF + (match_operand:DF 1 "nonimmediate_operand")))] + "TARGET_SSE2 && TARGET_SSE_MATH + && TARGET_SSE_PARTIAL_REG_DEPENDENCY + && optimize_function_for_speed_p (cfun) + && reload_completed && SSE_REG_P (operands[0]) + && peep2_reg_dead_p (0, operands[0]) + && (!SSE_REG_P (operands[1]) + || REGNO (operands[0]) != REGNO (operands[1]))" + [(set (match_dup 0) + (vec_merge:V4SF + (vec_duplicate:V4SF + (float_truncate:V2SF + (match_dup 1))) + (match_dup 0) + (const_int 1)))] +{ + operands[0] = simplify_gen_subreg (V4SFmode, operands[0], + SFmode, 0); + operands[1] = simplify_gen_subreg (V2DFmode, operands[1], + DFmode, 0); + emit_move_insn (operands[0], CONST0_RTX (V4SFmode)); +}) + +;; Break partial reg stall for cvtss2sd. + +(define_peephole2 + [(set (match_operand:DF 0 "register_operand") + (float_extend:DF + (match_operand:SF 1 "nonimmediate_operand")))] + "TARGET_SSE2 && TARGET_SSE_MATH + && TARGET_SSE_PARTIAL_REG_DEPENDENCY + && optimize_function_for_speed_p (cfun) + && reload_completed && SSE_REG_P (operands[0]) + && peep2_reg_dead_p (0, operands[0]) + && (!SSE_REG_P (operands[1]) + || REGNO (operands[0]) != REGNO (operands[1]))" + [(set (match_dup 0) + (vec_merge:V2DF + (float_extend:V2DF + (vec_select:V2SF + (match_dup 1) + (parallel [(const_int 0) (const_int 1)]))) + (match_dup 0) + (const_int 1)))] +{ + operands[0] = simplify_gen_subreg (V2DFmode, operands[0], + DFmode, 0); + operands[1] = simplify_gen_subreg (V4SFmode, operands[1], + SFmode, 0); + emit_move_insn (operands[0], CONST0_RTX (V2DFmode)); +}) + ;; Avoid store forwarding (partial memory) stall penalty ;; by passing DImode value through XMM registers. */