From patchwork Wed Sep 11 22:51:07 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Wei Mi <wmi@google.com>
X-Patchwork-Id: 274387
Return-Path: 
 <gcc-patches-return-348880-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id D65862C008A
	for <incoming@patchwork.ozlabs.org>;
	Thu, 12 Sep 2013 08:51:17 +1000 (EST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:mime-version:date:message-id:subject:from:to:cc:content-type;
	q=dns; s=default; b=atW79uQdu0NMc7IQPssdfGfTNBI7JLew2U/aF3JeHcf
	uWl1V/ePXFO3AlpOBrnysCT2rhSOapmHAuBSAt0ewiEB/77VbDfWQQF+5V5d+CFx
	S/NosNXCpEbE0iAPgsODgscsmO29x/nON5CTj9qk8aws+VtHd11UjdKoBx1GDCgc
	=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:mime-version:date:message-id:subject:from:to:cc:content-type;
	s=default; bh=OdgLck8mhsLtnIK7VW07e0QArbA=; b=IILjLdDmqW5Hx2jH8
	R+WvRJgXPwQk1CJKQm6N7jyPNkPAzfHP+pmrnIiaHJ9Gg4swgzU2BnAmzB+yY+Jr
	Sm1mufKTYcRDXnAp8aKJaPaMPHDgJDYqakeXGKdno+JWUwp36nGsJe9oR9FIAUH0
	Ag1DrapBuQ2hpnRcbmGnkG2N1Y=
Received: (qmail 11617 invoked by alias); 11 Sep 2013 22:51:10 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-##L=##H@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 11602 invoked by uid 89); 11 Sep 2013 22:51:09 -0000
Received: from mail-ie0-f178.google.com (HELO mail-ie0-f178.google.com)
	(209.85.223.178) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted)
	ESMTPS; Wed, 11 Sep 2013 22:51:09 +0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.7 required=5.0 tests=AWL, BAYES_00,
	LOTS_OF_MONEY, NO_RELAYS autolearn=no version=3.3.2
X-HELO: mail-ie0-f178.google.com
Received: by mail-ie0-f178.google.com with SMTP id to1so10568568ieb.9 for
	<gcc-patches@gcc.gnu.org>; Wed, 11 Sep 2013 15:51:07 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
	s=20130820;
	h=x-gm-message-state:mime-version:date:message-id:subject:from:to:cc
	:content-type; bh=2pBzz32lkwn7Ye2YqFJ33cl6ran3UWGHrpHLl7O3IEU=;
	b=AJKjMADvsu/OufJweqCTNOpTwkh3+E9ZxoUIsTWn+PW+qHDv+7F45AL9sDu24V+QDx
	8Idha6XUnj8cx9k9IeLurdzbnZ2BpiUc7YpoegCKw4tGfiuVIkpYNAI3Urd+Beu1lH8X
	FO6FhcUJEMtIJzHwmTTlNbQixN97I4rF1aIN7xqtTPKwSOk3oNK7F07UGcnup5qo9T4L
	2SVgZGobJ1W6HnX6QvBF7tKhh48IwOCcUs8JWAeuxiP2sNWNFeibOLV8svpaRB3OJPlR
	/Z8hcrW6itk9RVJ9R+fbEMuM+M7c+eOn1Xt7r9dsXjomDTZ2gWtJrABIvGvCuVyl7VFN
	qWlw==
X-Gm-Message-State: 
 ALoCoQmRqpbUpfhykAKnI7K6DIikwzW3/qYLobMvG10Uj+ZSm8O8412GQKa7+O3pc3dIx9bdCNz2rdjawEUnfUeUBHh2jHEvDSSuPREvh0FG2JD76IjJh5AEyStvqD5/8XQr/dmH38tbymHTAlc9YdPa89X/+oueaV2EVglv+tGSJeGjt+F0qqMPU7tKg5qTKOqiygRJPujMC9KSdb4bMaVQifXwrqaAWQ==
MIME-Version: 1.0
X-Received: by 10.50.6.106 with SMTP id z10mr12813796igz.9.1378939867302;
	Wed, 11 Sep 2013 15:51:07 -0700 (PDT)
Received: by 10.64.35.193 with HTTP; Wed, 11 Sep 2013 15:51:07 -0700 (PDT)
Date: Wed, 11 Sep 2013 15:51:07 -0700
Message-ID: 
 <CA+4CFy6AWcvddHi-S8N1pzae7ChCx00EF7+mPM88pdQuyy3Zow@mail.gmail.com>
Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
From: Wei Mi <wmi@google.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: David Li <davidxl@google.com>, "Zamyatin, Igor" <igor.zamyatin@intel.com>

For the following testcase 1.c, on westmere and sandybridge,
performance with the option -mtune=^use_vector_fp_converts is better
(improves from 3.46s to 2.83s). It means cvtss2sd is often better than
unpcklps+cvtps2pd on recent x86 platforms.

1.c:
float total = 0.2;
int k = 5;

int main() {
 int i;

 for (i = 0; i < 1000000000; i++) {
   total += (0.5 + k);
 }

 return total == 0.3;
}

assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts
.L2:
        unpcklps        %xmm0, %xmm0
        subl    $1, %eax
        cvtps2pd        %xmm0, %xmm0
        addsd   %xmm1, %xmm0
        unpcklpd        %xmm0, %xmm0
        cvtpd2ps        %xmm0, %xmm0
        jne     .L2

assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts
.L2:
        cvtss2sd        %xmm0, %xmm0
        subl    $1, %eax
        addsd   %xmm1, %xmm0
        cvtsd2ss        %xmm0, %xmm0
        jne     .L2

But for testcase 2.c (Thanks to Igor Zamyatin for the testcase),
performance with the option -mtune=^use_vector_fp_converts is worse.
Analysis to the assembly shows the performance degradation comes from
partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before
cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back.

2.c:
double b[1024];

float a[1024];

int main()
{
    int i;
    for(i = 0 ; i < 1024 * 1024 * 256; i++)
      a[i & 1023] = a[i & 1023] * (float)b[i & 1023];
    return (int)a[512];
}

without -mtune-crtl=^use_vector_fp_converts
.L2:
        movl    %eax, %edx
        addl    $1, %eax
        andl    $1023, %edx
        cmpl    $268435456, %eax
        movsd   b(,%rdx,8), %xmm0
        cvtpd2ps        %xmm0, %xmm0    ==> without partial reg stall
because of movsd.
        mulss   a(,%rdx,4), %xmm0
        movss   %xmm0, a(,%rdx,4)
        jne     .L2

with -mtune-crtl=^use_vector_fp_converts
.L2:
        movl    %eax, %edx
        addl    $1, %eax
        andl    $1023, %edx
        cmpl    $268435456, %eax
        cvtsd2ss        b(,%rdx,8), %xmm0   ==> with partial reg
stall. Needs to insert "pxor %xmm0, %xmm0" before current insn.
        mulss   a(,%rdx,4), %xmm0
        movss   %xmm0, a(,%rdx,4)
        jne     .L2

So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to
use cvtss2sd/cvtsd2ss directly,  and add "pxor %xmmreg %xmmreg" before
cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308
does for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for
trunk?

Thanks,
Wei Mi.

2013-09-11  Wei Mi  <wmi@google.com>

        * config/i386/x86-tune.def (DEF_TUNE): Remove
        m_CORE_ALL.
        * config/i386/i386.md: Add define_peephole2 to
        break partial reg stall for cvtss2sd/cvtsd2ss.

Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def    (revision 201963)
+++ config/i386/x86-tune.def    (working copy)
@@ -189,7 +189,7 @@ DEF_TUNE (X86_TUNE_NOT_VECTORMODE, "not_
 /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
    from FP to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts",
-          m_CORE_ALL | m_AMDFAM10 | m_GENERIC)
+          m_AMDFAM10 | m_GENERIC)
 /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
    from integer to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md (revision 201963)
+++ config/i386/i386.md (working copy)
@@ -5075,6 +5075,63 @@
   emit_move_insn (operands[0], CONST0_RTX (<ssevecmode>mode));
 })

+;; Break partial reg stall for cvtsd2ss.
+
+(define_peephole2
+  [(set (match_operand:SF 0 "register_operand")
+        (float_truncate:SF
+         (match_operand:DF 1 "nonimmediate_operand")))]
+  "TARGET_SSE2 && TARGET_SSE_MATH
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+   && optimize_function_for_speed_p (cfun)
+   && reload_completed && SSE_REG_P (operands[0])
+   && peep2_reg_dead_p (0, operands[0])
+   && (!SSE_REG_P (operands[1])
+       || REGNO (operands[0]) != REGNO (operands[1]))"
+  [(set (match_dup 0)
+       (vec_merge:V4SF
+         (vec_duplicate:V4SF
+           (float_truncate:V2SF
+             (match_dup 1)))
+         (match_dup 0)
+         (const_int 1)))]
+{
+  operands[0] = simplify_gen_subreg (V4SFmode, operands[0],
+                                    SFmode, 0);
+  operands[1] = simplify_gen_subreg (V2DFmode, operands[1],
+                                    DFmode, 0);
+  emit_move_insn (operands[0], CONST0_RTX (V4SFmode));
+})
+
+;; Break partial reg stall for cvtss2sd.
+
+(define_peephole2
+  [(set (match_operand:DF 0 "register_operand")
+        (float_extend:DF
+          (match_operand:SF 1 "nonimmediate_operand")))]
+  "TARGET_SSE2 && TARGET_SSE_MATH
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+   && optimize_function_for_speed_p (cfun)
+   && reload_completed && SSE_REG_P (operands[0])
+   && peep2_reg_dead_p (0, operands[0])
+   && (!SSE_REG_P (operands[1])
+       || REGNO (operands[0]) != REGNO (operands[1]))"
+  [(set (match_dup 0)
+        (vec_merge:V2DF
+          (float_extend:V2DF
+            (vec_select:V2SF
+              (match_dup 1)
+              (parallel [(const_int 0) (const_int 1)])))
+          (match_dup 0)
+          (const_int 1)))]
+{
+  operands[0] = simplify_gen_subreg (V2DFmode, operands[0],
+                                    DFmode, 0);
+  operands[1] = simplify_gen_subreg (V4SFmode, operands[1],
+                                    SFmode, 0);
+  emit_move_insn (operands[0], CONST0_RTX (V2DFmode));
+})
+
 ;; Avoid store forwarding (partial memory) stall penalty
 ;; by passing DImode value through XMM registers.  */