From patchwork Mon Apr  4 16:13:11 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Evandro Menezes <e.menezes@samsung.com>
X-Patchwork-Id: 605908
Return-Path: 
 <gcc-patches-return-424345-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3qdxqh4wbpz9s8d
	for <incoming@patchwork.ozlabs.org>;
	Tue,  5 Apr 2016 02:13:35 +1000 (AEST)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b=qRUYbyFt; dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:subject:to:references:cc:from:message-id:date:mime-version
	:in-reply-to:content-type; q=dns; s=default; b=o0PkIs2jUOUitBaIu
	12z26V5rl4Dj60mG3hYPXxdZ88AbhW/ZmMxVF++vg0yaYL1JtIWY6OF2z27zalFJ
	nko7R1yWbCko319ws3ZmwCmLtTd64tBl42Ov+aiOP27z8SbV7ogOhGgbDWEqkWWC
	oh1DV4Zi6Yt3mVEjqQhnPnS7nY=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:subject:to:references:cc:from:message-id:date:mime-version
	:in-reply-to:content-type; s=default; bh=Vyj7lt/U6Zj2aoAPxgg0lZt
	o91E=; b=qRUYbyFtfdH0LfXPVVX2mzOqAbxRcx16KmVUYklu9kBg5UqjMQx27Bo
	naR6hCvdf7qNT70tgVMhcdoZwtrVXdQ6LKsxnjEMvrjTe16vn47mm+w3pPdZwaAp
	H00k6UcJyrKR+5ZIWAXAsIlV0iZU2bfbBNEl0iWOvuRoXmYzk+YI=
Received: (qmail 87660 invoked by alias); 4 Apr 2016 16:13:26 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 87628 invoked by uid 89); 4 Apr 2016 16:13:26 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL, BAYES_00,
	KAM_LAZY_DOMAIN_SECURITY,
	RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=markets,
	granted, TYPE_MODE, type_mode
X-HELO: usmailout2.samsung.com
Received: from mailout2.w2.samsung.com (HELO usmailout2.samsung.com)
	(211.189.100.12) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted)
	ESMTPS; Mon, 04 Apr 2016 16:13:15 +0000
Received: from uscpsbgm2.samsung.com (u115.gpu85.samsung.co.kr
	[203.254.195.115]) by mailout2.w2.samsung.com (Oracle
	Communications Messaging Server 7.0.5.31.0 64bit (built May 5
	2014)) with ESMTP id <0O54006O9AE1GC60@mailout2.w2.samsung.com> for
	gcc-patches@gcc.gnu.org; Mon, 04 Apr 2016 12:13:13 -0400 (EDT)
Received: from ussync3.samsung.com ( [203.254.195.83])	by
	uscpsbgm2.samsung.com (USCPMTA) with SMTP id
	A8.B5.07641.99292075; Mon, 4 Apr 2016 12:13:13 -0400 (EDT)
Received: from [172.31.207.194] ([105.140.31.10]) by ussync3.samsung.com
	(Oracle Communications Messaging Server 7.0.5.31.0 64bit
	(built May 5 2014)) with ESMTPA id
	<0O54008B1AE03S30@ussync3.samsung.com>;
	Mon, 04 Apr 2016 12:13:13 -0400 (EDT)
Subject: Re: [AArch64] Add more precision choices for the reciprocal square
	root approximation
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
	GCC Patches <gcc-patches@gcc.gnu.org>
References: <56EB2BDC.30209@samsung.com>
	<AM3PR08MB00883C48B491A1BA92CD0783838C0@AM3PR08MB0088.eurprd08.prod.outlook.com>
	<56EC2A91.2030604@samsung.com>
	<AM3PR08MB0088D90F31B84E852FF3100C838C0@AM3PR08MB0088.eurprd08.prod.outlook.com>
	<56EC8870.1030108@samsung.com> <56FDA338.4050108@samsung.com>
	<AM3PR08MB00889651F672A4F0157BDE17839A0@AM3PR08MB0088.eurprd08.prod.outlook.com>
	<56FE8B0B.1060303@samsung.com> <56FECE90.9@samsung.com>
	<AM3PR08MB008867649DBF969AADABAD03839A0@AM3PR08MB0088.eurprd08.prod.outlook.com>
Cc: James Greenhalgh <James.Greenhalgh@arm.com>,
	Andrew Pinski <pinskia@gmail.com>, nd <nd@arm.com>
From: Evandro Menezes <e.menezes@samsung.com>
Message-id: <57029297.2050908@samsung.com>
Date: Mon, 04 Apr 2016 11:13:11 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:38.0) Gecko/20100101 Thunderbird/38.6.0
MIME-version: 1.0
In-reply-to: 
 <AM3PR08MB008867649DBF969AADABAD03839A0@AM3PR08MB0088.eurprd08.prod.outlook.com>
Content-type: multipart/mixed; boundary=------------000304080202060708010509
X-IsSubscribed: yes

On 04/01/16 18:08, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>> I hope that this gets in the ballpark of what's been discussed previously.
> Yes that's very close to what I had in mind. A minor issue is that the vector
> modes cannot work as they start at MAX_MODE_FLOAT (which is > 32):
>
> +/* Control approximate alternatives to certain FP operators.  */
> +#define AARCH64_APPROX_MODE(MODE) \
> +  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
> +   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
> +   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
> +     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT + 1)) \
> +     : (0))
>
> That should be:
>
> +     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \
>
> It would be worth testing all the obvious cases to be sure they work.
>
> Also I don't think it is a good idea to enable all modes on Exynos-M1 and XGene-1 -
> I haven't seen any evidence that shows it gives a speedup on real code for all modes
> (or at least on a good micro benchmark like the unit vector test I suggested - a simple
> throughput test does not count!).

This approximation does benefit M1 in general across several 
benchmarks.  As for my choice for Xgene1, it preserves the original 
setting.  I believe that, with this more granular option, developers can 
fine tune their targets.

> The issue is it hides performance gains from an improved divider/sqrt on new revisions
> or microarchitectures. That means you should only enable cases where there is evidence
> of a major speedup that cannot be matched by a future improved divider/sqrt.

I did notice that some benchmarks with heavy use of multiplication or 
multiply-accumulation, the series may be detrimental, since it increases 
the competition for the unit(s) that do(es) such operations.

But those micro-architectures that get a better unit for division or 
sqrt() are free to add their own tuning parameters.  Granted, I assume 
that running legacy code is not much of an issue only in a few markets.

Thank you,

From 63a39df80104c504ffdfba698aab9dc2f73221a1 Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 3 Mar 2016 18:13:46 -0600
Subject: [PATCH 1/2] [AArch64] Add more choices for the reciprocal square root
 approximation

Allow a target to prefer such operation depending on the operation mode.

gcc/
	* config/aarch64/aarch64-protos.h
	(AARCH64_APPROX_MODE): New macro.
	(AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise.
	(tune_params): New member "approx_rsqrt_modes".
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT): Remove macro.
	* config/aarch64/aarch64.c
	(generic_tunings): New member "approx_rsqrt_modes".
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(exynosm1_tunings): Likewise.
	(thunderx_tunings): Likewise.
	(xgene1_tunings): Likewise.
	(use_rsqrt_p): New argument for the mode and use new member
	"approx_rsqrt_modes" from "tune_params".
	(aarch64_builtin_reciprocal): Devise mode from builtin.
	(aarch64_optab_supported_p): New argument for the mode.
---
 gcc/config/aarch64/aarch64-protos.h         | 30 ++++++++++++++++++++++
 gcc/config/aarch64/aarch64-tuning-flags.def |  2 --
 gcc/config/aarch64/aarch64.c                | 39 ++++++++++++++++++-----------
 3 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index 58c9d0d..a31ee35 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -178,6 +178,32 @@ struct cpu_branch_cost
   const int unpredictable;  /* Unpredictable branch or optimizing for speed.  */
 };
 
+/* Control approximate alternatives to certain FP operators.  */
+#define AARCH64_APPROX_MODE(MODE) \
+  ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \
+   ? (1 << ((MODE) - MIN_MODE_FLOAT)) \
+   : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \
+     ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT \
+	      + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \
+     : (0))
+#define AARCH64_APPROX_NONE (0)
+#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \
+			   | AARCH64_APPROX_MODE (V2SFmode) \
+			   | AARCH64_APPROX_MODE (V4SFmode))
+#define AARCH64_APPROX_DP (AARCH64_APPROX_MODE (DFmode) \
+			   | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_DFORM (AARCH64_APPROX_MODE (SFmode) \
+			      | AARCH64_APPROX_MODE (DFmode) \
+			      | AARCH64_APPROX_MODE (V2SFmode))
+#define AARCH64_APPROX_QFORM (AARCH64_APPROX_MODE (V4SFmode) \
+			      | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_SCALAR (AARCH64_APPROX_MODE (SFmode) \
+			       | AARCH64_APPROX_MODE (DFmode))
+#define AARCH64_APPROX_VECTOR (AARCH64_APPROX_MODE (V2SFmode) \
+                               | AARCH64_APPROX_MODE (V4SFmode) \
+			       | AARCH64_APPROX_MODE (V2DFmode))
+#define AARCH64_APPROX_ALL (-1)
+
 struct tune_params
 {
   const struct cpu_cost_table *insn_extra_cost;
@@ -218,6 +244,7 @@ struct tune_params
   } autoprefetcher_model;
 
   unsigned int extra_tuning_flags;
+  unsigned int approx_rsqrt_modes;
 };
 
 #define AARCH64_FUSION_PAIR(x, name) \
@@ -263,6 +290,9 @@ enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_RSQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..048c2a3 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -29,5 +29,3 @@
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
-AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
-
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b7086dd..b0ee11e 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -38,6 +38,7 @@
 #include "recog.h"
 #include "diagnostic.h"
 #include "insn-attr.h"
+#include "insn-modes.h"
 #include "alias.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -414,7 +415,8 @@ static const struct tune_params generic_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa35_tunings =
@@ -439,7 +441,8 @@ static const struct tune_params cortexa35_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa53_tunings =
@@ -464,7 +467,8 @@ static const struct tune_params cortexa53_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa57_tunings =
@@ -489,7 +493,8 @@ static const struct tune_params cortexa57_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params cortexa72_tunings =
@@ -514,7 +519,8 @@ static const struct tune_params cortexa72_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params exynosm1_tunings =
@@ -538,7 +544,8 @@ static const struct tune_params exynosm1_tunings =
   48,	/* max_case_values.  */
   64,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE), /* tune_flags.  */
+  (AARCH64_APPROX_ALL) /* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params thunderx_tunings =
@@ -562,7 +569,8 @@ static const struct tune_params thunderx_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_NONE)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_NONE)	/* approx_rsqrt_modes.  */
 };
 
 static const struct tune_params xgene1_tunings =
@@ -586,7 +594,8 @@ static const struct tune_params xgene1_tunings =
   0,	/* max_case_values.  */
   0,	/* cache_line_size.  */
   tune_params::AUTOPREFETCHER_OFF,	/* autoprefetcher_model.  */
-  (AARCH64_EXTRA_TUNE_APPROX_RSQRT)	/* tune_flags.  */
+  (AARCH64_EXTRA_TUNE_NONE),	/* tune_flags.  */
+  (AARCH64_APPROX_ALL)	/* approx_rsqrt_modes.  */
 };
 
 /* Support for fine-grained override of the tuning structures.  */
@@ -7452,12 +7461,12 @@ aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED,
    to optimize 1.0/sqrt.  */
 
 static bool
-use_rsqrt_p (void)
+use_rsqrt_p (machine_mode mode)
 {
   return (!flag_trapping_math
 	  && flag_unsafe_math_optimizations
-	  && ((aarch64_tune_params.extra_tuning_flags
-	       & AARCH64_EXTRA_TUNE_APPROX_RSQRT)
+	  && ((aarch64_tune_params.approx_rsqrt_modes
+	       & AARCH64_APPROX_MODE (mode))
 	      || flag_mrecip_low_precision_sqrt));
 }
 
@@ -7467,7 +7476,9 @@ use_rsqrt_p (void)
 static tree
 aarch64_builtin_reciprocal (tree fndecl)
 {
-  if (!use_rsqrt_p ())
+  machine_mode mode = TYPE_MODE (TREE_TYPE (fndecl));
+
+  if (!use_rsqrt_p (mode))
     return NULL_TREE;
   return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl));
 }
@@ -13964,13 +13975,13 @@ aarch64_promoted_type (const_tree t)
 /* Implement the TARGET_OPTAB_SUPPORTED_P hook.  */
 
 static bool
-aarch64_optab_supported_p (int op, machine_mode, machine_mode,
+aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 			   optimization_type opt_type)
 {
   switch (op)
     {
     case rsqrt_optab:
-      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p ();
+      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (mode1);
 
     default:
       return true;
-- 
1.9.1