From patchwork Mon Apr 4 16:13:11 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Evandro Menezes X-Patchwork-Id: 605908 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3qdxqh4wbpz9s8d for ; Tue, 5 Apr 2016 02:13:35 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=qRUYbyFt; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=o0PkIs2jUOUitBaIu 12z26V5rl4Dj60mG3hYPXxdZ88AbhW/ZmMxVF++vg0yaYL1JtIWY6OF2z27zalFJ nko7R1yWbCko319ws3ZmwCmLtTd64tBl42Ov+aiOP27z8SbV7ogOhGgbDWEqkWWC oh1DV4Zi6Yt3mVEjqQhnPnS7nY= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; s=default; bh=Vyj7lt/U6Zj2aoAPxgg0lZt o91E=; b=qRUYbyFtfdH0LfXPVVX2mzOqAbxRcx16KmVUYklu9kBg5UqjMQx27Bo naR6hCvdf7qNT70tgVMhcdoZwtrVXdQ6LKsxnjEMvrjTe16vn47mm+w3pPdZwaAp H00k6UcJyrKR+5ZIWAXAsIlV0iZU2bfbBNEl0iWOvuRoXmYzk+YI= Received: (qmail 87660 invoked by alias); 4 Apr 2016 16:13:26 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 87628 invoked by uid 89); 4 Apr 2016 16:13:26 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.1 required=5.0 tests=AWL, BAYES_00, KAM_LAZY_DOMAIN_SECURITY, RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=markets, granted, TYPE_MODE, type_mode X-HELO: usmailout2.samsung.com Received: from mailout2.w2.samsung.com (HELO usmailout2.samsung.com) (211.189.100.12) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Mon, 04 Apr 2016 16:13:15 +0000 Received: from uscpsbgm2.samsung.com (u115.gpu85.samsung.co.kr [203.254.195.115]) by mailout2.w2.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0O54006O9AE1GC60@mailout2.w2.samsung.com> for gcc-patches@gcc.gnu.org; Mon, 04 Apr 2016 12:13:13 -0400 (EDT) Received: from ussync3.samsung.com ( [203.254.195.83]) by uscpsbgm2.samsung.com (USCPMTA) with SMTP id A8.B5.07641.99292075; Mon, 4 Apr 2016 12:13:13 -0400 (EDT) Received: from [172.31.207.194] ([105.140.31.10]) by ussync3.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTPA id <0O54008B1AE03S30@ussync3.samsung.com>; Mon, 04 Apr 2016 12:13:13 -0400 (EDT) Subject: Re: [AArch64] Add more precision choices for the reciprocal square root approximation To: Wilco Dijkstra , GCC Patches References: <56EB2BDC.30209@samsung.com> <56EC2A91.2030604@samsung.com> <56EC8870.1030108@samsung.com> <56FDA338.4050108@samsung.com> <56FE8B0B.1060303@samsung.com> <56FECE90.9@samsung.com> Cc: James Greenhalgh , Andrew Pinski , nd From: Evandro Menezes Message-id: <57029297.2050908@samsung.com> Date: Mon, 04 Apr 2016 11:13:11 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-version: 1.0 In-reply-to: Content-type: multipart/mixed; boundary=------------000304080202060708010509 X-IsSubscribed: yes On 04/01/16 18:08, Wilco Dijkstra wrote: > Evandro Menezes wrote: >> I hope that this gets in the ballpark of what's been discussed previously. > Yes that's very close to what I had in mind. A minor issue is that the vector > modes cannot work as they start at MAX_MODE_FLOAT (which is > 32): > > +/* Control approximate alternatives to certain FP operators. */ > +#define AARCH64_APPROX_MODE(MODE) \ > + ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \ > + ? (1 << ((MODE) - MIN_MODE_FLOAT)) \ > + : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \ > + ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT + 1)) \ > + : (0)) > > That should be: > > + ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \ > > It would be worth testing all the obvious cases to be sure they work. > > Also I don't think it is a good idea to enable all modes on Exynos-M1 and XGene-1 - > I haven't seen any evidence that shows it gives a speedup on real code for all modes > (or at least on a good micro benchmark like the unit vector test I suggested - a simple > throughput test does not count!). This approximation does benefit M1 in general across several benchmarks. As for my choice for Xgene1, it preserves the original setting. I believe that, with this more granular option, developers can fine tune their targets. > The issue is it hides performance gains from an improved divider/sqrt on new revisions > or microarchitectures. That means you should only enable cases where there is evidence > of a major speedup that cannot be matched by a future improved divider/sqrt. I did notice that some benchmarks with heavy use of multiplication or multiply-accumulation, the series may be detrimental, since it increases the competition for the unit(s) that do(es) such operations. But those micro-architectures that get a better unit for division or sqrt() are free to add their own tuning parameters. Granted, I assume that running legacy code is not much of an issue only in a few markets. Thank you, From 63a39df80104c504ffdfba698aab9dc2f73221a1 Mon Sep 17 00:00:00 2001 From: Evandro Menezes Date: Thu, 3 Mar 2016 18:13:46 -0600 Subject: [PATCH 1/2] [AArch64] Add more choices for the reciprocal square root approximation Allow a target to prefer such operation depending on the operation mode. gcc/ * config/aarch64/aarch64-protos.h (AARCH64_APPROX_MODE): New macro. (AARCH64_APPROX_{NONE,SP,DP,DFORM,QFORM,SCALAR,VECTOR,ALL}: Likewise. (tune_params): New member "approx_rsqrt_modes". * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNE_APPROX_RSQRT): Remove macro. * config/aarch64/aarch64.c (generic_tunings): New member "approx_rsqrt_modes". (cortexa35_tunings): Likewise. (cortexa53_tunings): Likewise. (cortexa57_tunings): Likewise. (cortexa72_tunings): Likewise. (exynosm1_tunings): Likewise. (thunderx_tunings): Likewise. (xgene1_tunings): Likewise. (use_rsqrt_p): New argument for the mode and use new member "approx_rsqrt_modes" from "tune_params". (aarch64_builtin_reciprocal): Devise mode from builtin. (aarch64_optab_supported_p): New argument for the mode. --- gcc/config/aarch64/aarch64-protos.h | 30 ++++++++++++++++++++++ gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- gcc/config/aarch64/aarch64.c | 39 ++++++++++++++++++----------- 3 files changed, 55 insertions(+), 16 deletions(-) diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 58c9d0d..a31ee35 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -178,6 +178,32 @@ struct cpu_branch_cost const int unpredictable; /* Unpredictable branch or optimizing for speed. */ }; +/* Control approximate alternatives to certain FP operators. */ +#define AARCH64_APPROX_MODE(MODE) \ + ((MIN_MODE_FLOAT <= (MODE) && (MODE) <= MAX_MODE_FLOAT) \ + ? (1 << ((MODE) - MIN_MODE_FLOAT)) \ + : (MIN_MODE_VECTOR_FLOAT <= (MODE) && (MODE) <= MAX_MODE_VECTOR_FLOAT) \ + ? (1 << ((MODE) - MIN_MODE_VECTOR_FLOAT \ + + MAX_MODE_FLOAT - MIN_MODE_FLOAT + 1)) \ + : (0)) +#define AARCH64_APPROX_NONE (0) +#define AARCH64_APPROX_SP (AARCH64_APPROX_MODE (SFmode) \ + | AARCH64_APPROX_MODE (V2SFmode) \ + | AARCH64_APPROX_MODE (V4SFmode)) +#define AARCH64_APPROX_DP (AARCH64_APPROX_MODE (DFmode) \ + | AARCH64_APPROX_MODE (V2DFmode)) +#define AARCH64_APPROX_DFORM (AARCH64_APPROX_MODE (SFmode) \ + | AARCH64_APPROX_MODE (DFmode) \ + | AARCH64_APPROX_MODE (V2SFmode)) +#define AARCH64_APPROX_QFORM (AARCH64_APPROX_MODE (V4SFmode) \ + | AARCH64_APPROX_MODE (V2DFmode)) +#define AARCH64_APPROX_SCALAR (AARCH64_APPROX_MODE (SFmode) \ + | AARCH64_APPROX_MODE (DFmode)) +#define AARCH64_APPROX_VECTOR (AARCH64_APPROX_MODE (V2SFmode) \ + | AARCH64_APPROX_MODE (V4SFmode) \ + | AARCH64_APPROX_MODE (V2DFmode)) +#define AARCH64_APPROX_ALL (-1) + struct tune_params { const struct cpu_cost_table *insn_extra_cost; @@ -218,6 +244,7 @@ struct tune_params } autoprefetcher_model; unsigned int extra_tuning_flags; + unsigned int approx_rsqrt_modes; }; #define AARCH64_FUSION_PAIR(x, name) \ @@ -263,6 +290,9 @@ enum aarch64_extra_tuning_flags }; #undef AARCH64_EXTRA_TUNING_OPTION +#define AARCH64_EXTRA_TUNE_APPROX_RSQRT \ + (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF) + extern struct tune_params aarch64_tune_params; HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned); diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index 7e45a0c..048c2a3 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -29,5 +29,3 @@ AARCH64_TUNE_ to give an enum name. */ AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS) -AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT) - diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index b7086dd..b0ee11e 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -38,6 +38,7 @@ #include "recog.h" #include "diagnostic.h" #include "insn-attr.h" +#include "insn-modes.h" #include "alias.h" #include "fold-const.h" #include "stor-layout.h" @@ -414,7 +415,8 @@ static const struct tune_params generic_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_APPROX_NONE) /* approx_rsqrt_modes. */ }; static const struct tune_params cortexa35_tunings = @@ -439,7 +441,8 @@ static const struct tune_params cortexa35_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_APPROX_NONE) /* approx_rsqrt_modes. */ }; static const struct tune_params cortexa53_tunings = @@ -464,7 +467,8 @@ static const struct tune_params cortexa53_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_APPROX_NONE) /* approx_rsqrt_modes. */ }; static const struct tune_params cortexa57_tunings = @@ -489,7 +493,8 @@ static const struct tune_params cortexa57_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS), /* tune_flags. */ + (AARCH64_APPROX_NONE) /* approx_rsqrt_modes. */ }; static const struct tune_params cortexa72_tunings = @@ -514,7 +519,8 @@ static const struct tune_params cortexa72_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_APPROX_NONE) /* approx_rsqrt_modes. */ }; static const struct tune_params exynosm1_tunings = @@ -538,7 +544,8 @@ static const struct tune_params exynosm1_tunings = 48, /* max_case_values. */ 64, /* cache_line_size. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_APPROX_ALL) /* approx_rsqrt_modes. */ }; static const struct tune_params thunderx_tunings = @@ -562,7 +569,8 @@ static const struct tune_params thunderx_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_APPROX_NONE) /* approx_rsqrt_modes. */ }; static const struct tune_params xgene1_tunings = @@ -586,7 +594,8 @@ static const struct tune_params xgene1_tunings = 0, /* max_case_values. */ 0, /* cache_line_size. */ tune_params::AUTOPREFETCHER_OFF, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_APPROX_RSQRT) /* tune_flags. */ + (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_APPROX_ALL) /* approx_rsqrt_modes. */ }; /* Support for fine-grained override of the tuning structures. */ @@ -7452,12 +7461,12 @@ aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED, to optimize 1.0/sqrt. */ static bool -use_rsqrt_p (void) +use_rsqrt_p (machine_mode mode) { return (!flag_trapping_math && flag_unsafe_math_optimizations - && ((aarch64_tune_params.extra_tuning_flags - & AARCH64_EXTRA_TUNE_APPROX_RSQRT) + && ((aarch64_tune_params.approx_rsqrt_modes + & AARCH64_APPROX_MODE (mode)) || flag_mrecip_low_precision_sqrt)); } @@ -7467,7 +7476,9 @@ use_rsqrt_p (void) static tree aarch64_builtin_reciprocal (tree fndecl) { - if (!use_rsqrt_p ()) + machine_mode mode = TYPE_MODE (TREE_TYPE (fndecl)); + + if (!use_rsqrt_p (mode)) return NULL_TREE; return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl)); } @@ -13964,13 +13975,13 @@ aarch64_promoted_type (const_tree t) /* Implement the TARGET_OPTAB_SUPPORTED_P hook. */ static bool -aarch64_optab_supported_p (int op, machine_mode, machine_mode, +aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode, optimization_type opt_type) { switch (op) { case rsqrt_optab: - return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (); + return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (mode1); default: return true; -- 1.9.1