diff mbox

[AArch64] Add precision choices for the reciprocal square root approximation

Message ID 56EB2BDC.30209@samsung.com
State New
Headers show

Commit Message

Evandro Menezes March 17, 2016, 10:12 p.m. UTC
Add precision choices for the reciprocal square root approximation

         Allow a target to prefer such operation depending on the FP
    precision.

         gcc/
             * config/aarch64/aarch64-protos.h
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
             * config/aarch64/aarch64-tuning-flags.def
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
             (AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
             * config/aarch64/aarch64.c
             (use_rsqrt_p): New argument for the mode.
             (aarch64_builtin_reciprocal): Devise mode from builtin.
             (aarch64_optab_supported_p): New argument for the mode.

This patch allows a target to choose for which FP precision the 
reciprocal square root approximation is used.

For example, though this approximation is improves the performance 
noticeably for DF on A57, for SF, not so much, if at all.

Feedback appreciated.

Thank you,

Comments

Wilco Dijkstra March 18, 2016, 3:21 p.m. UTC | #1
Hi Evandro,

> For example, though this approximation is improves the performance
> noticeably for DF on A57, for SF, not so much, if at all.

I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on
4x vectorized floats.

So what I would like to see is this implemented in a more general way. We should
be able choose whether to expand depending on the mode - including whether it is
vectorized. For example enable on V4SFmode and maybe V2DFmode, but not 
on any scalars. 

Then we'd add new CPU tuning settings for division, sqrt and rsqrt (rather than adding lots
of extra tune flags). Note the md file should call a function in aarch64.c to decide whether to
expand or not (your division approximation patch makes the decision in the md file which
does not seem a good idea).

Wilco
Evandro Menezes March 18, 2016, 4:19 p.m. UTC | #2
On 03/18/16 10:21, Wilco Dijkstra wrote:
> Hi Evandro,
>
>> For example, though this approximation is improves the performance
>> noticeably for DF on A57, for SF, not so much, if at all.
> I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on
> 4x vectorized floats.

I created a simple test that loops around an inline asm version of the 
Newton series using scalar insns and got these results on A57:

    1/sqrt(x):    18290898/s
    Fast:         45896823/s

    1/sqrtf(x):   69618490/s
    Fast:         61865874/s


> So what I would like to see is this implemented in a more general way. We should
> be able choose whether to expand depending on the mode - including whether it is
> vectorized. For example enable on V4SFmode and maybe V2DFmode, but not
> on any scalars.
>
> Then we'd add new CPU tuning settings for division, sqrt and rsqrt (rather than adding lots
> of extra tune flags).

If I understood you correctly, would something like coarse tuning flags 
along with target-specific cost or parameters tables be what you have in 
mind?

> Note the md file should call a function in aarch64.c to decide whether to
> expand or not (your division approximation patch makes the decision in the md file which
> does not seem a good idea).

I agree.  Will modify it.

Thank you,
Wilco Dijkstra March 18, 2016, 10:20 p.m. UTC | #3
Evandro Menezes <e.menezes@samsung.com> wrote:
> On 03/18/16 10:21, Wilco Dijkstra wrote:
> > Hi Evandro,
> >
> >> For example, though this approximation is improves the performance
> >> noticeably for DF on A57, for SF, not so much, if at all.
> > I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on
> > 4x vectorized floats.
>
> I created a simple test that loops around an inline asm version of the
> Newton series using scalar insns and got these results on A57:

That's pure max throughput rather than answering the question whether
it speeds up code that does real work. A test that loads an array of vectors and
writes back the unit vectors would be a more realistic scenario.

Note our testing showed rsqrt slows down various benchmarks:
https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html.

> If I understood you correctly, would something like coarse tuning flags
> along with target-specific cost or parameters tables be what you have in
> mind?

Yes, the magic tuning flags can be coarse (on/off is good enough). If we can
agree that these expansions are really only useful for 4x vectorized code and
not much else then all we need is a function that enables it for those modes. 
Otherwise we would need per-CPU settings that select which expansions are
enabled for which modes (not just single/double).

Wilco
Evandro Menezes March 18, 2016, 11 p.m. UTC | #4
On 03/18/16 17:20, Wilco Dijkstra wrote:
> Evandro Menezes <e.menezes@samsung.com> wrote:
>> On 03/18/16 10:21, Wilco Dijkstra wrote:
>>> Hi Evandro,
>>>
>>>> For example, though this approximation is improves the performance
>>>> noticeably for DF on A57, for SF, not so much, if at all.
>>> I'm still skeptical that you ever can get any gain on scalars. I bet the only gain is on
>>> 4x vectorized floats.
>> I created a simple test that loops around an inline asm version of the
>> Newton series using scalar insns and got these results on A57:
> That's pure max throughput rather than answering the question whether
> it speeds up code that does real work. A test that loads an array of vectors and
> writes back the unit vectors would be a more realistic scenario.
>
> Note our testing showed rsqrt slows down various benchmarks:
> https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html.

I remember having seen that, but my point is that if A57 enabled this 
for only DF, it might be an overall improvement.

>> If I understood you correctly, would something like coarse tuning flags
>> along with target-specific cost or parameters tables be what you have in
>> mind?
> Yes, the magic tuning flags can be coarse (on/off is good enough). If we can
> agree that these expansions are really only useful for 4x vectorized code and
> not much else then all we need is a function that enables it for those modes.
> Otherwise we would need per-CPU settings that select which expansions are
> enabled for which modes (not just single/double).

Just to be clear, the flags refer to the inner mode, whether scalar or 
vector.
I'm not hopeful that it can be said that this is only useful when 
vectorized though.
Evandro Menezes March 31, 2016, 10:22 p.m. UTC | #5
On 03/18/16 18:00, Evandro Menezes wrote:
> On 03/18/16 17:20, Wilco Dijkstra wrote:
>> Evandro Menezes <e.menezes@samsung.com> wrote:
>>> On 03/18/16 10:21, Wilco Dijkstra wrote:
>>>> Hi Evandro,
>>>>
>>>>> For example, though this approximation is improves the performance
>>>>> noticeably for DF on A57, for SF, not so much, if at all.
>>>> I'm still skeptical that you ever can get any gain on scalars. I 
>>>> bet the only gain is on
>>>> 4x vectorized floats.
>>> I created a simple test that loops around an inline asm version of the
>>> Newton series using scalar insns and got these results on A57:
>> That's pure max throughput rather than answering the question whether
>> it speeds up code that does real work. A test that loads an array of 
>> vectors and
>> writes back the unit vectors would be a more realistic scenario.
>>
>> Note our testing showed rsqrt slows down various benchmarks:
>> https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00574.html.
>
> I remember having seen that, but my point is that if A57 enabled this 
> for only DF, it might be an overall improvement.
>
>>> If I understood you correctly, would something like coarse tuning flags
>>> along with target-specific cost or parameters tables be what you 
>>> have in
>>> mind?
>> Yes, the magic tuning flags can be coarse (on/off is good enough). If 
>> we can
>> agree that these expansions are really only useful for 4x vectorized 
>> code and
>> not much else then all we need is a function that enables it for 
>> those modes.
>> Otherwise we would need per-CPU settings that select which expansions 
>> are
>> enabled for which modes (not just single/double).
>
> Just to be clear, the flags refer to the inner mode, whether scalar or 
> vector.
> I'm not hopeful that it can be said that this is only useful when 
> vectorized though.
>

Ping^1
Wilco Dijkstra April 1, 2016, 1:47 p.m. UTC | #6
Evandro Menezes wrote:
>
> Ping^1

I haven't seen a newer version that incorporates my feedback. To recap what
I'd like to see is a more general way to select approximations based on mode. 
I don't believe that looking at the inner mode works in general, and it doesn't
make sense to add internal tune flags for all possible combinations.

To give an idea what I mean, it would be easiest to add a single field to the CPU
tuning structure that contains a mask for all the combinations. Then we call a
single function with approximation kind ie. sqrt, rsqrt, div (x/y), recip (1/x) and
mode which uses the CPU tuning field to decide whether it should be inlined.

Cheers,
Wilco
James Greenhalgh April 1, 2016, 2:06 p.m. UTC | #7
On Fri, Apr 01, 2016 at 02:47:05PM +0100, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
> >
> > Ping^1
> 
> I haven't seen a newer version that incorporates my feedback. To recap what
> I'd like to see is a more general way to select approximations based on mode.
> I don't believe that looking at the inner mode works in general, and it
> doesn't make sense to add internal tune flags for all possible combinations.

Agreed. I don't think that a flag for each of the cartesian product of
{rsqrt,sqrt,div} X {SF,DF,V2SF,V4SF,V2DF} is a scalable solution - that's
at least 15 flags we'll need.

As I said earlier in the discussion, this particular split (between SF and
DF mode) seems strange to me. I'd expect the V4SF vs. SF would also be
interesting, and that a distinction between vector modes and scalar
modes would be more likely to be useful.

> To give an idea what I mean, it would be easiest to add a single field to the
> CPU tuning structure that contains a mask for all the combinations. Then we
> call a single function with approximation kind ie. sqrt, rsqrt, div (x/y),
> recip (1/x) and mode which uses the CPU tuning field to decide whether it
> should be inlined.

I like the idea of a single cost function.

These patches are well and truly on my radar for GCC 7, but as we're still
in bugfixing mode (and there's still plenty to do!), I'm not going to get
round to giving them a more detailed review until after the release. Feel
free to ping them again once GCC 6 has shipped.

Thanks,
James
Evandro Menezes April 1, 2016, 2:51 p.m. UTC | #8
On 04/01/16 08:47, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>> Ping^1
> I haven't seen a newer version that incorporates my feedback. To recap what
> I'd like to see is a more general way to select approximations based on mode.
> I don't believe that looking at the inner mode works in general, and it doesn't
> make sense to add internal tune flags for all possible combinations.
>
> To give an idea what I mean, it would be easiest to add a single field to the CPU
> tuning structure that contains a mask for all the combinations. Then we call a
> single function with approximation kind ie. sqrt, rsqrt, div (x/y), recip (1/x) and
> mode which uses the CPU tuning field to decide whether it should be inlined.

I think that I have a better idea of what you meant.

Thank you,
Evandro Menezes April 1, 2016, 2:53 p.m. UTC | #9
On 04/01/16 09:06, James Greenhalgh wrote:
> On Fri, Apr 01, 2016 at 02:47:05PM +0100, Wilco Dijkstra wrote:
>> Evandro Menezes wrote:
>>> Ping^1
>> I haven't seen a newer version that incorporates my feedback. To recap what
>> I'd like to see is a more general way to select approximations based on mode.
>> I don't believe that looking at the inner mode works in general, and it
>> doesn't make sense to add internal tune flags for all possible combinations.
> Agreed. I don't think that a flag for each of the cartesian product of
> {rsqrt,sqrt,div} X {SF,DF,V2SF,V4SF,V2DF} is a scalable solution - that's
> at least 15 flags we'll need.
>
> As I said earlier in the discussion, this particular split (between SF and
> DF mode) seems strange to me. I'd expect the V4SF vs. SF would also be
> interesting, and that a distinction between vector modes and scalar
> modes would be more likely to be useful.
>
>> To give an idea what I mean, it would be easiest to add a single field to the
>> CPU tuning structure that contains a mask for all the combinations. Then we
>> call a single function with approximation kind ie. sqrt, rsqrt, div (x/y),
>> recip (1/x) and mode which uses the CPU tuning field to decide whether it
>> should be inlined.
> I like the idea of a single cost function.

I'll go with it.

> These patches are well and truly on my radar for GCC 7, but as we're still
> in bugfixing mode (and there's still plenty to do!), I'm not going to get
> round to giving them a more detailed review until after the release. Feel
> free to ping them again once GCC 6 has shipped.

I've been proposing this change for a few months now and I'd really like 
to have it in 6.  I'd appreciate if you'd consider this request, all 
things considered.

Thank you,
diff mbox

Patch

From 95581aefcf324233c3603f4d8232ee18c5836f8a Mon Sep 17 00:00:00 2001
From: Evandro Menezes <e.menezes@samsung.com>
Date: Thu, 17 Mar 2016 17:00:03 -0500
Subject: [PATCH] Add precision choices for the reciprocal square root
 approximation

Allow a target to prefer such operation depending on the FP precision.

gcc/
	* config/aarch64/aarch64-protos.h
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT): New macro.
	* config/aarch64/aarch64-tuning-flags.def
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF): New mask.
	(AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF): Likewise.
	* config/aarch64/aarch64.c
	(use_rsqrt_p): New argument for the mode.
	(aarch64_builtin_reciprocal): Devise mode from builtin.
	(aarch64_optab_supported_p): New argument for the mode.
---
 gcc/config/aarch64/aarch64-protos.h         |  3 +++
 gcc/config/aarch64/aarch64-tuning-flags.def |  3 ++-
 gcc/config/aarch64/aarch64.c                | 23 +++++++++++++++--------
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index dced209..58e5d73 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -263,6 +263,9 @@  enum aarch64_extra_tuning_flags
 };
 #undef AARCH64_EXTRA_TUNING_OPTION
 
+#define AARCH64_EXTRA_TUNE_APPROX_RSQRT \
+  (AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF | AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF)
+
 extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def
index 7e45a0c..57d9588 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -29,5 +29,6 @@ 
      AARCH64_TUNE_ to give an enum name. */
 
 AARCH64_EXTRA_TUNING_OPTION ("rename_fma_regs", RENAME_FMA_REGS)
-AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT)
+AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrt", APPROX_RSQRT_DF)
+AARCH64_EXTRA_TUNING_OPTION ("approx_rsqrtf", APPROX_RSQRT_SF)
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 12e498d..e651123 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -7440,12 +7440,16 @@  aarch64_memory_move_cost (machine_mode mode ATTRIBUTE_UNUSED,
    to optimize 1.0/sqrt.  */
 
 static bool
-use_rsqrt_p (void)
+use_rsqrt_p (machine_mode mode)
 {
   return (!flag_trapping_math
 	  && flag_unsafe_math_optimizations
-	  && ((aarch64_tune_params.extra_tuning_flags
-	       & AARCH64_EXTRA_TUNE_APPROX_RSQRT)
+	  && ((GET_MODE_INNER (mode) == SFmode
+	       && (aarch64_tune_params.extra_tuning_flags
+		   & AARCH64_EXTRA_TUNE_APPROX_RSQRT_SF))
+	      || (GET_MODE_INNER (mode) == DFmode
+		  && (aarch64_tune_params.extra_tuning_flags
+		      & AARCH64_EXTRA_TUNE_APPROX_RSQRT_DF))
 	      || flag_mrecip_low_precision_sqrt));
 }
 
@@ -7455,9 +7459,12 @@  use_rsqrt_p (void)
 static tree
 aarch64_builtin_reciprocal (tree fndecl)
 {
-  if (!use_rsqrt_p ())
-    return NULL_TREE;
-  return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl));
+  machine_mode mode = TYPE_MODE (TREE_TYPE (fndecl));
+
+  if (use_rsqrt_p (mode))
+    return aarch64_builtin_rsqrt (DECL_FUNCTION_CODE (fndecl));
+
+  return NULL_TREE;
 }
 
 typedef rtx (*rsqrte_type) (rtx, rtx);
@@ -13952,13 +13959,13 @@  aarch64_promoted_type (const_tree t)
 /* Implement the TARGET_OPTAB_SUPPORTED_P hook.  */
 
 static bool
-aarch64_optab_supported_p (int op, machine_mode, machine_mode,
+aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
 			   optimization_type opt_type)
 {
   switch (op)
     {
     case rsqrt_optab:
-      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p ();
+      return opt_type == OPTIMIZE_FOR_SPEED && use_rsqrt_p (mode1);
 
     default:
       return true;
-- 
1.9.1