diff mbox

[aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math

Message ID 027701d0ae9c$b8f3eff0$2adbcfd0$@samsung.com
State New
Headers show

Commit Message

Evandro Menezes June 24, 2015, 4:42 p.m. UTC
Benedikt,

You beat me to it! :-)  Do you have the implementation for dividing using
the Newton series as well?

I'm not sure that the series is always for all data types and on all
processors.  It would be useful to allow each AArch64 processor to enable
this or not depending on the data type.  BTW, do you have some tests showing
the speed up?

Thank you,

Comments

Philipp Tomsich June 24, 2015, 4:52 p.m. UTC | #1
Evandro,

We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal sqrt.

Also, the “reciprocal divide” patches are floating around in various of our git-tree, but 
aren’t ready for public consumption, yet… I’ll leave Benedikt to comment on potential 
timelines for getting that pushed out.

Best,
Philipp.

> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
> 
> Benedikt,
> 
> You beat me to it! :-)  Do you have the implementation for dividing using
> the Newton series as well?
> 
> I'm not sure that the series is always for all data types and on all
> processors.  It would be useful to allow each AArch64 processor to enable
> this or not depending on the data type.  BTW, do you have some tests showing
> the speed up?
> 
> Thank you,
> 
> -- 
> Evandro Menezes                              Austin, TX
> 
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org]
> On
>> Behalf Of Benedikt Huber
>> Sent: Thursday, June 18, 2015 7:04
>> To: gcc-patches@gcc.gnu.org
>> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
>> systems.com
>> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation
> and
>> a Newton-Raphson step, respectively.
>> There are ARMv8 implementations where this is faster than using fdiv and
>> rsqrt.
>> It runs three steps for double and two steps for float to achieve the
> needed
>> precision.
>> 
>> There is one caveat and open question.
>> Since -ffast-math enables flush to zero intermediate values between
>> approximation steps will be flushed to zero if they are denormal.
>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>> The test cases pass, but it is unclear to me whether this is expected
>> behavior with -ffast-math.
>> 
>> The patch applies to commit:
>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>> 
>> Please consider including this patch.
>> Thank you and best regards,
>> Benedikt Huber
>> 
>> Benedikt Huber (1):
>>  2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>> 
>> gcc/ChangeLog                            |   9 +++
>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>> gcc/config/aarch64/aarch64.md            |   3 +
>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>> +++++++++++++++++++++++++++++++
>> 7 files changed, 277 insertions(+)
>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>> 
>> --
>> 1.9.1
> <Mail Attachment.eml>
Benedikt Huber June 24, 2015, 5:10 p.m. UTC | #2
Evandro,

Yes, we also have the 1/x approximation.
However we do not have the test cases yet, and it also would need some clean up.
I am going to provide a patch for that soon (say next week).
Also, for this optimization we have *not* yet found a benchmark with significant improvements.

Best Regards,
Benedikt


> On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich <philipp.tomsich@theobroma-systems.com> wrote:
> 
> Evandro,
> 
> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal sqrt.
> 
> Also, the “reciprocal divide” patches are floating around in various of our git-tree, but
> aren’t ready for public consumption, yet… I’ll leave Benedikt to comment on potential
> timelines for getting that pushed out.
> 
> Best,
> Philipp.
> 
>> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
>> 
>> Benedikt,
>> 
>> You beat me to it! :-)  Do you have the implementation for dividing using
>> the Newton series as well?
>> 
>> I'm not sure that the series is always for all data types and on all
>> processors.  It would be useful to allow each AArch64 processor to enable
>> this or not depending on the data type.  BTW, do you have some tests showing
>> the speed up?
>> 
>> Thank you,
>> 
>> --
>> Evandro Menezes                              Austin, TX
>> 
>>> -----Original Message-----
>>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org]
>> On
>>> Behalf Of Benedikt Huber
>>> Sent: Thursday, June 18, 2015 7:04
>>> To: gcc-patches@gcc.gnu.org
>>> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
>>> systems.com
>>> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>>> estimation in -ffast-math
>>> 
>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt estimation
>> and
>>> a Newton-Raphson step, respectively.
>>> There are ARMv8 implementations where this is faster than using fdiv and
>>> rsqrt.
>>> It runs three steps for double and two steps for float to achieve the
>> needed
>>> precision.
>>> 
>>> There is one caveat and open question.
>>> Since -ffast-math enables flush to zero intermediate values between
>>> approximation steps will be flushed to zero if they are denormal.
>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>>> The test cases pass, but it is unclear to me whether this is expected
>>> behavior with -ffast-math.
>>> 
>>> The patch applies to commit:
>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>> 
>>> Please consider including this patch.
>>> Thank you and best regards,
>>> Benedikt Huber
>>> 
>>> Benedikt Huber (1):
>>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>> 
>>> gcc/ChangeLog                            |   9 +++
>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>>> gcc/config/aarch64/aarch64.md            |   3 +
>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>>> +++++++++++++++++++++++++++++++
>>> 7 files changed, 277 insertions(+)
>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>> 
>>> --
>>> 1.9.1
>> <Mail Attachment.eml>
>
Evandro Menezes June 24, 2015, 6:19 p.m. UTC | #3
Benedikt,

Are you developing the reciprocal approximation just for 1/x proper or for any division, as in x/y = x * 1/y?

Thank you,
Philipp Tomsich June 24, 2015, 8:08 p.m. UTC | #4
Evandro,

Shouldn't ‘execute_cse_reciprocals_1’ take care of this, once the reciprocal-division is implemented?
Do you think there’s additional work needed to catch all cases/opportunities?

Best,
Philipp.

> On 24 Jun 2015, at 20:19, Evandro Menezes <e.menezes@samsung.com> wrote:
> 
> Benedikt,
> 
> Are you developing the reciprocal approximation just for 1/x proper or for any division, as in x/y = x * 1/y?
> 
> Thank you,
> 
> -- 
> Evandro Menezes                              Austin, TX
> 
> 
>> -----Original Message-----
>> From: Benedikt Huber [mailto:benedikt.huber@theobroma-systems.com]
>> Sent: Wednesday, June 24, 2015 12:11
>> To: Dr. Philipp Tomsich
>> Cc: Evandro Menezes; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> Evandro,
>> 
>> Yes, we also have the 1/x approximation.
>> However we do not have the test cases yet, and it also would need some clean
>> up.
>> I am going to provide a patch for that soon (say next week).
>> Also, for this optimization we have *not* yet found a benchmark with
>> significant improvements.
>> 
>> Best Regards,
>> Benedikt
>> 
>> 
>>> On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich <philipp.tomsich@theobroma-
>> systems.com> wrote:
>>> 
>>> Evandro,
>>> 
>>> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal
>> sqrt.
>>> 
>>> Also, the “reciprocal divide” patches are floating around in various
>>> of our git-tree, but aren’t ready for public consumption, yet… I’ll
>>> leave Benedikt to comment on potential timelines for getting that pushed
>> out.
>>> 
>>> Best,
>>> Philipp.
>>> 
>>>> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
>>>> 
>>>> Benedikt,
>>>> 
>>>> You beat me to it! :-)  Do you have the implementation for dividing
>>>> using the Newton series as well?
>>>> 
>>>> I'm not sure that the series is always for all data types and on all
>>>> processors.  It would be useful to allow each AArch64 processor to
>>>> enable this or not depending on the data type.  BTW, do you have some
>>>> tests showing the speed up?
>>>> 
>>>> Thank you,
>>>> 
>>>> --
>>>> Evandro Menezes                              Austin, TX
>>>> 
>>>>> -----Original Message-----
>>>>> From: gcc-patches-owner@gcc.gnu.org
>>>>> [mailto:gcc-patches-owner@gcc.gnu.org]
>>>> On
>>>>> Behalf Of Benedikt Huber
>>>>> Sent: Thursday, June 18, 2015 7:04
>>>>> To: gcc-patches@gcc.gnu.org
>>>>> Cc: benedikt.huber@theobroma-systems.com; philipp.tomsich@theobroma-
>>>>> systems.com
>>>>> Subject: [PATCH] [aarch64] Implemented reciprocal square root
>>>>> (rsqrt) estimation in -ffast-math
>>>>> 
>>>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt
>>>>> estimation
>>>> and
>>>>> a Newton-Raphson step, respectively.
>>>>> There are ARMv8 implementations where this is faster than using fdiv
>>>>> and rsqrt.
>>>>> It runs three steps for double and two steps for float to achieve
>>>>> the
>>>> needed
>>>>> precision.
>>>>> 
>>>>> There is one caveat and open question.
>>>>> Since -ffast-math enables flush to zero intermediate values between
>>>>> approximation steps will be flushed to zero if they are denormal.
>>>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>>>>> The test cases pass, but it is unclear to me whether this is
>>>>> expected behavior with -ffast-math.
>>>>> 
>>>>> The patch applies to commit:
>>>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>>>> 
>>>>> Please consider including this patch.
>>>>> Thank you and best regards,
>>>>> Benedikt Huber
>>>>> 
>>>>> Benedikt Huber (1):
>>>>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>>>> 
>>>>> gcc/ChangeLog                            |   9 +++
>>>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>>>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>>>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>>>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>>>>> gcc/config/aarch64/aarch64.md            |   3 +
>>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>>>>> +++++++++++++++++++++++++++++++
>>>>> 7 files changed, 277 insertions(+)
>>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>>>> 
>>>>> --
>>>>> 1.9.1
>>>> <Mail Attachment.eml>
>>> 
> 
>
Evandro Menezes June 24, 2015, 8:39 p.m. UTC | #5
Philipp,

I think that execute_cse_reciprocals_1() applies only when the denominator is known at compile-time, otherwise the division stays.  It doesn't seem to know whether the target supports the approximate reciprocal or not.

Cheers,
Kumar, Venkataramanan June 25, 2015, 5:14 a.m. UTC | #6
Hi, 

If I understand correct, current implementation replaces 

fdiv 
fsqrt

 by  
 frsqrte
for i=0 to 3
fmul
frsqrts  
fmul

So I think gains depends latency of  frsqrts  insn.

I see patch has patterns for  vector versions of frsqrts, but does not enable them?

Regards,
Venkat.

> -----Original Message-----

> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-

> owner@gcc.gnu.org] On Behalf Of Dr. Philipp Tomsich

> Sent: Wednesday, June 24, 2015 10:22 PM

> To: Evandro Menezes

> Cc: Benedikt Huber; gcc-patches@gcc.gnu.org

> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)

> estimation in -ffast-math

> 

> Evandro,

> 

> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar) reciprocal

> sqrt.

> 

> Also, the “reciprocal divide” patches are floating around in various of our git-

> tree, but aren’t ready for public consumption, yet… I’ll leave Benedikt to

> comment on potential timelines for getting that pushed out.

> 

> Best,

> Philipp.

> 

> > On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com>

> wrote:

> >

> > Benedikt,

> >

> > You beat me to it! :-)  Do you have the implementation for dividing

> > using the Newton series as well?

> >

> > I'm not sure that the series is always for all data types and on all

> > processors.  It would be useful to allow each AArch64 processor to

> > enable this or not depending on the data type.  BTW, do you have some

> > tests showing the speed up?

> >

> > Thank you,

> >

> > --

> > Evandro Menezes                              Austin, TX

> >

> >> -----Original Message-----

> >> From: gcc-patches-owner@gcc.gnu.org

> >> [mailto:gcc-patches-owner@gcc.gnu.org]

> > On

> >> Behalf Of Benedikt Huber

> >> Sent: Thursday, June 18, 2015 7:04

> >> To: gcc-patches@gcc.gnu.org

> >> Cc: benedikt.huber@theobroma-systems.com;

> philipp.tomsich@theobroma-

> >> systems.com

> >> Subject: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)

> >> estimation in -ffast-math

> >>

> >> arch64 offers the instructions frsqrte and frsqrts, for rsqrt

> >> estimation

> > and

> >> a Newton-Raphson step, respectively.

> >> There are ARMv8 implementations where this is faster than using fdiv

> >> and rsqrt.

> >> It runs three steps for double and two steps for float to achieve the

> > needed

> >> precision.

> >>

> >> There is one caveat and open question.

> >> Since -ffast-math enables flush to zero intermediate values between

> >> approximation steps will be flushed to zero if they are denormal.

> >> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).

> >> The test cases pass, but it is unclear to me whether this is expected

> >> behavior with -ffast-math.

> >>

> >> The patch applies to commit:

> >> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470

> >>

> >> Please consider including this patch.

> >> Thank you and best regards,

> >> Benedikt Huber

> >>

> >> Benedikt Huber (1):

> >>  2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-

> systems.com>

> >>

> >> gcc/ChangeLog                            |   9 +++

> >> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++

> >> gcc/config/aarch64/aarch64-protos.h      |   2 +

> >> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++

> >> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++

> >> gcc/config/aarch64/aarch64.md            |   3 +

> >> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113

> >> +++++++++++++++++++++++++++++++

> >> 7 files changed, 277 insertions(+)

> >> create mode 100644 gcc/testsuite/gcc.target/aarch64/rsqrt.c

> >>

> >> --

> >> 1.9.1

> > <Mail Attachment.eml>
Benedikt Huber June 25, 2015, 11:47 a.m. UTC | #7
I did not look at execute_cse_reciprocals_1(), yet.

However, with the recip-patch applied:

double recipd (double a, double b)
{
  return a/b;
}

translates to

recipd:
  frecpe  d2, d1
  frecps  d3, d2, d1
  fmul  d2, d2, d3
  frecps  d3, d2, d1
  fmul  d2, d2, d3
  frecps  d1, d2, d1
  fmul  d2, d2, d1
  fmul  d0, d2, d0
  ret

float recipf (float a, float b)
{
  return a/b;
}

translates to

recipf:
  frecpe  s2, s1
  frecps  s3, s2, s1
  fmul  s2, s2, s3
  frecps  s1, s2, s1
  fmul  s2, s2, s1
  fmul  s0, s2, s0
  ret

So it seems, that it works also for a generic division.

Best regards,
Benedikt

> On 24 Jun 2015, at 22:39, Evandro Menezes <e.menezes@samsung.com> wrote:
> 
> Philipp,
> 
> I think that execute_cse_reciprocals_1() applies only when the denominator is known at compile-time, otherwise the division stays.  It doesn't seem to know whether the target supports the approximate reciprocal or not.
> 
> Cheers,
> 
> --
> Evandro Menezes                              Austin, TX
> 
> 
>> -----Original Message-----
>> From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On
>> Behalf Of Dr. Philipp Tomsich
>> Sent: Wednesday, June 24, 2015 15:08
>> To: Evandro Menezes
>> Cc: Benedikt Huber; gcc-patches@gcc.gnu.org
>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
>> estimation in -ffast-math
>> 
>> Evandro,
>> 
>> Shouldn't ‘execute_cse_reciprocals_1’ take care of this, once the reciprocal-
>> division is implemented?
>> Do you think there’s additional work needed to catch all cases/opportunities?
>> 
>> Best,
>> Philipp.
>> 
>>> On 24 Jun 2015, at 20:19, Evandro Menezes <e.menezes@samsung.com> wrote:
>>> 
>>> Benedikt,
>>> 
>>> Are you developing the reciprocal approximation just for 1/x proper or for
>> any division, as in x/y = x * 1/y?
>>> 
>>> Thank you,
>>> 
>>> --
>>> Evandro Menezes                              Austin, TX
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Benedikt Huber [mailto:benedikt.huber@theobroma-systems.com]
>>>> Sent: Wednesday, June 24, 2015 12:11
>>>> To: Dr. Philipp Tomsich
>>>> Cc: Evandro Menezes; gcc-patches@gcc.gnu.org
>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
>>>> (rsqrt) estimation in -ffast-math
>>>> 
>>>> Evandro,
>>>> 
>>>> Yes, we also have the 1/x approximation.
>>>> However we do not have the test cases yet, and it also would need
>>>> some clean up.
>>>> I am going to provide a patch for that soon (say next week).
>>>> Also, for this optimization we have *not* yet found a benchmark with
>>>> significant improvements.
>>>> 
>>>> Best Regards,
>>>> Benedikt
>>>> 
>>>> 
>>>>> On 24 Jun 2015, at 18:52, Dr. Philipp Tomsich
>>>>> <philipp.tomsich@theobroma-
>>>> systems.com> wrote:
>>>>> 
>>>>> Evandro,
>>>>> 
>>>>> We’ve seen a 28% speed-up on gromacs in SPECfp for the (scalar)
>>>>> reciprocal
>>>> sqrt.
>>>>> 
>>>>> Also, the “reciprocal divide” patches are floating around in various
>>>>> of our git-tree, but aren’t ready for public consumption, yet… I’ll
>>>>> leave Benedikt to comment on potential timelines for getting that
>>>>> pushed
>>>> out.
>>>>> 
>>>>> Best,
>>>>> Philipp.
>>>>> 
>>>>>> On 24 Jun 2015, at 18:42, Evandro Menezes <e.menezes@samsung.com> wrote:
>>>>>> 
>>>>>> Benedikt,
>>>>>> 
>>>>>> You beat me to it! :-)  Do you have the implementation for dividing
>>>>>> using the Newton series as well?
>>>>>> 
>>>>>> I'm not sure that the series is always for all data types and on
>>>>>> all processors.  It would be useful to allow each AArch64 processor
>>>>>> to enable this or not depending on the data type.  BTW, do you have
>>>>>> some tests showing the speed up?
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> --
>>>>>> Evandro Menezes                              Austin, TX
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: gcc-patches-owner@gcc.gnu.org
>>>>>>> [mailto:gcc-patches-owner@gcc.gnu.org]
>>>>>> On
>>>>>>> Behalf Of Benedikt Huber
>>>>>>> Sent: Thursday, June 18, 2015 7:04
>>>>>>> To: gcc-patches@gcc.gnu.org
>>>>>>> Cc: benedikt.huber@theobroma-systems.com;
>>>>>>> philipp.tomsich@theobroma- systems.com
>>>>>>> Subject: [PATCH] [aarch64] Implemented reciprocal square root
>>>>>>> (rsqrt) estimation in -ffast-math
>>>>>>> 
>>>>>>> arch64 offers the instructions frsqrte and frsqrts, for rsqrt
>>>>>>> estimation
>>>>>> and
>>>>>>> a Newton-Raphson step, respectively.
>>>>>>> There are ARMv8 implementations where this is faster than using
>>>>>>> fdiv and rsqrt.
>>>>>>> It runs three steps for double and two steps for float to achieve
>>>>>>> the
>>>>>> needed
>>>>>>> precision.
>>>>>>> 
>>>>>>> There is one caveat and open question.
>>>>>>> Since -ffast-math enables flush to zero intermediate values
>>>>>>> between approximation steps will be flushed to zero if they are
>> denormal.
>>>>>>> E.g. This happens in the case of rsqrt (DBL_MAX) and rsqrtf (FLT_MAX).
>>>>>>> The test cases pass, but it is unclear to me whether this is
>>>>>>> expected behavior with -ffast-math.
>>>>>>> 
>>>>>>> The patch applies to commit:
>>>>>>> svn+ssh://gcc.gnu.org/svn/gcc/trunk@224470
>>>>>>> 
>>>>>>> Please consider including this patch.
>>>>>>> Thank you and best regards,
>>>>>>> Benedikt Huber
>>>>>>> 
>>>>>>> Benedikt Huber (1):
>>>>>>> 2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
>>>>>>> 
>>>>>>> gcc/ChangeLog                            |   9 +++
>>>>>>> gcc/config/aarch64/aarch64-builtins.c    |  60 ++++++++++++++++
>>>>>>> gcc/config/aarch64/aarch64-protos.h      |   2 +
>>>>>>> gcc/config/aarch64/aarch64-simd.md       |  27 ++++++++
>>>>>>> gcc/config/aarch64/aarch64.c             |  63 +++++++++++++++++
>>>>>>> gcc/config/aarch64/aarch64.md            |   3 +
>>>>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c | 113
>>>>>>> +++++++++++++++++++++++++++++++
>>>>>>> 7 files changed, 277 insertions(+) create mode 100644
>>>>>>> gcc/testsuite/gcc.target/aarch64/rsqrt.c
>>>>>>> 
>>>>>>> --
>>>>>>> 1.9.1
>>>>>> <Mail Attachment.eml>
>>>>> 
>>> 
>>> 
>
diff mbox

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index c9b156f..690ebba 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,12 @@ 
+2015-06-15  Benedikt Huber  <benedikt.huber@theobroma-systems.com>
+
+	* config/aarch64/aarch64-builtins.c: Builtins for rsqrt and rsqrtf.
+	* config/aarch64/aarch64-protos.h: Declare.
+	* config/aarch64/aarch64-simd.md: Matching expressions for frsqrte
and frsqrts.
+	* config/aarch64/aarch64.c: New functions. Emit rsqrt estimation
code in fast math mode.
+	* config/aarch64/aarch64.md: Added enum entry.
+	* testsuite/gcc.target/aarch64/rsqrt.c: Tests for single and double.
+
 2015-06-14  Richard Sandiford  <richard.sandiford@arm.com>
 
 	* rtl.h (classify_insn): Declare.
diff --git a/gcc/config/aarch64/aarch64-builtins.c
b/gcc/config/aarch64/aarch64-builtins.c
index f7a39ec..484bb84 100644
--- a/gcc/config/aarch64/aarch64-builtins.c
+++ b/gcc/config/aarch64/aarch64-builtins.c
@@ -342,6 +342,8 @@  enum aarch64_builtins
   AARCH64_BUILTIN_GET_FPSR,
   AARCH64_BUILTIN_SET_FPSR,
 
+  AARCH64_BUILTIN_RSQRT,
+  AARCH64_BUILTIN_RSQRTF,
   AARCH64_SIMD_BUILTIN_BASE,
   AARCH64_SIMD_BUILTIN_LANE_CHECK,
 #include "aarch64-simd-builtins.def"
@@ -831,6 +833,32 @@  aarch64_init_crc32_builtins ()
 }
 
 void
+aarch64_add_builtin_rsqrt (void)
+{
+  tree fndecl = NULL;
+  tree ftype = NULL;
+  ftype = build_function_type_list (double_type_node, double_type_node,
NULL_TREE);
+
+  fndecl = add_builtin_function ("__builtin_aarch64_rsqrt",
+                                 ftype,
+                                 AARCH64_BUILTIN_RSQRT,
+                                 BUILT_IN_MD,
+                                 NULL,
+                                 NULL_TREE);
+  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT] = fndecl;
+
+  tree ftypef = NULL;
+  ftypef = build_function_type_list (float_type_node, float_type_node,
NULL_TREE);
+  fndecl = add_builtin_function ("__builtin_aarch64_rsqrtf",
+                                 ftypef,
+                                 AARCH64_BUILTIN_RSQRTF,
+                                 BUILT_IN_MD,
+                                 NULL,
+                                 NULL_TREE);
+  aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF] = fndecl;
+}
+
+void
 aarch64_init_builtins (void)
 {
   tree ftype_set_fpr
@@ -855,6 +883,7 @@  aarch64_init_builtins (void)
     aarch64_init_simd_builtins ();
   if (TARGET_CRC32)
     aarch64_init_crc32_builtins ();
+  aarch64_add_builtin_rsqrt ();
 }
 
 tree
@@ -1099,6 +1128,23 @@  aarch64_crc32_expand_builtin (int fcode, tree exp,
rtx target)
   return target;
 }
 
+static rtx
+aarch64_expand_builtin_rsqrt (int fcode, tree exp, rtx target)
+{
+  rtx pat;
+  tree arg0 = CALL_EXPR_ARG (exp, 0);
+  rtx op0 = expand_normal (arg0);
+
+  enum insn_code c = CODE_FOR_rsqrtdf;
+  if (fcode == AARCH64_BUILTIN_RSQRTF)
+    c = CODE_FOR_rsqrtsf;
+
+  pat = GEN_FCN (c) (target, op0);
+  emit_insn (pat);
+
+  return target;
+}
+
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient.  */
 rtx
@@ -1146,6 +1192,11 @@  aarch64_expand_builtin (tree exp,
   else if (fcode >= AARCH64_CRC32_BUILTIN_BASE && fcode <=
AARCH64_CRC32_BUILTIN_MAX)
     return aarch64_crc32_expand_builtin (fcode, exp, target);
 
+  if (fcode == AARCH64_BUILTIN_RSQRT ||
+      fcode == AARCH64_BUILTIN_RSQRTF)
+    return aarch64_expand_builtin_rsqrt (fcode, exp, target);
+
+  return NULL_RTX;
   gcc_unreachable ();
 }
 
@@ -1303,6 +1354,15 @@  aarch64_builtin_vectorized_function (tree fndecl,
tree type_out, tree type_in)
   return NULL_TREE;
 }
 
+tree
+aarch64_builtin_rsqrt (bool is_float)
+{
+  if (is_float)
+    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRTF];
+  else
+    return aarch64_builtin_decls[AARCH64_BUILTIN_RSQRT];
+}
+
 #undef VAR1
 #define VAR1(T, N, MAP, A) \
   case AARCH64_SIMD_BUILTIN_##T##_##N##A:
diff --git a/gcc/config/aarch64/aarch64-protos.h
b/gcc/config/aarch64/aarch64-protos.h
index 965a11b..4f1c8ce 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -270,6 +270,8 @@  void aarch64_print_operand (FILE *, rtx, char);
 void aarch64_print_operand_address (FILE *, rtx);
 void aarch64_emit_call_insn (rtx);
 
+void aarch64_emit_swrsqrt (rtx, rtx);
+
 /* Initialize builtins for SIMD intrinsics.  */
 void init_aarch64_simd_builtins (void);
 
diff --git a/gcc/config/aarch64/aarch64-simd.md
b/gcc/config/aarch64/aarch64-simd.md
index b90f938..266800a 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -353,6 +353,33 @@ 
   [(set_attr "type" "neon_fp_mul_d_scalar_q")]
 )
 
+(define_insn "*rsqrte_simd"
+  [(set (match_operand:VALLF 0 "register_operand" "=w")
+	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")]
+		     UNSPEC_RSQRTE))]
+  "TARGET_SIMD"
+  "frsqrte\\t%<v>0<Vmtype>, %<v>1<Vmtype>"
+  [(set_attr "type" "neon_fp_rsqrte_<Vetype><q>")])
+
+(define_insn "*rsqrts_simd"
+  [(set (match_operand:VALLF 0 "register_operand" "=w")
+	(unspec:VALLF [(match_operand:VALLF 1 "register_operand" "w")
+               (match_operand:VALLF 2 "register_operand" "w")]
+		     UNSPEC_RSQRTS))]
+  "TARGET_SIMD"
+  "frsqrts\\t%<v>0<Vmtype>, %<v>1<Vmtype>, %<v>2<Vmtype>"
+  [(set_attr "type" "neon_fp_rsqrts_<Vetype><q>")])
+
+(define_expand "rsqrt<mode>"
+  [(set (match_operand:GPF 0 "register_operand" "=w")
+	(unspec:GPF [(match_operand:GPF 1 "register_operand" "w")]
+		     UNSPEC_RSQRT))]
+  "TARGET_FLOAT"
+{
+  aarch64_emit_swrsqrt (operands[0], operands[1]);
+  DONE;
+})
+
 (define_insn "*aarch64_mul3_elt_to_64v2df"
   [(set (match_operand:DF 0 "register_operand" "=w")
      (mult:DF
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index c3c2795..281f0b2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -6816,6 +6816,66 @@  aarch64_memory_move_cost (machine_mode mode
ATTRIBUTE_UNUSED,
   return aarch64_tune_params->memmov_cost;
 }
 
+extern tree aarch64_builtin_rsqrt (bool is_float);
+
+static tree
+aarch64_builtin_reciprocal (unsigned int fn,
+                            bool md_fn ATTRIBUTE_UNUSED,
+                            bool sqrt ATTRIBUTE_UNUSED)
+{
+  if (!fast_math_flags_set_p (&global_options))
+    return NULL_TREE;
+
+  if (fn == BUILT_IN_SQRTF)
+    return aarch64_builtin_rsqrt (true);
+  else if (fn == BUILT_IN_SQRT)
+    return aarch64_builtin_rsqrt (false);
+  else
+    return NULL_TREE;
+}
+
+void
+aarch64_emit_swrsqrt (rtx dst, rtx src)
+{
+  enum machine_mode mode = GET_MODE (src);
+  rtx xsrc = gen_reg_rtx (mode);
+  emit_set_insn (xsrc, src);
+
+  rtx x0 = gen_reg_rtx (mode);
+  emit_insn (gen_rtx_SET (x0,
+                          gen_rtx_UNSPEC (mode, gen_rtvec (1, xsrc),
+                                          UNSPEC_RSQRTE)));
+
+  bool double_mode = (DFmode   == mode ||
+                      V1DFmode == mode ||
+                      V2DFmode == mode ||
+                      V4DFmode == mode ||
+                      V6DFmode == mode ||
+                      V8DFmode == mode);
+
+  int iterations = 2;
+  if (double_mode)
+    iterations = 3;
+
+  for (int i = 0; i < iterations; ++i)
+    {
+      rtx x1 = gen_reg_rtx (mode);
+      rtx x2 = gen_reg_rtx (mode);
+      rtx x3 = gen_reg_rtx (mode);
+      emit_insn (gen_rtx_SET (x2,
+                              gen_rtx_MULT (mode, x0, x0)));
+      emit_insn (gen_rtx_SET (x3,
+                              gen_rtx_UNSPEC (mode, gen_rtvec (2, xsrc,
x2),
+                                              UNSPEC_RSQRTS)));
+      emit_insn (gen_rtx_SET (x1,
+                              gen_rtx_MULT (mode, x0, x3)));
+      x0 = x1;
+    }
+
+  emit_move_insn (dst, x0);
+  return;
+}
+
 /* Return the number of instructions that can be issued per cycle.  */
 static int
 aarch64_sched_issue_rate (void)
@@ -11747,6 +11807,9 @@  aarch64_gen_adjusted_ldpstp (rtx *operands, bool
load,
 #undef TARGET_USE_BLOCKS_FOR_CONSTANT_P
 #define TARGET_USE_BLOCKS_FOR_CONSTANT_P aarch64_use_blocks_for_constant_p
 
+#undef TARGET_BUILTIN_RECIPROCAL
+#define TARGET_BUILTIN_RECIPROCAL aarch64_builtin_reciprocal
+
 #undef TARGET_VECTOR_MODE_SUPPORTED_P
 #define TARGET_VECTOR_MODE_SUPPORTED_P aarch64_vector_mode_supported_p
 
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 11123d6..7272d05 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -120,6 +120,9 @@ 
     UNSPEC_VSTRUCTDUMMY
     UNSPEC_SP_SET
     UNSPEC_SP_TEST
+    UNSPEC_RSQRT
+    UNSPEC_RSQRTE
+    UNSPEC_RSQRTS
 ])
 
 (define_c_enum "unspecv" [
diff --git a/gcc/testsuite/gcc.target/aarch64/rsqrt.c
b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
new file mode 100644
index 0000000..6607483
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/rsqrt.c
@@ -0,0 +1,113 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -fno-inline --save-temps -ffast-math" } */
+
+#include <math.h>
+#include <stdio.h>
+
+#include <values.h>
+
+#define PI    3.141592653589793
+#define SQRT2 1.4142135623730951
+
+#define PI_4 0.7853981633974483
+#define SQRT1_2 0.7071067811865475
+
+/* 2^25+1, float has 24 significand bits
+ *       according to Single-precision floating-point format.  */
+#define TESTA8_FLT 33554433
+/* 2^54+1, double has 53 significand bits
+ *       according to Double-precision floating-point format.  */
+#define TESTA8_DBL 18014398509481985
+
+#define SD(a, b) t_double ((#a), (a), (b));
+#define SF(a, b) t_float ((#a), (a), (b));
+
+#define EPSILON_double __DBL_EPSILON__
+#define EPSILON_float __FLT_EPSILON__
+#define ABS_double fabs
+#define ABS_float fabsf
+#define SQRT_double sqrt
+#define SQRT_float sqrtf
+
+extern void abort (void);
+
+#define TESTTYPE(TYPE)
\
+TYPE rsqrt_##TYPE (TYPE a)
\
+{
\
+  return 1.0/SQRT_##TYPE(a);
\
+}
\
+
\
+int equals_##TYPE (TYPE a, TYPE b)
\
+{
\
+  return (a == b ||
\
+          (isnan (a) && isnan (b)) ||
\
+          (ABS_##TYPE (a - b) < EPSILON_##TYPE));
\
+}
\
+
\
+void t_##TYPE (const char *s, TYPE a, TYPE result)
\
+{
\
+  TYPE r = rsqrt_##TYPE (a);
\
+  if (!equals_##TYPE (r, result))
\
+  {
\
+    abort ();
\
+  }
\
+}
\
+
+//  printf ("Problem in %20s: %30.18A should be %30.18A\n", s, r, result);
\
+
+TESTTYPE(double)
+TESTTYPE(float)
+
+/* { dg-final { scan-assembler-times "frsqrte\\td\[0-9\]+, d\[0-9\]+" 1 } }
*/
+/* { dg-final { scan-assembler-times "frsqrts\\td\[0-9\]+, d\[0-9\]+,
d\[0-9\]+" 3 } } */
+
+/* { dg-final { scan-assembler-times "frsqrte\\ts\[0-9\]+, s\[0-9\]+" 1 } }
*/
+/* { dg-final { scan-assembler-times "frsqrts\\ts\[0-9\]+, s\[0-9\]+,
s\[0-9\]+" 2 } } */
+
+int main ()
+{
+  SD(    1.0/256, 0X1.00000000000000P+4  );
+  SD(        1.0, 0X1.00000000000000P+0  );
+  SD(       -1.0,                     NAN);
+  SD(       11.0, 0X1.34BF63D1568260P-2  );
+  SD(        0.0,                INFINITY);
+  SD(   INFINITY, 0X0.00000000000000P+0  );
+  SD(        NAN,                     NAN);
+  SD(       -NAN,                    -NAN);
+  SD(    DBL_MAX, 0X1.00000000000010P-512);
+  SD(    DBL_MIN, 0X1.00000000000000P+511);
+  SD(         PI, 0X1.20DD750429B6D0P-1  );
+  SD(       PI_4, 0X1.20DD750429B6D0P+0  );
+  SD(      SQRT2, 0X1.AE89F995AD3AE0P-1  );
+  SD(    SQRT1_2, 0X1.306FE0A31B7150P+0  );
+  SD(        -PI,                     NAN);
+  SD(     -SQRT2,                     NAN);
+  SD( TESTA8_DBL, 0X1.00000000000000P-27 );
+
+  SF(    1.0/256, 0X1.00000000000000P+4  );