x86-64: Add sinf with FMA

Message ID	20171204180905.GA31592@gmail.com
State	New
Headers	show Return-Path: <libc-alpha-return-87787-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:date:from:to:cc:subject:message-id:reply-to :mime-version:content-type; q=dns; s=default; b=IduLmQopkByxZeId u7FyaTbm9e22gGAgUAEMCKFaF3HjPGe6/s5jkqYEBGGLrSpwDJTv9Xyu79fM2irJ hT/yPG2sUTXXjGkcMPeKOkH0mTkdgqoLFR15W9t4nWU8ZjQkC3lzRUtAYePOzgSA jHRtHpYivyq2Dutoa7OWw7Yx2w0= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org Date: Mon, 4 Dec 2017 10:09:05 -0800 From: "H.J. Lu" <hongjiu.lu@intel.com> To: GNU C Library <libc-alpha@sourceware.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Subject: [PATCH] x86-64: Add sinf with FMA Message-ID: <20171204180905.GA31592@gmail.com> Reply-To: "H.J. Lu" <hjl.tools@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.9.1 (2017-09-22)
Series	x86-64: Add sinf with FMA \| expand x86-64: Add sinf with FMA

H.J. Lu Dec. 4, 2017, 6:09 p.m. UTC

On Skylake, bench-sinf reports performance improvement:

            Before        After         Improvement
max        153.996       100.094           54%
min        8.546         6.852             25%
mean       18.1223       14.4616           25%

Any comments?

H.J.
---
	* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
	Add s_sinf-sse2 and s_sinf-fma.
	(CFLAGS-s_sinf-fma.c): New.
	* sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
	is undefined.
	* sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
	* sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
	* sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
---
 sysdeps/x86_64/fpu/multiarch/Makefile      |  5 ++++-
 sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c  |  3 +++
 sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S |  3 +++
 sysdeps/x86_64/fpu/multiarch/s_sinf.c      | 28 ++++++++++++++++++++++++++++
 sysdeps/x86_64/fpu/s_sinf.S                |  2 ++
 5 files changed, 40 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c
 create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S
 create mode 100644 sysdeps/x86_64/fpu/multiarch/s_sinf.c

Adhemerval Zanella Netto Dec. 4, 2017, 6:38 p.m. UTC | #1

On 04/12/2017 16:09, H.J. Lu wrote:
> On Skylake, bench-sinf reports performance improvement:
> 
>             Before        After         Improvement
> max        153.996       100.094           54%
> min        8.546         6.852             25%
> mean       18.1223       14.4616           25%
> 
> Any comments?
> 
> H.J.
> ---
> 	* sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
> 	Add s_sinf-sse2 and s_sinf-fma.
> 	(CFLAGS-s_sinf-fma.c): New.
> 	* sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
> 	is undefined.
> 	* sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
> 	* sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
> 	* sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
> ---
>  sysdeps/x86_64/fpu/multiarch/Makefile      |  5 ++++-
>  sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c  |  3 +++
>  sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S |  3 +++

With new s_sinf.c generic implementation, does x86_64 still require an 
assembly one?

H.J. Lu Dec. 4, 2017, 6:51 p.m. UTC | #2

On Mon, Dec 4, 2017 at 10:38 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 04/12/2017 16:09, H.J. Lu wrote:
>> On Skylake, bench-sinf reports performance improvement:
>>
>>             Before        After         Improvement
>> max        153.996       100.094           54%
>> min        8.546         6.852             25%
>> mean       18.1223       14.4616           25%
>>
>> Any comments?
>>
>> H.J.
>> ---
>>       * sysdeps/x86_64/fpu/multiarch/Makefile (libm-sysdep_routines):
>>       Add s_sinf-sse2 and s_sinf-fma.
>>       (CFLAGS-s_sinf-fma.c): New.
>>       * sysdeps/x86_64/fpu/s_sinf.S (sinf): Add alias only if __sinf
>>       is undefined.
>>       * sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c: New file.
>>       * sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S: Likewise.
>>       * sysdeps/x86_64/fpu/multiarch/s_sinf.c: Likewise.
>> ---
>>  sysdeps/x86_64/fpu/multiarch/Makefile      |  5 ++++-
>>  sysdeps/x86_64/fpu/multiarch/s_sinf-fma.c  |  3 +++
>>  sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.S |  3 +++
>
> With new s_sinf.c generic implementation, does x86_64 still require an
> assembly one?

They are very close.  But assembly version is a little bit faster.

Assembly:

  "sinf": {
   "": {
    "duration": 3.40466e+10,
    "iterations": 1.88083e+09,
    "max": 137.398,
    "min": 8.546,
    "mean": 18.1019
   }
  }

Generic:

  "sinf": {
   "": {
    "duration": 3.40946e+10,
    "iterations": 1.54362e+09,
    "max": 205.012,
    "min": 7.704,
    "mean": 22.0875
   }
  }

I think we should keep assembly version.

Joseph Myers Dec. 4, 2017, 6:56 p.m. UTC | #3

On Mon, 4 Dec 2017, H.J. Lu wrote:

> > With new s_sinf.c generic implementation, does x86_64 still require an
> > assembly one?
> 
> They are very close.  But assembly version is a little bit faster.

There may well be scope for further performance tuning of the generic 
version.

H.J. Lu Dec. 4, 2017, 7:41 p.m. UTC | #4

On Mon, Dec 4, 2017 at 10:56 AM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
>
>> > With new s_sinf.c generic implementation, does x86_64 still require an
>> > assembly one?
>>
>> They are very close.  But assembly version is a little bit faster.
>
> There may well be scope for further performance tuning of the generic
> version.


There are a couple differences.  Among them:

Generic:

                 U errno
                 U __floor

Assembly:

                 U __errno_location

Is

(*__errno_location ()) = xxx

faster?  On x86-64, there should be no call to __floor.

Joseph Myers Dec. 4, 2017, 8:59 p.m. UTC | #5

On Mon, 4 Dec 2017, H.J. Lu wrote:

> Is
> 
> (*__errno_location ()) = xxx

If anything I'd expect a direct TLS initial-exec access to errno to be 
faster.

> faster?  On x86-64, there should be no call to __floor.

The x86_64 __floor inline in math_private.h is only when compiling glibc 
for SSE4.1 or later.

The case of inlining floor / __floor and related functions for x86_64 
without SSE4.1 is tricky.  Supposing we had appropriate asm redirects to 
allow libm to call floor / ceil / trunc etc. directly so the compiler 
could inline them but __* are still called if not inlined, the default 
SSE2 inlines would come into play.  But those inlines are slower on SSE4.1 
hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if 
you're building a generic SSE2 glibc that may well be used on SSE4.1 
hardware, you may wish either to avoid those inlines or, if there is a 
significant performance difference in benchmarks, have an SSE4.1 IFUNC of 
the calling function using floor (or __floor, with the present inline).

The expf etc. set of optimized float functions have several different 
choices of how conversions to integer are handled, which may be configured 
by an architecture.  That may make sense in other cases as well.

H.J. Lu Dec. 4, 2017, 10:42 p.m. UTC | #6

On Mon, Dec 4, 2017 at 12:59 PM, Joseph Myers <joseph@codesourcery.com> wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
>
>> Is
>>
>> (*__errno_location ()) = xxx
>
> If anything I'd expect a direct TLS initial-exec access to errno to be
> faster.

I will update x86-64 s_sinf.S to use errno.

>> faster?  On x86-64, there should be no call to __floor.
>
> The x86_64 __floor inline in math_private.h is only when compiling glibc
> for SSE4.1 or later.
>
> The case of inlining floor / __floor and related functions for x86_64
> without SSE4.1 is tricky.  Supposing we had appropriate asm redirects to
> allow libm to call floor / ceil / trunc etc. directly so the compiler
> could inline them but __* are still called if not inlined, the default
> SSE2 inlines would come into play.  But those inlines are slower on SSE4.1
> hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if
> you're building a generic SSE2 glibc that may well be used on SSE4.1
> hardware, you may wish either to avoid those inlines or, if there is a
> significant performance difference in benchmarks, have an SSE4.1 IFUNC of
> the calling function using floor (or __floor, with the present inline).
>
> The expf etc. set of optimized float functions have several different
> choices of how conversions to integer are handled, which may be configured
> by an architecture.  That may make sense in other cases as well.

x86-64 s_sinf.S avoids floor () with cvttsd2si.  I don't think it can be used
with generic sinf.

Florian Weimer Dec. 5, 2017, 1:44 p.m. UTC | #7

On 12/04/2017 09:59 PM, Joseph Myers wrote:
> On Mon, 4 Dec 2017, H.J. Lu wrote:
> 
>> Is
>>
>> (*__errno_location ()) = xxx
> If anything I'd expect a direct TLS initial-exec access to errno to be
> faster.

On x86-64, direct TLS access usually results in more machine code than 
calling __errno_location:

   Direct access:    7 bytes for the load of the TLS offset
                     7 bytes for storing a constant to the TLS variable
   __errno_location: 5 bytes for calling __errno_location
                     6 bytes for storing the constant

Since errno is typically accessed on error paths only, direct TLS access 
is not a win because it bloats the code.

Even if stack alignment is required, it is typically possible to d this 
with just a push/pop pair, adding two more bytes to the __errno_location 
approach, so it is still one byte shorter than direct access.

Thanks,
Florian

Adhemerval Zanella Netto Dec. 5, 2017, 1:47 p.m. UTC | #8

On 04/12/2017 20:42, H.J. Lu wrote:
> On Mon, Dec 4, 2017 at 12:59 PM, Joseph Myers <joseph@codesourcery.com> wrote:
>> On Mon, 4 Dec 2017, H.J. Lu wrote:
>>
>>> Is
>>>
>>> (*__errno_location ()) = xxx
>>
>> If anything I'd expect a direct TLS initial-exec access to errno to be
>> faster.
> 
> I will update x86-64 s_sinf.S to use errno.
> 
>>> faster?  On x86-64, there should be no call to __floor.
>>
>> The x86_64 __floor inline in math_private.h is only when compiling glibc
>> for SSE4.1 or later.
>>
>> The case of inlining floor / __floor and related functions for x86_64
>> without SSE4.1 is tricky.  Supposing we had appropriate asm redirects to
>> allow libm to call floor / ceil / trunc etc. directly so the compiler
>> could inline them but __* are still called if not inlined, the default
>> SSE2 inlines would come into play.  But those inlines are slower on SSE4.1
>> hardware than an out-of-line call to the floor / ceil / trunc IFUNC, so if
>> you're building a generic SSE2 glibc that may well be used on SSE4.1
>> hardware, you may wish either to avoid those inlines or, if there is a
>> significant performance difference in benchmarks, have an SSE4.1 IFUNC of
>> the calling function using floor (or __floor, with the present inline).
>>
>> The expf etc. set of optimized float functions have several different
>> choices of how conversions to integer are handled, which may be configured
>> by an architecture.  That may make sense in other cases as well.
> 
> x86-64 s_sinf.S avoids floor () with cvttsd2si.  I don't think it can be used
> with generic sinf.
> 

I think we can the compiler builtins for math_private.h as a fallback for non
SSE 4.1 as:

---
@@ -109,11 +116,15 @@ extern __always_inline double
 __floor (double d)
 {
   double res;
+#ifdef __SSE4_1__
 # if defined __AVX__ || defined SSE2AVX
   asm ("vroundsd $1, %1, %0, %0" : "=x" (res) : "xm" (d));
 # else
   asm ("roundsd $1, %1, %0" : "=x" (res) : "xm" (d));
 # endif
+#else
+  res = __builtin_floor (d);
+#endif
   return res;
 }
---

As least for GCC7 compiler will expand to an inline version even for generic
tune option.  However the code that actually calls __floor is not showing on
profiling, but rather int to fp conversion.

Generic implementation shows on my system (gcc version 7.1.1, i7-4790K):

  "sinf": {
   "": {
    "duration": 4.00221e+10,
    "iterations": 1.4128e+09,
    "max": 587.555,
    "min": 12.747,
    "mean": 28.3281
   }

And with a simple modification to avoid int to fp conversion:

---
diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
index 40d3d19..a2fd3cf 100644
--- a/sysdeps/ieee754/flt-32/s_sinf.c
+++ b/sysdeps/ieee754/flt-32/s_sinf.c
@@ -75,7 +75,7 @@ static const double invpio4_table[] = {
   0x1.0e4107cp-169
 };

-static const int ones[] = { +1, -1 };
+static const double ones[] = { 1.0, -1.0 };

 /* Compute the sine value using Chebyshev polynomials where
    THETA is the range reduced absolute value of the input
@@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n,
   const double theta2 = theta * theta;
   /* We are operating on |x|, so we need to add back the original
      signbit for sinf.  */
-  int sign;
+  double sign;
   /* Determine positive or negative primary interval.  */
   sign = ones[((n >> 2) & 1) ^ signbit];
   /* Are we in the primary interval of sin or cos?  */
---

I get:

  "sinf": {
   "": {
    "duration": 4.0015e+10,
    "iterations": 1.4535e+09,
    "max": 640.456,
    "min": 11.437,
    "mean": 27.5301
   }

Which is roughly 3% on mean and 11.5% on min. I think we can improve it
even more by avoiding the int to fp conversion to get the sign right
and try operate with sign as double argument.

Joseph Myers Dec. 5, 2017, 1:57 p.m. UTC | #9

On Tue, 5 Dec 2017, Adhemerval Zanella wrote:

> +#else
> +  res = __builtin_floor (d);
> +#endif
>    return res;
>  }
> ---
> 
> As least for GCC7 compiler will expand to an inline version even for generic
> tune option.  However the code that actually calls __floor is not showing on

Which, as per discussions on gcc-patches in August, may not be a good idea 
with current GCC, because it means that your code built for generic SSE2 
is now slower on an SSE4.1 processor than it would have been if it had 
called the __floor IFUNC.  (That issue could be avoided by adding an 
SSE4.1 IFUNC of sinf, of course.  I think we should aim to use floor etc. 
directly anyway - in cases where we can't avoid floor and just convert 
directly to integer - and address any such pessimizations in GCC rather 
than working around them in glibc.)

H.J. Lu Dec. 5, 2017, 4:56 p.m. UTC | #10

On Tue, Dec 5, 2017 at 5:47 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:

> And with a simple modification to avoid int to fp conversion:
>
> ---
> diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
> index 40d3d19..a2fd3cf 100644
> --- a/sysdeps/ieee754/flt-32/s_sinf.c
> +++ b/sysdeps/ieee754/flt-32/s_sinf.c
> @@ -75,7 +75,7 @@ static const double invpio4_table[] = {
>    0x1.0e4107cp-169
>  };
>
> -static const int ones[] = { +1, -1 };
> +static const double ones[] = { 1.0, -1.0 };
>
>  /* Compute the sine value using Chebyshev polynomials where
>     THETA is the range reduced absolute value of the input
> @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n,
>    const double theta2 = theta * theta;
>    /* We are operating on |x|, so we need to add back the original
>       signbit for sinf.  */
> -  int sign;
> +  double sign;
>    /* Determine positive or negative primary interval.  */
>    sign = ones[((n >> 2) & 1) ^ signbit];
>    /* Are we in the primary interval of sin or cos?  */
> ---
>
> I get:
>
>   "sinf": {
>    "": {
>     "duration": 4.0015e+10,
>     "iterations": 1.4535e+09,
>     "max": 640.456,
>     "min": 11.437,
>     "mean": 27.5301
>    }
>
> Which is roughly 3% on mean and 11.5% on min. I think we can improve it
> even more by avoiding the int to fp conversion to get the sign right
> and try operate with sign as double argument.

I tried it on Skylake with the current master.  Before:

  "sinf": {
   "": {
    "duration": 3.4044e+10,
    "iterations": 1.9942e+09,
    "max": 141.106,
    "min": 7.704,
    "mean": 17.0715
   }
  }

After:

  "sinf": {
   "": {
    "duration": 3.40665e+10,
    "iterations": 2.03199e+09,
    "max": 95.994,
    "min": 7.704,
    "mean": 16.765
   }
  }

Generic is faster than asm now:

  "sinf": {
   "": {
    "duration": 3.40417e+10,
    "iterations": 1.87792e+09,
    "max": 138.868,
    "min": 8.546,
    "mean": 18.1273
   }
  }

Can you submit your patch?

Thanks.

Adhemerval Zanella Netto Dec. 5, 2017, 5:06 p.m. UTC | #11

On 05/12/2017 11:57, Joseph Myers wrote:
> On Tue, 5 Dec 2017, Adhemerval Zanella wrote:
> 
>> +#else
>> +  res = __builtin_floor (d);
>> +#endif
>>    return res;
>>  }
>> ---
>>
>> As least for GCC7 compiler will expand to an inline version even for generic
>> tune option.  However the code that actually calls __floor is not showing on
> 
> Which, as per discussions on gcc-patches in August, may not be a good idea 
> with current GCC, because it means that your code built for generic SSE2 
> is now slower on an SSE4.1 processor than it would have been if it had 
> called the __floor IFUNC.  (That issue could be avoided by adding an 
> SSE4.1 IFUNC of sinf, of course.  I think we should aim to use floor etc. 
> directly anyway - in cases where we can't avoid floor and just convert 
> directly to integer - and address any such pessimizations in GCC rather 
> than working around them in glibc.)
> 

Which indeed seems to be the case here as well (although the difference is
quite small, but still noticeable).  With latest H.J. Lu changes to remove
floor it seems the small change to use double for sign operation also does
not yield much change.

Adhemerval Zanella Netto Dec. 5, 2017, 5:09 p.m. UTC | #12

On 05/12/2017 14:56, H.J. Lu wrote:
> On Tue, Dec 5, 2017 at 5:47 AM, Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
> 
>> And with a simple modification to avoid int to fp conversion:
>>
>> ---
>> diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
>> index 40d3d19..a2fd3cf 100644
>> --- a/sysdeps/ieee754/flt-32/s_sinf.c
>> +++ b/sysdeps/ieee754/flt-32/s_sinf.c
>> @@ -75,7 +75,7 @@ static const double invpio4_table[] = {
>>    0x1.0e4107cp-169
>>  };
>>
>> -static const int ones[] = { +1, -1 };
>> +static const double ones[] = { 1.0, -1.0 };
>>
>>  /* Compute the sine value using Chebyshev polynomials where
>>     THETA is the range reduced absolute value of the input
>> @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n,
>>    const double theta2 = theta * theta;
>>    /* We are operating on |x|, so we need to add back the original
>>       signbit for sinf.  */
>> -  int sign;
>> +  double sign;
>>    /* Determine positive or negative primary interval.  */
>>    sign = ones[((n >> 2) & 1) ^ signbit];
>>    /* Are we in the primary interval of sin or cos?  */
>> ---
>>
>> I get:
>>
>>   "sinf": {
>>    "": {
>>     "duration": 4.0015e+10,
>>     "iterations": 1.4535e+09,
>>     "max": 640.456,
>>     "min": 11.437,
>>     "mean": 27.5301
>>    }
>>
>> Which is roughly 3% on mean and 11.5% on min. I think we can improve it
>> even more by avoiding the int to fp conversion to get the sign right
>> and try operate with sign as double argument.
> 
> I tried it on Skylake with the current master.  Before:
> 
>   "sinf": {
>    "": {
>     "duration": 3.4044e+10,
>     "iterations": 1.9942e+09,
>     "max": 141.106,
>     "min": 7.704,
>     "mean": 17.0715
>    }
>   }
> 
> After:
> 
>   "sinf": {
>    "": {
>     "duration": 3.40665e+10,
>     "iterations": 2.03199e+09,
>     "max": 95.994,
>     "min": 7.704,
>     "mean": 16.765
>    }
>   }
> 
> Generic is faster than asm now:
> 
>   "sinf": {
>    "": {
>     "duration": 3.40417e+10,
>     "iterations": 1.87792e+09,
>     "max": 138.868,
>     "min": 8.546,
>     "mean": 18.1273
>    }
>   }
> 
> Can you submit your patch?

I will do it, thanks for checking this out.

H.J. Lu Dec. 5, 2017, 7:03 p.m. UTC | #13

On Tue, Dec 5, 2017 at 9:09 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 05/12/2017 14:56, H.J. Lu wrote:
>> On Tue, Dec 5, 2017 at 5:47 AM, Adhemerval Zanella
>> <adhemerval.zanella@linaro.org> wrote:
>>
>>> And with a simple modification to avoid int to fp conversion:
>>>
>>> ---
>>> diff --git a/sysdeps/ieee754/flt-32/s_sinf.c b/sysdeps/ieee754/flt-32/s_sinf.c
>>> index 40d3d19..a2fd3cf 100644
>>> --- a/sysdeps/ieee754/flt-32/s_sinf.c
>>> +++ b/sysdeps/ieee754/flt-32/s_sinf.c
>>> @@ -75,7 +75,7 @@ static const double invpio4_table[] = {
>>>    0x1.0e4107cp-169
>>>  };
>>>
>>> -static const int ones[] = { +1, -1 };
>>> +static const double ones[] = { 1.0, -1.0 };
>>>
>>>  /* Compute the sine value using Chebyshev polynomials where
>>>     THETA is the range reduced absolute value of the input
>>> @@ -92,7 +92,7 @@ reduced (const double theta, const unsigned long int n,
>>>    const double theta2 = theta * theta;
>>>    /* We are operating on |x|, so we need to add back the original
>>>       signbit for sinf.  */
>>> -  int sign;
>>> +  double sign;
>>>    /* Determine positive or negative primary interval.  */
>>>    sign = ones[((n >> 2) & 1) ^ signbit];
>>>    /* Are we in the primary interval of sin or cos?  */
>>> ---
>>>
>>> I get:
>>>
>>>   "sinf": {
>>>    "": {
>>>     "duration": 4.0015e+10,
>>>     "iterations": 1.4535e+09,
>>>     "max": 640.456,
>>>     "min": 11.437,
>>>     "mean": 27.5301
>>>    }
>>>
>>> Which is roughly 3% on mean and 11.5% on min. I think we can improve it
>>> even more by avoiding the int to fp conversion to get the sign right
>>> and try operate with sign as double argument.
>>
>> I tried it on Skylake with the current master.  Before:
>>
>>   "sinf": {
>>    "": {
>>     "duration": 3.4044e+10,
>>     "iterations": 1.9942e+09,
>>     "max": 141.106,
>>     "min": 7.704,
>>     "mean": 17.0715
>>    }
>>   }
>>
>> After:
>>
>>   "sinf": {
>>    "": {
>>     "duration": 3.40665e+10,
>>     "iterations": 2.03199e+09,
>>     "max": 95.994,
>>     "min": 7.704,
>>     "mean": 16.765
>>    }
>>   }
>>
>> Generic is faster than asm now:
>>
>>   "sinf": {
>>    "": {
>>     "duration": 3.40417e+10,
>>     "iterations": 1.87792e+09,
>>     "max": 138.868,
>>     "min": 8.546,
>>     "mean": 18.1273
>>    }
>>   }
>>
>> Can you submit your patch?
>
> I will do it, thanks for checking this out.

Here is the patch to add sinf with FMA using s_sinf.c for SSE2.
I posted a separate patch to remove sysdeps/x86_64/fpu/s_sinf.S:

https://sourceware.org/ml/libc-alpha/2017-12/msg00146.html

OK for master?

Adhemerval Zanella Netto Dec. 7, 2017, 12:20 p.m. UTC | #14

On 05/12/2017 17:03, H.J. Lu wrote:
> diff --git a/sysdeps/x86_64/fpu/multiarch/s_sinf.c b/sysdeps/x86_64/fpu/multiarch/s_sinf.c
> new file mode 100644
> index 0000000000..f91f866cdc
> --- /dev/null
> +++ b/sysdeps/x86_64/fpu/multiarch/s_sinf.c
> @@ -0,0 +1,31 @@
> +/* Multiple versions of sinf.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <libm-alias-float.h>
> +
> +extern float __redirect_sinf (float);
> +
> +#define SYMBOL_NAME sinf
> +#include "ifunc-fma.h"
> +
> +libc_ifunc_redirected (__redirect_sinf, __sinf, IFUNC_SELECTOR ());
> +
> +libm_alias_float (__sin, sin)
> +
> +#define SINF __sinf_sse2
> +#include <sysdeps/ieee754/flt-32/s_sinf.c>
> -- 2.14.3

I would prefer if we move the default version to specific files instead
of incorporate them on ifunc resolver itself.  In this case add a
sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.c file with:

#define SINF __sinf_sse2
#include <sysdeps/ieee754/flt-32/s_sinf.c>

LGTM with this change.

H.J. Lu Dec. 7, 2017, 6:13 p.m. UTC | #15

On Thu, Dec 7, 2017 at 4:20 AM, Adhemerval Zanella
<adhemerval.zanella@linaro.org> wrote:
>
>
> On 05/12/2017 17:03, H.J. Lu wrote:
>> diff --git a/sysdeps/x86_64/fpu/multiarch/s_sinf.c b/sysdeps/x86_64/fpu/multiarch/s_sinf.c
>> new file mode 100644
>> index 0000000000..f91f866cdc
>> --- /dev/null
>> +++ b/sysdeps/x86_64/fpu/multiarch/s_sinf.c
>> @@ -0,0 +1,31 @@
>> +/* Multiple versions of sinf.
>> +   Copyright (C) 2017 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <http://www.gnu.org/licenses/>.  */
>> +
>> +#include <libm-alias-float.h>
>> +
>> +extern float __redirect_sinf (float);
>> +
>> +#define SYMBOL_NAME sinf
>> +#include "ifunc-fma.h"
>> +
>> +libc_ifunc_redirected (__redirect_sinf, __sinf, IFUNC_SELECTOR ());
>> +
>> +libm_alias_float (__sin, sin)
>> +
>> +#define SINF __sinf_sse2
>> +#include <sysdeps/ieee754/flt-32/s_sinf.c>
>> -- 2.14.3
>
> I would prefer if we move the default version to specific files instead
> of incorporate them on ifunc resolver itself.  In this case add a
> sysdeps/x86_64/fpu/multiarch/s_sinf-sse2.c file with:
>
> #define SINF __sinf_sse2
> #include <sysdeps/ieee754/flt-32/s_sinf.c>
>
> LGTM with this change.

This is what I am checking in.

Thanks.

Nix Dec. 8, 2017, 4:02 p.m. UTC | #16

On 4 Dec 2017, H. J. Lu uttered the following:

> On Skylake, bench-sinf reports performance improvement:
>
>             Before        After         Improvement
> max        153.996       100.094           54%
> min        8.546         6.852             25%
> mean       18.1223       14.4616           25%
>
> Any comments?

Do we have any benchmark runs on older processors? They're not remotely
obsolete: Intel is still selling Broadwell server parts, and the vast
majority of SSE2-capable parts out there at present are not as new as
Skylake.

Are we penalizing them? (I'd guess not, but it would be nice to know.)

Arjan van de Ven Dec. 8, 2017, 4:04 p.m. UTC | #17

On 12/8/2017 8:02 AM, Nick Alcock wrote:
> On 4 Dec 2017, H. J. Lu uttered the following:
> 
>> On Skylake, bench-sinf reports performance improvement:
>>
>>              Before        After         Improvement
>> max        153.996       100.094           54%
>> min        8.546         6.852             25%
>> mean       18.1223       14.4616           25%
>>
>> Any comments?
> 
> Do we have any benchmark runs on older processors? They're not remotely
> obsolete: Intel is still selling Broadwell server parts, and the vast
> majority of SSE2-capable parts out there at present are not as new as
> Skylake.
> 
> Are we penalizing them? (I'd guess not, but it would be nice to know.)

this is just adding another IFUNC for FMA capable CPUs.. the older CPUs
will keep using the code that they are running today....
so no they're not being penalized.

Arjan van de Ven Dec. 8, 2017, 4:07 p.m. UTC | #18

On 12/8/2017 8:02 AM, Nick Alcock wrote:
> On 4 Dec 2017, H. J. Lu uttered the following:
> 
>> On Skylake, bench-sinf reports performance improvement:
>>
>>              Before        After         Improvement
>> max        153.996       100.094           54%
>> min        8.546         6.852             25%
>> mean       18.1223       14.4616           25%
>>
>> Any comments?
> 
> Do we have any benchmark runs on older processors? They're not remotely
> obsolete: Intel is still selling Broadwell server parts, and the vast
> majority of SSE2-capable parts out there at present are not as new as
> Skylake.

(oh and Haswell & Broadwell also support FMA already)

H.J. Lu Dec. 8, 2017, 4:11 p.m. UTC | #19

On Fri, Dec 8, 2017 at 8:07 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 12/8/2017 8:02 AM, Nick Alcock wrote:
>>
>> On 4 Dec 2017, H. J. Lu uttered the following:
>>
>>> On Skylake, bench-sinf reports performance improvement:
>>>
>>>              Before        After         Improvement
>>> max        153.996       100.094           54%
>>> min        8.546         6.852             25%
>>> mean       18.1223       14.4616           25%
>>>
>>> Any comments?
>>
>>
>> Do we have any benchmark runs on older processors? They're not remotely
>> obsolete: Intel is still selling Broadwell server parts, and the vast
>> majority of SSE2-capable parts out there at present are not as new as
>> Skylake.
>
>
> (oh and Haswell & Broadwell also support FMA already)

On Haswell, without FMA:

  "sinf": {
   "": {
    "duration": 3.4905e+10,
    "iterations": 1.91281e+09,
    "max": 450.098,
    "min": 8.091,
    "mean": 18.2481
   }
  }

With FMA:

  "sinf": {
   "": {
    "duration": 3.49046e+10,
    "iterations": 2.44188e+09,
    "max": 173.855,
    "min": 7.253,
    "mean": 14.2942
   }
  }

FMA is faster on HSW.  I expect Broadwell should also be faster with FMA.

Nix Dec. 8, 2017, 4:16 p.m. UTC | #20

On 8 Dec 2017, H. J. Lu spake thusly:

> On Fri, Dec 8, 2017 at 8:07 AM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>> On 12/8/2017 8:02 AM, Nick Alcock wrote:
>>>
>>> On 4 Dec 2017, H. J. Lu uttered the following:
>>>
>>>> On Skylake, bench-sinf reports performance improvement:
>>>>
>>>>              Before        After         Improvement
>>>> max        153.996       100.094           54%
>>>> min        8.546         6.852             25%
>>>> mean       18.1223       14.4616           25%
>>>>
>>>> Any comments?
>>>
>>>
>>> Do we have any benchmark runs on older processors? They're not remotely
>>> obsolete: Intel is still selling Broadwell server parts, and the vast
>>> majority of SSE2-capable parts out there at present are not as new as
>>> Skylake.
>>
>> (oh and Haswell & Broadwell also support FMA already)

Yeah, I guess I was wondering if the speedup was actually due to FMA or
some other microarchitectural variation which might not be present on
older FMA-capable processors.

But my worries were for naught:

>     "max": 450.098,
>     "min": 8.091,
>     "mean": 18.2481

>     "max": 173.855,
>     "min": 7.253,
>     "mean": 14.2942

That's a nice speedup! :)

Arjan van de Ven Dec. 8, 2017, 4:32 p.m. UTC | #21

On 12/8/2017 8:16 AM, Nix wrote:
> Yeah, I guess I was wondering if the speedup was actually due to FMA or
> some other microarchitectural variation which might not be present on
> older FMA-capable processors.

I'm not aware of any system where a FMA would be slower than discrete mul + add...
Even for HSW/BDW they're faster. Now on SKL the performance increase of FMA is *more*
(FMA is 4 cycles, but so are individual add and mul .. so it's really 2x gain)
but that's more upside, not a downside for others.

x86-64: Add sinf with FMA

Commit Message

Comments

Patch