Inline C99 math functions

Message ID	001201d0a75b$921d9860$b658c920$@com
State	New
Headers	show Return-Path: <libc-alpha-return-60033-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:subject:date:message-id:mime-version :content-type; q=dns; s=default; b=D/EO5vX35T88zFTIYiXQ/tFbHRGTF 15KnfWUwbC2Prxk2/X/E0ITmKSowJuHV/NyNKWBT/PSEZk4yJwn88JsBPVjM/zsd ofp9NqNTO3d2iScVN1IBfLSm2uqKDgxCAwyp8A5HpSwIZTU4fMGpXCYi3npBap6V wTN5U7CZH/WpO0= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: "Wilco Dijkstra" <wdijkstr@arm.com> To: "GNU C Library" <libc-alpha@sourceware.org> Subject: [PATCH] Inline C99 math functions Date: Mon, 15 Jun 2015 12:08:12 +0100 Message-ID: <001201d0a75b$921d9860$b658c920$@com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_NextPart_000_0013_01D0A763.F3E20060"

Wilco June 15, 2015, 11:08 a.m. UTC

Add inlining of the C99 math functions isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC
built-ins when available. Since going through the PLT is expensive for these small functions,
inlining results in major speedups (about 7x on Cortex-A57 for isinf). The GCC built-ins are not
correct if signalling NaN support is required, and thus are turned off in that case (see GCC bug
66462). The test-snan.c tests sNaNs and so must be explicitly built with -fsignaling-nans.

As a result of this many target overrides and the various __isnan/__finite inlines in math_private.h
are no longer required. If agreed we could remove all this code and only keep the generic definition
of isinf/etc which will use the builtin.

Tested on AArch64. OK for commit?

ChangeLog: 
2015-06-15  Wilco Dijkstra  <wdijkstr@arm.com>

	* math/Makefile: Build test-snan.c with -fsignaling-nans.	
	* math/math.h (fpclassify): Use __builtin_fpclassify when
	available.  (signbit): Use __builtin_signbit(f/l).
	(isfinite): Use__builtin_isfinite.  (isnormal): Use
	__builtin_isnormal.  (isnan): Use __builtin_isnan.
	(isinf): Use __builtin_isinf_sign.
---
 math/Makefile |  1 +
 math/math.h   | 51 ++++++++++++++++++++++++++++++---------------------
 2 files changed, 31 insertions(+), 21 deletions(-)

Joseph Myers June 15, 2015, 2:44 p.m. UTC | #1

On Mon, 15 Jun 2015, Wilco Dijkstra wrote:

> Add inlining of the C99 math functions 
> isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC built-ins 
> when available. Since going through the PLT is expensive for these small 
> functions, inlining results in major speedups (about 7x on Cortex-A57 
> for isinf). The GCC built-ins are not correct if signalling NaN support 

Where are the benchmarks for this?  Please put them in benchtests so 
actual reproducible figures can be given.  That's the standard practice 
for any change being justified on the basis of performance.

What are the effects on code size (especially for fpclassify)?  If code 
becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be 
considered.

> As a result of this many target overrides and the various 
> __isnan/__finite inlines in math_private.h are no longer required. If 
> agreed we could remove all this code and only keep the generic 
> definition of isinf/etc which will use the builtin.

How do those inlines compare with the code GCC generates?  As I understand 
it, before your patch libm-internal calls to these macros would have 
called the inline versions of functions such as __finite (no function 
call, PLT or otherwise, and using integer arithmetic), and after the patch 
libm-internal calls would have used the GCC inlines (with floating-point 
arithmetic) - is the new state for libm-internal calls better than the old 
state or not?

> 2015-06-15  Wilco Dijkstra  <wdijkstr@arm.com>
> 

	[BZ #15367]
	[BZ #17441]

Wilco June 15, 2015, 4:40 p.m. UTC | #2

> Joseph Myers wrote:
> On Mon, 15 Jun 2015, Wilco Dijkstra wrote:
> 
> > Add inlining of the C99 math functions
> > isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC built-ins
> > when available. Since going through the PLT is expensive for these small
> > functions, inlining results in major speedups (about 7x on Cortex-A57
> > for isinf). The GCC built-ins are not correct if signalling NaN support
> 
> Where are the benchmarks for this?  Please put them in benchtests so
> actual reproducible figures can be given.  That's the standard practice
> for any change being justified on the basis of performance.

I'll add a benchmark in another patch - it's not trivial as benchtest is not
suitable to accurately time very simple functions, especially when inlined...

> What are the effects on code size (especially for fpclassify)?  If code
> becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be
> considered.

Codesize of what? Few applications use these functions... GLIBC mathlib is 
the main example, but it's not typical as it has extremely high usage of these
functions and you'll unlikely want to avoid inlining as the performance gain
and avoidance of explicit calls are significant benefits in the math functions 
(no need to use callee-saves).

I suppose adding a size check for fpclassify would be reasonable as the 
expansion ends up quite large (but that should really be fixed). 
Note fpclassify is used rather inappropriately in most cases in GLIBC, so
the best approach would be rewrite those to use faster isnan/isinf checks.

> > As a result of this many target overrides and the various
> > __isnan/__finite inlines in math_private.h are no longer required. If
> > agreed we could remove all this code and only keep the generic
> > definition of isinf/etc which will use the builtin.
> 
> How do those inlines compare with the code GCC generates?  As I understand
> it, before your patch libm-internal calls to these macros would have
> called the inline versions of functions such as __finite (no function
> call, PLT or otherwise, and using integer arithmetic), and after the patch
> libm-internal calls would have used the GCC inlines (with floating-point
> arithmetic) - is the new state for libm-internal calls better than the old
> state or not?

__isinf(f/l) were never inlined, but are now (__isinf_ns was used in some cases,
but not nearly all). Fpclassify*, isnormal* were never inlined, but now are. These
new inlines provide the gains.

__isnan(f) and __finite(f) used GLIBC internal inlines, and now use GCC built-ins 
instead. We could measure whether this specific change causes any difference, but 
I bet it will be very uarch specific and minor compared to the total time math
functions take.

Also I don't think having special inlines that are only used inside GLIBC is a 
good approach - if the GCC built-ins are not fast enough then we should fix them.

Wilco

Joseph Myers June 15, 2015, 5 p.m. UTC | #3

On Mon, 15 Jun 2015, Wilco Dijkstra wrote:

> > Where are the benchmarks for this?  Please put them in benchtests so
> > actual reproducible figures can be given.  That's the standard practice
> > for any change being justified on the basis of performance.
> 
> I'll add a benchmark in another patch - it's not trivial as benchtest is not
> suitable to accurately time very simple functions, especially when inlined...

Well, the benchmark should come first....

> > What are the effects on code size (especially for fpclassify)?  If code
> > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be
> > considered.
> 
> Codesize of what? Few applications use these functions... GLIBC mathlib is 

Size of any code calling these macros (for nonconstant arguments).

> Also I don't think having special inlines that are only used inside 
> GLIBC is a good approach - if the GCC built-ins are not fast enough then 
> we should fix them.

Yes, we should improve the built-in functions, but first we should 
understand the effects on performance of glibc libm functions (I don't 
know if the existing benchtests cover cases where __finite / __isnan / 
__isinf_ns inlines were used) to see if this cleanup patch indeed doesn't 
significantly harm performance of affected libm functions (and possibly 
improves performance through the changes in cases that wouldn't previously 
have been inlined at all).

Ondřej Bílka June 15, 2015, 6:52 p.m. UTC | #4

On Mon, Jun 15, 2015 at 05:40:11PM +0100, Wilco Dijkstra wrote:
> > Joseph Myers wrote:
> > On Mon, 15 Jun 2015, Wilco Dijkstra wrote:
> > 
> > > Add inlining of the C99 math functions
> > > isinf/isnan/signbit/isfinite/isnormal/fpclassify using GCC built-ins
> > > when available. Since going through the PLT is expensive for these small
> > > functions, inlining results in major speedups (about 7x on Cortex-A57
> > > for isinf). The GCC built-ins are not correct if signalling NaN support
> > 
> > Where are the benchmarks for this?  Please put them in benchtests so
> > actual reproducible figures can be given.  That's the standard practice
> > for any change being justified on the basis of performance.
> 
> I'll add a benchmark in another patch - it's not trivial as benchtest is not
> suitable to accurately time very simple functions, especially when inlined...
>
As I wrote in other thread that gcc builtins have poor performance a
benchmark is tricky. Main problem is that these tests are in branch and
gcc will simplify them. As builtins are branchless it obviously couldn't
simplify them.

You could reuse my example to show problem with builtins as benchtest,
you could make worse mistakes and inlining would still save a day.

You will get different characteristic in following three cases. In case
1 isinf is simplified by surrounding if so we measure time of branch
never taken.

Case 2 is identical except that in case 1 conversion to branchless 
ret += 42 * __builtin_isinf(d[i])
migth make sense where gcc could make isinf return 1 when its infinity.

In case 3 you get raw sum, difference between builtin and optimized
inline is bigger, mainly because of dependency chain.

Then numbers you get are also architecture specific so its hard to guess
exact numbers except that it will help as inlining gives huge boost.

For other functions performance should be identical on optimized inline
as it just checks that exponent is 1023 and fails, what happens after
that check is irrelevant.

On my ivy bridge I get following where bti is optimized inline of case i
and nti is gcc builtin for same case.

That performance becomes different when you decide to isolate builtin
and inline expansion into separately compiled function as loops gain a
lot by not having to initialize constants.

So you first need to tell what are your benchmark characteristic before
writing benchmark. A sample follows


time ./bt1

real	0m0.788s
user	0m0.772s
sys	0m0.000s
19:44:18:~$ time ./nt1

real	0m1.162s
user	0m1.136s
sys	0m0.004s
19:44:23:~$ time ./bt2

real	0m0.777s
user	0m0.760s
sys	0m0.000s
19:44:28:~$ time ./nt2

real	0m1.167s
user	0m1.144s
sys	0m0.000s
19:44:32:~$ time ./bt3

real	0m1.161s
user	0m1.140s
sys	0m0.000s
19:44:37:~$ time ./nt3

real	0m1.774s
user	0m1.740s
sys	0m0.000s






#ifdef BRANCHED
static inline int
isinf (double dx)
{
  union u {
    double d;
    long l;  
  };
  union u u;
  u.d = dx;
  long x = u.l;
  return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 :
-1) : 0;
}
#endif

int main()
{
  double ret= 0.0;
  int i, j;
  double *d = malloc (800000);
  for (j=0; j<1000000; j++)
#ifdef T1
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf (d[i]),0))
      ret += 42;                         
#endif
#ifdef T2
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf (d[i]),0))
      abort();
#endif
#ifdef T3
  for (i=0; i<1000; i++)
    ret += __builtin_expect(isinf (d[i]),0);
#endif

  return ret;
}


 
> > What are the effects on code size (especially for fpclassify)?  If code
> > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be
> > considered.
> 
> Codesize of what? Few applications use these functions... GLIBC mathlib is 
> the main example, but it's not typical as it has extremely high usage of these
> functions and you'll unlikely want to avoid inlining as the performance gain
> and avoidance of explicit calls are significant benefits in the math functions 
> (no need to use callee-saves).
> 
> I suppose adding a size check for fpclassify would be reasonable as the 
> expansion ends up quite large (but that should really be fixed). 
> Note fpclassify is used rather inappropriately in most cases in GLIBC, so
> the best approach would be rewrite those to use faster isnan/isinf checks.
> 
Remember cost of cache, you are really optimizing for performance when
assuming that instruction cache misses are problem. It is worth to
optimize even by increasing size, just be aware of instruction cost.
Rule of thumb is that extra byte should save a cycle.

fpclassify should have around same impact as cache usage as isinf. Just
check of exponent would ever be in cache. gcc should move rest of function to separate 
cache line with cold code that is almost never read and obviously has no impact on performance.

Also my guess is that it would improve size in most cases but it depends
on caller. Also you shouldn't rewrite that to isnan/isinf checks as that would
probably harm performance. 

Again reason is that gcc will delete unused branches so function size
depends on what do you check and is smaller than separate checks. Using
your builtin on x64 you will get following assembly

   int result =  __builtin_fpclassify (FP_NAN, FP_INFINITE, \
     FP_NORMAL, FP_SUBNORMAL, FP_ZERO, d[i]);

followed by

if (result == FP_NORMAL)

  48:	f2 0f 10 02          	movsd  (%rdx),%xmm0
  4c:	66 0f 54 c1          	andpd  %xmm1,%xmm0
  50:	66 0f 2e c0          	ucomisd %xmm0,%xmm0
  54:	7a 10                	jp     66 <main+0x66>
  56:	66 0f 2e c2          	ucomisd %xmm2,%xmm0
  5a:	77 0a                	ja     66 <main+0x66>
  5c:	66 0f 2e c4          	ucomisd %xmm4,%xmm0
  60:	72 04                	jb     66 <main+0x66>
  62:	f2 0f 58 dd          	addsd  %xmm5,%xmm3


or 

if (result == FP_NAN)

  41:	f2 0f 10 02          	movsd  (%rdx),%xmm0
  45:	66 0f 54 c1          	andpd  %xmm1,%xmm0
  49:	66 0f 2e c0          	ucomisd %xmm0,%xmm0
  4d:	7b e9                	jnp    38 <main+0x38>

or

if (result == FP_INFINITE || result == FP_ZERO)

  48:	f2 0f 10 02          	movsd  (%rdx),%xmm0
  4c:	66 0f 54 c2          	andpd  %xmm2,%xmm0
  50:	66 0f 2e c0          	ucomisd %xmm0,%xmm0
  54:	7a 06                	jp     5c <main+0x5c>
  56:	66 0f 2e c3          	ucomisd %xmm3,%xmm0
  5a:	76 04                	jbe    60 <main+0x60>


Note that FP_NORMAL has poor performance and you should replace it by
inline as none of three branches there is taken, while what I submitted
in review thread has one branch not taken.

> > > As a result of this many target overrides and the various
> > > __isnan/__finite inlines in math_private.h are no longer required. If
> > > agreed we could remove all this code and only keep the generic
> > > definition of isinf/etc which will use the builtin.
> > 
> > How do those inlines compare with the code GCC generates?  As I understand
> > it, before your patch libm-internal calls to these macros would have
> > called the inline versions of functions such as __finite (no function
> > call, PLT or otherwise, and using integer arithmetic), and after the patch
> > libm-internal calls would have used the GCC inlines (with floating-point
> > arithmetic) - is the new state for libm-internal calls better than the old
> > state or not?
> 
> __isinf(f/l) were never inlined, but are now (__isinf_ns was used in some cases,
> but not nearly all). Fpclassify*, isnormal* were never inlined, but now are. These
> new inlines provide the gains.
> 
> __isnan(f) and __finite(f) used GLIBC internal inlines, and now use GCC built-ins 
> instead. We could measure whether this specific change causes any difference, but 
> I bet it will be very uarch specific and minor compared to the total time math
> functions take.
> 
> Also I don't think having special inlines that are only used inside GLIBC is a 
> good approach - if the GCC built-ins are not fast enough then we should fix them.
>

Joseph Myers June 15, 2015, 9:35 p.m. UTC | #5

On Mon, 15 Jun 2015, Ondřej Bílka wrote:

> As I wrote in other thread that gcc builtins have poor performance a
> benchmark is tricky. Main problem is that these tests are in branch and
> gcc will simplify them. As builtins are branchless it obviously couldn't
> simplify them.

Even a poor benchmark, checked into the benchtests directory, would be a 
starting point for improved benchmarks as well as for benchmarking any 
future improvements to these functions.  Having a benchmark that everyone 
can readily use with glibc is better than having a performance improvement 
asserted in a patch submission without the benchmark being available at 
all.

It isn't necessary to show that the use of built-in functions here is 
optimal.  Simply provide evidence that (a) it's at least as good as the 
existing out-of-line functions, for calls from user programs, and (b) libm 
functions that previously used glibc-internal inlines, and would use GCC 
built-in functions after the patch, don't suffer any significant 
performance regression from that change.

Ondřej Bílka June 16, 2015, 5 a.m. UTC | #6

On Mon, Jun 15, 2015 at 09:35:22PM +0000, Joseph Myers wrote:
> On Mon, 15 Jun 2015, Ondřej Bílka wrote:
> 
> > As I wrote in other thread that gcc builtins have poor performance a
> > benchmark is tricky. Main problem is that these tests are in branch and
> > gcc will simplify them. As builtins are branchless it obviously couldn't
> > simplify them.
> 
> Even a poor benchmark, checked into the benchtests directory, would be a 
> starting point for improved benchmarks as well as for benchmarking any 
> future improvements to these functions.  Having a benchmark that everyone 
> can readily use with glibc is better than having a performance improvement 
> asserted in a patch submission without the benchmark being available at 
> all.
>
No, a poor benchmark is dangerous and much worse than none at all. With
poor benchmark you could easily check performance regression that looks
like improvement on benchmark and you wouldn't notice until some
developer measures poor performance of his application and finds that
problem is on his side.

I could almost "show" that fpclassify gcc builtin is slower than library
call, in benchmark below I exploit branch misprediction to get close. If
I could use "benchtest" below I could "improve" fpclassify by making
zero check branchless which would improve benchmark numbers to actually
beat call overhead. Or I could play with probability of subnormals to
increase running time of gcc builtin and decrease of library. Moral is
that with poor benchmark your implementation will be poor as it tries to
minimize benchmark.

With d[i]=(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0);

there is almost no difference

06:28:15:~$  gcc -O3  ft.c -lm 
06:28:19:~$ time ./a.out 

real	0m0.719s
user	0m0.712s
sys	0m0.004s

06:28:26:~$  gcc -O3  ft.c -lm  -DBUILTIN
06:28:39:~$ time ./a.out 

real	0m0.677s
user	0m0.676s
sys	0m0.000s

While when I change that to d[i]=4.2
then there is big improvement.

06:50:55:~$ gcc -O3 ft.c -lm 
06:51:00:~$ time ./a.out 

real	0m0.624s
user	0m0.624s
sys	0m0.000s
06:51:02:~$ gcc -O3 ft.c -lm  -DBUILTIN
06:51:12:~$ time ./a.out 

real	0m0.384s
user	0m0.380s
sys	0m0.000s


> It isn't necessary to show that the use of built-in functions here is 
> optimal.  Simply provide evidence that (a) it's at least as good as the 
> existing out-of-line functions, for calls from user programs, and (b) libm 
> functions that previously used glibc-internal inlines, and would use GCC 
> built-in functions after the patch, don't suffer any significant 
> performance regression from that change.
> 
Which as I write before isn't as due gcc builtin poor implementation.
You would need to fix that or add inlines for optimal performance.

And variant of my "benchmark" is here:


#include <stdlib.h>
#include <stdio.h>
#include <math.h>

#ifdef BUILTIN
int __attribute__((noinline)) nor(double x)
{
  return  __builtin_expect( 
__builtin_fpclassify (FP_NAN, FP_INFINITE,           \
     FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0);

}
#else
int __attribute__((noinline)) nor(double x)
{
  return fpclassify (x);
}
#endif


int main()
{
  double ret= 0.0;
  int i, j;
  double *d = malloc (800000);
  for (i=0;i<1000;i++)
    d[i]=(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0);

  for (j=0; j<100000; j++)
  for (i=0; i<1000; i++){
   int result = nor(d[i]);
    if (result == FP_NORMAL)
      ret += 42;                         }
  
  return ret;
}

Adhemerval Zanella June 16, 2015, 12:31 p.m. UTC | #7

On 16-06-2015 02:00, Ondřej Bílka wrote:
> On Mon, Jun 15, 2015 at 09:35:22PM +0000, Joseph Myers wrote:
>> On Mon, 15 Jun 2015, Ondřej Bílka wrote:
>>
>>> As I wrote in other thread that gcc builtins have poor performance a
>>> benchmark is tricky. Main problem is that these tests are in branch and
>>> gcc will simplify them. As builtins are branchless it obviously couldn't
>>> simplify them.
>>
>> Even a poor benchmark, checked into the benchtests directory, would be a 
>> starting point for improved benchmarks as well as for benchmarking any 
>> future improvements to these functions.  Having a benchmark that everyone 
>> can readily use with glibc is better than having a performance improvement 
>> asserted in a patch submission without the benchmark being available at 
>> all.
>>
> No, a poor benchmark is dangerous and much worse than none at all. With
> poor benchmark you could easily check performance regression that looks
> like improvement on benchmark and you wouldn't notice until some
> developer measures poor performance of his application and finds that
> problem is on his side.
> 
> I could almost "show" that fpclassify gcc builtin is slower than library
> call, in benchmark below I exploit branch misprediction to get close. If
> I could use "benchtest" below I could "improve" fpclassify by making
> zero check branchless which would improve benchmark numbers to actually
> beat call overhead. Or I could play with probability of subnormals to
> increase running time of gcc builtin and decrease of library. Moral is
> that with poor benchmark your implementation will be poor as it tries to
> minimize benchmark.

So to make this proposal to move forward, how exactly do you propose to
create a benchtest for such scenario? I get this is tricky and a lot of
variables may apply, but I do agree with Joseph that we shouldn't quite
aim for optimal performance, imho using compiler builtins with reasonable
performance is a gain in code maintainability.

So from various code pieces you have thrown in maillist, I see that we
may focus on a benchmark that uses a random sample with different 
probability scenarios FP number types:

1. high prob for normal
2. high prob for nan
3. high prob for inf,
4. high prob for subnormal 

And 2, 3, 4 should not be the optimization focus (since they are not the
usual case for mostly of computations and algorithms). Do you propose
something different?

Ondřej Bílka June 16, 2015, 1:43 p.m. UTC | #8

On Tue, Jun 16, 2015 at 09:31:02AM -0300, Adhemerval Zanella wrote:
> 
> 
> On 16-06-2015 02:00, Ondřej Bílka wrote:
> > On Mon, Jun 15, 2015 at 09:35:22PM +0000, Joseph Myers wrote:
> >> On Mon, 15 Jun 2015, Ondřej Bílka wrote:
> >>
> >>> As I wrote in other thread that gcc builtins have poor performance a
> >>> benchmark is tricky. Main problem is that these tests are in branch and
> >>> gcc will simplify them. As builtins are branchless it obviously couldn't
> >>> simplify them.
> >>
> >> Even a poor benchmark, checked into the benchtests directory, would be a 
> >> starting point for improved benchmarks as well as for benchmarking any 
> >> future improvements to these functions.  Having a benchmark that everyone 
> >> can readily use with glibc is better than having a performance improvement 
> >> asserted in a patch submission without the benchmark being available at 
> >> all.
> >>
> > No, a poor benchmark is dangerous and much worse than none at all. With
> > poor benchmark you could easily check performance regression that looks
> > like improvement on benchmark and you wouldn't notice until some
> > developer measures poor performance of his application and finds that
> > problem is on his side.
> > 
> > I could almost "show" that fpclassify gcc builtin is slower than library
> > call, in benchmark below I exploit branch misprediction to get close. If
> > I could use "benchtest" below I could "improve" fpclassify by making
> > zero check branchless which would improve benchmark numbers to actually
> > beat call overhead. Or I could play with probability of subnormals to
> > increase running time of gcc builtin and decrease of library. Moral is
> > that with poor benchmark your implementation will be poor as it tries to
> > minimize benchmark.
> 
> So to make this proposal to move forward, how exactly do you propose to
> create a benchtest for such scenario? I get this is tricky and a lot of
> variables may apply, but I do agree with Joseph that we shouldn't quite
> aim for optimal performance, imho using compiler builtins with reasonable
> performance is a gain in code maintainability.
> 
As I said before about these they are hard to measure and I could
argue also versus my benchmark that its inaccurate as it doesn't measure
effect of cpu pipeline when function does other computation. Answer is
don't do microbenchmark.

Take existing math function with benchtest and see what performance
difference you gain. Most complex math functions start with tests like

if (isfinite(x) && isfinite(y))

so there should be measurable performance improvement, possibly add
these if they aren't covered.


> So from various code pieces you have thrown in maillist, I see that we
> may focus on a benchmark that uses a random sample with different 
> probability scenarios FP number types:
> 
> 1. high prob for normal
> 2. high prob for nan
> 3. high prob for inf,
> 4. high prob for subnormal 
> 
> And 2, 3, 4 should not be the optimization focus (since they are not the
> usual case for mostly of computations and algorithms). Do you propose
> something different?

No, I done that to show that microbenchmarks tend to be wrong, you need
to do benchmarking to rule that out. Depending on distribution of these
cases a different implemenation will become optimal and with some work I
could find distribution where gcc builtin decision tree cost is more
than saved overhead.

That just means to go collect data to avoid problems like this. And you
forget case 5. high prob of zero. That could make fpclassify benchmark
misleading when you just use zeroed array like I did with original
benchmark.

Wilco June 16, 2015, 3:53 p.m. UTC | #9

> -----Original Message-----
> From: Joseph Myers [mailto:joseph@codesourcery.com]
> Sent: 15 June 2015 18:01
> To: Wilco Dijkstra
> Cc: GNU C Library
> Subject: RE: [PATCH] Inline C99 math functions
> 
> On Mon, 15 Jun 2015, Wilco Dijkstra wrote:
> 
> > > Where are the benchmarks for this?  Please put them in benchtests so
> > > actual reproducible figures can be given.  That's the standard practice
> > > for any change being justified on the basis of performance.
> >
> > I'll add a benchmark in another patch - it's not trivial as benchtest is not
> > suitable to accurately time very simple functions, especially when inlined...
> 
> Well, the benchmark should come first....

I added a new math-inlines benchmark based on the string benchmark infrastructure.
I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN,
and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57:

__fpclassify_t:	8.76	7.04
fpclassify_t:	4.91	5.17
__isnormal_inl_t:	8.77	7.16
isnormal_t:		3.16	3.17
__finite_inl_t:	1.91	1.91
__finite_t:		15.29	15.28
isfinite_t:		1.28	1.28
__isinf_inl_t:	1.92	2.99
__isinf_t:		8.9	6.17
isinf_t:		1.28	1.28
__isnan_inl_t:	1.91	1.92
__isnan_t:		15.28	15.28
isnan_t:		1	1.01

The plain isnan_t functions use the GCC built-ins, the _inl variant uses the
existing math_private.h inlines (with __isinf fixed to return the sign too),
and the __isnan variants are the non-inline GLIBC functions.

So this clearly shows the GCC built-ins win by a huge margin, including the
inline versions. It also shows that multiple isinf/isnan calls would be faster
than a single inlined fpclassify...

> > > What are the effects on code size (especially for fpclassify)?  If code
> > > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be
> > > considered.
> >
> > Codesize of what? Few applications use these functions... GLIBC mathlib is
> 
> Size of any code calling these macros (for nonconstant arguments).

Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes
due to the callee-save overhead of the function call. The builtin isinf uses
3 instructions inside the loop plus 3 lifted before it, while the call to
__isinf needs 3 plus a lot of code to save/restore the callee-saves.

> > Also I don't think having special inlines that are only used inside
> > GLIBC is a good approach - if the GCC built-ins are not fast enough then
> > we should fix them.
> 
> Yes, we should improve the built-in functions, but first we should
> understand the effects on performance of glibc libm functions (I don't
> know if the existing benchtests cover cases where __finite / __isnan /
> __isinf_ns inlines were used) to see if this cleanup patch indeed doesn't
> significantly harm performance of affected libm functions (and possibly
> improves performance through the changes in cases that wouldn't previously
> have been inlined at all).

A run of the math tests doesn't show up any obvious differences beyond the
usual variations from run to run. I suspect the difference due to inlining
is in the noise for expensive math functions.

Wilco

Joseph Myers June 16, 2015, 4:13 p.m. UTC | #10

On Tue, 16 Jun 2015, Wilco Dijkstra wrote:

> > Well, the benchmark should come first....
> 
> I added a new math-inlines benchmark based on the string benchmark 
> infrastructure.

Thanks.  I await the patch submission.

> So this clearly shows the GCC built-ins win by a huge margin, including the
> inline versions. It also shows that multiple isinf/isnan calls would be faster

That's interesting information - suggesting that changes in GCC to use 
integer arithmetic should be conditional on -fsignaling-nans, if doing the 
operations by integer arithmetic is slower (at least on this processor).

(It also suggests it's safe to remove the existing glibc-internal inlines 
as part of moving to using the built-in functions when possible.)

> > > Codesize of what? Few applications use these functions... GLIBC mathlib is
> > 
> > Size of any code calling these macros (for nonconstant arguments).
> 
> Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes
> due to the callee-save overhead of the function call. The builtin isinf uses
> 3 instructions inside the loop plus 3 lifted before it, while the call to
> __isinf needs 3 plus a lot of code to save/restore the callee-saves.

One might suppose that most functions using these macros contain other 
function calls as well, and so that the callee-save overhead should not be 
included in the comparison.

When you exclude callee-save overhead, how do things compare for 
fpclassify (the main case where inlining may be questionable when 
optimizing for size)?

Ondřej Bílka June 16, 2015, 4:40 p.m. UTC | #11

On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote:
> 
> 
> > -----Original Message-----
> > From: Joseph Myers [mailto:joseph@codesourcery.com]
> > Sent: 15 June 2015 18:01
> > To: Wilco Dijkstra
> > Cc: GNU C Library
> > Subject: RE: [PATCH] Inline C99 math functions
> > 
> > On Mon, 15 Jun 2015, Wilco Dijkstra wrote:
> > 
> > > > Where are the benchmarks for this?  Please put them in benchtests so
> > > > actual reproducible figures can be given.  That's the standard practice
> > > > for any change being justified on the basis of performance.
> > >
> > > I'll add a benchmark in another patch - it's not trivial as benchtest is not
> > > suitable to accurately time very simple functions, especially when inlined...
> > 
> > Well, the benchmark should come first....
> 
> I added a new math-inlines benchmark based on the string benchmark infrastructure.
> I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN,
> and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57:
> 
Where is benchmark, there are several things that could go wrong with it.

> __fpclassify_t:	8.76	7.04
> fpclassify_t:	4.91	5.17

> __isnormal_inl_t:	8.77	7.16
> isnormal_t:		3.16	3.17

Where did you get inline? I couldn't find it anywhere. Also such big
number for inline implementation is suspect

> __finite_inl_t:	1.91	1.91
> __finite_t:		15.29	15.28
> isfinite_t:		1.28	1.28
> __isinf_inl_t:	1.92	2.99
> __isinf_t:		8.9	6.17
> isinf_t:		1.28	1.28
> __isnan_inl_t:	1.91	1.92
> __isnan_t:		15.28	15.28
> isnan_t:		1	1.01
> 
> The plain isnan_t functions use the GCC built-ins, the _inl variant uses the
> existing math_private.h inlines (with __isinf fixed to return the sign too),
> and the __isnan variants are the non-inline GLIBC functions.
> 
> So this clearly shows the GCC built-ins win by a huge margin, including the
> inline versions.
That looks bit suspect, submit a benchmark to see if its correct or not.


>  It also shows that multiple isinf/isnan calls would be faster
> than a single inlined fpclassify...
> 
No, thats completely wrong. When you look on assembly when using __builtin_isnan
then its identical to one of (on x64 but I doubt that arm gcc is worse)

__builtin_fpclassify (FP_NAN, FP_INFINITE, FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0) == FP_NAN

So removing fpclassify wouldn't in better case didn't change performance
at all, in worse one it would harm it due to duplicated checks.


> > > > What are the effects on code size (especially for fpclassify)?  If code
> > > > becomes larger, conditioning on !defined __OPTIMIZE_SIZE__ should be
> > > > considered.
> > >
> > > Codesize of what? Few applications use these functions... GLIBC mathlib is
> > 
> > Size of any code calling these macros (for nonconstant arguments).
> 
> Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes
> due to the callee-save overhead of the function call. The builtin isinf uses
> 3 instructions inside the loop plus 3 lifted before it, while the call to
> __isinf needs 3 plus a lot of code to save/restore the callee-saves.
> 
> > > Also I don't think having special inlines that are only used inside
> > > GLIBC is a good approach - if the GCC built-ins are not fast enough then
> > > we should fix them.
> > 
> > Yes, we should improve the built-in functions, but first we should
> > understand the effects on performance of glibc libm functions (I don't
> > know if the existing benchtests cover cases where __finite / __isnan /
> > __isinf_ns inlines were used) to see if this cleanup patch indeed doesn't
> > significantly harm performance of affected libm functions (and possibly
> > improves performance through the changes in cases that wouldn't previously
> > have been inlined at all).
> 
> A run of the math tests doesn't show up any obvious differences beyond the
> usual variations from run to run. I suspect the difference due to inlining
> is in the noise for expensive math functions.
>
Look at complex math, these use it. For real math you need to pick
specific inputs to trigger unlikely path that uses isinf...

Wilco June 16, 2015, 5:53 p.m. UTC | #12

> Ondřej Bílka wrote:
> On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote:
> > I added a new math-inlines benchmark based on the string benchmark infrastructure.
> > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN,
> > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57:
> >
> Where is benchmark, there are several things that could go wrong with it.

I'll send it when I can (it has to go through review etc).

> > __fpclassify_t:	8.76	7.04
> > fpclassify_t:	4.91	5.17
> 
> > __isnormal_inl_t:	8.77	7.16
> > isnormal_t:		3.16	3.17
> 
> Where did you get inline? I couldn't find it anywhere. Also such big
> number for inline implementation is suspect

It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad
idea and the reason for the low performance. Although the GCC isnormal builtin
is not particularly fast, it still beats it by more than a factor of 2.

> > __finite_inl_t:	1.91	1.91
> > __finite_t:		15.29	15.28
> > isfinite_t:		1.28	1.28
> > __isinf_inl_t:	1.92	2.99
> > __isinf_t:		8.9	6.17
> > isinf_t:		1.28	1.28
> > __isnan_inl_t:	1.91	1.92
> > __isnan_t:		15.28	15.28
> > isnan_t:		1	1.01
> >
> > The plain isnan_t functions use the GCC built-ins, the _inl variant uses the
> > existing math_private.h inlines (with __isinf fixed to return the sign too),
> > and the __isnan variants are the non-inline GLIBC functions.
> >
> > So this clearly shows the GCC built-ins win by a huge margin, including the
> > inline versions.
> That looks bit suspect, submit a benchmark to see if its correct or not.

It's certainly correct, but obviously different microarchitectures will show
different results. Note the GLIBC private inlines are not particularly good.

> >  It also shows that multiple isinf/isnan calls would be faster
> > than a single inlined fpclassify...
> >
> No, thats completely wrong. When you look on assembly when using __builtin_isnan
> then its identical to one of (on x64 but I doubt that arm gcc is worse)
> 
> __builtin_fpclassify (FP_NAN, FP_INFINITE, FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0) == FP_NAN
> 
> So removing fpclassify wouldn't in better case didn't change performance
> at all, in worse one it would harm it due to duplicated checks.

Fpclassify basically does several checks if you save the result in a variable and
executes some branches. So you are far better off using dedicated checks if you
just need 2 or 3 of the 5 possible results. And depending on how the code is
structured you may only ever execute 1 check. That is far cheaper than first
computing the full result for fpclassify and then testing that.

> > A run of the math tests doesn't show up any obvious differences beyond the
> > usual variations from run to run. I suspect the difference due to inlining
> > is in the noise for expensive math functions.
> >
> Look at complex math, these use it. For real math you need to pick
> specific inputs to trigger unlikely path that uses isinf...

Yes this needs a dedicated test. Still if we save a cycle in a 100 cycle function,
it is hard to show it given modern OoO CPUs can have 5% variation from run to run...

Wilco

Carlos O'Donell June 16, 2015, 7:44 p.m. UTC | #13

On 06/15/2015 01:00 PM, Joseph Myers wrote:
> On Mon, 15 Jun 2015, Wilco Dijkstra wrote:
> 
>>> Where are the benchmarks for this?  Please put them in benchtests so
>>> actual reproducible figures can be given.  That's the standard practice
>>> for any change being justified on the basis of performance.
>>
>> I'll add a benchmark in another patch - it's not trivial as benchtest is not
>> suitable to accurately time very simple functions, especially when inlined...
> 
> Well, the benchmark should come first....

Agreed.

If it's not trivial to test the performance... then how did you test it?

How do I test it when reviewing the patch?

Cheers,
Carlos.

Carlos O'Donell June 16, 2015, 7:54 p.m. UTC | #14

On 06/16/2015 09:43 AM, Ondřej Bílka wrote:
>> So to make this proposal to move forward, how exactly do you propose to
>> create a benchtest for such scenario? I get this is tricky and a lot of
>> variables may apply, but I do agree with Joseph that we shouldn't quite
>> aim for optimal performance, imho using compiler builtins with reasonable
>> performance is a gain in code maintainability.
>>
> As I said before about these they are hard to measure and I could
> argue also versus my benchmark that its inaccurate as it doesn't measure
> effect of cpu pipeline when function does other computation. Answer is
> don't do microbenchmark.

That's not an answer, an answer is "Here's a patch to extend the libm testing
to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance."

I agree that microbenchmarks can be misleading if interpreted by automated
systems, but we aren't talking about that yet, we are talking about experts
using these tools to discuss patches in an objective fashion.

Cheers,
Carlos.

Ondřej Bílka June 17, 2015, 5:35 a.m. UTC | #15

On Tue, Jun 16, 2015 at 06:53:39PM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote:
> > > I added a new math-inlines benchmark based on the string benchmark infrastructure.
> > > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN,
> > > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57:
> > >
> > Where is benchmark, there are several things that could go wrong with it.
> 
> I'll send it when I can (it has to go through review etc).
> 
> > > __fpclassify_t:	8.76	7.04
> > > fpclassify_t:	4.91	5.17
> > 
> > > __isnormal_inl_t:	8.77	7.16
> > > isnormal_t:		3.16	3.17
> > 
> > Where did you get inline? I couldn't find it anywhere. Also such big
> > number for inline implementation is suspect
> 
> It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad
> idea and the reason for the low performance. Although the GCC isnormal builtin
> is not particularly fast, it still beats it by more than a factor of 2.
>
No, bad idea was not inlining fpclassify, that affects most of performance difference. 
There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as it tests unlikely cases first but that is secondary.

> > > __finite_inl_t:	1.91	1.91
> > > __finite_t:		15.29	15.28
> > > isfinite_t:		1.28	1.28
> > > __isinf_inl_t:	1.92	2.99
> > > __isinf_t:		8.9	6.17
> > > isinf_t:		1.28	1.28
> > > __isnan_inl_t:	1.91	1.92
> > > __isnan_t:		15.28	15.28
> > > isnan_t:		1	1.01
> > >
> > > The plain isnan_t functions use the GCC built-ins, the _inl variant uses the
> > > existing math_private.h inlines (with __isinf fixed to return the sign too),
> > > and the __isnan variants are the non-inline GLIBC functions.
> > >
> > > So this clearly shows the GCC built-ins win by a huge margin, including the
> > > inline versions.
> > That looks bit suspect, submit a benchmark to see if its correct or not.
> 
> It's certainly correct, but obviously different microarchitectures will show
> different results. Note the GLIBC private inlines are not particularly good.
>
No, problem is that different benchmarks show different results on same
architecture. To speed things up run following to test all cases of
environment. Run attached tf script to get results on arm.

I am concerned on libc inlines that they need two big constants. That
sohuldn't be problem with loop but should in initial checks, so I need
benchmark to see what case you did test. Alter that only thing I could
come is that move from fp register has slow latency compared to floating
comparison. That don't have to mean anything on OoO cpu which hide that
latency by executing instructions after that. Second theory is that gcc
decided that its infinity is likely which leads to suboptimal assembly. 
We should also add expect to header to add that information


On x64 I get following results depending how gcc optimizes code which is
controled by inlining

don't inline
conditional add
branched

real	0m1.313s
user	0m1.312s
sys	0m0.000s
builtin

real	0m1.309s
user	0m1.308s
sys	0m0.000s
branch
branched

real	0m1.310s
user	0m1.308s
sys	0m0.000s
builtin

real	0m1.337s
user	0m1.312s
sys	0m0.004s
sum
branched

real	0m1.209s
user	0m1.204s
sys	0m0.000s
builtin

real	0m1.216s
user	0m1.212s
sys	0m0.000s
inline outer call
conditional add
branched

real	0m0.705s
user	0m0.704s
sys	0m0.000s
builtin

real	0m0.916s
user	0m0.916s
sys	0m0.000s
branch
branched

real	0m0.806s
user	0m0.804s
sys	0m0.000s
builtin

real	0m0.721s
user	0m0.716s
sys	0m0.000s
sum
branched

real	0m1.029s
user	0m1.028s
sys	0m0.000s
builtin

real	0m0.911s
user	0m0.908s
sys	0m0.000s
inline inner call
conditional add
branched

real	0m1.038s
user	0m1.032s
sys	0m0.000s
builtin

real	0m1.024s
user	0m1.016s
sys	0m0.000s
branch
branched

real	0m0.614s
user	0m0.608s
sys	0m0.000s
builtin

real	0m0.606s
user	0m0.608s
sys	0m0.000s
sum
branched

real	0m0.662s
user	0m0.660s
sys	0m0.000s
builtin

real	0m0.629s
user	0m0.628s
sys	0m0.000s
tigth loop
conditional add
branched

real	0m0.208s
user	0m0.208s
sys	0m0.000s
builtin

real	0m0.326s
user	0m0.324s
sys	0m0.000s
branch
branched

real	0m0.204s
user	0m0.200s
sys	0m0.000s
builtin

real	0m0.325s
user	0m0.324s
sys	0m0.000s
sum
branched

real	0m0.328s
user	0m0.332s
sys	0m0.000s
builtin

real	0m0.486s
user	0m0.484s
sys	0m0.000s

 
> > >  It also shows that multiple isinf/isnan calls would be faster
> > > than a single inlined fpclassify...
> > >
> > No, thats completely wrong. When you look on assembly when using __builtin_isnan
> > then its identical to one of (on x64 but I doubt that arm gcc is worse)
> > 
> > __builtin_fpclassify (FP_NAN, FP_INFINITE, FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0) == FP_NAN
> > 
> > So removing fpclassify wouldn't in better case didn't change performance
> > at all, in worse one it would harm it due to duplicated checks.
> 
> Fpclassify basically does several checks if you save the result in a variable and
> executes some branches. So you are far better off using dedicated checks if you
> just need 2 or 3 of the 5 possible results. And depending on how the code is
> structured you may only ever execute 1 check. That is far cheaper than first
> computing the full result for fpclassify and then testing that.
> 
Which doesn't matter. As gcc optimized unneded checks away you won't do
unneeded checks. As using:

__builtin_fpclassify (FP_NAN, FP_INFINITE,           \
     FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0);
  return result == FP_INFINITE || result == FP_NAN;

is slower than:

 return __builtin_isinf (x) ||  __builtin_isnan (x);

Your claim is false, run attached tf2 script to test.

conditional add
isnan+isinf

real	0m0.141s
user	0m0.140s
sys	0m0.000s
classify

real	0m0.136s
user	0m0.132s
sys	0m0.000s
branch
isnan+isinf

real	0m0.158s
user	0m0.156s
sys	0m0.000s
classify

real	0m0.136s
user	0m0.132s
sys	0m0.000s
sum
isnan+isinf

real	0m0.237s
user	0m0.236s
sys	0m0.000s
classify

real	0m0.202s
user	0m0.200s
sys	0m0.000s




> > > A run of the math tests doesn't show up any obvious differences beyond the
> > > usual variations from run to run. I suspect the difference due to inlining
> > > is in the noise for expensive math functions.
> > >
> > Look at complex math, these use it. For real math you need to pick
> > specific inputs to trigger unlikely path that uses isinf...
> 
> Yes this needs a dedicated test. Still if we save a cycle in a 100 cycle function,
> it is hard to show it given modern OoO CPUs can have 5% variation from run to run...
> 
It isn't hard. You just need to run it long enough until variance is
below one cycle. Also with OoO its even more important as while function
could be slower on benchmark it could be faster in practice as it takes
advantage of that and selects instructions that are easy to OoO so most
happens in parallel.
#include <stdlib.h>
#include <stdio.h>
//#include <math.h>
#undef isinf2
#ifdef BRANCHED
int I2
isinf2 (double dx)
{
  union u {
    double d;
    long l;
  };
  union u u;
  u.d = dx;
  long x = u.l;
  return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 :
-1) : 0;
}
#else
 int I2
isinf2(double x)
{
  return __builtin_isinf (x);
}
#endif

void I1
t1(double x,double *d)
{
  if (isinf2 (x))
   *d+=42;
}
void I1
t2(double x,double *d)
{
  if (isinf2 (x))
    abort();
}
void I1
t3(double x,double *d)
{
  *d += isinf2 (x);
}




int main()
{
  double ret= 0.0;
  int i, j;
  double *d = malloc (800000);
  for (i=0;i<1000;i++)
    d[i]=4.2;//(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0);

  for (j=0; j<300000; j++)
#ifdef T1
  for (i=0; i<1000; i++)
    t1(d[i], &ret);
#endif
#ifdef T2
  for (i=0; i<1000; i++)
    t2(d[i], &ret);
#endif
#ifdef T3
  for (i=0; i<1000; i++)
    t3(d[i], &ret);
#endif
  
  return ret;
}
#include <stdlib.h>
#include <stdio.h>
#if defined CLAS
#include <math.h>
#undef isinf2
static int isinf2(double x)
{
  int result =  __builtin_expect( 
__builtin_fpclassify (FP_NAN, FP_INFINITE,           \
     FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0);
  return result == FP_INFINITE || result == FP_NAN;
}
#else
static int isinf2(double x)
{
  return __builtin_isinf (x) ||  __builtin_isnan (x);
}
#endif


int main()
{
  double ret= 0.0;
  int i, j;
  double *d = malloc (800000);
  for (i=0;i<1000;i++)
    d[i]=4.2;//(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0);

  for (j=0; j<100000; j++)
#ifdef T1
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf2 (d[i]),0))
      ret += 42;
#endif
#ifdef T2
  for (i=0; i<1000; i++)
    if (__builtin_expect(isinf2 (d[i]),0))
      abort();
#endif
#ifdef T3
  for (i=0; i<1000; i++)
    ret += __builtin_expect(isinf2 (d[i]),0);
#endif
  
  return ret;
}
test()
{
gcc -O3 -DBRANCHED -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null
echo branched
time ./a.out
gcc -O3  -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null
echo builtin
time ./a.out
}
test2()
{
  echo conditional add
  test $1 $2 1
  echo branch
  test $1 $2 2
  echo sum
  test $1 $2 3

}
echo "don't inline"
test2 '__attribute__((noinline))' '__attribute__((noinline))'
echo inline outer call
test2 '__attribute__((noinline))' '__attribute__((always_inline))'
echo inline inner call
test2 '__attribute__((always_inline))' '__attribute__((noinline))'
echo tigth loop
test2 '__attribute__((always_inline))' '__attribute__((always_inline))'
echo conditional add
gcc -O3 ft2.c -DT1
echo isnan+isinf
time ./a.out
gcc -O3 ft2.c -DT1 -DCLAS
echo classify
time ./a.out
echo branch
gcc -O3 ft2.c -DT2
echo isnan+isinf
time ./a.out
gcc -O3 ft2.c -DT2 -DCLAS
echo classify
time ./a.out
echo sum
gcc -O3 ft2.c -DT3
echo isnan+isinf
time ./a.out
gcc -O3 ft2.c -DT3 -DCLAS
echo classify
time ./a.out

Wilco June 17, 2015, 12:49 p.m. UTC | #16

> Carlos O'Donell wrote:
> On 06/15/2015 01:00 PM, Joseph Myers wrote:
> > On Mon, 15 Jun 2015, Wilco Dijkstra wrote:
> >
> >>> Where are the benchmarks for this?  Please put them in benchtests so
> >>> actual reproducible figures can be given.  That's the standard practice
> >>> for any change being justified on the basis of performance.
> >>
> >> I'll add a benchmark in another patch - it's not trivial as benchtest is not
> >> suitable to accurately time very simple functions, especially when inlined...
> >
> > Well, the benchmark should come first....
> 
> Agreed.
> 
> If it's not trivial to test the performance... then how did you test it?
> 
> How do I test it when reviewing the patch?

Besides checking it was correctly inlining as expected, I ran the usual benchmarks
like SPECFP to check there are no performance regressions due to the math libs.

Adding more inlining is simply a no-brainer, especially when we're talking about 3-4 
instruction functions and not just avoiding a call but also a PLT indirection... 
This provides more than a factor 15 speedup in my microbenchmarks.

Wilco

Wilco June 17, 2015, 3:24 p.m. UTC | #17

> Ondřej Bílka wrote:
> On Tue, Jun 16, 2015 at 06:53:39PM +0100, Wilco Dijkstra wrote:
> > > Ondřej Bílka wrote:
> > > On Tue, Jun 16, 2015 at 04:53:11PM +0100, Wilco Dijkstra wrote:
> > > > I added a new math-inlines benchmark based on the string benchmark infrastructure.
> > > > I used 2x1024 inputs, one 99% finite FP numbers (20% zeroes) and 1% inf/NaN,
> > > > and the 2nd with 50% inf, and 50% Nan. Here are the relative timings for Cortex-A57:
> > > >
> > > Where is benchmark, there are several things that could go wrong with it.
> >
> > I'll send it when I can (it has to go through review etc).
> >
> > > > __fpclassify_t:	8.76	7.04
> > > > fpclassify_t:	4.91	5.17
> > >
> > > > __isnormal_inl_t:	8.77	7.16
> > > > isnormal_t:		3.16	3.17
> > >
> > > Where did you get inline? I couldn't find it anywhere. Also such big
> > > number for inline implementation is suspect
> >
> > It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad
> > idea and the reason for the low performance. Although the GCC isnormal builtin
> > is not particularly fast, it still beats it by more than a factor of 2.
> >
> No, bad idea was not inlining fpclassify, that affects most of performance difference.
> There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as
> it tests unlikely cases first but that is secondary.

Even with the inlined fpclassify (inl2 below), isnormal is slower:

__isnormal_inl2_t:	1.25	3.67
__isnormal_inl_t:	4.59	2.89
isnormal_t:	1	1

So using a dedicated builtin for isnormal is important.

> > It's certainly correct, but obviously different microarchitectures will show
> > different results. Note the GLIBC private inlines are not particularly good.
> >
> No, problem is that different benchmarks show different results on same
> architecture. To speed things up run following to test all cases of
> environment. Run attached tf script to get results on arm.

I tried, but I don't think this is a good benchmark - you're not measuring
the FP->int move for the branched version, and you're comparing the signed
version of isinf vs the builtin which does isinf_ns.

> Which doesn't matter. As gcc optimized unneded checks away you won't do
> unneeded checks. As using:
> 
> __builtin_fpclassify (FP_NAN, FP_INFINITE,           \
>      FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0);
>   return result == FP_INFINITE || result == FP_NAN;
> 
> is slower than:
> 
>  return __builtin_isinf (x) ||  __builtin_isnan (x);
> 
> Your claim is false, run attached tf2 script to test.

That's not what I am seeing, using two explicit isinf/isnan calls (test2) is faster
than inlined fpclassify (test1):

__fpclassify_test2_t:	1	4.41
__fpclassify_test1_t:	1.23	4.66

Wilco

Ondřej Bílka June 17, 2015, 4:34 p.m. UTC | #18

On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote:
> > >
> > > > > __fpclassify_t:	8.76	7.04
> > > > > fpclassify_t:	4.91	5.17
> > > >
> > > > > __isnormal_inl_t:	8.77	7.16
> > > > > isnormal_t:		3.16	3.17
> > > >
> > > > Where did you get inline? I couldn't find it anywhere. Also such big
> > > > number for inline implementation is suspect
> > >
> > > It does (__fpclassify (x) == FP_NORMAL) like math.h which is obviously a bad
> > > idea and the reason for the low performance. Although the GCC isnormal builtin
> > > is not particularly fast, it still beats it by more than a factor of 2.
> > >
> > No, bad idea was not inlining fpclassify, that affects most of performance difference.
> > There is also problem that glibcdev/glibc/sysdeps/ieee754/dbl-64/s_fpclassify.c is bit slow as
> > it tests unlikely cases first but that is secondary.
> 
> Even with the inlined fpclassify (inl2 below), isnormal is slower:
> 
> __isnormal_inl2_t:	1.25	3.67
> __isnormal_inl_t:	4.59	2.89
> isnormal_t:	1	1
> 
> So using a dedicated builtin for isnormal is important.
>
That makes result identical to one of isnan. That its slower is bug in
fpclassify which should first check for normal, then do unlikely checks.

 
> > > It's certainly correct, but obviously different microarchitectures will show
> > > different results. Note the GLIBC private inlines are not particularly good.
> > >
> > No, problem is that different benchmarks show different results on same
> > architecture. To speed things up run following to test all cases of
> > environment. Run attached tf script to get results on arm.
> 
> I tried, but I don't think this is a good benchmark - you're not measuring
> the FP->int move for the branched version, and you're comparing the signed
> version of isinf vs the builtin which does isinf_ns.
> 
Wilco that isinf is signed is again completely irrelevant. gcc is smart.
It get expanded into 
if (foo ? (bar ? 1 : -1) : 0)
that gcc simplifies to
if (foo)
so it doesn't matter that checking sign would take 100 cycles as its
deleted code.

Also as it could make branched version only slower when it beats builtin
then also nonsigned one would beat builtin.

And do you have assembly to show it doesn't measure move or its just
your guess? On x64 it definitely measures move and I could add that gcc
messes that bit by moving several times. objdump -d 
on gcc   -DT2 -DBRANCHED  -DI1="__attribute((always_inline))" -DI2="__attribute__((always_inline))" ft.c   -c
clearly shows that conversion is done several times.

 24c:	48 8b 45 f0          	mov    -0x10(%rbp),%rax
 250:	48 01 d0             	add    %rdx,%rax
 253:	f2 0f 10 00          	movsd  (%rax),%xmm0
 257:	f2 0f 11 45 e8       	movsd  %xmm0,-0x18(%rbp)
 25c:	48 8d 45 c8          	lea    -0x38(%rbp),%rax
 260:	48 89 45 e0          	mov    %rax,-0x20(%rbp)
 264:	f2 0f 10 45 e8       	movsd  -0x18(%rbp),%xmm0
 269:	f2 0f 11 45 d8       	movsd  %xmm0,-0x28(%rbp)
 26e:	f2 0f 10 45 d8       	movsd  -0x28(%rbp),%xmm0
 273:	f2 0f 11 45 c0       	movsd  %xmm0,-0x40(%rbp)
 278:	48 8b 45 c0          	mov    -0x40(%rbp),%rax
 27c:	48 89 45 d0          	mov    %rax,-0x30(%rbp)
 280:	48 8b 45 d0          	mov    -0x30(%rbp),%rax
 284:	48 8d 14 00          	lea    (%rax,%rax,1),%rdx
 288:	48 b8 00 00 00 00 00 	movabs $0xffe0000000000000,%rax
 28f:	00 e0 ff 
 292:	48 39 c2             	cmp    %rax,%rdx
 295:	75 1e                	jne    2b5 <main+0xe3>
 297:	48 b8 00 00 00 00 00 	movabs $0x7ff0000000000000,%rax
 29e:	00 f0 7f 
 2a1:	48 39 45 d0          	cmp    %rax,-0x30(%rbp)
 2a5:	75 07                	jne    2ae <main+0xdc>
 2a7:	b8 01 00 00 00       	mov    $0x1,%eax
 2ac:	eb 0c                	jmp    2ba <main+0xe8>
 2ae:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
 2b3:	eb 05                	jmp    2ba <main+0xe8>
 2b5:	b8 00 00 00 00       	mov    $0x0,%eax
 2ba:	85 c0                	test   %eax,%eax
 2bc:	74 05                	je     2c3 <main+0xf1>


So will you publish results or not as it would show your builtins in
unfavorable ligth?


> > Which doesn't matter. As gcc optimized unneded checks away you won't do
> > unneeded checks. As using:
> > 
> > __builtin_fpclassify (FP_NAN, FP_INFINITE,           \
> >      FP_NORMAL, FP_SUBNORMAL, FP_ZERO, x),0);
> >   return result == FP_INFINITE || result == FP_NAN;
> > 
> > is slower than:
> > 
> >  return __builtin_isinf (x) ||  __builtin_isnan (x);
> > 
> > Your claim is false, run attached tf2 script to test.
> 
> That's not what I am seeing, using two explicit isinf/isnan calls (test2) is faster
> than inlined fpclassify (test1):
> 
> __fpclassify_test2_t:	1	4.41
> __fpclassify_test1_t:	1.23	4.66
> 
You need to run my benchmark. You get different results if thats inside
if or not so I want to also know my results.

Ondřej Bílka June 17, 2015, 4:37 p.m. UTC | #19

On Wed, Jun 17, 2015 at 06:34:33PM +0200, Ondřej Bílka wrote:
 > 
> > I tried, but I don't think this is a good benchmark - you're not measuring
> > the FP->int move for the branched version, and you're comparing the signed
> > version of isinf vs the builtin which does isinf_ns.
> > 
> Wilco that isinf is signed is again completely irrelevant. gcc is smart.
> It get expanded into 
> if (foo ? (bar ? 1 : -1) : 0)
> that gcc simplifies to
> if (foo)
> so it doesn't matter that checking sign would take 100 cycles as its
> deleted code.
> 
Forgot to add that you should use gcc-4.9 or higher. I know that 4.7
doesn't do that and don't remember if 4.8 does that.

Wilco June 17, 2015, 5:03 p.m. UTC | #20

> Joseph Myers wrote:
> On Tue, 16 Jun 2015, Wilco Dijkstra wrote:
> 
> > > Well, the benchmark should come first....
> >
> > I added a new math-inlines benchmark based on the string benchmark
> > infrastructure.
> 
> Thanks.  I await the patch submission.

See https://sourceware.org/ml/libc-alpha/2015-06/msg00569.html

> > So this clearly shows the GCC built-ins win by a huge margin, including the
> > inline versions. It also shows that multiple isinf/isnan calls would be faster
> 
> That's interesting information - suggesting that changes in GCC to use
> integer arithmetic should be conditional on -fsignaling-nans, if doing the
> operations by integer arithmetic is slower (at least on this processor).
> 
> (It also suggests it's safe to remove the existing glibc-internal inlines
> as part of moving to using the built-in functions when possible.)

Indeed. To check which sequence is better we'd need to write a better benchmark,
maybe base it on a GLIBC function which uses these functions in the hot path.

> > > > Codesize of what? Few applications use these functions... GLIBC mathlib is
> > >
> > > Size of any code calling these macros (for nonconstant arguments).
> >
> > Well the size of the __isinf_t function is 160 bytes vs isinf_t 84 bytes
> > due to the callee-save overhead of the function call. The builtin isinf uses
> > 3 instructions inside the loop plus 3 lifted before it, while the call to
> > __isinf needs 3 plus a lot of code to save/restore the callee-saves.
> 
> One might suppose that most functions using these macros contain other
> function calls as well, and so that the callee-save overhead should not be
> included in the comparison.

That may be true in some cases, but if you can tailcall (which might be possible
in several math veneers) then the callee-save savings would apply.

> When you exclude callee-save overhead, how do things compare for
> fpclassify (the main case where inlining may be questionable when
> optimizing for size)?

Well in the worst-case scenario where you need all 5 tests of fpclassify it 
effectively changes a single-instruction call into 16 instructions plus 2 double 
immediate. So it is best to use OPTIMIZE_SIZE for fpclassify for now and revisit
when the GCC implementation has been improved. I also wonder what the difference
would be once I've optimized the __fpclassify implementation - I can do it in
about 8-9 instructions. 

Wilco

Wilco June 17, 2015, 5:22 p.m. UTC | #21

> Ondřej Bílka wrote:
> On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote:
> > Even with the inlined fpclassify (inl2 below), isnormal is slower:
> >
> > __isnormal_inl2_t:	1.25	3.67
> > __isnormal_inl_t:	4.59	2.89
> > isnormal_t:	1	1
> >
> > So using a dedicated builtin for isnormal is important.
> >
> That makes result identical to one of isnan. That its slower is bug in
> fpclassify which should first check for normal, then do unlikely checks.

It's about twice as slow as isnan as the isnormal check isn't done efficiently.
Fpclassify is slower still as it does 3 comparisons before setting FP_NORMAL.

> > > > It's certainly correct, but obviously different microarchitectures will show
> > > > different results. Note the GLIBC private inlines are not particularly good.
> > > >
> > > No, problem is that different benchmarks show different results on same
> > > architecture. To speed things up run following to test all cases of
> > > environment. Run attached tf script to get results on arm.
> >
> > I tried, but I don't think this is a good benchmark - you're not measuring
> > the FP->int move for the branched version, and you're comparing the signed
> > version of isinf vs the builtin which does isinf_ns.
> >
> Wilco that isinf is signed is again completely irrelevant. gcc is smart.
> It get expanded into
> if (foo ? (bar ? 1 : -1) : 0)
> that gcc simplifies to
> if (foo)
> so it doesn't matter that checking sign would take 100 cycles as its
> deleted code.

It matters for the -DT3 test.

> Also as it could make branched version only slower when it beats builtin
> then also nonsigned one would beat builtin.
> 
> And do you have assembly to show it doesn't measure move or its just
> your guess? On x64 it definitely measures move and I could add that gcc
> messes that bit by moving several times. objdump -d
> on gcc   -DT2 -DBRANCHED  -DI1="__attribute((always_inline))" -
> DI2="__attribute__((always_inline))" ft.c   -c
> clearly shows that conversion is done several times.

Look at what the inner loop generates for T1 (T2/T3 do the same):

.L27:
        ldr     x2, [x1]
        cmp     x4, x2, lsl 1
        bne     .L29
        fadd    d0, d0, d1
.L29:
        add     x1, x1, 8
        cmp     x1, x3
        bne     .L27

> So will you publish results or not as it would show your builtins in
> unfavorable ligth?

I don't see the point until your benchmark measures the right thing. 
Note my benchmark carefully avoids this gotcha.

Wilco

Ondřej Bílka June 17, 2015, 6:51 p.m. UTC | #22

On Wed, Jun 17, 2015 at 06:22:24PM +0100, Wilco Dijkstra wrote:
> > Ondřej Bílka wrote:
> > On Wed, Jun 17, 2015 at 04:24:46PM +0100, Wilco Dijkstra wrote:
> > > Even with the inlined fpclassify (inl2 below), isnormal is slower:
> > >
> > > __isnormal_inl2_t:	1.25	3.67
> > > __isnormal_inl_t:	4.59	2.89
> > > isnormal_t:	1	1
> > >
> > > So using a dedicated builtin for isnormal is important.
> > >
> > That makes result identical to one of isnan. That its slower is bug in
> > fpclassify which should first check for normal, then do unlikely checks.
> 
> It's about twice as slow as isnan as the isnormal check isn't done efficiently.
> Fpclassify is slower still as it does 3 comparisons before setting FP_NORMAL.
>
Then how could you explain original data? I couldn't find how you did
determined its twice slower, when it takes 1.25 which is faster than isinf 1.28 
__isnormal_inl_t:       8.77    7.16
isnormal_t:             3.16    3.17
__isinf_inl_t:  1.92    2.99
__isinf_t:              8.9     6.17
isinf_t:                1.28    1.28
__isnan_inl_t:  1.91    1.92
__isnan_t:              15.28   15.28
isnan_t:                1       1.01


 
> > > > > It's certainly correct, but obviously different microarchitectures will show
> > > > > different results. Note the GLIBC private inlines are not particularly good.
> > > > >
> > > > No, problem is that different benchmarks show different results on same
> > > > architecture. To speed things up run following to test all cases of
> > > > environment. Run attached tf script to get results on arm.
> > >
> > > I tried, but I don't think this is a good benchmark - you're not measuring
> > > the FP->int move for the branched version, and you're comparing the signed
> > > version of isinf vs the builtin which does isinf_ns.
> > >
> > Wilco that isinf is signed is again completely irrelevant. gcc is smart.
> > It get expanded into
> > if (foo ? (bar ? 1 : -1) : 0)
> > that gcc simplifies to
> > if (foo)
> > so it doesn't matter that checking sign would take 100 cycles as its
> > deleted code.
> 
> It matters for the -DT3 test.
> 
Only by increasing code size, it should have zero performance impact.
 It still does first check if positive/negative infinity and jump skips bar check.

> > Also as it could make branched version only slower when it beats builtin
> > then also nonsigned one would beat builtin.
> > 
> > And do you have assembly to show it doesn't measure move or its just
> > your guess? On x64 it definitely measures move and I could add that gcc
> > messes that bit by moving several times. objdump -d
> > on gcc   -DT2 -DBRANCHED  -DI1="__attribute((always_inline))" -
> > DI2="__attribute__((always_inline))" ft.c   -c
> > clearly shows that conversion is done several times.
> 
> Look at what the inner loop generates for T1 (T2/T3 do the same):
> 
> .L27:
>         ldr     x2, [x1]
>         cmp     x4, x2, lsl 1
>         bne     .L29
>         fadd    d0, d0, d1
> .L29:
>         add     x1, x1, 8
>         cmp     x1, x3
>         bne     .L27
> 
> > So will you publish results or not as it would show your builtins in
> > unfavorable ligth?
> 
> I don't see the point until your benchmark measures the right thing. 
> Note my benchmark carefully avoids this gotcha.
>
Looks like that arm gcc could optimize better than x64. I did addition
to force value into floating register. Then I needed to add volatile to
force gcc emit better code. Best would be use assembly but that makes
benchmark platform specific. I don't know if dropping volatile would
help and don't want to add arm assembly without testing it so add arm
assembly if gcc would also produce suboptimal code.

That changes make different numbers on x64 which are following:

don't inline
conditional add
branched

real	0m1.080s
user	0m1.079s
sys	0m0.000s
builtin

real	0m1.079s
user	0m1.075s
sys	0m0.003s
branch
branched

real	0m1.031s
user	0m1.031s
sys	0m0.000s
builtin

real	0m0.848s
user	0m0.848s
sys	0m0.000s
sum
branched

real	0m1.155s
user	0m1.154s
sys	0m0.000s
builtin

real	0m1.003s
user	0m1.002s
sys	0m0.000s
inline outer call
conditional add
branched

real	0m0.618s
user	0m0.617s
sys	0m0.000s
builtin

real	0m0.771s
user	0m0.771s
sys	0m0.000s
branch
branched

real	0m0.618s
user	0m0.617s
sys	0m0.000s
builtin

real	0m0.618s
user	0m0.617s
sys	0m0.000s
sum
branched

real	0m0.693s
user	0m0.689s
sys	0m0.003s
builtin

real	0m0.694s
user	0m0.693s
sys	0m0.000s
inline inner call
conditional add
branched

real	0m0.800s
user	0m0.800s
sys	0m0.000s
builtin

real	0m0.694s
user	0m0.694s
sys	0m0.000s
branch
branched

real	0m0.618s
user	0m0.614s
sys	0m0.003s
builtin

real	0m0.618s
user	0m0.617s
sys	0m0.000s
sum
branched

real	0m0.695s
user	0m0.694s
sys	0m0.000s
builtin

real	0m0.747s
user	0m0.747s
sys	0m0.000s
tigth loop
conditional add
branched

real	0m0.227s
user	0m0.226s
sys	0m0.000s
builtin

real	0m0.255s
user	0m0.254s
sys	0m0.000s
branch
branched

real	0m0.225s
user	0m0.224s
sys	0m0.000s
builtin

real	0m0.234s
user	0m0.233s
sys	0m0.000s
sum
branched

real	0m0.270s
user	0m0.269s
sys	0m0.000s
builtin

real	0m0.391s
user	0m0.390s
sys	0m0.000s
#include <stdlib.h>
#include <stdio.h>
//#include <math.h>
#undef isinf2
#ifdef BRANCHED
int I2
isinf2 (double dx)
{
  union u {
    double d;
    long l;
  };
  volatile union u u;
  u.d = dx;
  long x = u.l;
  return 2 * x == 0xffe0000000000000 ? (x == 0x7ff0000000000000 ? 1 :
-1) : 0;
}
#else
 int I2
isinf2(double x)
{
  return __builtin_isinf (x);
}
#endif

void I1
t1(double x,double *d)
{
  if (isinf2 (x))
   *d+=42;
}
void I1
t2(double x,double *d)
{
  if (isinf2 (x))
    abort();
}
void I1
t3(double x,double *d)
{
  *d += isinf2 (x);
}




int main()
{
  double ret= 0.0;
  int i, j;
  double *d = malloc (800000);
  for (i=0;i<1000;i++)
    d[i]=4.2;//(rand() % 2) ? 0.0 : (rand() % 2 ? 1.3 : 1.0/0.0);

  for (j=0; j<300000; j++)
#ifdef T1
  for (i=0; i<1000; i++)
    t1(d[i]+1.3, &ret);
#endif
#ifdef T2
  for (i=0; i<1000; i++)
    t2(d[i]+1.3, &ret);
#endif
#ifdef T3
  for (i=0; i<1000; i++)
    t3(d[i]+1.3, &ret);
#endif
  
  return ret;
}
test()
{
gcc -O3 -DBRANCHED -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null
echo branched
time ./a.out
gcc -O3  -DI1=$1 -DI2=$2 -DT$3 ft.c 2>/dev/null
echo builtin
time ./a.out
}
test2()
{
  echo conditional add
  test $1 $2 1
  echo branch
  test $1 $2 2
  echo sum
  test $1 $2 3

}
echo "don't inline"
test2 '__attribute__((noinline))' '__attribute__((noinline))'
echo inline outer call
test2 '__attribute__((noinline))' '__attribute__((always_inline))'
echo inline inner call
test2 '__attribute__((always_inline))' '__attribute__((noinline))'
echo tigth loop
test2 '__attribute__((always_inline))' '__attribute__((always_inline))'

Ondřej Bílka July 3, 2015, 8:40 a.m. UTC | #23

On Tue, Jun 16, 2015 at 03:54:35PM -0400, Carlos O'Donell wrote:
> On 06/16/2015 09:43 AM, Ondřej Bílka wrote:
> >> So to make this proposal to move forward, how exactly do you propose to
> >> create a benchtest for such scenario? I get this is tricky and a lot of
> >> variables may apply, but I do agree with Joseph that we shouldn't quite
> >> aim for optimal performance, imho using compiler builtins with reasonable
> >> performance is a gain in code maintainability.
> >>
> > As I said before about these they are hard to measure and I could
> > argue also versus my benchmark that its inaccurate as it doesn't measure
> > effect of cpu pipeline when function does other computation. Answer is
> > don't do microbenchmark.
> 
> That's not an answer, an answer is "Here's a patch to extend the libm testing
> to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance."
>
No its answer as it isn't my responsibility to provide benchmark to
convince that change is desirable but submitters. As I said before he should 
for example  add catan to benchtest, measure difference and report that.
If necessary increase iteration count to catch difference. Its
neccessary anyway if we want to measure microoptimizations that improve
performance with several cycles.


> I agree that microbenchmarks can be misleading if interpreted by automated
> systems, but we aren't talking about that yet, we are talking about experts
> using these tools to discuss patches in an objective fashion.
> 
No they will tell you exacty same argument as I said to explain why what
you do want is impossible.


Carlos you talk lot about deciding objectively but when I ask you out
its never done. So I will ask you again to decide based on my previous
benchmark. There sometimes builtin is 20% faster and sometimes a current
inline is 20% faster. How do you imagine that experts would decide
solely on that instead of telling you that its inconclusive and you need
to do real world measurements or that benchmark is flawed because X?

don't inline
conditional add
branched

real	0m1.313s
user	0m1.312s
sys	0m0.000s
builtin

real	0m1.309s
user	0m1.308s
sys	0m0.000s
branch
branched

real	0m1.310s
user	0m1.308s
sys	0m0.000s
builtin

real	0m1.337s
user	0m1.312s
sys	0m0.004s
sum
branched

real	0m1.209s
user	0m1.204s
sys	0m0.000s
builtin

real	0m1.216s
user	0m1.212s
sys	0m0.000s
inline outer call
conditional add
branched

real	0m0.705s
user	0m0.704s
sys	0m0.000s
builtin

real	0m0.916s
user	0m0.916s
sys	0m0.000s
branch
branched

real	0m0.806s
user	0m0.804s
sys	0m0.000s
builtin

real	0m0.721s
user	0m0.716s
sys	0m0.000s
sum
branched

real	0m1.029s
user	0m1.028s
sys	0m0.000s
builtin

real	0m0.911s
user	0m0.908s
sys	0m0.000s
inline inner call
conditional add
branched

real	0m1.038s
user	0m1.032s
sys	0m0.000s
builtin

real	0m1.024s
user	0m1.016s
sys	0m0.000s
branch
branched

real	0m0.614s
user	0m0.608s
sys	0m0.000s
builtin

real	0m0.606s
user	0m0.608s
sys	0m0.000s
sum
branched

real	0m0.662s
user	0m0.660s
sys	0m0.000s
builtin

real	0m0.629s
user	0m0.628s
sys	0m0.000s
tigth loop
conditional add
branched

real	0m0.208s
user	0m0.208s
sys	0m0.000s
builtin

real	0m0.326s
user	0m0.324s
sys	0m0.000s
branch
branched

real	0m0.204s
user	0m0.200s
sys	0m0.000s
builtin

real	0m0.325s
user	0m0.324s
sys	0m0.000s
sum
branched

real	0m0.328s
user	0m0.332s
sys	0m0.000s
builtin

real	0m0.486s
user	0m0.484s
sys	0m0.000s

Carlos O'Donell July 3, 2015, 2:28 p.m. UTC | #24

On 07/03/2015 04:40 AM, Ondřej Bílka wrote:
> On Tue, Jun 16, 2015 at 03:54:35PM -0400, Carlos O'Donell wrote:
>> On 06/16/2015 09:43 AM, Ondřej Bílka wrote:
>>>> So to make this proposal to move forward, how exactly do you propose to
>>>> create a benchtest for such scenario? I get this is tricky and a lot of
>>>> variables may apply, but I do agree with Joseph that we shouldn't quite
>>>> aim for optimal performance, imho using compiler builtins with reasonable
>>>> performance is a gain in code maintainability.
>>>>
>>> As I said before about these they are hard to measure and I could
>>> argue also versus my benchmark that its inaccurate as it doesn't measure
>>> effect of cpu pipeline when function does other computation. Answer is
>>> don't do microbenchmark.
>>
>> That's not an answer, an answer is "Here's a patch to extend the libm testing
>> to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance."
>>
> No its answer as it isn't my responsibility to provide benchmark to
> convince that change is desirable but submitters. As I said before he should 
> for example  add catan to benchtest, measure difference and report that.
> If necessary increase iteration count to catch difference. Its
> neccessary anyway if we want to measure microoptimizations that improve
> performance with several cycles.

I never said it was your responsibility. It is the responsibility of the person
submitting the patch to provide an objective description of how they verified
the performance gains. It has become standard practice to recommend the author
of the patch contribute a microbenchmark, but it need not be a microbenchmark.
The author does need to provide sufficient performance justification including
the methods used to measure the performance gains to ensure the results can
be reproduced.

In this case if you believe catan can be used to test the performance in this
patch, please suggest this to Wilco. However, I will not accept performance
patches without justification for the performance gains.

> Carlos you talk lot about deciding objectively but when I ask you out
> its never done. So I will ask you again to decide based on my previous
> benchmark. There sometimes builtin is 20% faster and sometimes a current
> inline is 20% faster. How do you imagine that experts would decide
> solely on that instead of telling you that its inconclusive and you need
> to do real world measurements or that benchmark is flawed because X?

My apologies if I failed to do something you requested.

I'm not interested in abstract examples, I'm interested in the patch being
submitted by ARM. I will continue this discussion on the downstream thread
that includes the microbenchmark written by Wilco.

In general I expect there are certainly classes of performance problems
that have inconclusive results. In which case I will reject that patch until
you tell me how you measure the performance gains and what you expect the
performance gains to be on average.

Cheers,
Carlos.

Ondřej Bílka July 9, 2015, 3:02 p.m. UTC | #25

On Fri, Jul 03, 2015 at 10:28:56AM -0400, Carlos O'Donell wrote:
> On 07/03/2015 04:40 AM, Ondřej Bílka wrote:
> > On Tue, Jun 16, 2015 at 03:54:35PM -0400, Carlos O'Donell wrote:
> >> On 06/16/2015 09:43 AM, Ondřej Bílka wrote:
> >>>> So to make this proposal to move forward, how exactly do you propose to
> >>>> create a benchtest for such scenario? I get this is tricky and a lot of
> >>>> variables may apply, but I do agree with Joseph that we shouldn't quite
> >>>> aim for optimal performance, imho using compiler builtins with reasonable
> >>>> performance is a gain in code maintainability.
> >>>>
> >>> As I said before about these they are hard to measure and I could
> >>> argue also versus my benchmark that its inaccurate as it doesn't measure
> >>> effect of cpu pipeline when function does other computation. Answer is
> >>> don't do microbenchmark.
> >>
> >> That's not an answer, an answer is "Here's a patch to extend the libm testing
> >> to show how isinf/isnan/signbit/isfinite/isnormal/fpclassify impact performance."
> >>
> > No its answer as it isn't my responsibility to provide benchmark to
> > convince that change is desirable but submitters. As I said before he should 
> > for example  add catan to benchtest, measure difference and report that.
> > If necessary increase iteration count to catch difference. Its
> > neccessary anyway if we want to measure microoptimizations that improve
> > performance with several cycles.
> 
> I never said it was your responsibility. It is the responsibility of the person
> submitting the patch to provide an objective description of how they verified
> the performance gains. It has become standard practice to recommend the author
> of the patch contribute a microbenchmark, but it need not be a microbenchmark.
> The author does need to provide sufficient performance justification including
> the methods used to measure the performance gains to ensure the results can
> be reproduced.
>
Correct. Here there are plenty of things that could go wrong so I need
to point that out.

Only benchmark where I would be certain with result would be take one of
numerical applications where is* is bottleneck that were mentioned
before and measure performance before and after.
 
> In this case if you believe catan can be used to test the performance in this
> patch, please suggest this to Wilco. However, I will not accept performance
> patches without justification for the performance gains.
>
Already wrote about that to get more precise answer.
  
> > Carlos you talk lot about deciding objectively but when I ask you out
> > its never done. So I will ask you again to decide based on my previous
> > benchmark. There sometimes builtin is 20% faster and sometimes a current
> > inline is 20% faster. How do you imagine that experts would decide
> > solely on that instead of telling you that its inconclusive and you need
> > to do real world measurements or that benchmark is flawed because X?
> 
> My apologies if I failed to do something you requested.
>  
> I'm not interested in abstract examples, I'm interested in the patch being
> submitted by ARM. I will continue this discussion on the downstream thread
> that includes the microbenchmark written by Wilco.
> 
> In general I expect there are certainly classes of performance problems
> that have inconclusive results. In which case I will reject that patch until
> you tell me how you measure the performance gains and what you expect the
> performance gains to be on average.
> 
Problem is that most functions have inconclusive results as we use mined
cases where its possible and must use assumptions about input distribution to get speedup.
Simplest example is that in string functions a we assume that 64 bytes
cross page boundary only rarely and list of these assumptions goes on
and on. Then when microbenchmark violates one of these as its simplistic
truth is if that gives real speedup of programs not what microbenchmark
shows.

Carlos O'Donell July 9, 2015, 3:35 p.m. UTC | #26

On 07/09/2015 11:02 AM, Ondřej Bílka wrote:
>> I never said it was your responsibility. It is the responsibility of the person
>> submitting the patch to provide an objective description of how they verified
>> the performance gains. It has become standard practice to recommend the author
>> of the patch contribute a microbenchmark, but it need not be a microbenchmark.
>> The author does need to provide sufficient performance justification including
>> the methods used to measure the performance gains to ensure the results can
>> be reproduced.
>>
> Correct. Here there are plenty of things that could go wrong so I need
> to point that out.

Agreed. I would like to see us create a "Measuring Performance" wiki page
with what the community expects. I think you would have a lot to contribute
to such a document.

> Only benchmark where I would be certain with result would be take one of
> numerical applications where is* is bottleneck that were mentioned
> before and measure performance before and after.

That is a good recommendation. When I worked in HPC the common phrase was
"the only benchmark that matters is your application." :-)

In glibc this is harder because we are looking for average-case performance
across all applications. In some ways the optimization space will get both
harder and easier as Siddhesh works on tunnables. Applications will be able
to adjust runtime behaviour for performance. However, untuned general performance
will still be just as difficult to optimze for. And then we get to talk about
auto-tunning algorithms, and tunnings for given workloads.

>> In this case if you believe catan can be used to test the performance in this
>> patch, please suggest this to Wilco. However, I will not accept performance
>> patches without justification for the performance gains.
>>
> Already wrote about that to get more precise answer.

Thanks.

>>> Carlos you talk lot about deciding objectively but when I ask you out
>>> its never done. So I will ask you again to decide based on my previous
>>> benchmark. There sometimes builtin is 20% faster and sometimes a current
>>> inline is 20% faster. How do you imagine that experts would decide
>>> solely on that instead of telling you that its inconclusive and you need
>>> to do real world measurements or that benchmark is flawed because X?
>>
>> My apologies if I failed to do something you requested.
>>  
>> I'm not interested in abstract examples, I'm interested in the patch being
>> submitted by ARM. I will continue this discussion on the downstream thread
>> that includes the microbenchmark written by Wilco.
>>
>> In general I expect there are certainly classes of performance problems
>> that have inconclusive results. In which case I will reject that patch until
>> you tell me how you measure the performance gains and what you expect the
>> performance gains to be on average.
>>
> Problem is that most functions have inconclusive results as we use mined
> cases where its possible and must use assumptions about input distribution to get speedup.
> Simplest example is that in string functions a we assume that 64 bytes
> cross page boundary only rarely and list of these assumptions goes on
> and on. Then when microbenchmark violates one of these as its simplistic
> truth is if that gives real speedup of programs not what microbenchmark
> shows.

I don't follow. Could you expand on your thought there?

Cheers,
Carlos.

Inline C99 math functions

Commit Message

Comments

Patch