Message ID  20171205130832.GA3696@intel.com 

State  New 
Headers  show 
Series 

Related  show 
On Tue, 5 Dec 2017, H.J. Lu wrote: > diff git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h > index f29898c19c..a2cdce5b6a 100644 >  a/sysdeps/generic/math_private.h > +++ b/sysdeps/generic/math_private.h > @@ 184,6 +184,14 @@ do { \ > } while (0) > #endif > > +#ifndef FLOOR_DOUBLE_TO_INT > +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x)) > +#endif > + > +#ifndef FLOOR_INT_TO_DOUBLE_HALF > +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0) > +#endif These need comments defining their semantics (argument types and ranges, return values and types). The existing code is using unsigned int. You're using int here for the return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int in the caller. (The semantics in the caller do in fact include that the argument is positive but in the range of int.) I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by 2.0 and calling floor, as opposed to dividing by integer 2 and converting the result to double (which is just a right shift, given the argument is unsigned) is optimal anywhere, unless the compiler optimizes it into an integer shift anyway. Thus, it would make sense just to use "n / 2" unconditionally in place of "__floor (n / 2.0)". Similarly, I'd expect a direct conversion to int (without a call to __floor) to be at least as good as the present code for FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it OK to do so. So I think the natural change would not need to add any new macros, but would change s_sinf.c unconditionally to do essentially what your patch does in the x86_64 case. As the original patch of Rajalakshmi's was benchmarked on s390, you could ask her to confirm such a change makes performance no worse there (and possibly improves it).
On Tue, Dec 5, 2017 at 5:32 AM, Joseph Myers <joseph@codesourcery.com> wrote: > On Tue, 5 Dec 2017, H.J. Lu wrote: > >> diff git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h >> index f29898c19c..a2cdce5b6a 100644 >>  a/sysdeps/generic/math_private.h >> +++ b/sysdeps/generic/math_private.h >> @@ 184,6 +184,14 @@ do { \ >> } while (0) >> #endif >> >> +#ifndef FLOOR_DOUBLE_TO_INT >> +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x)) >> +#endif >> + >> +#ifndef FLOOR_INT_TO_DOUBLE_HALF >> +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0) >> +#endif > > These need comments defining their semantics (argument types and ranges, > return values and types). > > The existing code is using unsigned int. You're using int here for the > return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int in > the caller. (The semantics in the caller do in fact include that the > argument is positive but in the range of int.) > > I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by > 2.0 and calling floor, as opposed to dividing by integer 2 and converting > the result to double (which is just a right shift, given the argument is > unsigned) is optimal anywhere, unless the compiler optimizes it into an > integer shift anyway. Thus, it would make sense just to use "n / 2" > unconditionally in place of "__floor (n / 2.0)". > > Similarly, I'd expect a direct conversion to int (without a call to > __floor) to be at least as good as the present code for > FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it > OK to do so. > > So I think the natural change would not need to add any new macros, but > would change s_sinf.c unconditionally to do essentially what your patch > does in the x86_64 case. As the original patch of Rajalakshmi's was > benchmarked on s390, you could ask her to confirm such a change makes > performance no worse there (and possibly improves it). Here is the patch to replace floor with simple casts. It improves performance of max and mean by 30% on x8664. Rajalakshmi, can you try it on s390? Thanks.
On 12/05/2017 07:29 PM, H.J. Lu wrote: > On Tue, Dec 5, 2017 at 5:32 AM, Joseph Myers<joseph@codesourcery.com> wrote: >> On Tue, 5 Dec 2017, H.J. Lu wrote: >> >>> diff git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h >>> index f29898c19c..a2cdce5b6a 100644 >>>  a/sysdeps/generic/math_private.h >>> +++ b/sysdeps/generic/math_private.h >>> @@ 184,6 +184,14 @@ do { \ >>> } while (0) >>> #endif >>> >>> +#ifndef FLOOR_DOUBLE_TO_INT >>> +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x)) >>> +#endif >>> + >>> +#ifndef FLOOR_INT_TO_DOUBLE_HALF >>> +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0) >>> +#endif >> These need comments defining their semantics (argument types and ranges, >> return values and types). >> >> The existing code is using unsigned int. You're using int here for the >> return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int in >> the caller. (The semantics in the caller do in fact include that the >> argument is positive but in the range of int.) >> >> I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by >> 2.0 and calling floor, as opposed to dividing by integer 2 and converting >> the result to double (which is just a right shift, given the argument is >> unsigned) is optimal anywhere, unless the compiler optimizes it into an >> integer shift anyway. Thus, it would make sense just to use "n / 2" >> unconditionally in place of "__floor (n / 2.0)". >> >> Similarly, I'd expect a direct conversion to int (without a call to >> __floor) to be at least as good as the present code for >> FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it >> OK to do so. >> >> So I think the natural change would not need to add any new macros, but >> would change s_sinf.c unconditionally to do essentially what your patch >> does in the x86_64 case. As the original patch of Rajalakshmi's was >> benchmarked on s390, you could ask her to confirm such a change makes >> performance no worse there (and possibly improves it). > Here is the patch to replace floor with simple casts. It improves performance > of max and mean by 30% on x8664. Rajalakshmi, can you try it on s390? Mean value reduced from 22.0 to 20.0 with this patch on s390x.
On Tue, Dec 5, 2017 at 6:29 AM, Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com> wrote: > > > On 12/05/2017 07:29 PM, H.J. Lu wrote: >> >> On Tue, Dec 5, 2017 at 5:32 AM, Joseph Myers<joseph@codesourcery.com> >> wrote: >>> >>> On Tue, 5 Dec 2017, H.J. Lu wrote: >>> >>>> diff git a/sysdeps/generic/math_private.h >>>> b/sysdeps/generic/math_private.h >>>> index f29898c19c..a2cdce5b6a 100644 >>>>  a/sysdeps/generic/math_private.h >>>> +++ b/sysdeps/generic/math_private.h >>>> @@ 184,6 +184,14 @@ do { >>>> \ >>>> } while (0) >>>> #endif >>>> >>>> +#ifndef FLOOR_DOUBLE_TO_INT >>>> +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x)) >>>> +#endif >>>> + >>>> +#ifndef FLOOR_INT_TO_DOUBLE_HALF >>>> +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0) >>>> +#endif >>> >>> These need comments defining their semantics (argument types and ranges, >>> return values and types). >>> >>> The existing code is using unsigned int. You're using int here for the >>> return type of FLOOR_DOUBLE_TO_INT but still converting to unsigned int >>> in >>> the caller. (The semantics in the caller do in fact include that the >>> argument is positive but in the range of int.) >>> >>> I'd be surprised if the FLOOR_INT_TO_DOUBLE_HALF definition dividing by >>> 2.0 and calling floor, as opposed to dividing by integer 2 and converting >>> the result to double (which is just a right shift, given the argument is >>> unsigned) is optimal anywhere, unless the compiler optimizes it into an >>> integer shift anyway. Thus, it would make sense just to use "n / 2" >>> unconditionally in place of "__floor (n / 2.0)". >>> >>> Similarly, I'd expect a direct conversion to int (without a call to >>> __floor) to be at least as good as the present code for >>> FLOOR_DOUBLE_TO_INT everywhere, given the range constraints that make it >>> OK to do so. >>> >>> So I think the natural change would not need to add any new macros, but >>> would change s_sinf.c unconditionally to do essentially what your patch >>> does in the x86_64 case. As the original patch of Rajalakshmi's was >>> benchmarked on s390, you could ask her to confirm such a change makes >>> performance no worse there (and possibly improves it). >> >> Here is the patch to replace floor with simple casts. It improves >> performance >> of max and mean by 30% on x8664. Rajalakshmi, can you try it on s390? > > > Mean value reduced from 22.0 to 20.0 with this patch on s390x. > Any objections/comenrts for my patch: https://sourceware.org/ml/libcalpha/201712/msg00121.html to master? Thanks.
On Tue, 5 Dec 2017, Rajalakshmi Srinivasaraghavan wrote: > > Here is the patch to replace floor with simple casts. It improves > > performance > > of max and mean by 30% on x8664. Rajalakshmi, can you try it on s390? > > Mean value reduced from 22.0 to 20.0 with this patch on s390x. Thanks. The patch is OK.
diff git a/sysdeps/generic/math_private.h b/sysdeps/generic/math_private.h index f29898c19c..a2cdce5b6a 100644  a/sysdeps/generic/math_private.h +++ b/sysdeps/generic/math_private.h @@ 184,6 +184,14 @@ do { \ } while (0) #endif +#ifndef FLOOR_DOUBLE_TO_INT +# define FLOOR_DOUBLE_TO_INT(x) ((int) __floor (x)) +#endif + +#ifndef FLOOR_INT_TO_DOUBLE_HALF +# define FLOOR_INT_TO_DOUBLE_HALF(x) __floor ((x) / 2.0) +#endif + /* We need to guarantee an expansion of name when building ldbl128 files as another type (e.g _Float128). */ #define mathx_hidden_def(name) hidden_def(name) diff git a/sysdeps/ieee754/flt32/s_sinf.c b/sysdeps/ieee754/flt32/s_sinf.c index 40d3d197a8..13a49ceb1b 100644  a/sysdeps/ieee754/flt32/s_sinf.c +++ b/sysdeps/ieee754/flt32/s_sinf.c @@ 85,8 +85,8 @@ static const int ones[] = { +1, 1 }; SIGNBIT is used to add the correct sign after the Chebyshev polynomial is computed. */ static inline float reduced (const double theta, const unsigned long int n,  const unsigned long int signbit) +reduced (const double theta, const unsigned int n, + const unsigned int signbit) { double sx; const double theta2 = theta * theta; @@ 162,14 +162,14 @@ SINF_FUNC (float x) } else /* x >= Pi/4. */ {  unsigned long int signbit = (x < 0); + unsigned int signbit = (x < 0); if (abstheta < 9 * M_PI_4) /* x < 9*Pi/4. */ { /* There are cases where FE_UPWARD rounding mode can produce a result of abstheta * inv_PI_4 == 9, where abstheta < 9pi/4, so the domain for pio2_table must go to 5 (9 / 2 + 1). */  unsigned long int n = (abstheta * inv_PI_4) + 1; + unsigned int n = (abstheta * inv_PI_4) + 1; theta = abstheta  pio2_table[n / 2]; return reduced (theta, n, signbit); } @@ 177,8 +177,9 @@ SINF_FUNC (float x) { if (abstheta < 0x1p+23) /* x < 2^23. */ {  unsigned long int n = __floor (abstheta * inv_PI_4) + 1.0;  double x = __floor (n / 2.0); + unsigned int n + = FLOOR_DOUBLE_TO_INT (abstheta * inv_PI_4) + 1.0; + double x = FLOOR_INT_TO_DOUBLE_HALF (n); theta = x * PI_2_lo + (x * PI_2_hi + abstheta); /* Argument reduction needed. */ return reduced (theta, n, signbit); diff git a/sysdeps/x86_64/fpu/math_private.h b/sysdeps/x86_64/fpu/math_private.h index 027a6a3a4d..738897c9c6 100644  a/sysdeps/x86_64/fpu/math_private.h +++ b/sysdeps/x86_64/fpu/math_private.h @@ 45,6 +45,23 @@ f = f__; \ } while (0) +extern inline int +__floor_double_to_int (double x) +{ + return x; +} + +#define FLOOR_DOUBLE_TO_INT(x) __floor_double_to_int (x) + +extern inline double +__floor_int_to_double_half (int x) +{ + x /= 2; + return x; +} + +#define FLOOR_INT_TO_DOUBLE_HALF(x) __floor_int_to_double_half (x) + #include <sysdeps/i386/fpu/fenv_private.h> #include_next <math_private.h>