Patchwork [12/14] VSX Stage 4: Add Scalar SP Fused Multiply-Adds

login
register
mail settings
Submitter Tom Musta
Date Nov. 6, 2013, 8:31 p.m.
Message ID <1383769916-5582-13-git-send-email-tommusta@gmail.com>
Download mbox | patch
Permalink /patch/289008/
State New
Headers show

Comments

Tom Musta - Nov. 6, 2013, 8:31 p.m.
This patch adds the Single Precision VSX Scalar Fused Multiply-Add
instructions: xsmaddasp, xsmaddmsp, xssubasp, xssubmsp, xsnmaddasp,
xsnmaddmsp, xsnmsubasp, xsnmsubmsp.

The existing VSX_MADD() macro is modified to support rounding of the
intermediate double precision result to single precision.

Signed-off-by: Tom Musta <tommusta@gmail.com>
---
 target-ppc/fpu_helper.c |   70 +++++++++++++++++++++++++++++------------------
 target-ppc/helper.h     |    8 +++++
 target-ppc/translate.c  |   16 +++++++++++
 3 files changed, 67 insertions(+), 27 deletions(-)
Richard Henderson - Nov. 7, 2013, 11:28 p.m.
On 11/07/2013 06:31 AM, Tom Musta wrote:
>          }                                                                     \
> +                                                                              \
> +        if (r2sp) {                                                           \
> +            float32 tmp32 = float64_to_float32(xt_out.fld[i],                 \
> +                                               &env->fp_status);              \
> +            xt_out.fld[i] = float32_to_float64(tmp32, &env->fp_status);       \
> +        }                                                                     \
> +                                                                              \

You can't get correct results for a single-precision fma from a
double-precision fma and merely rounding the results.

See e.g. glibc's sysdeps/ieee754/dbl-64/s_fmaf.c.


r~
Richard Henderson - Nov. 7, 2013, 11:30 p.m.
On 11/08/2013 09:28 AM, Richard Henderson wrote:
> On 11/07/2013 06:31 AM, Tom Musta wrote:
>>          }                                                                     \
>> +                                                                              \
>> +        if (r2sp) {                                                           \
>> +            float32 tmp32 = float64_to_float32(xt_out.fld[i],                 \
>> +                                               &env->fp_status);              \
>> +            xt_out.fld[i] = float32_to_float64(tmp32, &env->fp_status);       \
>> +        }                                                                     \
>> +                                                                              \
> 
> You can't get correct results for a single-precision fma from a
> double-precision fma and merely rounding the results.
> 
> See e.g. glibc's sysdeps/ieee754/dbl-64/s_fmaf.c.

Blah, nevermind.  That would be using separate add+mul in double-precision, not
using a double-precision fma primitive.


r~
Richard Henderson - Nov. 8, 2013, 12:13 a.m.
On 11/08/2013 09:30 AM, Richard Henderson wrote:
> On 11/08/2013 09:28 AM, Richard Henderson wrote:
>> On 11/07/2013 06:31 AM, Tom Musta wrote:
>>>          }                                                                     \
>>> +                                                                              \
>>> +        if (r2sp) {                                                           \
>>> +            float32 tmp32 = float64_to_float32(xt_out.fld[i],                 \
>>> +                                               &env->fp_status);              \
>>> +            xt_out.fld[i] = float32_to_float64(tmp32, &env->fp_status);       \
>>> +        }                                                                     \
>>> +                                                                              \
>>
>> You can't get correct results for a single-precision fma from a
>> double-precision fma and merely rounding the results.
>>
>> See e.g. glibc's sysdeps/ieee754/dbl-64/s_fmaf.c.
> 
> Blah, nevermind.  That would be using separate add+mul in double-precision, not
> using a double-precision fma primitive.

Hmph.  I was right the first time.  See

> http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/

for example inputs that suffer from double-rounding.

What's needed in each of the examples are infinite precision values containing
55 bits.  This is easy to accomplish with fma.

Two 23-bit inputs can create a product with 46 significant bits.  One can
append 23 more significant bits by choosing an exponent for the addend that
does not overlap the product.  Thus one can create (almost) every intermediate
result with up to 69 consecutive bits (the exception being products without
factors that can be represented in 23-bits).

I'm too lazy to decompose the examples therein to actual single-precision
inputs, but I'm certain it can be done.

Thus you *do* need the round-to-zero plus inexact solution that glibc uses.
(Or to perform the fma in 128-bits and round once, but I think that would be
way more intrusive wrt the rest of the code, and more expensive than necessary.)


r~
Tom Musta - Nov. 13, 2013, 8:49 p.m.
On 11/7/2013 6:13 PM, Richard Henderson wrote:
> On 11/08/2013 09:30 AM, Richard Henderson wrote:
>> On 11/08/2013 09:28 AM, Richard Henderson wrote:
>>> On 11/07/2013 06:31 AM, Tom Musta wrote:
>>>>          }                                                                     \
>>>> +                                                                              \
>>>> +        if (r2sp) {                                                           \
>>>> +            float32 tmp32 = float64_to_float32(xt_out.fld[i],                 \
>>>> +                                               &env->fp_status);              \
>>>> +            xt_out.fld[i] = float32_to_float64(tmp32, &env->fp_status);       \
>>>> +        }                                                                     \
>>>> +                                                                              \
>>>
>>> You can't get correct results for a single-precision fma from a
>>> double-precision fma and merely rounding the results.
>>>
>>> See e.g. glibc's sysdeps/ieee754/dbl-64/s_fmaf.c.
>>
>> Blah, nevermind.  That would be using separate add+mul in double-precision, not
>> using a double-precision fma primitive.
> 
> Hmph.  I was right the first time.  See
> 
>> http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/
> 
> for example inputs that suffer from double-rounding.
> 
> What's needed in each of the examples are infinite precision values containing
> 55 bits.  This is easy to accomplish with fma.
> 
> Two 23-bit inputs can create a product with 46 significant bits.  One can
> append 23 more significant bits by choosing an exponent for the addend that
> does not overlap the product.  Thus one can create (almost) every intermediate
> result with up to 69 consecutive bits (the exception being products without
> factors that can be represented in 23-bits).
> 
> I'm too lazy to decompose the examples therein to actual single-precision
> inputs, but I'm certain it can be done.
> 
> Thus you *do* need the round-to-zero plus inexact solution that glibc uses.
> (Or to perform the fma in 128-bits and round once, but I think that would be
> way more intrusive wrt the rest of the code, and more expensive than necessary.)

I have reviewed the code and the spec and I cannot see a flaw.  The sequence is
effectively this:

  - float64_muladd   - performs proper FMA for 64 bit numbers)
  - float64_to_float32 - converts to single precision, including proper rounding
  - float32_to_float64

The implementation of float64_muladd would seem to provide enough mantissa bits
for proper handling of the case you describe.  The only rounding occurs in the
second step.

I have also done quite a bit of random and targeted random testing using Power
hardware to produce expected results.  The targeted random tests followed your
suggestion above: generate AxB + C where abs(exp(A) - exp(B)) = 23 and
abs(exp(A) - exp(C)) = 46.  Several million test patterns have been generated
and played back through QEMU without any miscompares in the numerical results.
Richard Henderson - Nov. 13, 2013, 11:14 p.m.
On 11/14/2013 06:49 AM, Tom Musta wrote:
> I have reviewed the code and the spec and I cannot see a flaw.  The sequence is
> effectively this:
> 
>   - float64_muladd   - performs proper FMA for 64 bit numbers)
>   - float64_to_float32 - converts to single precision, including proper rounding
>   - float32_to_float64
> 
> The implementation of float64_muladd would seem to provide enough mantissa bits
> for proper handling of the case you describe.  The only rounding occurs in the
> second step.
> 
> I have also done quite a bit of random and targeted random testing using Power
> hardware to produce expected results.  The targeted random tests followed your
> suggestion above: generate AxB + C where abs(exp(A) - exp(B)) = 23 and
> abs(exp(A) - exp(C)) = 46.  Several million test patterns have been generated
> and played back through QEMU without any miscompares in the numerical results.

Here's an example that produces wrong results when rounding to double first.
Replace the portable math.h calls with ppc asm as necessary.


r~


$ cat z.c
#include <stdio.h>
#include <math.h>

float a = 65281;
float b = 257;
float c = 0x1p-29f;

int main()
{
    double dd = fma(a, b, c);
    float d = dd;
    float e = fmaf(a, b, c);

    printf("a = %a\n", a);
    printf("b = %a\n", b);
    printf("c = %a\n", c);
    printf("dd= %a\n", dd);
    printf("d = %a\n", d);
    printf("e = %a\n", e);
    return 0;
}
$ ./a.out
a = 0x1.fe02p+15
b = 0x1.01p+8
c = 0x1p-29
dd= 0x1.000001p+24
d = 0x1p+24
e = 0x1.000002p+24
Tom Musta - Nov. 14, 2013, 8:58 p.m.
On 11/13/2013 5:14 PM, Richard Henderson wrote:
> On 11/14/2013 06:49 AM, Tom Musta wrote:
>> I have also done quite a bit of random and targeted random testing using Power
>> hardware to produce expected results.  The targeted random tests followed your
>> suggestion above: generate AxB + C where abs(exp(A) - exp(B)) = 23 and
>> abs(exp(A) - exp(C)) = 46.  Several million test patterns have been generated
>> and played back through QEMU without any miscompares in the numerical results.
> 
> Here's an example that produces wrong results when rounding to double first.
> Replace the portable math.h calls with ppc asm as necessary.
> 
> <snip>
> r~
> 

Thanks, Richard.  You have convinced me.

Patch

diff --git a/target-ppc/fpu_helper.c b/target-ppc/fpu_helper.c
index e08f317..c86778f 100644
--- a/target-ppc/fpu_helper.c
+++ b/target-ppc/fpu_helper.c
@@ -2198,7 +2198,7 @@  VSX_TSQRT(xvtsqrtsp, 4, float32, f32, -126, 23)
  *   afrm  - A form (1=A, 0=M)
  *   sfprf - set FPRF
  */
-#define VSX_MADD(op, nels, tp, fld, maddflgs, afrm, sfprf)                    \
+#define VSX_MADD(op, nels, tp, fld, maddflgs, afrm, sfprf, r2sp)              \
 void helper_##op(CPUPPCState *env, uint32_t opcode)                           \
 {                                                                             \
     ppc_vsr_t xt_in, xa, xb, xt_out;                                          \
@@ -2248,6 +2248,13 @@  void helper_##op(CPUPPCState *env, uint32_t opcode)                           \
                 fload_invalid_op_excp(env, POWERPC_EXCP_FP_VXISI, sfprf);     \
             }                                                                 \
         }                                                                     \
+                                                                              \
+        if (r2sp) {                                                           \
+            float32 tmp32 = float64_to_float32(xt_out.fld[i],                 \
+                                               &env->fp_status);              \
+            xt_out.fld[i] = float32_to_float64(tmp32, &env->fp_status);       \
+        }                                                                     \
+                                                                              \
         if (sfprf) {                                                          \
             helper_compute_fprf(env, xt_out.fld[i], sfprf);                   \
         }                                                                     \
@@ -2261,32 +2268,41 @@  void helper_##op(CPUPPCState *env, uint32_t opcode)                           \
 #define NMADD_FLGS float_muladd_negate_result
 #define NMSUB_FLGS (float_muladd_negate_c | float_muladd_negate_result)
 
-VSX_MADD(xsmaddadp, 1, float64, f64, MADD_FLGS, 1, 1)
-VSX_MADD(xsmaddmdp, 1, float64, f64, MADD_FLGS, 0, 1)
-VSX_MADD(xsmsubadp, 1, float64, f64, MSUB_FLGS, 1, 1)
-VSX_MADD(xsmsubmdp, 1, float64, f64, MSUB_FLGS, 0, 1)
-VSX_MADD(xsnmaddadp, 1, float64, f64, NMADD_FLGS, 1, 1)
-VSX_MADD(xsnmaddmdp, 1, float64, f64, NMADD_FLGS, 0, 1)
-VSX_MADD(xsnmsubadp, 1, float64, f64, NMSUB_FLGS, 1, 1)
-VSX_MADD(xsnmsubmdp, 1, float64, f64, NMSUB_FLGS, 0, 1)
-
-VSX_MADD(xvmaddadp, 2, float64, f64, MADD_FLGS, 1, 0)
-VSX_MADD(xvmaddmdp, 2, float64, f64, MADD_FLGS, 0, 0)
-VSX_MADD(xvmsubadp, 2, float64, f64, MSUB_FLGS, 1, 0)
-VSX_MADD(xvmsubmdp, 2, float64, f64, MSUB_FLGS, 0, 0)
-VSX_MADD(xvnmaddadp, 2, float64, f64, NMADD_FLGS, 1, 0)
-VSX_MADD(xvnmaddmdp, 2, float64, f64, NMADD_FLGS, 0, 0)
-VSX_MADD(xvnmsubadp, 2, float64, f64, NMSUB_FLGS, 1, 0)
-VSX_MADD(xvnmsubmdp, 2, float64, f64, NMSUB_FLGS, 0, 0)
-
-VSX_MADD(xvmaddasp, 4, float32, f32, MADD_FLGS, 1, 0)
-VSX_MADD(xvmaddmsp, 4, float32, f32, MADD_FLGS, 0, 0)
-VSX_MADD(xvmsubasp, 4, float32, f32, MSUB_FLGS, 1, 0)
-VSX_MADD(xvmsubmsp, 4, float32, f32, MSUB_FLGS, 0, 0)
-VSX_MADD(xvnmaddasp, 4, float32, f32, NMADD_FLGS, 1, 0)
-VSX_MADD(xvnmaddmsp, 4, float32, f32, NMADD_FLGS, 0, 0)
-VSX_MADD(xvnmsubasp, 4, float32, f32, NMSUB_FLGS, 1, 0)
-VSX_MADD(xvnmsubmsp, 4, float32, f32, NMSUB_FLGS, 0, 0)
+VSX_MADD(xsmaddadp, 1, float64, f64, MADD_FLGS, 1, 1, 0)
+VSX_MADD(xsmaddmdp, 1, float64, f64, MADD_FLGS, 0, 1, 0)
+VSX_MADD(xsmsubadp, 1, float64, f64, MSUB_FLGS, 1, 1, 0)
+VSX_MADD(xsmsubmdp, 1, float64, f64, MSUB_FLGS, 0, 1, 0)
+VSX_MADD(xsnmaddadp, 1, float64, f64, NMADD_FLGS, 1, 1, 0)
+VSX_MADD(xsnmaddmdp, 1, float64, f64, NMADD_FLGS, 0, 1, 0)
+VSX_MADD(xsnmsubadp, 1, float64, f64, NMSUB_FLGS, 1, 1, 0)
+VSX_MADD(xsnmsubmdp, 1, float64, f64, NMSUB_FLGS, 0, 1, 0)
+
+VSX_MADD(xsmaddasp, 1, float64, f64, MADD_FLGS, 1, 1, 1)
+VSX_MADD(xsmaddmsp, 1, float64, f64, MADD_FLGS, 0, 1, 1)
+VSX_MADD(xsmsubasp, 1, float64, f64, MSUB_FLGS, 1, 1, 1)
+VSX_MADD(xsmsubmsp, 1, float64, f64, MSUB_FLGS, 0, 1, 1)
+VSX_MADD(xsnmaddasp, 1, float64, f64, NMADD_FLGS, 1, 1, 1)
+VSX_MADD(xsnmaddmsp, 1, float64, f64, NMADD_FLGS, 0, 1, 1)
+VSX_MADD(xsnmsubasp, 1, float64, f64, NMSUB_FLGS, 1, 1, 1)
+VSX_MADD(xsnmsubmsp, 1, float64, f64, NMSUB_FLGS, 0, 1, 1)
+
+VSX_MADD(xvmaddadp, 2, float64, f64, MADD_FLGS, 1, 0, 0)
+VSX_MADD(xvmaddmdp, 2, float64, f64, MADD_FLGS, 0, 0, 0)
+VSX_MADD(xvmsubadp, 2, float64, f64, MSUB_FLGS, 1, 0, 0)
+VSX_MADD(xvmsubmdp, 2, float64, f64, MSUB_FLGS, 0, 0, 0)
+VSX_MADD(xvnmaddadp, 2, float64, f64, NMADD_FLGS, 1, 0, 0)
+VSX_MADD(xvnmaddmdp, 2, float64, f64, NMADD_FLGS, 0, 0, 0)
+VSX_MADD(xvnmsubadp, 2, float64, f64, NMSUB_FLGS, 1, 0, 0)
+VSX_MADD(xvnmsubmdp, 2, float64, f64, NMSUB_FLGS, 0, 0, 0)
+
+VSX_MADD(xvmaddasp, 4, float32, f32, MADD_FLGS, 1, 0, 0)
+VSX_MADD(xvmaddmsp, 4, float32, f32, MADD_FLGS, 0, 0, 0)
+VSX_MADD(xvmsubasp, 4, float32, f32, MSUB_FLGS, 1, 0, 0)
+VSX_MADD(xvmsubmsp, 4, float32, f32, MSUB_FLGS, 0, 0, 0)
+VSX_MADD(xvnmaddasp, 4, float32, f32, NMADD_FLGS, 1, 0, 0)
+VSX_MADD(xvnmaddmsp, 4, float32, f32, NMADD_FLGS, 0, 0, 0)
+VSX_MADD(xvnmsubasp, 4, float32, f32, NMSUB_FLGS, 1, 0, 0)
+VSX_MADD(xvnmsubmsp, 4, float32, f32, NMSUB_FLGS, 0, 0, 0)
 
 #define VSX_SCALAR_CMP(op, ordered)                                      \
 void helper_##op(CPUPPCState *env, uint32_t opcode)                      \
diff --git a/target-ppc/helper.h b/target-ppc/helper.h
index 84c6ee1..655b670 100644
--- a/target-ppc/helper.h
+++ b/target-ppc/helper.h
@@ -293,6 +293,14 @@  DEF_HELPER_2(xsdivsp, void, env, i32)
 DEF_HELPER_2(xsresp, void, env, i32)
 DEF_HELPER_2(xssqrtsp, void, env, i32)
 DEF_HELPER_2(xsrsqrtesp, void, env, i32)
+DEF_HELPER_2(xsmaddasp, void, env, i32)
+DEF_HELPER_2(xsmaddmsp, void, env, i32)
+DEF_HELPER_2(xsmsubasp, void, env, i32)
+DEF_HELPER_2(xsmsubmsp, void, env, i32)
+DEF_HELPER_2(xsnmaddasp, void, env, i32)
+DEF_HELPER_2(xsnmaddmsp, void, env, i32)
+DEF_HELPER_2(xsnmsubasp, void, env, i32)
+DEF_HELPER_2(xsnmsubmsp, void, env, i32)
 
 DEF_HELPER_2(xvadddp, void, env, i32)
 DEF_HELPER_2(xvsubdp, void, env, i32)
diff --git a/target-ppc/translate.c b/target-ppc/translate.c
index ae80289..672cf0a 100644
--- a/target-ppc/translate.c
+++ b/target-ppc/translate.c
@@ -7348,6 +7348,14 @@  GEN_VSX_HELPER_2(xsdivsp, 0x00, 0x03, 0, PPC2_VSX207)
 GEN_VSX_HELPER_2(xsresp, 0x14, 0x01, 0, PPC2_VSX207)
 GEN_VSX_HELPER_2(xssqrtsp, 0x16, 0x00, 0, PPC2_VSX207)
 GEN_VSX_HELPER_2(xsrsqrtesp, 0x14, 0x00, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsmaddasp, 0x04, 0x00, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsmaddmsp, 0x04, 0x01, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsmsubasp, 0x04, 0x02, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsmsubmsp, 0x04, 0x03, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsnmaddasp, 0x04, 0x10, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsnmaddmsp, 0x04, 0x11, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsnmsubasp, 0x04, 0x12, 0, PPC2_VSX207)
+GEN_VSX_HELPER_2(xsnmsubmsp, 0x04, 0x13, 0, PPC2_VSX207)
 
 GEN_VSX_HELPER_2(xvadddp, 0x00, 0x0C, 0, PPC2_VSX)
 GEN_VSX_HELPER_2(xvsubdp, 0x00, 0x0D, 0, PPC2_VSX)
@@ -10163,6 +10171,14 @@  GEN_XX3FORM(xsdivsp, 0x00, 0x03, PPC2_VSX207),
 GEN_XX2FORM(xsresp,  0x14, 0x01, PPC2_VSX207),
 GEN_XX2FORM(xssqrtsp,  0x16, 0x00, PPC2_VSX207),
 GEN_XX2FORM(xsrsqrtesp,  0x14, 0x00, PPC2_VSX207),
+GEN_XX3FORM(xsmaddasp, 0x04, 0x00, PPC2_VSX207),
+GEN_XX3FORM(xsmaddmsp, 0x04, 0x01, PPC2_VSX207),
+GEN_XX3FORM(xsmsubasp, 0x04, 0x02, PPC2_VSX207),
+GEN_XX3FORM(xsmsubmsp, 0x04, 0x03, PPC2_VSX207),
+GEN_XX3FORM(xsnmaddasp, 0x04, 0x10, PPC2_VSX207),
+GEN_XX3FORM(xsnmaddmsp, 0x04, 0x11, PPC2_VSX207),
+GEN_XX3FORM(xsnmsubasp, 0x04, 0x12, PPC2_VSX207),
+GEN_XX3FORM(xsnmsubmsp, 0x04, 0x13, PPC2_VSX207),
 
 GEN_XX3FORM(xvadddp, 0x00, 0x0C, PPC2_VSX),
 GEN_XX3FORM(xvsubdp, 0x00, 0x0D, PPC2_VSX),