diff mbox

[RFC] target-ppc/fpu_helper.c: Use C99 code to speed up floating point unit

Message ID E8B64B53-09F0-4B1E-9F11-48CC3957C4E1@gmail.com
State New
Headers show

Commit Message

Programmingkid Dec. 3, 2016, 5:59 a.m. UTC
The floating point code used in fpu_helper.c can be sped up by using the IEEE 754 support added to the C99 standard. To test this code out simply set and unset the I_NEED_SPEED macro. The program to test out each version of the helper_fmadd() function is below the patch. It needs to be ran in the guest. The emulator to use is qemu-system-ppc. I used a Mac OS X guest, but the test program would compile on a Linux guest.

This patch does make the fused multiply-add instruction fmadd work faster and still give a correct result.

This documentation might be of help to those who want to learn more about C99's IEEE 754 support:
http://grouper.ieee.org/groups/754/meeting-materials/2001-07-18-c99.pdf

---
 target-ppc/fpu_helper.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 102 insertions(+), 3 deletions(-)

Comments

Paolo Bonzini Dec. 3, 2016, 8:44 a.m. UTC | #1
On 03/12/2016 06:59, Programmingkid wrote:
> The floating point code used in fpu_helper.c can be sped up by using
> the IEEE 754 support added to the C99 standard. To test this code out
> simply set and unset the I_NEED_SPEED macro. The program to test out
> each version of the helper_fmadd() function is below the patch. It
> needs to be ran in the guest. The emulator to use is qemu-system-ppc.
> I used a Mac OS X guest, but the test program would compile on a
> Linux guest.
> 
> This patch does make the fused multiply-add instruction fmadd work
> faster and still give a correct result.
> 
> This documentation might be of help to those who want to learn more
> about C99's IEEE 754 support: 
> http://grouper.ieee.org/groups/754/meeting-materials/2001-07-18-c99.pdf

You're undoing what was done in 2007:

commit 76a66253e5e48f1744f689041c1c21cedcaff630
Author: j_mayer <j_mayer@c046a42c-6fe2-441c-8c8c-71466251a162>
Date:   Wed Mar 7 08:32:30 2007 +0000

    Great PowerPC emulation code resynchronisation and improvments:
...
    - Micro-operation fixes:
...
      * use softfloat routines for all floating-point operations

Paolo
Programmingkid Dec. 3, 2016, 5:07 p.m. UTC | #2
On Dec 3, 2016, at 3:44 AM, Paolo Bonzini wrote:

> 
> 
> On 03/12/2016 06:59, Programmingkid wrote:
>> The floating point code used in fpu_helper.c can be sped up by using
>> the IEEE 754 support added to the C99 standard. To test this code out
>> simply set and unset the I_NEED_SPEED macro. The program to test out
>> each version of the helper_fmadd() function is below the patch. It
>> needs to be ran in the guest. The emulator to use is qemu-system-ppc.
>> I used a Mac OS X guest, but the test program would compile on a
>> Linux guest.
>> 
>> This patch does make the fused multiply-add instruction fmadd work
>> faster and still give a correct result.
>> 
>> This documentation might be of help to those who want to learn more
>> about C99's IEEE 754 support: 
>> http://grouper.ieee.org/groups/754/meeting-materials/2001-07-18-c99.pdf
> 
> You're undoing what was done in 2007:
> 
> commit 76a66253e5e48f1744f689041c1c21cedcaff630
> Author: j_mayer <j_mayer@c046a42c-6fe2-441c-8c8c-71466251a162>
> Date:   Wed Mar 7 08:32:30 2007 +0000
> 
>    Great PowerPC emulation code resynchronisation and improvments:
> ...
>    - Micro-operation fixes:
> ...
>      * use softfloat routines for all floating-point operations
> 
> Paolo

Yes it would be. The commit message never stated why he wanted to switch
to floating point softfloat routines. These are my guesses:

- Different versions of gcc might have produced different results?
- Different hosts produced different results and instead wanted consistency?
- Author did not know about the C99 numerics support?

The C99 standard will keep results accurate on both different versions of 
gcc and different host architecture. It will also give us speed over
the softfloat routines.
Richard Henderson Dec. 4, 2016, 1:32 a.m. UTC | #3
On 12/03/2016 09:07 AM, Programmingkid wrote:
> Yes it would be. The commit message never stated why he wanted to switch
> to floating point softfloat routines. These are my guesses:
...
> - Different hosts produced different results and instead wanted consistency?

This is the correct answer.  There are quite a lot of differences between 
floating point across different cpus.  Which means that it's actually rare that 
the host and guest floating point are exactly alike.


r~
no-reply@patchew.org Dec. 5, 2016, 10:40 a.m. UTC | #4
Hi,

Your series seems to have some coding style problems. See output below for
more information:

Subject: [Qemu-devel] [RFC] target-ppc/fpu_helper.c: Use C99 code to speed up floating point unit
Type: series
Message-id: E8B64B53-09F0-4B1E-9F11-48CC3957C4E1@gmail.com

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

# Useful git options
git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/20161202194854.18737-1-eblake@redhat.com -> patchew/20161202194854.18737-1-eblake@redhat.com
Switched to a new branch 'test'
6f17682 target-ppc/fpu_helper.c: Use C99 code to speed up floating point unit

=== OUTPUT BEGIN ===
Checking PATCH 1/1: target-ppc/fpu_helper.c: Use C99 code to speed up floating point unit...
WARNING: line over 80 characters
#145: FILE: target-ppc/fpu_helper.c:811:
+        else if (isinf(farg1.d * farg2.d) && isinf(farg3.d) && (signbit(farg3.d) != 0)) {

ERROR: Missing Signed-off-by: line(s)

total: 1 errors, 1 warnings, 172 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org
Eric Blake Dec. 5, 2016, 4:42 p.m. UTC | #5
On 12/03/2016 07:32 PM, Richard Henderson wrote:
> On 12/03/2016 09:07 AM, Programmingkid wrote:
>> Yes it would be. The commit message never stated why he wanted to switch
>> to floating point softfloat routines. These are my guesses:
> ...
>> - Different hosts produced different results and instead wanted
>> consistency?
> 
> This is the correct answer.  There are quite a lot of differences
> between floating point across different cpus.  Which means that it's
> actually rare that the host and guest floating point are exactly alike.

Or to frame it in different words:

qemu's goal is to be an architecturally-accurate emulator.  fma() has
specific C99 semantics (mapping to what IEEE floating point requires),
but USUALLY those semantics are not fully implemented in hardware;
rather, the hardware implements 90% and has specific flags or traps that
let the C library recognize when it has to manually perform the 10% of
special cases. But the special cases differ by hardware.

If all hardware implemented C99 semantics 100% of the time, then using
C99 to emulate hardware would be correct.  But since that is not true,
emulating C99 semantics instead of hardware semantics will be WRONG in
the special cases, and cause programs to misbehave under qemu where they
would behave correctly on bare metal. So the decision was that accurate
emulation (guaranteed by using soft-float operations that are tailored
to each hardware's specific quirks on the special cases) is more
important that speed (where the behavior can be different from hardware
on the special cases, even if that behavior is accurate to IEEE and C99
semantics).

If you can completely describe ALL special cases that ANY particular
platform has quirks for, and kick to softmmu for those cases while still
sticking to C99 for the common cases, you may have success. But at this
point, I doubt you have properly identified all the quirks where native
fma instructions differ from C99 fma() requirements, let alone the
differences in emulation that will be required to properly quirk all
possible floating point hardware.  And it's not even guaranteed that
identifying all the inputs that need special casing for various guest
hardware, before calling out to fma() in the remainder of cases, will
not slow things down (since fma() will have its own set of additional
quirking conditionals based on host hardware).
diff mbox

Patch

diff --git a/target-ppc/fpu_helper.c b/target-ppc/fpu_helper.c
index b0760f0..05eb26e 100644
--- a/target-ppc/fpu_helper.c
+++ b/target-ppc/fpu_helper.c
@@ -20,10 +20,22 @@ 
 #include "cpu.h"
 #include "exec/helper-proto.h"
 #include "exec/exec-all.h"
+#include <inttypes.h>
+#include <math.h>
+#include <fenv.h>
 
 #define float64_snan_to_qnan(x) ((x) | 0x0008000000000000ULL)
 #define float32_snan_to_qnan(x) ((x) | 0x00400000)
 
+#define DEBUG_FPU 0
+
+#define DPRINTF(fmt, ...) do {                       \
+    if (DEBUG_FPU) {                                 \
+        printf("FPU: " fmt , ## __VA_ARGS__);        \
+    }                                                \
+} while (0);
+
+
 /*****************************************************************************/
 /* Floating point operations helpers */
 uint64_t helper_float32_to_float64(CPUPPCState *env, uint32_t arg)
@@ -281,29 +293,36 @@  static inline void float_inexact_excp(CPUPPCState *env)
 
 static inline void fpscr_set_rounding_mode(CPUPPCState *env)
 {
-    int rnd_type;
+    int rnd_type, result = 0;
 
     /* Set rounding mode */
     switch (fpscr_rn) {
     case 0:
         /* Best approximation (round to nearest) */
         rnd_type = float_round_nearest_even;
+        result = fesetround(FE_TONEAREST);
         break;
     case 1:
         /* Smaller magnitude (round toward zero) */
         rnd_type = float_round_to_zero;
+        result = fesetround(FE_TOWARDZERO);
         break;
     case 2:
         /* Round toward +infinite */
         rnd_type = float_round_up;
+        result = fesetround(FE_UPWARD);
         break;
     default:
     case 3:
         /* Round toward -infinite */
         rnd_type = float_round_down;
+        result = fesetround(FE_DOWNWARD);
         break;
     }
     set_float_rounding_mode(rnd_type, &env->fp_status);
+    if (result != 0) {
+        printf("Error: rounding mode was not set\n");
+    }
 }
 
 void helper_fpscr_clrbit(CPUPPCState *env, uint32_t bit)
@@ -534,6 +553,7 @@  void helper_float_check_status(CPUPPCState *env)
 void helper_reset_fpstatus(CPUPPCState *env)
 {
     set_float_exception_flags(0, &env->fp_status);
+    feclearexcept(FE_ALL_EXCEPT);
 }
 
 /* fadd - fadd. */
@@ -737,16 +757,94 @@  uint64_t helper_frim(CPUPPCState *env, uint64_t arg)
     return do_fri(env, arg, float_round_down);
 }
 
-/* fmadd - fmadd. */
+#define I_NEED_SPEED 1
+#ifdef I_NEED_SPEED
+
+union Converter {
+    uint64_t i;
+    double d;
+};
+
+typedef union Converter Converter;
+
+/* fmadd - fmadd. - fast */
 uint64_t helper_fmadd(CPUPPCState *env, uint64_t arg1, uint64_t arg2,
                       uint64_t arg3)
 {
+    DPRINTF("Fast helper_fmadd() called\n");
+    Converter farg1, farg2, farg3, result;
+
+    farg1.i = arg1;
+    farg2.i = arg2;
+    farg3.i = arg3;
+
+    DPRINTF("farg1.d = %f\n", farg1.d);
+    DPRINTF("farg2.d = %f\n", farg2.d);
+    DPRINTF("farg3.d = %f\n", farg3.d);
+
+    /* if signalling NaN operation */
+    if (unlikely(float64_is_signaling_nan(farg1.d, &env->fp_status) ||
+                 float64_is_signaling_nan(farg2.d, &env->fp_status) ||
+                 float64_is_signaling_nan(farg3.d, &env->fp_status))) {
+        float_invalid_op_excp(env, POWERPC_EXCP_FP_VXSNAN, 1);
+    }
+
+    result.d = fma(farg1.d, farg2.d, farg3.d); /* fused multiply-add function */
+    if (fetestexcept(FE_INEXACT)) {
+        DPRINTF("FE_INEXACT\n");
+        float_inexact_excp(env);
+    }
+    if (fetestexcept(FE_INVALID)) {
+        DPRINTF("FE_INVALID\n");
+
+        /* 0 * infinity */
+        if ((fpclassify(farg1.d) == FP_ZERO) && isinf(farg2.d)) {
+            result.i = float_invalid_op_excp(env, POWERPC_EXCP_FP_VXIMZ, 1);
+        }
+
+        /* infinity * 0 */
+        else if (isinf(farg1.d) && (fpclassify(farg2.d) == FP_ZERO)) {
+            result.i = float_invalid_op_excp(env, POWERPC_EXCP_FP_VXIMZ, 1);
+        }
+
+        /* infinity - infinity */
+        else if (isinf(farg1.d * farg2.d) && isinf(farg3.d) && (signbit(farg3.d) != 0)) {
+            result.i = float_invalid_op_excp(env, POWERPC_EXCP_FP_VXISI, 1);
+        }
+    }
+    if (fetestexcept(FE_OVERFLOW)) {
+        DPRINTF("FE_OVERFLOW\n");
+        float_overflow_excp(env);
+    }
+    if (fetestexcept(FE_UNDERFLOW)) {
+        DPRINTF("FE_UNDERFLOW\n");
+        float_underflow_excp(env);
+    }
+
+    DPRINTF("result.d = %f\n", result.d);
+    DPRINTF("result.i = 0x%" PRIx64 "\n", result.i);
+
+    return result.i;
+}
+
+#else
+
+/* fmadd - fmadd. - original */
+uint64_t helper_fmadd(CPUPPCState *env, uint64_t arg1, uint64_t arg2,
+                      uint64_t arg3)
+{
+    DPRINTF("old helper_fmadd() called\n");
+
     CPU_DoubleU farg1, farg2, farg3;
 
     farg1.ll = arg1;
     farg2.ll = arg2;
     farg3.ll = arg3;
 
+    DPRINTF("farg1.d = %f\n", farg1.d);
+    DPRINTF("farg2.d = %f\n", farg2.d);
+    DPRINTF("farg3.d = %f\n", farg3.d);
+
     if (unlikely((float64_is_infinity(farg1.d) && float64_is_zero(farg2.d)) ||
                  (float64_is_zero(farg1.d) && float64_is_infinity(farg2.d)))) {
         /* Multiplication of zero by infinity */
@@ -775,9 +873,10 @@  uint64_t helper_fmadd(CPUPPCState *env, uint64_t arg1, uint64_t arg2,
             farg1.d = float128_to_float64(ft0_128, &env->fp_status);
         }
     }
-
+    DPRINTF("farg1.ll = 0x%" PRIx64 "\n", farg1.ll);
     return farg1.ll;
 }
+#endif
 
 /* fmsub - fmsub. */
 uint64_t helper_fmsub(CPUPPCState *env, uint64_t arg1, uint64_t arg2,