Patchwork target-i386: implement FPREM and FPREM1 using softfloat only

login
register
mail settings
Submitter Catalin Patulea
Date June 27, 2012, 11 p.m.
Message ID <1340838036-10467-1-git-send-email-catalinp@google.com>
Download mbox | patch
Permalink /patch/167789/
State New
Headers show

Comments

Catalin Patulea - June 27, 2012, 11 p.m.
Hey guys,

I've been hacking up the FPREM and FPREM1 i386 instructions (without KVM) to be implemented using SoftFloat only. This was motivated by some correctness issues
that we noticed with the current implementation which follows the AMD datasheet.I believe the datasheet explains the operation as if intermediate results had "infinite" precision, but in this case intermediate rounding causes a loss of precision at the final output.

My reference was TestFloat [1], specifically the floatx80_rem suite, which tests the FPREM instruction quite thoroughly (FPREM1 is technically untested but very similar). The current code fails about half the suite; with the patch, all tests pass.

There are still lots of dark corners - the code originates from Bochs' copy of SoftFloat and I tried to port the Bochs exception handling logic as much as I could. Many details still elude me though (see comments in the code). TestFloat does test some of the exception logic but not as thoroughly as I would have liked. If anyone can offer some guidance here, I am happy to fix up the patch.

That's about it.. let me know what you think.

Catalin

[1] http://www.jhauser.us/arithmetic/TestFloat.html

---
 fpu/softfloat.c         |  182 +++++++++++++++++++++++++++++++++++++++++++++++
 fpu/softfloat.h         |    3 +
 target-i386/op_helper.c |  163 +++++++++++++++++-------------------------
 3 files changed, 249 insertions(+), 99 deletions(-)
Stefan Weil - June 28, 2012, 5:32 a.m.
Am 28.06.2012 01:00, schrieb Catalin Patulea:
> Hey guys,
>
> I've been hacking up the FPREM and FPREM1 i386 instructions (without KVM) to be implemented using SoftFloat only. This was motivated by some correctness issues
> that we noticed with the current implementation which follows the AMD datasheet.I believe the datasheet explains the operation as if intermediate results had "infinite" precision, but in this case intermediate rounding causes a loss of precision at the final output.
>
> My reference was TestFloat [1], specifically the floatx80_rem suite, which tests the FPREM instruction quite thoroughly (FPREM1 is technically untested but very similar). The current code fails about half the suite; with the patch, all tests pass.
>
> There are still lots of dark corners - the code originates from Bochs' copy of SoftFloat and I tried to port the Bochs exception handling logic as much as I could. Many details still elude me though (see comments in the code). TestFloat does test some of the exception logic but not as thoroughly as I would have liked. If anyone can offer some guidance here, I am happy to fix up the patch.
>
> That's about it.. let me know what you think.
>
> Catalin
>
> [1] http://www.jhauser.us/arithmetic/TestFloat.html
>
>

Hi Catalin,

could you please check your patch using the latest scripts/checkpatch.pl?
There are several style issues which should be fixed. I suggest to fix them
even in code which was copied from Bochs.

Thanks,
Stefan W.
Peter Maydell - June 28, 2012, 6:25 p.m.
On 28 June 2012 00:00, Catalin Patulea <catalinp@google.com> wrote:
> Hey guys,
>
> I've been hacking up the FPREM and FPREM1 i386 instructions (without KVM) to be implemented using SoftFloat only. This was motivated by some correctness issues
> that we noticed with the current implementation which follows the AMD datasheet.I believe the datasheet explains the operation as if intermediate results had "infinite" precision, but in this case intermediate rounding causes a loss of precision at the final output.
>
> My reference was TestFloat [1], specifically the floatx80_rem suite, which tests the FPREM instruction quite thoroughly (FPREM1 is technically untested but very similar). The current code fails about half the suite; with the patch, all tests pass.
>
> There are still lots of dark corners - the code originates from Bochs' copy of SoftFloat and I tried to port the Bochs exception handling logic as much as I could. Many details still elude me though (see comments in the code). TestFloat does test some of the exception logic but not as thoroughly as I would have liked. If anyone can offer some guidance here, I am happy to fix up the patch.
>
> That's about it.. let me know what you think.
>
> Catalin
>
> [1] http://www.jhauser.us/arithmetic/TestFloat.html
>
> ---
>  fpu/softfloat.c         |  182 +++++++++++++++++++++++++++++++++++++++++++++++
>  fpu/softfloat.h         |    3 +
>  target-i386/op_helper.c |  163 +++++++++++++++++-------------------------
>  3 files changed, 249 insertions(+), 99 deletions(-)
>
> diff --git a/fpu/softfloat.c b/fpu/softfloat.c
> index b29256a..9c6e4a3 100644
> --- a/fpu/softfloat.c
> +++ b/fpu/softfloat.c
> @@ -5234,6 +5234,16 @@ int floatx80_unordered_quiet( floatx80 a, floatx80 b STATUS_PARAM )
>  }
>
>  /*----------------------------------------------------------------------------
> +| Returns 1 if the extended double-precision floating-point value `a' is an
> +| unsupported; otherwise returns 0.

an unsupported what?

> +*----------------------------------------------------------------------------*/
> +int floatx80_is_unsupported(floatx80 a)
> +{
> +    return (extractFloatx80Exp(a) &&
> +            !(extractFloatx80Frac(a) & LIT64(0x8000000000000000)));
> +}
> +
> +/*----------------------------------------------------------------------------
>  | Returns the result of converting the quadruple-precision floating-point
>  | value `a' to the 32-bit two's complement integer format.  The conversion
>  | is performed according to the IEC/IEEE Standard for Binary Floating-Point
> @@ -6828,6 +6838,178 @@ floatx80 floatx80_scalbn( floatx80 a, int n STATUS_PARAM )
>                                           aSign, aExp, aSig, 0 STATUS_VAR );
>  }
>
> +/*
> + * BEGIN from Bochs rev 11224 dated 2012-06-21 10:33:37 -0700
> + *
> + * Converted to use Qemu type aliases, use C features only, etc.
> + */

I'm not convinced we want these markers in the source code
(though they should probably be in the commit message).

Can you confirm that the Bochs licence is compatible with
the QEMU one? In particular, is it compatible with/the same
as the license as stated at the top of softfloat.c?

> +
> +/* executes single exponent reduction cycle */
> +static uint64 remainder_kernel(uint64 aSig0, uint64 bSig, int expDiff, uint64 *zSig0, uint64 *zSig1)
> +{
> +    uint64 term0, term1;
> +    uint64 aSig1 = 0;

No new code should be using the uint64 &c types (which are
"at least NN bits wide") -- uint64_t or uint_fast64_t please.

> +
> +    shortShift128Left(aSig1, aSig0, expDiff, &aSig1, &aSig0);
> +    uint64 q = estimateDiv128To64(aSig1, aSig0, bSig);
> +    mul64To128(bSig, q, &term0, &term1);
> +    sub128(aSig1, aSig0, term0, term1, zSig1, zSig0);
> +    while ((int64)(*zSig1) < 0) {
> +        --q;
> +        add128(*zSig1, *zSig0, 0, bSig, zSig1, zSig0);
> +    }
> +    return q;
> +}
> +
> +static int do_fprem(floatx80 a, floatx80 b, floatx80 *r, uint64 *q, int rounding_mode STATUS_PARAM )
> +{
> +    int32 aExp, bExp, zExp, expDiff;
> +    uint64 aSig0, aSig1, bSig;
> +    int aSign;
> +    *q = 0;
> +
> +    // handle unsupported extended double-precision floating encodings
> +    if (floatx80_is_unsupported(a) || floatx80_is_unsupported(b))
> +    {
> +        float_raise(float_flag_invalid, status);
> +        *r = floatx80_default_nan;
> +        return -1;
> +    }

So are we mishandling unsupported 80bit floats in other operations
(eg addition), or do those functions just opencode things?

> +    aSig0 = extractFloatx80Frac(a);
> +    aExp = extractFloatx80Exp(a);
> +    aSign = extractFloatx80Sign(a);
> +    bSig = extractFloatx80Frac(b);
> +    bExp = extractFloatx80Exp(b);
> +
> +    if (aExp == 0x7FFF) {
> +        if ((uint64) (aSig0<<1) || ((bExp == 0x7FFF) && (uint64) (bSig<<1))) {
> +            *r = propagateFloatx80NaN(a, b, status);
> +            return -1;
> +        }
> +        float_raise(float_flag_invalid, status);
> +        *r = floatx80_default_nan;
> +        return -1;
> +    }
> +    if (bExp == 0x7FFF) {
> +        if ((uint64) (bSig<<1)) {
> +            *r = propagateFloatx80NaN(a, b, status);
> +            return -1;
> +        }
> +        if (aExp == 0 && aSig0) {
> +            float_raise(float_flag_input_denormal, status);
> +            normalizeFloatx80Subnormal(aSig0, &aExp, &aSig0);
> +            *r = (extractFloatx80Frac(a) & LIT64(0x8000000000000000)) ?
> +                    packFloatx80(aSign, aExp, aSig0) : a;
> +            return 0;
> +        }
> +        *r = a;
> +        return 0;
> +
> +    }
> +    if (bExp == 0) {
> +        if (bSig == 0) {
> +            float_raise(float_flag_invalid, status);
> +            *r = floatx80_default_nan;
> +            return -1;
> +        }
> +        float_raise(float_flag_input_denormal, status);
> +        normalizeFloatx80Subnormal(bSig, &bExp, &bSig);
> +    }
> +    if (aExp == 0) {
> +        if (aSig0 == 0) {
> +            *r = a;
> +            return 0;
> +        }
> +        float_raise(float_flag_input_denormal, status);
> +        normalizeFloatx80Subnormal(aSig0, &aExp, &aSig0);
> +    }
> +    expDiff = aExp - bExp;
> +    aSig1 = 0;
> +
> +    uint32 overflow = 0;
> +
> +    if (expDiff >= 64) {
> +        int n = (expDiff & 0x1f) | 0x20;
> +        remainder_kernel(aSig0, bSig, n, &aSig0, &aSig1);
> +        zExp = aExp - n;
> +        overflow = 1;
> +    }
> +    else {
> +        zExp = bExp;
> +
> +        if (expDiff < 0) {
> +            if (expDiff < -1) {
> +               *r = (extractFloatx80Frac(a) & LIT64(0x8000000000000000)) ?
> +                    packFloatx80(aSign, aExp, aSig0) : a;
> +               return 0;
> +            }
> +            shift128Right(aSig0, 0, 1, &aSig0, &aSig1);
> +            expDiff = 0;
> +        }
> +
> +        if (expDiff > 0) {
> +            *q = remainder_kernel(aSig0, bSig, expDiff, &aSig0, &aSig1);
> +        }
> +        else {
> +            if (bSig <= aSig0) {
> +               aSig0 -= bSig;
> +               *q = 1;
> +            }
> +        }
> +
> +        if (rounding_mode == float_round_nearest_even)
> +        {
> +            uint64 term0, term1;
> +            shift128Right(bSig, 0, 1, &term0, &term1);
> +
> +            if (! lt128(aSig0, aSig1, term0, term1))
> +            {
> +               int lt = lt128(term0, term1, aSig0, aSig1);
> +               int eq = eq128(aSig0, aSig1, term0, term1);
> +
> +               if ((eq && (*q & 1)) || lt) {
> +                  aSign = !aSign;
> +                  ++*q;
> +               }
> +               if (lt) sub128(bSig, 0, aSig0, aSig1, &aSig0, &aSig1);
> +            }
> +        }
> +    }
> +
> +    *r = normalizeRoundAndPackFloatx80(80, aSign, zExp, aSig0, aSig1, status);
> +    return overflow;
> +}
> +
> +/*----------------------------------------------------------------------------
> +| Returns the remainder of the extended double-precision floating-point value
> +| `a' with respect to the corresponding value `b'.  The operation is performed
> +| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic.
> +*----------------------------------------------------------------------------*/
> +
> +int floatx80_ieee754_remainder(floatx80 a, floatx80 b, floatx80 *r, uint64 *q STATUS_PARAM )
> +{
> +    return do_fprem(a, b, r, q, float_round_nearest_even STATUS_VAR );
> +}

This claims to return the remainder, but do_fprem only returns
0, 1 or -1...

> +
> +/*----------------------------------------------------------------------------
> +| Returns the remainder of the extended double-precision floating-point value
> +| `a' with  respect to  the corresponding value `b'. Unlike previous function
> +| the  function  does not compute  the remainder  specified  in  the IEC/IEEE
> +| Standard  for Binary  Floating-Point  Arithmetic.  This  function  operates
> +| differently  from the  previous  function in  the way  that it  rounds  the
> +| quotient of 'a' divided by 'b' to an integer.
> +*----------------------------------------------------------------------------*/
> +
> +int floatx80_remainder(floatx80 a, floatx80 b, floatx80 *r, uint64 *q STATUS_PARAM )
> +{
> +    return do_fprem(a, b, r, q, float_round_to_zero STATUS_VAR );
> +}
> +
> +/*
> + * END from Bochs rev 11224 dated 2012-06-21 10:33:37 -0700
> + */
> +
>  float128 float128_scalbn( float128 a, int n STATUS_PARAM )
>  {
>     flag aSign;
> diff --git a/fpu/softfloat.h b/fpu/softfloat.h
> index feec3a1..53d7827 100644
> --- a/fpu/softfloat.h
> +++ b/fpu/softfloat.h
> @@ -497,10 +497,13 @@ int floatx80_lt_quiet( floatx80, floatx80 STATUS_PARAM );
>  int floatx80_unordered_quiet( floatx80, floatx80 STATUS_PARAM );
>  int floatx80_compare( floatx80, floatx80 STATUS_PARAM );
>  int floatx80_compare_quiet( floatx80, floatx80 STATUS_PARAM );
> +int floatx80_is_unsupported( floatx80 );
>  int floatx80_is_quiet_nan( floatx80 );
>  int floatx80_is_signaling_nan( floatx80 );
>  floatx80 floatx80_maybe_silence_nan( floatx80 );
>  floatx80 floatx80_scalbn( floatx80, int STATUS_PARAM );
> +int floatx80_ieee754_remainder( floatx80, floatx80, floatx80 *, uint64 * STATUS_PARAM );
> +int floatx80_remainder( floatx80, floatx80, floatx80 *, uint64 * STATUS_PARAM );
>
>  INLINE floatx80 floatx80_abs(floatx80 a)
>  {
> diff --git a/target-i386/op_helper.c b/target-i386/op_helper.c
> index 2862ea4..ae0c877 100644
> --- a/target-i386/op_helper.c
> +++ b/target-i386/op_helper.c
> @@ -3620,6 +3620,23 @@ static void fpu_set_exception(int mask)
>         env->fpus |= FPUS_SE | FPUS_B;
>  }
>
> +static int fpu_exception_from_fp_status(void)
> +{
> +    int mask;
> +
> +    const int float_flag_denormals = float_flag_input_denormal |
> +                                     float_flag_output_denormal;
> +
> +    // Most flags have the same value in the enum and 387 status word, except
> +    // the denormal flags.
> +    mask = env->fp_status.float_exception_flags & ~float_flag_denormals;
> +    if (env->fp_status.float_exception_flags & float_flag_denormals) {
> +        mask |= FPUS_DE;
> +    }
> +
> +    return mask;
> +}
> +
>  static inline floatx80 helper_fdiv(floatx80 a, floatx80 b)
>  {
>     if (floatx80_is_zero(b)) {
> @@ -4210,119 +4227,67 @@ void helper_fxtract(void)
>
>  void helper_fprem1(void)
>  {
> -    double st0, st1, dblq, fpsrcop, fptemp;
> -    CPU_LDoubleU fpsrcop1, fptemp1;
> -    int expdif;
> -    signed long long int q;
> -
> -    st0 = floatx80_to_double(ST0);
> -    st1 = floatx80_to_double(ST1);
> -
> -    if (isinf(st0) || isnan(st0) || isnan(st1) || (st1 == 0.0)) {
> -        ST0 = double_to_floatx80(0.0 / 0.0); /* NaN */
> -        env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
> -        return;
> -    }
> -
> -    fpsrcop = st0;
> -    fptemp = st1;
> -    fpsrcop1.d = ST0;
> -    fptemp1.d = ST1;
> -    expdif = EXPD(fpsrcop1) - EXPD(fptemp1);
> -
> -    if (expdif < 0) {
> -        /* optimisation? taken from the AMD docs */
> +    floatx80 result;
> +    uint64_t quotient;
> +
> +    /* TODO(catalinp):
> +     *
> +     * - What is the defined purpose of fp_status in Qemu? Seems many things
> +     *   write into it, but few if any translate into env->fpus.

fp_status is a float_status structure which has two purposes:
 * tells the softfloat code how to behave in certain circumstances
   (how to handle denormals, rounding modes, etc)
 * tracks the IEEE "cumulative exception flags" (inexact, overflow, etc)
Every softfloat routine which needs to set exception flags or behave
differently according to the flags in fp_status will take a float_status*
argument. (This is obscured slightly by the silly STATUS_PARAM #define.)

(Note that for some CPU architectures there may be more than one float_status
struct; eg ARM has both an fp_status and a standard_fp_status, because Neon
operations behave differently from VFP ones; see comments in target-arm/cpu.h.
The Neon operations pass the softfloat code the standard_fp_status, and
the VFP ops pass in the fp_status.)

The way I would recommend using it is:
 * the "master copy" of this information is in the float_status
 * when we do a read of whatever CPU register(s) expose this info
   to the guest, construct that register value on the fly from the
   float_status struct
 * when we do a write of those registers (or a CPU reset), change
   the float_status struct accordingly
 * for most floating point arithmetic operations, no messing with
   float_status should then be necessary (we just pass a pointer to
   fp_status into the softfloat function)
The idea here is that fp operations are common but reading or writing
the registers which change rounding mode or reveal the exception
flags are rare, so we do the conversion only when we have to.

target-i386 is a bit of a latecomer to using softfloat and not
particularly maintained either, so it is quite possible it is
buggy in this regard. (Very few programs actually care about
the cumulative exception flags so it is quite easy for bugs in
this area to go unnoticed.) target-arm gets it right.

Also the functions which are still doing native arithmetic
(like the ones you're trying to convert) have to do their own
detection and handling of exception conditions. In theory at
least some of that code just vanishes into the generic softfloat
handling.

> +     *
> +     * - Bochs' FPU_pre_exception_handling resets a few more fields in fp_status
> +     *   before doing the operation. What is the purpose of that and is this
> +     *   necessary here?
> +     */
> +    env->fp_status.float_exception_flags = 0;

This will trash all the cumulative exception flags and is very unlikely
to be correct.

> +    int flags = floatx80_ieee754_remainder(ST0, ST1, &result, &quotient,
> +                                           &env->fp_status);
> +
> +    /* TODO(catalinp): Bochs also checks for unmasked exceptions before storing
> +     * these flags. Should we also do this?
> +     */
> +    if (flags >= 0) {
>         env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
> -        /* ST0 is unchanged */
> -        return;
> +        if (flags) {
> +            env->fpus |= (1 << 10);
> +        } else {            /* (C0,C3,C1) <-- (q2,q1,q0) */
> +            if (quotient & 1) env->fpus |= (1 << 9);
> +            if (quotient & 2) env->fpus |= (1 << 14);
> +            if (quotient & 4) env->fpus |= (1 << 8);
> +        }
>     }
>
> -    if (expdif < 53) {
> -        dblq = fpsrcop / fptemp;
> -        /* round dblq towards nearest integer */
> -        dblq = rint(dblq);
> -        st0 = fpsrcop - fptemp * dblq;
> -
> -        /* convert dblq to q by truncating towards zero */
> -        if (dblq < 0.0)
> -           q = (signed long long int)(-dblq);
> -        else
> -           q = (signed long long int)dblq;
> +    ST0 = result;
>
> -        env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
> -                                /* (C0,C3,C1) <-- (q2,q1,q0) */
> -        env->fpus |= (q & 0x4) << (8 - 2);  /* (C0) <-- q2 */
> -        env->fpus |= (q & 0x2) << (14 - 1); /* (C3) <-- q1 */
> -        env->fpus |= (q & 0x1) << (9 - 0);  /* (C1) <-- q0 */
> -    } else {
> -        env->fpus |= 0x400;  /* C2 <-- 1 */
> -        fptemp = pow(2.0, expdif - 50);
> -        fpsrcop = (st0 / st1) / fptemp;
> -        /* fpsrcop = integer obtained by chopping */
> -        fpsrcop = (fpsrcop < 0.0) ?
> -                  -(floor(fabs(fpsrcop))) : floor(fpsrcop);
> -        st0 -= (st1 * fpsrcop * fptemp);
> -    }
> -    ST0 = double_to_floatx80(st0);
> +    /* TODO(catalinp): Set only unmasked exceptions? */
> +    fpu_set_exception(fpu_exception_from_fp_status());

Check the implementation of fpu_set_exception -- it handles
the check for whether the exceptions are masked or not.

>  }
>
>  void helper_fprem(void)
>  {
> -    double st0, st1, dblq, fpsrcop, fptemp;
> -    CPU_LDoubleU fpsrcop1, fptemp1;
> -    int expdif;
> -    signed long long int q;
> -
> -    st0 = floatx80_to_double(ST0);
> -    st1 = floatx80_to_double(ST1);
> +    floatx80 result;
> +    uint64_t quotient;
>
> -    if (isinf(st0) || isnan(st0) || isnan(st1) || (st1 == 0.0)) {
> -       ST0 = double_to_floatx80(0.0 / 0.0); /* NaN */
> -       env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
> -       return;
> -    }
> -
> -    fpsrcop = st0;
> -    fptemp = st1;
> -    fpsrcop1.d = ST0;
> -    fptemp1.d = ST1;
> -    expdif = EXPD(fpsrcop1) - EXPD(fptemp1);
> +    /* TODO(catalinp): See comments in helper_fprem1() about lots of potential
> +     * semantic ambiguities/differences between this and Bochs' implementation.
> +     */
> +    env->fp_status.float_exception_flags = 0;
> +    int flags = floatx80_remainder(ST0, ST1, &result, &quotient, &env->fp_status);
>
> -    if (expdif < 0) {
> -        /* optimisation? taken from the AMD docs */
> +    if (flags >= 0) {
>         env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
> -        /* ST0 is unchanged */
> -        return;
> +        if (flags) {
> +            env->fpus |= (1 << 10);
> +        } else {            /* (C0,C3,C1) <-- (q2,q1,q0) */
> +            if (quotient & 1) env->fpus |= (1 << 9);
> +            if (quotient & 2) env->fpus |= (1 << 14);
> +            if (quotient & 4) env->fpus |= (1 << 8);
> +        }
>     }
>
> -    if ( expdif < 53 ) {
> -        dblq = fpsrcop/*ST0*/ / fptemp/*ST1*/;
> -        /* round dblq towards zero */
> -        dblq = (dblq < 0.0) ? ceil(dblq) : floor(dblq);
> -        st0 = fpsrcop/*ST0*/ - fptemp * dblq;
> +    ST0 = result;
>
> -        /* convert dblq to q by truncating towards zero */
> -        if (dblq < 0.0)
> -           q = (signed long long int)(-dblq);
> -        else
> -           q = (signed long long int)dblq;
> -
> -        env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
> -                                /* (C0,C3,C1) <-- (q2,q1,q0) */
> -        env->fpus |= (q & 0x4) << (8 - 2);  /* (C0) <-- q2 */
> -        env->fpus |= (q & 0x2) << (14 - 1); /* (C3) <-- q1 */
> -        env->fpus |= (q & 0x1) << (9 - 0);  /* (C1) <-- q0 */
> -    } else {
> -        int N = 32 + (expdif % 32); /* as per AMD docs */
> -        env->fpus |= 0x400;  /* C2 <-- 1 */
> -        fptemp = pow(2.0, (double)(expdif - N));
> -        fpsrcop = (st0 / st1) / fptemp;
> -        /* fpsrcop = integer obtained by chopping */
> -        fpsrcop = (fpsrcop < 0.0) ?
> -                  -(floor(fabs(fpsrcop))) : floor(fpsrcop);
> -        st0 -= (st1 * fpsrcop * fptemp);
> -    }
> -    ST0 = double_to_floatx80(st0);
> +    fpu_set_exception(fpu_exception_from_fp_status());
>  }
>
>  void helper_fyl2xp1(void)
> --
> 1.7.7.3
>


-- PMM
Catalin Patulea - June 29, 2012, 1:50 a.m.
On Thu, Jun 28, 2012 at 2:25 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> >  /*----------------------------------------------------------------------------
> > +| Returns 1 if the extended double-precision floating-point value `a' is an
> > +| unsupported; otherwise returns 0.
>
> an unsupported what?
I think it should be "unsupported value" - fixed.

> > +/*
> > + * BEGIN from Bochs rev 11224 dated 2012-06-21 10:33:37 -0700
> > + *
> > + * Converted to use Qemu type aliases, use C features only, etc.
> > + */
>
> I'm not convinced we want these markers in the source code
> (though they should probably be in the commit message).
Ok, moved to the commit message.

> Can you confirm that the Bochs licence is compatible with
> the QEMU one? In particular, is it compatible with/the same
> as the license as stated at the top of softfloat.c?
Bochs is licensed under LGPL 2.1, and QEMU is GPL 2. According to the
GNU license compatibility matrix [1], it appears to be allowed to copy
code from Bochs to QEMU.

However, each copy of SoftFloat is under its own license, which has a
warranty disclaimer and the following:

Derivative works are acceptable, even for commercial purposes, so long as
(1) the source code for the derivative work includes prominent notice that
the work is derivative, and (2) the source code includes prominent notice with
these four paragraphs for those parts of this code that are retained.

I'm not sure which license takes precedence. Perhaps because each
project has modified SoftFloat, each copy is now licensed under the
broader project license. In either case, I don't think there are any
issues (though IANAL).

> > +static uint64 remainder_kernel(uint64 aSig0, uint64 bSig, int expDiff, uint64 *zSig0, uint64 *zSig1)
> > +{
> > +    uint64 term0, term1;
> > +    uint64 aSig1 = 0;
>
> No new code should be using the uint64 &c types (which are
> "at least NN bits wide") -- uint64_t or uint_fast64_t please.
Ok, changed some {int -> flag} and {uint64 -> uint64_t}.

There are some int32s left, but these seem to be tied to
extractFloatx80Exp's return type, which is int32 despite the
underlying floatx80.high being an int16_t. Is this intentional? Fixing
this would imply many other changes so I would punt on this for now.

> > +    // handle unsupported extended double-precision floating encodings
> > +    if (floatx80_is_unsupported(a) || floatx80_is_unsupported(b))
> > +    {
> > +        float_raise(float_flag_invalid, status);
> > +        *r = floatx80_default_nan;
> > +        return -1;
> > +    }
>
> So are we mishandling unsupported 80bit floats in other operations
> (eg addition), or do those functions just opencode things?
Yes, I believe addition (and likely others) mishandles these. I put
together a small C program that clears FP exceptions, adds two long
doubles, and checks FP exceptions. (I do these things using glibc
functions like feclearexcept - I don't think this is any different
from doing it at the machine level, but just to be sure - are you
aware of any issues with this approach?)

The operands were (decimal notation, then hexadecimal exponent.significand):
st0: 1.000000000000 3FFF.8000000000000000
+
st1: 0.000000000000 3FFF.0000000000000000
// Intel IA-32 [2] defines the latter as a "positive unnormal" (Table
8-3), one of those "unsupported" values.

The result on bare metal was:
result: -nan FFFF.C000000000000000  (with FE_INVALID)

qemu i386-linux-user kvm-less:
result: 3.000000000000 4000.C000000000000000 (no exceptions)

I'm not sure what you mean by "opencode".

> > +/*----------------------------------------------------------------------------
> > +| Returns the remainder of the extended double-precision floating-point value
> > +| `a' with respect to the corresponding value `b'.  The operation is performed
> > +| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic.
> > +*----------------------------------------------------------------------------*/
> This claims to return the remainder, but do_fprem only returns
> 0, 1 or -1...
Ok, fixed the comments.

> > +    /* TODO(catalinp):
> > +     *
> > +     * - What is the defined purpose of fp_status in Qemu? Seems many things
> > +     *   write into it, but few if any translate into env->fpus.
>
> fp_status is a float_status structure which has two purposes:
>  * tells the softfloat code how to behave in certain circumstances
>   (how to handle denormals, rounding modes, etc)
>  * tracks the IEEE "cumulative exception flags" (inexact, overflow, etc)
> Every softfloat routine which needs to set exception flags or behave
> differently according to the flags in fp_status will take a float_status*
> argument. (This is obscured slightly by the silly STATUS_PARAM #define.)
>
> (Note that for some CPU architectures there may be more than one float_status
> struct; eg ARM has both an fp_status and a standard_fp_status, because Neon
> operations behave differently from VFP ones; see comments in target-arm/cpu.h.
> The Neon operations pass the softfloat code the standard_fp_status, and
> the VFP ops pass in the fp_status.)
>
> The way I would recommend using it is:
>  * the "master copy" of this information is in the float_status
>  * when we do a read of whatever CPU register(s) expose this info
>   to the guest, construct that register value on the fly from the
>   float_status struct
>  * when we do a write of those registers (or a CPU reset), change
>   the float_status struct accordingly
>  * for most floating point arithmetic operations, no messing with
>   float_status should then be necessary (we just pass a pointer to
>   fp_status into the softfloat function)
> The idea here is that fp operations are common but reading or writing
> the registers which change rounding mode or reveal the exception
> flags are rare, so we do the conversion only when we have to.
>
> target-i386 is a bit of a latecomer to using softfloat and not
> particularly maintained either, so it is quite possible it is
> buggy in this regard. (Very few programs actually care about
> the cumulative exception flags so it is quite easy for bugs in
> this area to go unnoticed.) target-arm gets it right.
>
> Also the functions which are still doing native arithmetic
> (like the ones you're trying to convert) have to do their own
> detection and handling of exception conditions. In theory at
> least some of that code just vanishes into the generic softfloat
> handling.
Ok, this all makes sense, thanks for the clarification.

> > +     * - Bochs' FPU_pre_exception_handling resets a few more fields in fp_status
> > +     *   before doing the operation. What is the purpose of that and is this
> > +     *   necessary here?
> > +     */
> > +    env->fp_status.float_exception_flags = 0;
>
> This will trash all the cumulative exception flags and is very unlikely
> to be correct.
I've changed exception handling around slightly. The helpers now make
a copy of fp_status into local_status, and reset exception flags only
there. This way I can detect newly-raised exceptions without modifying
the global state. New exceptions are reported through
fpu_set_exception (even though I guess that's going away eventually).
Once env->fp_status becomes the authority for FPU state, local_status
in these helpers can simply go away.

> > +    /* TODO(catalinp): Set only unmasked exceptions? */
> > +    fpu_set_exception(fpu_exception_from_fp_status());
>
> Check the implementation of fpu_set_exception -- it handles
> the check for whether the exceptions are masked or not.
Yes, you are right. I was confusing the role of the mask bit, thought
it should also prevent reporting into the status word. Comment
removed.

As a sidenote, I did a quick check just to test my understanding of
FPREM exceptions on real hardware. I set st0 to 1.0 (arbitrary
positive value) and st1 to 0.0. I expected to get a div-by-zero
exception, which seems to be what's defined in IA-32 Table 3-41.
(ST0=+F and ST1=+0 gives "**" which means div-by-zero). This is the
snippet I used, wrapped with some variable initialization and printfs:

 asm("fnclex\n"
     "fprem\n"
     "fnstsw %1"
     : "=t"(result), "=m"(sw)
     : "f"(st0), "f"(st1)
     : "memory"
     );

Instead of div-by-zero, I got only invalid-operand. This is on an i5.
If I replace the FPREM with a (result = st0 / st1) I get div-by-zero.
Any idea what might be going on here?

I will wait to hear feedback on this mail before sending a patch v2.
Stefan, I've done a run of checkpatch.pl and I think I've addressed
nearly all warnings. When v2 goes out, please let me know if you still
see any style issues.

Catalin

[1] http://www.gnu.org/licenses/gpl-faq.html#AllCompatibility
[2] http://download.intel.com/products/processor/manual/325462.pdf
Peter Maydell - June 29, 2012, 7:30 a.m.
On 29 June 2012 02:50, Catalin Patulea <catalinp@google.com> wrote:
> On Thu, Jun 28, 2012 at 2:25 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> No new code should be using the uint64 &c types (which are
>> "at least NN bits wide") -- uint64_t or uint_fast64_t please.
> Ok, changed some {int -> flag} and {uint64 -> uint64_t}.
>
> There are some int32s left, but these seem to be tied to
> extractFloatx80Exp's return type, which is int32 despite the
> underlying floatx80.high being an int16_t. Is this intentional? Fixing
> this would imply many other changes so I would punt on this for now.

We haven't yet managed to do the full conversion away from the
softfloat custom int32 types to the POSIX ones...

>> > +    // handle unsupported extended double-precision floating encodings
>> > +    if (floatx80_is_unsupported(a) || floatx80_is_unsupported(b))
>> > +    {
>> > +        float_raise(float_flag_invalid, status);
>> > +        *r = floatx80_default_nan;
>> > +        return -1;
>> > +    }
>>
>> So are we mishandling unsupported 80bit floats in other operations
>> (eg addition), or do those functions just opencode things?
> Yes, I believe addition (and likely others) mishandles these. I put
> together a small C program that clears FP exceptions, adds two long
> doubles, and checks FP exceptions. (I do these things using glibc
> functions like feclearexcept - I don't think this is any different
> from doing it at the machine level, but just to be sure - are you
> aware of any issues with this approach?)

Well, you're relying on the compiler and libc not to insert
any other FPU operations into your code sequences, but it's
probably OK. For ARM testing I use
https://wiki.linaro.org/PeterMaydell/Risu , but getting that
to properly support x86 is probably at least a few days work
so if you're happy with your current testing approach that's fine.

> I'm not sure what you mean by "opencode".

do they just do the relevant checks as a bit of inline C
(rather than calling out to a function to do the checking)?

-- PMM
Andreas Färber - June 29, 2012, 1:13 p.m.
Am 28.06.2012 01:00, schrieb Catalin Patulea:
> Hey guys,
> 
> I've been hacking up the FPREM and FPREM1 i386 instructions (without KVM) to be implemented using SoftFloat only. This was motivated by some correctness issues
> that we noticed with the current implementation which follows the AMD datasheet.I believe the datasheet explains the operation as if intermediate results had "infinite" precision, but in this case intermediate rounding causes a loss of precision at the final output.
> 
> My reference was TestFloat [1], specifically the floatx80_rem suite, which tests the FPREM instruction quite thoroughly (FPREM1 is technically untested but very similar). The current code fails about half the suite; with the patch, all tests pass.
> 
> There are still lots of dark corners - the code originates from Bochs' copy of SoftFloat and I tried to port the Bochs exception handling logic as much as I could. Many details still elude me though (see comments in the code). TestFloat does test some of the exception logic but not as thoroughly as I would have liked. If anyone can offer some guidance here, I am happy to fix up the patch.
> 
> That's about it.. let me know what you think.
> 
> Catalin
> 
> [1] http://www.jhauser.us/arithmetic/TestFloat.html
> 
> ---
>  fpu/softfloat.c         |  182 +++++++++++++++++++++++++++++++++++++++++++++++
>  fpu/softfloat.h         |    3 +
>  target-i386/op_helper.c |  163 +++++++++++++++++-------------------------
>  3 files changed, 249 insertions(+), 99 deletions(-)
> 
> diff --git a/fpu/softfloat.c b/fpu/softfloat.c
> index b29256a..9c6e4a3 100644
> --- a/fpu/softfloat.c
> +++ b/fpu/softfloat.c
> @@ -5234,6 +5234,16 @@ int floatx80_unordered_quiet( floatx80 a, floatx80 b STATUS_PARAM )
>  }
>  
>  /*----------------------------------------------------------------------------
> +| Returns 1 if the extended double-precision floating-point value `a' is an
> +| unsupported; otherwise returns 0.
> +*----------------------------------------------------------------------------*/
> +int floatx80_is_unsupported(floatx80 a)
> +{
> +    return (extractFloatx80Exp(a) &&
> +            !(extractFloatx80Frac(a) & LIT64(0x8000000000000000)));
> +}

Disclaimer: It's been a while that I had to learn IEEE 754.

Unsupported for some i386 instruction? Or unsupported in the SoftFloat
library due to some limitation?

Please run scripts/checkpatch.pl for CODING_STYLE issues, I spotted one
slightly misplaced if brace, and one empty line at the bottom seemed to
have indentation (git-am complains about that, too).

The patch description will need to be cleaned up (not be letter-style).

I'm not too thrilled to introduce more uses of int32 (we started
converting int16 to int_fast16_t) but I won't veto. It would be nice if
you could review for any potential int32 vs. int32_t breakages though,
to not make things worse than they are.

Regards,
Andreas
Peter Maydell - June 29, 2012, 1:43 p.m.
On 29 June 2012 14:13, Andreas Färber <afaerber@suse.de> wrote:
> Am 28.06.2012 01:00, schrieb Catalin Patulea:
>>  /*----------------------------------------------------------------------------
>> +| Returns 1 if the extended double-precision floating-point value `a' is an
>> +| unsupported; otherwise returns 0.
>> +*----------------------------------------------------------------------------*/
>> +int floatx80_is_unsupported(floatx80 a)
>> +{
>> +    return (extractFloatx80Exp(a) &&
>> +            !(extractFloatx80Frac(a) & LIT64(0x8000000000000000)));
>> +}
>
> Disclaimer: It's been a while that I had to learn IEEE 754.
>
> Unsupported for some i386 instruction? Or unsupported in the SoftFloat
> library due to some limitation?

It means 'unsupported' in the sense of "Unsupported Double Extended-Precision
Floating-Point Encodings", ie there are some bit patterns in a floatx80 which
are not valid floating point numbers of any kind, and the Intel documentation
calls these "unsupported". See section 8.2.2 in "Intel 64 and IA-32
Architectures
Software Developer’s Manual Volume 1" 253665-034US.

-- PMM
Catalin Patulea - July 2, 2012, 2:51 p.m.
On Fri, Jun 29, 2012 at 9:13 AM, Andreas Färber <afaerber@suse.de> wrote:
> Please run scripts/checkpatch.pl for CODING_STYLE issues, I spotted one
> slightly misplaced if brace, and one empty line at the bottom seemed to
> have indentation (git-am complains about that, too).
Done. The only issue left is a false positive:

ERROR: "foo * bar" should be "foo *bar"
#248: FILE: fpu/softfloat.h:507:
+int floatx80_remainder(floatx80, floatx80, floatx80 *, uint64_t *
STATUS_PARAM);

(STATUS_PARAM is not the name of the param, it's a macro.)

> The patch description will need to be cleaned up (not be letter-style).
Done.

> I'm not too thrilled to introduce more uses of int32 (we started
> converting int16 to int_fast16_t) but I won't veto. It would be nice if
> you could review for any potential int32 vs. int32_t breakages though,
> to not make things worse than they are.
I did notice one place where int32 was used where int should be used
(the return variable of do_fprem), and that's now fixed. But for the
other uses of int32, as I was saying to Peter, they are tied to the
return type of extractFloatx80Exp. I wouldn't give the variable a
different type than the function, and especially not in this case
because the function should become int16_t, not int32_t.

I'll be sending a PATCHv2 momentarily.

Patch

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index b29256a..9c6e4a3 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -5234,6 +5234,16 @@  int floatx80_unordered_quiet( floatx80 a, floatx80 b STATUS_PARAM )
 }
 
 /*----------------------------------------------------------------------------
+| Returns 1 if the extended double-precision floating-point value `a' is an
+| unsupported; otherwise returns 0.
+*----------------------------------------------------------------------------*/
+int floatx80_is_unsupported(floatx80 a)
+{
+    return (extractFloatx80Exp(a) &&
+            !(extractFloatx80Frac(a) & LIT64(0x8000000000000000)));
+}
+
+/*----------------------------------------------------------------------------
 | Returns the result of converting the quadruple-precision floating-point
 | value `a' to the 32-bit two's complement integer format.  The conversion
 | is performed according to the IEC/IEEE Standard for Binary Floating-Point
@@ -6828,6 +6838,178 @@  floatx80 floatx80_scalbn( floatx80 a, int n STATUS_PARAM )
                                           aSign, aExp, aSig, 0 STATUS_VAR );
 }
 
+/*
+ * BEGIN from Bochs rev 11224 dated 2012-06-21 10:33:37 -0700
+ *
+ * Converted to use Qemu type aliases, use C features only, etc.
+ */
+
+/* executes single exponent reduction cycle */
+static uint64 remainder_kernel(uint64 aSig0, uint64 bSig, int expDiff, uint64 *zSig0, uint64 *zSig1)
+{
+    uint64 term0, term1;
+    uint64 aSig1 = 0;
+
+    shortShift128Left(aSig1, aSig0, expDiff, &aSig1, &aSig0);
+    uint64 q = estimateDiv128To64(aSig1, aSig0, bSig);
+    mul64To128(bSig, q, &term0, &term1);
+    sub128(aSig1, aSig0, term0, term1, zSig1, zSig0);
+    while ((int64)(*zSig1) < 0) {
+        --q;
+        add128(*zSig1, *zSig0, 0, bSig, zSig1, zSig0);
+    }
+    return q;
+}
+
+static int do_fprem(floatx80 a, floatx80 b, floatx80 *r, uint64 *q, int rounding_mode STATUS_PARAM )
+{
+    int32 aExp, bExp, zExp, expDiff;
+    uint64 aSig0, aSig1, bSig;
+    int aSign;
+    *q = 0;
+
+    // handle unsupported extended double-precision floating encodings
+    if (floatx80_is_unsupported(a) || floatx80_is_unsupported(b))
+    {
+        float_raise(float_flag_invalid, status);
+        *r = floatx80_default_nan;
+        return -1;
+    }
+
+    aSig0 = extractFloatx80Frac(a);
+    aExp = extractFloatx80Exp(a);
+    aSign = extractFloatx80Sign(a);
+    bSig = extractFloatx80Frac(b);
+    bExp = extractFloatx80Exp(b);
+
+    if (aExp == 0x7FFF) {
+        if ((uint64) (aSig0<<1) || ((bExp == 0x7FFF) && (uint64) (bSig<<1))) {
+            *r = propagateFloatx80NaN(a, b, status);
+            return -1;
+        }
+        float_raise(float_flag_invalid, status);
+        *r = floatx80_default_nan;
+        return -1;
+    }
+    if (bExp == 0x7FFF) {
+        if ((uint64) (bSig<<1)) {
+            *r = propagateFloatx80NaN(a, b, status);
+            return -1;
+        }
+        if (aExp == 0 && aSig0) {
+            float_raise(float_flag_input_denormal, status);
+            normalizeFloatx80Subnormal(aSig0, &aExp, &aSig0);
+            *r = (extractFloatx80Frac(a) & LIT64(0x8000000000000000)) ?
+                    packFloatx80(aSign, aExp, aSig0) : a;
+            return 0;
+        }
+        *r = a;
+        return 0;
+
+    }
+    if (bExp == 0) {
+        if (bSig == 0) {
+            float_raise(float_flag_invalid, status);
+            *r = floatx80_default_nan;
+            return -1;
+        }
+        float_raise(float_flag_input_denormal, status);
+        normalizeFloatx80Subnormal(bSig, &bExp, &bSig);
+    }
+    if (aExp == 0) {
+        if (aSig0 == 0) {
+            *r = a;
+            return 0;
+        }
+        float_raise(float_flag_input_denormal, status);
+        normalizeFloatx80Subnormal(aSig0, &aExp, &aSig0);
+    }
+    expDiff = aExp - bExp;
+    aSig1 = 0;
+
+    uint32 overflow = 0;
+
+    if (expDiff >= 64) {
+        int n = (expDiff & 0x1f) | 0x20;
+        remainder_kernel(aSig0, bSig, n, &aSig0, &aSig1);
+        zExp = aExp - n;
+        overflow = 1;
+    }
+    else {
+        zExp = bExp;
+
+        if (expDiff < 0) {
+            if (expDiff < -1) {
+               *r = (extractFloatx80Frac(a) & LIT64(0x8000000000000000)) ?
+                    packFloatx80(aSign, aExp, aSig0) : a;
+               return 0;
+            }
+            shift128Right(aSig0, 0, 1, &aSig0, &aSig1);
+            expDiff = 0;
+        }
+
+        if (expDiff > 0) {
+            *q = remainder_kernel(aSig0, bSig, expDiff, &aSig0, &aSig1);
+        }
+        else {
+            if (bSig <= aSig0) {
+               aSig0 -= bSig;
+               *q = 1;
+            }
+        }
+
+        if (rounding_mode == float_round_nearest_even)
+        {
+            uint64 term0, term1;
+            shift128Right(bSig, 0, 1, &term0, &term1);
+
+            if (! lt128(aSig0, aSig1, term0, term1))
+            {
+               int lt = lt128(term0, term1, aSig0, aSig1);
+               int eq = eq128(aSig0, aSig1, term0, term1);
+
+               if ((eq && (*q & 1)) || lt) {
+                  aSign = !aSign;
+                  ++*q;
+               }
+               if (lt) sub128(bSig, 0, aSig0, aSig1, &aSig0, &aSig1);
+            }
+        }
+    }
+
+    *r = normalizeRoundAndPackFloatx80(80, aSign, zExp, aSig0, aSig1, status);
+    return overflow;
+}
+
+/*----------------------------------------------------------------------------
+| Returns the remainder of the extended double-precision floating-point value
+| `a' with respect to the corresponding value `b'.  The operation is performed
+| according to the IEC/IEEE Standard for Binary Floating-Point Arithmetic.
+*----------------------------------------------------------------------------*/
+
+int floatx80_ieee754_remainder(floatx80 a, floatx80 b, floatx80 *r, uint64 *q STATUS_PARAM )
+{
+    return do_fprem(a, b, r, q, float_round_nearest_even STATUS_VAR );
+}
+
+/*----------------------------------------------------------------------------
+| Returns the remainder of the extended double-precision floating-point value
+| `a' with  respect to  the corresponding value `b'. Unlike previous function
+| the  function  does not compute  the remainder  specified  in  the IEC/IEEE
+| Standard  for Binary  Floating-Point  Arithmetic.  This  function  operates
+| differently  from the  previous  function in  the way  that it  rounds  the
+| quotient of 'a' divided by 'b' to an integer.
+*----------------------------------------------------------------------------*/
+
+int floatx80_remainder(floatx80 a, floatx80 b, floatx80 *r, uint64 *q STATUS_PARAM )
+{
+    return do_fprem(a, b, r, q, float_round_to_zero STATUS_VAR );
+}
+
+/*
+ * END from Bochs rev 11224 dated 2012-06-21 10:33:37 -0700
+ */
+
 float128 float128_scalbn( float128 a, int n STATUS_PARAM )
 {
     flag aSign;
diff --git a/fpu/softfloat.h b/fpu/softfloat.h
index feec3a1..53d7827 100644
--- a/fpu/softfloat.h
+++ b/fpu/softfloat.h
@@ -497,10 +497,13 @@  int floatx80_lt_quiet( floatx80, floatx80 STATUS_PARAM );
 int floatx80_unordered_quiet( floatx80, floatx80 STATUS_PARAM );
 int floatx80_compare( floatx80, floatx80 STATUS_PARAM );
 int floatx80_compare_quiet( floatx80, floatx80 STATUS_PARAM );
+int floatx80_is_unsupported( floatx80 );
 int floatx80_is_quiet_nan( floatx80 );
 int floatx80_is_signaling_nan( floatx80 );
 floatx80 floatx80_maybe_silence_nan( floatx80 );
 floatx80 floatx80_scalbn( floatx80, int STATUS_PARAM );
+int floatx80_ieee754_remainder( floatx80, floatx80, floatx80 *, uint64 * STATUS_PARAM );
+int floatx80_remainder( floatx80, floatx80, floatx80 *, uint64 * STATUS_PARAM );
 
 INLINE floatx80 floatx80_abs(floatx80 a)
 {
diff --git a/target-i386/op_helper.c b/target-i386/op_helper.c
index 2862ea4..ae0c877 100644
--- a/target-i386/op_helper.c
+++ b/target-i386/op_helper.c
@@ -3620,6 +3620,23 @@  static void fpu_set_exception(int mask)
         env->fpus |= FPUS_SE | FPUS_B;
 }
 
+static int fpu_exception_from_fp_status(void)
+{
+    int mask;
+
+    const int float_flag_denormals = float_flag_input_denormal |
+                                     float_flag_output_denormal;
+
+    // Most flags have the same value in the enum and 387 status word, except
+    // the denormal flags.
+    mask = env->fp_status.float_exception_flags & ~float_flag_denormals;
+    if (env->fp_status.float_exception_flags & float_flag_denormals) {
+        mask |= FPUS_DE;
+    }
+
+    return mask;
+}
+
 static inline floatx80 helper_fdiv(floatx80 a, floatx80 b)
 {
     if (floatx80_is_zero(b)) {
@@ -4210,119 +4227,67 @@  void helper_fxtract(void)
 
 void helper_fprem1(void)
 {
-    double st0, st1, dblq, fpsrcop, fptemp;
-    CPU_LDoubleU fpsrcop1, fptemp1;
-    int expdif;
-    signed long long int q;
-
-    st0 = floatx80_to_double(ST0);
-    st1 = floatx80_to_double(ST1);
-
-    if (isinf(st0) || isnan(st0) || isnan(st1) || (st1 == 0.0)) {
-        ST0 = double_to_floatx80(0.0 / 0.0); /* NaN */
-        env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
-        return;
-    }
-
-    fpsrcop = st0;
-    fptemp = st1;
-    fpsrcop1.d = ST0;
-    fptemp1.d = ST1;
-    expdif = EXPD(fpsrcop1) - EXPD(fptemp1);
-
-    if (expdif < 0) {
-        /* optimisation? taken from the AMD docs */
+    floatx80 result;
+    uint64_t quotient;
+
+    /* TODO(catalinp):
+     *
+     * - What is the defined purpose of fp_status in Qemu? Seems many things
+     *   write into it, but few if any translate into env->fpus.
+     *
+     * - Bochs' FPU_pre_exception_handling resets a few more fields in fp_status
+     *   before doing the operation. What is the purpose of that and is this
+     *   necessary here?
+     */
+    env->fp_status.float_exception_flags = 0;
+    int flags = floatx80_ieee754_remainder(ST0, ST1, &result, &quotient,
+                                           &env->fp_status);
+
+    /* TODO(catalinp): Bochs also checks for unmasked exceptions before storing
+     * these flags. Should we also do this?
+     */
+    if (flags >= 0) {
         env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
-        /* ST0 is unchanged */
-        return;
+        if (flags) {
+            env->fpus |= (1 << 10);
+        } else {            /* (C0,C3,C1) <-- (q2,q1,q0) */
+            if (quotient & 1) env->fpus |= (1 << 9);
+            if (quotient & 2) env->fpus |= (1 << 14);
+            if (quotient & 4) env->fpus |= (1 << 8);
+        }
     }
 
-    if (expdif < 53) {
-        dblq = fpsrcop / fptemp;
-        /* round dblq towards nearest integer */
-        dblq = rint(dblq);
-        st0 = fpsrcop - fptemp * dblq;
-
-        /* convert dblq to q by truncating towards zero */
-        if (dblq < 0.0)
-           q = (signed long long int)(-dblq);
-        else
-           q = (signed long long int)dblq;
+    ST0 = result;
 
-        env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
-                                /* (C0,C3,C1) <-- (q2,q1,q0) */
-        env->fpus |= (q & 0x4) << (8 - 2);  /* (C0) <-- q2 */
-        env->fpus |= (q & 0x2) << (14 - 1); /* (C3) <-- q1 */
-        env->fpus |= (q & 0x1) << (9 - 0);  /* (C1) <-- q0 */
-    } else {
-        env->fpus |= 0x400;  /* C2 <-- 1 */
-        fptemp = pow(2.0, expdif - 50);
-        fpsrcop = (st0 / st1) / fptemp;
-        /* fpsrcop = integer obtained by chopping */
-        fpsrcop = (fpsrcop < 0.0) ?
-                  -(floor(fabs(fpsrcop))) : floor(fpsrcop);
-        st0 -= (st1 * fpsrcop * fptemp);
-    }
-    ST0 = double_to_floatx80(st0);
+    /* TODO(catalinp): Set only unmasked exceptions? */
+    fpu_set_exception(fpu_exception_from_fp_status());
 }
 
 void helper_fprem(void)
 {
-    double st0, st1, dblq, fpsrcop, fptemp;
-    CPU_LDoubleU fpsrcop1, fptemp1;
-    int expdif;
-    signed long long int q;
-
-    st0 = floatx80_to_double(ST0);
-    st1 = floatx80_to_double(ST1);
+    floatx80 result;
+    uint64_t quotient;
 
-    if (isinf(st0) || isnan(st0) || isnan(st1) || (st1 == 0.0)) {
-       ST0 = double_to_floatx80(0.0 / 0.0); /* NaN */
-       env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
-       return;
-    }
-
-    fpsrcop = st0;
-    fptemp = st1;
-    fpsrcop1.d = ST0;
-    fptemp1.d = ST1;
-    expdif = EXPD(fpsrcop1) - EXPD(fptemp1);
+    /* TODO(catalinp): See comments in helper_fprem1() about lots of potential
+     * semantic ambiguities/differences between this and Bochs' implementation.
+     */
+    env->fp_status.float_exception_flags = 0;
+    int flags = floatx80_remainder(ST0, ST1, &result, &quotient, &env->fp_status);
 
-    if (expdif < 0) {
-        /* optimisation? taken from the AMD docs */
+    if (flags >= 0) {
         env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
-        /* ST0 is unchanged */
-        return;
+        if (flags) {
+            env->fpus |= (1 << 10);
+        } else {            /* (C0,C3,C1) <-- (q2,q1,q0) */
+            if (quotient & 1) env->fpus |= (1 << 9);
+            if (quotient & 2) env->fpus |= (1 << 14);
+            if (quotient & 4) env->fpus |= (1 << 8);
+        }
     }
 
-    if ( expdif < 53 ) {
-        dblq = fpsrcop/*ST0*/ / fptemp/*ST1*/;
-        /* round dblq towards zero */
-        dblq = (dblq < 0.0) ? ceil(dblq) : floor(dblq);
-        st0 = fpsrcop/*ST0*/ - fptemp * dblq;
+    ST0 = result;
 
-        /* convert dblq to q by truncating towards zero */
-        if (dblq < 0.0)
-           q = (signed long long int)(-dblq);
-        else
-           q = (signed long long int)dblq;
-
-        env->fpus &= (~0x4700); /* (C3,C2,C1,C0) <-- 0000 */
-                                /* (C0,C3,C1) <-- (q2,q1,q0) */
-        env->fpus |= (q & 0x4) << (8 - 2);  /* (C0) <-- q2 */
-        env->fpus |= (q & 0x2) << (14 - 1); /* (C3) <-- q1 */
-        env->fpus |= (q & 0x1) << (9 - 0);  /* (C1) <-- q0 */
-    } else {
-        int N = 32 + (expdif % 32); /* as per AMD docs */
-        env->fpus |= 0x400;  /* C2 <-- 1 */
-        fptemp = pow(2.0, (double)(expdif - N));
-        fpsrcop = (st0 / st1) / fptemp;
-        /* fpsrcop = integer obtained by chopping */
-        fpsrcop = (fpsrcop < 0.0) ?
-                  -(floor(fabs(fpsrcop))) : floor(fpsrcop);
-        st0 -= (st1 * fpsrcop * fptemp);
-    }
-    ST0 = double_to_floatx80(st0);
+    fpu_set_exception(fpu_exception_from_fp_status());
 }
 
 void helper_fyl2xp1(void)