diff mbox series

[RFC,v2] target/ppc: Enable hardfloat for PPC

Message ID 20200218171702.979F074637D@zero.eik.bme.hu
State New
Headers show
Series [RFC,v2] target/ppc: Enable hardfloat for PPC | expand

Commit Message

BALATON Zoltan Feb. 18, 2020, 5:10 p.m. UTC
While other targets take advantage of using host FPU to do floating
point computations, this was disabled for PPC target because always
clearing exception flags before every FP op made it slightly slower
than emulating everyting with softfloat. To emulate some FPSCR bits,
clearing of fp_status may be necessary (unless these could be handled
e.g. using FP exceptions on host but there's no API for that in QEMU
yet) but preserving at least the inexact flag makes hardfloat usable
and faster than softfloat. Since most clients don't actually care
about this flag, we can gain some speed trading some emulation
accuracy.

This patch implements a simple way to keep the inexact flag set for
hardfloat while still allowing to revert to softfloat for workloads
that need more accurate albeit slower emulation. (Set hardfloat
property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
still be room for further improvement but this seems to increase
floating point performance. Unfortunately the softfloat case is slower
than before this patch so this patch only makes sense if the default
is also set to enable hardfloat.

Because of the above this patch at the moment is mainly for testing
different workloads to evaluate how viable would this be in practice.
Thus, RFC and not ready for merge yet.

Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
---
v2: use different approach to avoid needing if () in
helper_reset_fpstatus() but this does not seem to change overhead
much, also make it a single patch as adding the hardfloat option is
only a few lines; with this we can use same value at other places where
float_status is reset and maybe enable hardfloat for a few more places
for a little more performance but not too much. With this I got:

lame: 3:13, lame_vmx: 1:55 (this is probably within jitter though and
still far from the results on real hardware) also tried mplayer test
and got results between 144-146s (this test is more VMX bound).

I've also done some profiling for hardfloat=true and false cases with
this patch to see what are the hot functions. Results are:

Overhead  Command          Symbol
-cpu G4,hardfloat=false, lame:
   9.82%  qemu-system-ppc  [.] round_canonical
   8.35%  qemu-system-ppc  [.] soft_f64_muladd
   7.16%  qemu-system-ppc  [.] soft_f64_addsub
   5.27%  qemu-system-ppc  [.] float32_to_float64
   5.20%  qemu-system-ppc  [.] helper_compute_fprf_float64
   4.61%  qemu-system-ppc  [.] helper_frsp
   4.59%  qemu-system-ppc  [.] soft_f64_mul
   4.01%  qemu-system-ppc  [.] float_to_float.isra.26
   3.84%  qemu-system-ppc  [.] float64_classify
   2.97%  qemu-system-ppc  [.] do_float_check_status

-cpu G4,hardfloat=false, lame_vmx:
Overhead  Command          Symbol
  10.04%  qemu-system-ppc  [.] float32_muladd
   9.49%  qemu-system-ppc  [.] helper_vperm
   6.10%  qemu-system-ppc  [.] round_canonical
   4.13%  qemu-system-ppc  [.] soft_f64_addsub
   3.23%  qemu-system-ppc  [.] helper_frsp
   3.13%  qemu-system-ppc  [.] soft_f64_muladd
   2.88%  qemu-system-ppc  [.] helper_vmaddfp
   2.69%  qemu-system-ppc  [.] float32_add
   2.60%  qemu-system-ppc  [.] float32_to_float64
   2.52%  qemu-system-ppc  [.] helper_compute_fprf_float64
    
-cpu G4,hardfloat=true, lame:
  11.59%  qemu-system-ppc  [.] round_canonical
   6.18%  qemu-system-ppc  [.] helper_compute_fprf_float64
   6.01%  qemu-system-ppc  [.] float32_to_float64
   4.58%  qemu-system-ppc  [.] float64_classify
   3.87%  qemu-system-ppc  [.] helper_frsp
   3.75%  qemu-system-ppc  [.] float_to_float.isra.26
   3.48%  qemu-system-ppc  [.] helper_todouble
   3.31%  qemu-system-ppc  [.] float64_muladd
   3.21%  qemu-system-ppc  [.] do_float_check_status
   3.01%  qemu-system-ppc  [.] float64_mul

-cpu G4,hardfloat=true, lame_vmx:
   9.34%  qemu-system-ppc  [.] float32_muladd
   8.83%  qemu-system-ppc  [.] helper_vperm
   5.41%  qemu-system-ppc  [.] round_canonical
   4.51%  qemu-system-ppc  [.] page_collection_lock
   3.58%  qemu-system-ppc  [.] page_trylock_add.isra.17
   2.71%  qemu-system-ppc  [.] helper_vmaddfp
   2.53%  qemu-system-ppc  [.] float32_add
   2.30%  qemu-system-ppc  [.] helper_compute_fprf_float64
   2.21%  qemu-system-ppc  [.] float32_to_float64
   2.06%  qemu-system-ppc  [.] helper_frsp

round_canonical seems to come up frequently in this with large overhead.

Could those with better test cases or benchmarks give it a test please
on different CPUs to see what else this would break?

---
fpu/softfloat.c                 | 14 +++++++-------
 target/ppc/cpu.h                |  2 ++
 target/ppc/fpu_helper.c         | 32 ++++++++++++++++----------------
 target/ppc/translate_init.inc.c |  3 +++
 4 files changed, 28 insertions(+), 23 deletions(-)

Comments

BALATON Zoltan Feb. 18, 2020, 5:38 p.m. UTC | #1
On Tue, 18 Feb 2020, BALATON Zoltan wrote:
> While other targets take advantage of using host FPU to do floating
> point computations, this was disabled for PPC target because always
> clearing exception flags before every FP op made it slightly slower
> than emulating everyting with softfloat. To emulate some FPSCR bits,
> clearing of fp_status may be necessary (unless these could be handled
> e.g. using FP exceptions on host but there's no API for that in QEMU
> yet) but preserving at least the inexact flag makes hardfloat usable
> and faster than softfloat. Since most clients don't actually care
> about this flag, we can gain some speed trading some emulation
> accuracy.
>
> This patch implements a simple way to keep the inexact flag set for
> hardfloat while still allowing to revert to softfloat for workloads
> that need more accurate albeit slower emulation. (Set hardfloat
> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
> still be room for further improvement but this seems to increase
> floating point performance. Unfortunately the softfloat case is slower
> than before this patch so this patch only makes sense if the default
> is also set to enable hardfloat.
>
> Because of the above this patch at the moment is mainly for testing
> different workloads to evaluate how viable would this be in practice.
> Thus, RFC and not ready for merge yet.
>
> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
> ---

Forgot to say, in case that's not clear that this should currently apply 
to David's ppc-for-5.0 branch because there are some clean up patches from 
me queued there so until those reach master this patch may not apply 
cleanly. Just in case someone wanted to try it and get errors.

Regards,
BALATON Zoltan
Programmingkid Feb. 19, 2020, 2:27 a.m. UTC | #2
> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> 
> While other targets take advantage of using host FPU to do floating
> point computations, this was disabled for PPC target because always
> clearing exception flags before every FP op made it slightly slower
> than emulating everyting with softfloat. To emulate some FPSCR bits,
> clearing of fp_status may be necessary (unless these could be handled
> e.g. using FP exceptions on host but there's no API for that in QEMU
> yet) but preserving at least the inexact flag makes hardfloat usable
> and faster than softfloat. Since most clients don't actually care
> about this flag, we can gain some speed trading some emulation
> accuracy.
> 
> This patch implements a simple way to keep the inexact flag set for
> hardfloat while still allowing to revert to softfloat for workloads
> that need more accurate albeit slower emulation. (Set hardfloat
> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
> still be room for further improvement but this seems to increase
> floating point performance. Unfortunately the softfloat case is slower
> than before this patch so this patch only makes sense if the default
> is also set to enable hardfloat.
> 
> Because of the above this patch at the moment is mainly for testing
> different workloads to evaluate how viable would this be in practice.
> Thus, RFC and not ready for merge yet.
> 
> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
> ---
> v2: use different approach to avoid needing if () in
> helper_reset_fpstatus() but this does not seem to change overhead
> much, also make it a single patch as adding the hardfloat option is
> only a few lines; with this we can use same value at other places where
> float_status is reset and maybe enable hardfloat for a few more places
> for a little more performance but not too much. With this I got:

<snip>

Thank you for working on this. It is about time we have a better FPU. 

I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly and compiled easily.

Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3. 

I did several tests and here are my results:

With hard float:
- The USB audio device does not produce any sound. 
- Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft float :) ).
For my FPSCR test program, 21 tests failed. The high number is because the inexact exception is being set for situations it should not be set for.

With soft float:
- Some sound can be heard from the USB audio device. It isn't good sounding. I had to force quit Quicktime player because it stopped working.
- Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard float).
- 13 tests failed with my FPSCR test program.

This patch is a good start. I'm not worried about the Floating Point Status and Control Register flags being wrong since hardly any software bothers to check them. I think more optimizations can happen by simplifying the FPU. As it is now it makes a lot of calls per operation.
BALATON Zoltan Feb. 19, 2020, 3:35 p.m. UTC | #3
Hello,

On Tue, 18 Feb 2020, Programmingkid wrote:
>> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>> While other targets take advantage of using host FPU to do floating
>> point computations, this was disabled for PPC target because always
>> clearing exception flags before every FP op made it slightly slower
>> than emulating everyting with softfloat. To emulate some FPSCR bits,
>> clearing of fp_status may be necessary (unless these could be handled
>> e.g. using FP exceptions on host but there's no API for that in QEMU
>> yet) but preserving at least the inexact flag makes hardfloat usable
>> and faster than softfloat. Since most clients don't actually care
>> about this flag, we can gain some speed trading some emulation
>> accuracy.
>>
>> This patch implements a simple way to keep the inexact flag set for
>> hardfloat while still allowing to revert to softfloat for workloads
>> that need more accurate albeit slower emulation. (Set hardfloat
>> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
>> still be room for further improvement but this seems to increase
>> floating point performance. Unfortunately the softfloat case is slower
>> than before this patch so this patch only makes sense if the default
>> is also set to enable hardfloat.
>>
>> Because of the above this patch at the moment is mainly for testing
>> different workloads to evaluate how viable would this be in practice.
>> Thus, RFC and not ready for merge yet.
>>
>> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
>> ---
>> v2: use different approach to avoid needing if () in
>> helper_reset_fpstatus() but this does not seem to change overhead
>> much, also make it a single patch as adding the hardfloat option is
>> only a few lines; with this we can use same value at other places where
>> float_status is reset and maybe enable hardfloat for a few more places
>> for a little more performance but not too much. With this I got:
>
> <snip>
>
> Thank you for working on this. It is about time we have a better FPU.

Thank you for testing it. I think it would be great if we could come up 
with some viable approach to improve this before the next freeze.

> I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly and compiled easily.

I've heard some preliminary results from others that there's also a 
difference between v1 and v2 of the patch in performance where v1 may be 
faster for same cases so if you (or someone else) want and have time you 
could experiment with different versions and combinations as well to find 
the one that's best on all CPUs. Basically we have these parts:

1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to force 
float_flag_inexact on in case hadfloat is enabled, I've tried two 
approaches for this:

a. In v1 added an if () in the function
b. In v2 used a variable from env set earlier (I've hoped this may be 
faster but maybe it's not, testing and explanation is welcome)

2. Also change places where env->fp_status is copied to a local tstat and 
then reset (I think this is done to accumulate flags from multiple FP ops 
that would individually reset env->fp_status or some other reason, maybe 
this could be avoided if we reset fp_status less often but that would need 
more understanding of the FP emulation that I don't have so I did not 
try to clean that up yet).

If v2 is really slower than v1 then I'm not sure is it because also 
changing places with tstat or because of the different approach in 
helper_reset_fpstatus() so you could try combinations of these as well.

> Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3.

What was the host CPU and OS this was tested on? Please always share CPU 
info and host OS when sharing bechmark results so they are somewhat 
comparable. It also depends on CPU features for vector instrucions at 
least so without CPU info the results could not be understood.

I think G3 does not have AltiVec/VMX so maybe testing with G4 would be 
better to also test those ops unless there's a reason to only test G3. 
I've tested with G4 both FPU only and FPU+VMX code on Linux host with 
i7-9700K CPU @ 3.60GHz as was noted in the original cover letter but may 
be I'va also forgotten some details so I list it here again.

> I did several tests and here are my results:
>
> With hard float:
> - The USB audio device does not produce any sound.

I've heard this could also be due to some other problem not directly 
related to FPU, maybe there's a problem with USB/OHCI emulation as well 
because problems with that were reported but it's interesting why you get 
different results changing FPU related stuff. I think OSX uses float 
samples so probably does use FPU for processing sound and may rely on some 
pecularity of the hardware as it was probably optimised for Apple 
hardware. It would be interesting to find out how FPU stuff is related to 
this but since it's broken anyway probably not a show stopper at the 
moment.

> - Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft float :) ).

Does resulting file match? As a simple test I've verified md5sum of the 
resulting mp3 with the lame benchmark I've tried just to find any big 
errors. Even if it does not prove that nothing broke, it shuold detect if 
something breaks badly. However that was WAV->MP3 where results were same, 
although the VMX build did produce different result than FPU only but did 
so consistently for multiple runs. With MIDI there might be slight timing 
difference that could cause different audio results so you should first 
verify if doing the conversion multiple times does produce the same result 
at all without any patch first.

> For my FPSCR test program, 21 tests failed. The high number is because 
> the inexact exception is being set for situations it should not be set 
> for.

Since we force the inexact flag to be set to enable hardfloat this is 
expected. More interesting is if apart from this are there any difference 
in the results compared to the soffloat case (that may also be host CPU 
dependent I think). Do you have more detailed info on the errors and 
differences found?

Some of the problems with inexact may be fixed by not always forcing the 
flag on but just not clearing it. As I undersood other targets do that so 
it starts with softfloat but the first time the inexact flag is set it 
will start using hardfloat as long as the guest does not clear this flag. 
Probably this is done to automatically detect code that needs the flag and 
assume it's not used when it's not touched. Since PPC also has an inexact 
flag just for previous FP op (the FI bit) apart from the usual cumulative 
flag, the client could read that instead of clearing the cumulative flag 
so we can't detect guest usage this way, teherefore we might as well break 
inexact completely to always use hardfloat and need to manually enable it 
for guests that we know need it. I'm not sure however if forcing the 
inexact flag would lead to unwanted FP exceptions as well so this may also 
need to be made conditional on the enabled/disabled status of inexact FP 
exceptions. Does anyone have more info on this?

> With soft float: - Some sound can be heard from the USB audio device. It 
> isn't good sounding. I had to force quit Quicktime player because it 
> stopped working.
> - Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard float).
> - 13 tests failed with my FPSCR test program.
>
> This patch is a good start. I'm not worried about the Floating Point 
> Status and Control Register flags being wrong since hardly any software 
> bothers to check them. I think more optimizations can happen by

I don't know if guest code checks fpscr and what flags it cares about. 
Also don't know if it's a fact that these are not used but maybe if we 
test with more guest codes we can find out. That's why I'd like to at 
least have an option to test with hardfloat. Unfortunately enabling 
hardfloat without also making it default would make it slower so if we go 
this way we should make sure we can also enable hardfloat as default.

> simplifying the FPU. As it is now it makes a lot of calls per operation.

Question is if those calls are really needed to emulate PPC FPU or if not 
why would they be there? If the FPU is really that much different so all 
these calls are needed then there's not much to simplify (but maybe there 
could be some optimisations possible). This would need someone to 
understand the current code in full first that probably we don't currently 
(ar least I don't for sure so can't really make changes either). Another 
more viable approach is to pick a small part and follow through with that 
and try to clean up and optimise that small part only. The exception and 
fpscr handling is one such part, another could be round_canonical() that 
seems to be high on profiles I've taken. Maybe this could be done by 
reading and understading docs only on the small part picked that may be 
easier than getting everything first. I wonder if such smaller tasks could 
be defined and given out as GSoC or other volunteer projects?

Regards,
BALATON Zoltan
Howard Spoelstra Feb. 19, 2020, 6:28 p.m. UTC | #4
Hi,

I tested with the current ppc-for-5.0 branch and with v1 of the hardfloat
patches applied on top of that. There is a noticeable speed improvement in
Linux and OSX hosts. Windows 10 host doesn't seem to be impressed at all. I
saw no obvious glitches so far. The fpu performance on OSX hosts seems very
slow. This was not always the case in the past, when it was on par with
Linux performance.

Below are my results.

Best,
Howard

Host Linux (Fedora 31):
Mac OS tests: 9.2 with MacBench 5.0
Baseline(100%): G3 300Mhz
5.0 branch + hardfloat patches: cpu 193%, fpu 86%
5.0 branch: cpu 188%, fpu 57%
Mac OSX tests: 10.5 with Skidmarks 4.0 test
Baseline(100%): G4 1.0Ghz.
5.0 branch + hardfloat patches: Int:131 FP:11 Vec:15
5.0 branch: Int:131 FP:9 Vec:11

Host OSX Sierra:
Mac OS tests: 9.2 with MacBench 5.0
Baseline(100%): G3 300Mhz
5.0 branch + hardfloat patches: cpu 199%, fpu 66%
5.0 branch: cpu 199%, fpu 40%
Mac OSX tests: 10.5 with Skidmarks 4.0 test
Baseline(100%): G4 1.0Ghz.
5.0 branch + hardfloat patches: Int:129 PF:11 Vec:14
5.0 branch: Int:129 FP:8 Vec:9

Host Windows 10:
Mac OS tests: 9.2 with MacBench 5.0
Baseline(100%): G3 300Mhz
5.0 branch + hardfloat patches: cpu 180%, fpu 54%
5.0 branch: cpu 199%, fpu 40%
Mac OSX tests: 10.5 with Skidmarks 4.0 test
Baseline(100%): G4 1.0Ghz.
5.0 branch + hardfloat patches: Int:130 FP:9 Vec:10
5.0 branch: Int:130 FP:10 Vec:11

All tests done on the same host with v1 of the hardfloat patches
Intel i7-4770K at 3.50Ghz. 32Gb memory
All guests set to 1024x768 and "thousands" of colors.
Linux and OSX (with brew) use default compilers.
Windows build cross-compiled from Fedora with x86_64-win64-mingw32

Skidmarks test is a combination of tests:
ParseVid (INT)
MPEG (INT)
PixBlend (INT)
Ellipticrypt (INT)
Rijndael (INT)
Quicksort (INT)
MDCT (FP)
IntToFloat (FP)
Q3 (FP)
FFT (FP)
VolInt (FP)
Quant (VMX)
Galaxy (VMX)
IDCT (VMX)
BigMult (VMX)
BALATON Zoltan Feb. 19, 2020, 7:28 p.m. UTC | #5
On Wed, 19 Feb 2020, Howard Spoelstra wrote:
> I tested with the current ppc-for-5.0 branch and with v1 of the hardfloat
> patches applied on top of that. There is a noticeable speed improvement in
> Linux and OSX hosts. Windows 10 host doesn't seem to be impressed at all. I
> saw no obvious glitches so far. The fpu performance on OSX hosts seems very
> slow. This was not always the case in the past, when it was on par with
> Linux performance.

Interesting, thanks for the measurements.

> Below are my results.
>
> Best,
> Howard
>
> Host Linux (Fedora 31):
> Mac OS tests: 9.2 with MacBench 5.0
> Baseline(100%): G3 300Mhz
> 5.0 branch + hardfloat patches: cpu 193%, fpu 86%
> 5.0 branch: cpu 188%, fpu 57%

Here there's a difference in cpu value before and after patch which I 
can't explain (only changed FPU stuff so it should not change others) but 
also not seen in other measurements so this could be some external 
influence such as something else using CPU while running test? Unless this 
happens consistently I'd put it down to measurement error.

> Mac OSX tests: 10.5 with Skidmarks 4.0 test
> Baseline(100%): G4 1.0Ghz.
> 5.0 branch + hardfloat patches: Int:131 FP:11 Vec:15
> 5.0 branch: Int:131 FP:9 Vec:11
>
> Host OSX Sierra:
> Mac OS tests: 9.2 with MacBench 5.0
> Baseline(100%): G3 300Mhz
> 5.0 branch + hardfloat patches: cpu 199%, fpu 66%
> 5.0 branch: cpu 199%, fpu 40%
> Mac OSX tests: 10.5 with Skidmarks 4.0 test
> Baseline(100%): G4 1.0Ghz.
> 5.0 branch + hardfloat patches: Int:129 PF:11 Vec:14

These values seem to match Linux measurement above so don't seem slower 
although MacOS9 seems to be slower (66 vs. 86) so either this depends on 
the ops used or something else.

> 5.0 branch: Int:129 FP:8 Vec:9
>
> Host Windows 10:
> Mac OS tests: 9.2 with MacBench 5.0
> Baseline(100%): G3 300Mhz
> 5.0 branch + hardfloat patches: cpu 180%, fpu 54%
> 5.0 branch: cpu 199%, fpu 40%

Here there's again difference in cpu value but the other way so maybe if 
the cause is external CPU usage then this again may be an outlying 
measurement? You could retake these two to verify if you get same numbers 
again. The fpu value does seem to improve just not as much as the others 
and it's also lower to start with. I wonder why.

> Mac OSX tests: 10.5 with Skidmarks 4.0 test
> Baseline(100%): G4 1.0Ghz.
> 5.0 branch + hardfloat patches: Int:130 FP:9 Vec:10
> 5.0 branch: Int:130 FP:10 Vec:11
>
> All tests done on the same host with v1 of the hardfloat patches
> Intel i7-4770K at 3.50Ghz. 32Gb memory
> All guests set to 1024x768 and "thousands" of colors.

Does it mean this host machine were rebooted into these OSes or these were 
run in a VM. In case using VM, were all three running in VM or one was on 
host (I'd guess OSX host with Linux and Windows VMs).

> Linux and OSX (with brew) use default compilers.
> Windows build cross-compiled from Fedora with x86_64-win64-mingw32

I assume Linux and OSX were 64 bit builds, is Windows 32 bit or 64 bit 
exe?

Regards,
BALATON Zoltan
Howard Spoelstra Feb. 20, 2020, 5:43 a.m. UTC | #6
On Wed, Feb 19, 2020 at 8:28 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:

> On Wed, 19 Feb 2020, Howard Spoelstra wrote:
> > I tested with the current ppc-for-5.0 branch and with v1 of the hardfloat
> > patches applied on top of that. There is a noticeable speed improvement
> in
> > Linux and OSX hosts. Windows 10 host doesn't seem to be impressed at
> all. I
> > saw no obvious glitches so far. The fpu performance on OSX hosts seems
> very
> > slow. This was not always the case in the past, when it was on par with
> > Linux performance.
>
> Interesting, thanks for the measurements.
>
> > Below are my results.
> >
> > Best,
> > Howard
> >
> > Host Linux (Fedora 31):
> > Mac OS tests: 9.2 with MacBench 5.0
> > Baseline(100%): G3 300Mhz
> > 5.0 branch + hardfloat patches: cpu 193%, fpu 86%
> > 5.0 branch: cpu 188%, fpu 57%
>
> Here there's a difference in cpu value before and after patch which I
> can't explain (only changed FPU stuff so it should not change others) but
> also not seen in other measurements so this could be some external
> influence such as something else using CPU while running test? Unless this
> happens consistently I'd put it down to measurement error.
>

  Yes, I would put that cpu value down to some fluctuation in the test

>
> > Mac OSX tests: 10.5 with Skidmarks 4.0 test
> > Baseline(100%): G4 1.0Ghz.
> > 5.0 branch + hardfloat patches: Int:131 FP:11 Vec:15
> > 5.0 branch: Int:131 FP:9 Vec:11
> >
> > Host OSX Sierra:
> > Mac OS tests: 9.2 with MacBench 5.0
> > Baseline(100%): G3 300Mhz
> > 5.0 branch + hardfloat patches: cpu 199%, fpu 66%
> > 5.0 branch: cpu 199%, fpu 40%
> > Mac OSX tests: 10.5 with Skidmarks 4.0 test
> > Baseline(100%): G4 1.0Ghz.
> > 5.0 branch + hardfloat patches: Int:129 PF:11 Vec:14
>
> These values seem to match Linux measurement above so don't seem slower
> although MacOS9 seems to be slower (66 vs. 86) so either this depends on
> the ops used or something else.
>

 Yes, the baseline speed for the fpu in Mac OS 9.2 is relatively low.

>
> > 5.0 branch: Int:129 FP:8 Vec:9
> >
> > Host Windows 10:
> > Mac OS tests: 9.2 with MacBench 5.0
> > Baseline(100%): G3 300Mhz
> > 5.0 branch + hardfloat patches: cpu 180%, fpu 54%
>

 new run 5.0 branch + hardfloat patches: cpu 184%, fpu 54%

> 5.0 branch: cpu 199%, fpu 40%
>

 new run 5.0 branch: cpu 184%, fpu 56%

It seems I misreported (copy/past without changing the values) the earlier
Windows-based results with Mac OS 9.2 guest. As said above (and this now
seems to confirm) Windows is not impressed at all and perhaps a bit slower
even.
Windows builds are particularly sensitive to any other activity on the
system. Moving the qemu window drops performance considerably. Perhaps due
to SDL not running in its own thread?

>
> Here there's again difference in cpu value but the other way so maybe if
> the cause is external CPU usage then this again may be an outlying
> measurement? You could retake these two to verify if you get same numbers
> again. The fpu value does seem to improve just not as much as the others
> and it's also lower to start with. I wonder why.
>


> > Mac OSX tests: 10.5 with Skidmarks 4.0 test
> > Baseline(100%): G4 1.0Ghz.
> > 5.0 branch + hardfloat patches: Int:130 FP:9 Vec:10
> > 5.0 branch: Int:130 FP:10 Vec:11
> >
> > All tests done on the same host with v1 of the hardfloat patches
> > Intel i7-4770K at 3.50Ghz. 32Gb memory
> > All guests set to 1024x768 and "thousands" of colors.
>
> Does it mean this host machine were rebooted into these OSes or these were
> run in a VM. In case using VM, were all three running in VM or one was on
> host (I'd guess OSX host with Linux and Windows VMs).
>
> > Linux and OSX (with brew) use default compilers.
> > Windows build cross-compiled from Fedora with x86_64-win64-mingw32
>
> I assume Linux and OSX were 64 bit builds, is Windows 32 bit or 64 bit
> exe?
>

No virtualisation. I run all on the same bare metal, so booted into these
three separately from separate SSDs. You might guess OSX Sierra is running
on less "official" hardware and you would be right. All qemu builds were 64
bit.

Best,
Howard
Richard Henderson Feb. 20, 2020, 8:13 p.m. UTC | #7
On 2/18/20 9:10 AM, BALATON Zoltan wrote:
>  void helper_reset_fpstatus(CPUPPCState *env)
>  {
> -    set_float_exception_flags(0, &env->fp_status);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &env->fp_status);
>  }

What I don't like is the forced setting of inexact.  I don't mind leaving it
set if it is already set, which corresponds to the normal accumulation of
exceptions.

In addition, if the inexact exception is unmasked, I would expect a signal to
be delivered only when an inexact exception happens.  Whereas this patch would
deliver a signal for every fp operation.

It should be just as easy to do

    flags = get_float_exception_flags(status);
    flags &= env->save_fp_exception_flags;
    set_float_exception_flags(flags, status);


> +    DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),

I would also prefer a different name here -- perhaps x-no-fp-fi.


r~
BALATON Zoltan Feb. 21, 2020, 4:04 p.m. UTC | #8
On Thu, 20 Feb 2020, Richard Henderson wrote:
> On 2/18/20 9:10 AM, BALATON Zoltan wrote:
>>  void helper_reset_fpstatus(CPUPPCState *env)
>>  {
>> -    set_float_exception_flags(0, &env->fp_status);
>> +    set_float_exception_flags(env->default_fp_excpt_flags, &env->fp_status);
>>  }
>
> What I don't like is the forced setting of inexact.  I don't mind leaving it
> set if it is already set, which corresponds to the normal accumulation of
> exceptions.

In a previous reply I've explained this (apart from just trying to get it 
work the simplest way for testing first):

On Wed, 19 Feb 2020, BALATON Zoltan wrote:
> Some of the problems with inexact may be fixed by not always forcing the 
> flag on but just not clearing it. As I undersood other targets do that 
> so it starts with softfloat but the first time the inexact flag is set 
> it will start using hardfloat as long as the guest does not clear this 
> flag. Probably this is done to automatically detect code that needs the 
> flag and assume it's not used when it's not touched. Since PPC also has 
> an inexact flag just for previous FP op (the FI bit) apart from the 
> usual cumulative flag, the client could read that instead of clearing 
> the cumulative flag so we can't detect guest usage this way, therefore 
> we might as well break inexact completely to always use hardfloat and 
> need to manually enable it for guests that we know need it. I'm not sure 
> however if forcing the inexact flag would lead to unwanted FP exceptions 
> as well so this may also need to be made conditional on the 
> enabled/disabled status of inexact FP exceptions. Does anyone have more 
> info on this?

So I know this should be improved but not sure how. I would not worry 
about setting the bit from the start due to the above but exceptions are 
likely wrong this way as you say.

> In addition, if the inexact exception is unmasked, I would expect a signal to
> be delivered only when an inexact exception happens.  Whereas this patch would
> deliver a signal for every fp operation.
>
> It should be just as easy to do
>
>    flags = get_float_exception_flags(status);
>    flags &= env->save_fp_exception_flags;
>    set_float_exception_flags(flags, status);

I'm not sure I get this. Can you please explain a bit more? Where would 
save_flags be set and where would the above masking with it be done? 
Currently we reset flags before every FP op as far as I understand and 
detecting of exceptions rely on this and test fp_status bits. Probably we 
should not reset the flags and change this to keep one fp_status 
accumulating flags and only reset if softfloat is forced but that would 
need rewriting exception handling that I'm not sure yet how to do. Any 
idea? Maybe if we could just unbreak inexact exceptions when enabled with 
current patch then this rewrite of exception handling could be done later.

>> +    DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),
>
> I would also prefer a different name here -- perhaps x-no-fp-fi.

What's wrong with hardfloat? That's how the code refers to this so if 
anyone searches what it does would turn up some meaningful results. Your 
proposed property name is very cryptic and nobody would be able to guess 
what is it for. If you insist not to use hardfloat maybe we could invert 
it to softfloat or try x-use-fpu maybe. I specificaly don't want to tie it 
to the FI but as there's at least another non-sticky bit in FPSCR (don't 
remember which now) that might also need to be broken for best hardfloat 
results.

Regards,
BALATON Zoltan
Peter Maydell Feb. 21, 2020, 4:11 p.m. UTC | #9
On Fri, 21 Feb 2020 at 16:05, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>
> On Thu, 20 Feb 2020, Richard Henderson wrote:
> > On 2/18/20 9:10 AM, BALATON Zoltan wrote:
> >> +    DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),
> >
> > I would also prefer a different name here -- perhaps x-no-fp-fi.
>
> What's wrong with hardfloat? That's how the code refers to this so if
> anyone searches what it does would turn up some meaningful results.

This prompted me to check what you're using the property for.
The cover letter says:
> This patch implements a simple way to keep the inexact flag set for
> hardfloat while still allowing to revert to softfloat for workloads
> that need more accurate albeit slower emulation. (Set hardfloat
> property of CPU, i.e. -cpu name,hardfloat=false for that.)

I think that is the wrong approach. Enabling use of the host
FPU should not affect the accuracy of the emulation, which
should remain bitwise-correct. We should only be using the
host FPU to the extent that we can do that without discarding
accuracy. As far as I'm aware that's how the hardfloat support
for other guest CPUs that use it works.

thanks
-- PMM
Aleksandar Markovic Feb. 21, 2020, 4:51 p.m. UTC | #10
On Fri, Feb 21, 2020 at 5:11 PM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Fri, 21 Feb 2020 at 16:05, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> >
> > On Thu, 20 Feb 2020, Richard Henderson wrote:
> > > On 2/18/20 9:10 AM, BALATON Zoltan wrote:
> > >> +    DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),
> > >
> > > I would also prefer a different name here -- perhaps x-no-fp-fi.
> >
> > What's wrong with hardfloat? That's how the code refers to this so if
> > anyone searches what it does would turn up some meaningful results.
>
> This prompted me to check what you're using the property for.
> The cover letter says:
> > This patch implements a simple way to keep the inexact flag set for
> > hardfloat while still allowing to revert to softfloat for workloads
> > that need more accurate albeit slower emulation. (Set hardfloat
> > property of CPU, i.e. -cpu name,hardfloat=false for that.)
>
> I think that is the wrong approach. Enabling use of the host
> FPU should not affect the accuracy of the emulation, which
> should remain bitwise-correct. We should only be using the
> host FPU to the extent that we can do that without discarding
> accuracy. As far as I'm aware that's how the hardfloat support
> for other guest CPUs that use it works.
>

Right, that is my understanding as well. There shouldn't be
"hardfloat" switch at all. If there is, it is either a mistake, or for
experimental or other similar purposes. In my understanding,
hardfloat feature should work entirely transparently, making
the decision whether to use host math functions (instead of
softfloat library) by itself, based on the particular execution
circumstances.

In other words, the accuracy of FP emulation should not be
compromised in absolutely any case (there should not be an
option "ok, we are happy with approximate calculations").

On top of that, artificial generating of "inexact" flag really
seems problematic (or, speaking bluntly, looks like a hack)).

Perhaps hardfloat feature needs a little bit of more work,
but not in this way.

Yours,
Aleksandar

> thanks
> -- PMM
>
BALATON Zoltan Feb. 21, 2020, 6:04 p.m. UTC | #11
On Fri, 21 Feb 2020, Peter Maydell wrote:
> On Fri, 21 Feb 2020 at 16:05, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>> On Thu, 20 Feb 2020, Richard Henderson wrote:
>>> On 2/18/20 9:10 AM, BALATON Zoltan wrote:
>>>> +    DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),
>>>
>>> I would also prefer a different name here -- perhaps x-no-fp-fi.
>>
>> What's wrong with hardfloat? That's how the code refers to this so if
>> anyone searches what it does would turn up some meaningful results.
>
> This prompted me to check what you're using the property for.
> The cover letter says:
>> This patch implements a simple way to keep the inexact flag set for
>> hardfloat while still allowing to revert to softfloat for workloads
>> that need more accurate albeit slower emulation. (Set hardfloat
>> property of CPU, i.e. -cpu name,hardfloat=false for that.)
>
> I think that is the wrong approach. Enabling use of the host
> FPU should not affect the accuracy of the emulation, which
> should remain bitwise-correct. We should only be using the
> host FPU to the extent that we can do that without discarding
> accuracy. As far as I'm aware that's how the hardfloat support
> for other guest CPUs that use it works.

I don't know of a better approach. Please see section 4.2.2 Floating-Point 
Status and Control Register on page 124 in this document:

https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0

especially the definition of the FR and FI bits and tell me how can we 
emulate these accurately and use host FPU. Not using the FPU even when 
these bits are not needed (which seems to be the case for all workloads 
we've tested so far) seriously limits the emulation speed so spending time 
to emulate obscure and unused part of an architecture when not actually 
needed just to keep emulation accurate but unusably slow does not seem to 
be the right approach. In an ideal world of course this should be both 
fast and accurate but we don't seem to have anyone who could achieve that 
in past two years so maybe we could give up some accuracy now to get 
usable speed and worry about emulating obscure features when we come 
across some workload that actually needs it (but we have the option to 
revert to accurate but slow emulation for that until a better way can be 
devised that's both fast and accurate). Insisting on accuracy without any 
solution to current state just hinders making any progress with this.

Other PowerPC emulators also seem to not bother or have similar 
optimisation. I've quickly checked three that I know about:

https://github.com/mamedev/mame/blob/master/src/devices/cpu/powerpc/ppcdrc.cpp#L1893
https://github.com/mamedev/mame/blob/master/src/devices/cpu/powerpc/ppcdrc.cpp#L3503
there's also something here but no mention of FI bit I could notice:
https://github.com/mamedev/mame/blob/master/src/devices/cpu/powerpc/ppccom.cpp#L2023

https://github.com/xenia-project/xenia/blob/master/src/xenia/cpu/ppc/ppc_hir_builder.cc#L428

https://github.com/dolphin-emu/dolphin/blob/master/Source/Core/Core/PowerPC/Jit64/Jit_FloatingPoint.cpp

But I'm not sure I understand all of the above so hope this makes more 
sense to someone and can advise.

Regards,
BALATON Zoltan
Peter Maydell Feb. 21, 2020, 6:26 p.m. UTC | #12
On Fri, 21 Feb 2020 at 18:04, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> On Fri, 21 Feb 2020, Peter Maydell wrote:
> > I think that is the wrong approach. Enabling use of the host
> > FPU should not affect the accuracy of the emulation, which
> > should remain bitwise-correct. We should only be using the
> > host FPU to the extent that we can do that without discarding
> > accuracy. As far as I'm aware that's how the hardfloat support
> > for other guest CPUs that use it works.
>
> I don't know of a better approach. Please see section 4.2.2 Floating-Point
> Status and Control Register on page 124 in this document:
>
> https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
>
> especially the definition of the FR and FI bits and tell me how can we
> emulate these accurately and use host FPU.

I don't know much about PPC, but if you can't emulate the
guest architecture accurately with the host FPU, then
don't use the host FPU. We used to have a kind of 'hardfloat'
support that was fast but inaccurate, but it was a mess
because it meant that most guest code sort of worked but
some guest code would confusingly misbehave. Deliberately
not correctly emulating the guest CPU/FPU behaviour is not
something I want us to return to.

You're right that sometimes you can't get both speed
and accuracy; other emulators (and especially ones
which are trying to emulate games consoles) may choose
to prefer speed over accuracy. For QEMU we prefer to
choose accuracy over speed in this area.

thanks
-- PMM
BALATON Zoltan Feb. 21, 2020, 7:52 p.m. UTC | #13
On Fri, 21 Feb 2020, Peter Maydell wrote:
> On Fri, 21 Feb 2020 at 18:04, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>> On Fri, 21 Feb 2020, Peter Maydell wrote:
>>> I think that is the wrong approach. Enabling use of the host
>>> FPU should not affect the accuracy of the emulation, which
>>> should remain bitwise-correct. We should only be using the
>>> host FPU to the extent that we can do that without discarding
>>> accuracy. As far as I'm aware that's how the hardfloat support
>>> for other guest CPUs that use it works.
>>
>> I don't know of a better approach. Please see section 4.2.2 Floating-Point
>> Status and Control Register on page 124 in this document:
>>
>> https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
>>
>> especially the definition of the FR and FI bits and tell me how can we
>> emulate these accurately and use host FPU.
>
> I don't know much about PPC, but if you can't emulate the
> guest architecture accurately with the host FPU, then
> don't use the host FPU. We used to have a kind of 'hardfloat'

I don't know if it's possible or not to emulate these accurately and use 
the FPU but nobody did it for QEMU so far. But if someone knows a way 
please speak up then we can try to implement it. Unfortunately this would 
require more detailed knowledge about different FPU implementations (at 
least X86_64, ARM and PPC that are the mostly used platforms) than what I 
have or willing to spend time to learn.

> support that was fast but inaccurate, but it was a mess
> because it meant that most guest code sort of worked but
> some guest code would confusingly misbehave. Deliberately
> not correctly emulating the guest CPU/FPU behaviour is not
> something I want us to return to.
>
> You're right that sometimes you can't get both speed
> and accuracy; other emulators (and especially ones
> which are trying to emulate games consoles) may choose
> to prefer speed over accuracy. For QEMU we prefer to
> choose accuracy over speed in this area.

OK, then how about keeping the default accurate but allow to opt in to use 
FPU even if it's known to break some bits for workloads where users would 
need speed over accuracy and would be happy to live with the limitation. 
Note that i've found that just removing the define that disables hardfloat 
for PPC target makes VMX vector instructions faster while normal FPU is a 
little slower without any other changes so disabling hardfloat already 
limits performance for guests using VMX even when not using the FPU for 
cases when it would cause inaccuracy. If you say we want accuracy and 
don't care about speed, then just don't disable hardfloat as it helps at 
least VMX and then we can add option to allow the user to say we can use 
hardfloat even if it's inaccurate then they can test their workload and 
decide for themselves.

Regards,
BALATON Zoltan
Programmingkid Feb. 25, 2020, 3:07 a.m. UTC | #14
> On Feb 19, 2020, at 10:35 AM, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> 
> Hello,
> 
> On Tue, 18 Feb 2020, Programmingkid wrote:
>>> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>> While other targets take advantage of using host FPU to do floating
>>> point computations, this was disabled for PPC target because always
>>> clearing exception flags before every FP op made it slightly slower
>>> than emulating everyting with softfloat. To emulate some FPSCR bits,
>>> clearing of fp_status may be necessary (unless these could be handled
>>> e.g. using FP exceptions on host but there's no API for that in QEMU
>>> yet) but preserving at least the inexact flag makes hardfloat usable
>>> and faster than softfloat. Since most clients don't actually care
>>> about this flag, we can gain some speed trading some emulation
>>> accuracy.
>>> 
>>> This patch implements a simple way to keep the inexact flag set for
>>> hardfloat while still allowing to revert to softfloat for workloads
>>> that need more accurate albeit slower emulation. (Set hardfloat
>>> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
>>> still be room for further improvement but this seems to increase
>>> floating point performance. Unfortunately the softfloat case is slower
>>> than before this patch so this patch only makes sense if the default
>>> is also set to enable hardfloat.
>>> 
>>> Because of the above this patch at the moment is mainly for testing
>>> different workloads to evaluate how viable would this be in practice.
>>> Thus, RFC and not ready for merge yet.
>>> 
>>> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
>>> ---
>>> v2: use different approach to avoid needing if () in
>>> helper_reset_fpstatus() but this does not seem to change overhead
>>> much, also make it a single patch as adding the hardfloat option is
>>> only a few lines; with this we can use same value at other places where
>>> float_status is reset and maybe enable hardfloat for a few more places
>>> for a little more performance but not too much. With this I got:
>> 
>> <snip>
>> 
>> Thank you for working on this. It is about time we have a better FPU.
> 
> Thank you for testing it. I think it would be great if we could come up with some viable approach to improve this before the next freeze.
> 
>> I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly and compiled easily.
> 
> I've heard some preliminary results from others that there's also a difference between v1 and v2 of the patch in performance where v1 may be faster for same cases so if you (or someone else) want and have time you could experiment with different versions and combinations as well to find the one that's best on all CPUs. Basically we have these parts:
> 
> 1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to force float_flag_inexact on in case hadfloat is enabled, I've tried two approaches for this:
> 
> a. In v1 added an if () in the function
> b. In v2 used a variable from env set earlier (I've hoped this may be faster but maybe it's not, testing and explanation is welcome)
> 
> 2. Also change places where env->fp_status is copied to a local tstat and then reset (I think this is done to accumulate flags from multiple FP ops that would individually reset env->fp_status or some other reason, maybe this could be avoided if we reset fp_status less often but that would need more understanding of the FP emulation that I don't have so I did not try to clean that up yet).
> 
> If v2 is really slower than v1 then I'm not sure is it because also changing places with tstat or because of the different approach in helper_reset_fpstatus() so you could try combinations of these as well.
> 
>> Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3.
> 
> What was the host CPU and OS this was tested on? Please always share CPU info and host OS when sharing bechmark results so they are somewhat comparable. It also depends on CPU features for vector instrucions at least so without CPU info the results could not be understood.

Intel Core i5-2500S CPU @ 2.70GHz.

> I think G3 does not have AltiVec/VMX so maybe testing with G4 would be better to also test those ops unless there's a reason to only test G3. I've tested with G4 both FPU only and FPU+VMX code on Linux host with i7-9700K CPU @ 3.60GHz as was noted in the original cover letter but may be I'va also forgotten some details so I list it here again.

Ok, I did test on the G4, here are my results:

Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f
Mac OS 10.4.3 VM
-cpu G4
-USB audio device

Hardfloat=false
Audio sounds bad when playing midi file.
Extraction rate: 1.5x
Converting rate: 0.7x
Total time: 7:24

Hardfloat=true
Midi audio sounded perfect for about 30 seconds, then it went silent!
Extraction rate: 1.4x (slower with hard float)
Converting rate: 0.7x (same as without hardfloat)
Total time: 7:16 (faster time with hardfloat)

> 
>> I did several tests and here are my results:
>> 
>> With hard float:
>> - The USB audio device does not produce any sound.
> 
> I've heard this could also be due to some other problem not directly related to FPU, maybe there's a problem with USB/OHCI emulation as well because problems with that were reported but it's interesting why you get different results changing FPU related stuff. I think OSX uses float samples so probably does use FPU for processing sound and may rely on some pecularity of the hardware as it was probably optimised for Apple hardware. It would be interesting to find out how FPU stuff is related to this but since it's broken anyway probably not a show stopper at the moment.

When I played sound this second time I hard the same broken audio I usually hear with the USB audio device with hardfloat set to false. When playing the same midi file with hardfloat set to true, the audio played perfectly! It only played for 30 seconds before it went silent.

> 
>> - Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft float :) ).
> 
> Does resulting file match? As a simple test I've verified md5sum of the resulting mp3 with the lame benchmark I've tried just to find any big errors. Even if it does not prove that nothing broke, it shuold detect if something breaks badly. However that was WAV->MP3 where results were same, although the VMX build did produce different result than FPU only but did so consistently for multiple runs. With MIDI there might be slight timing difference that could cause different audio results so you should first verify if doing the conversion multiple times does produce the same result at all without any patch first.

The md5 number for each file does match. Note this time I converted a midi file to MP3.

> 
>> For my FPSCR test program, 21 tests failed. The high number is because the inexact exception is being set for situations it should not be set for.
> 
> Since we force the inexact flag to be set to enable hardfloat this is expected. More interesting is if apart from this are there any difference in the results compared to the soffloat case (that may also be host CPU dependent I think). Do you have more detailed info on the errors and differences found?

I can give you the full testing suite if you like. I run it on Mac OS 10.4 but it should compile with gcc on Linux. I will send it to you in a separate email because it is big.

<snip>

>> With soft float: - Some sound can be heard from the USB audio device. It isn't good sounding. I had to force quit Quicktime player because it stopped working.
>> - Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard float).
>> - 13 tests failed with my FPSCR test program.
>> 
>> This patch is a good start. I'm not worried about the Floating Point Status and Control Register flags being wrong since hardly any software bothers to check them. I think more optimizations can happen by
> 
> I don't know if guest code checks fpscr and what flags it cares about. Also don't know if it's a fact that these are not used but maybe if we test with more guest codes we can find out. That's why I'd like to at least have an option to test with hardfloat. Unfortunately enabling hardfloat without also making it default would make it slower so if we go this way we should make sure we can also enable hardfloat as default.
> 
>> simplifying the FPU. As it is now it makes a lot of calls per operation.
> 
> Question is if those calls are really needed to emulate PPC FPU or if not why would they be there? If the FPU is really that much different so all these calls are needed then there's not much to simplify (but maybe there could be some optimisations possible). This would need someone to understand the current code in full first that probably we don't currently (ar least I don't for sure so can't really make changes either). Another more viable approach is to pick a small part and follow through with that and try to clean up and optimise that small part only. The exception and fpscr handling is one such part, another could be round_canonical() that seems to be high on profiles I've taken. Maybe this could be done by reading and understading docs only on the small part picked that may be easier than getting everything first. I wonder if such smaller tasks could be defined and given out as GSoC or other volunteer projects?

I have another idea on how to improve QEMU's performance. What if you enabled more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux support multiple CPUs. It might actually be easier to do this than to improve the FPU. I imagine the performance increase with multiple emulated CPUs would be much more noticeable.
BALATON Zoltan Feb. 25, 2020, 12:09 p.m. UTC | #15
On Mon, 24 Feb 2020, Programmingkid wrote:
> Intel Core i5-2500S CPU @ 2.70GHz.
[...]
> Ok, I did test on the G4, here are my results:
>
> Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f
> Mac OS 10.4.3 VM
> -cpu G4
> -USB audio device
>
> Hardfloat=false
> Audio sounds bad when playing midi file.
> Extraction rate: 1.5x
> Converting rate: 0.7x
> Total time: 7:24
>
> Hardfloat=true
> Midi audio sounded perfect for about 30 seconds, then it went silent!
> Extraction rate: 1.4x (slower with hard float)
> Converting rate: 0.7x (same as without hardfloat)
> Total time: 7:16 (faster time with hardfloat)

How is that extraction rate is slower but total time is less than without 
hardfloat? There must be other factors here than just FP ops. Maybe a 
better test is to not play the audio just save it to a file so other 
issues with USB is not influencing the test.

> When I played sound this second time I hard the same broken audio I 
> usually hear with the USB audio device with hardfloat set to false. When 
> playing the same midi file with hardfloat set to true, the audio played 
> perfectly! It only played for 30 seconds before it went silent.

So probably there are at least two problems: FPU emulation is not fast 
enough to decode audio to fill buffer then there's also something with 
usb-audio that jams it after a while? I don't think all of this is FPU 
related.

> I can give you the full testing suite if you like. I run it on Mac OS 
> 10.4 but it should compile with gcc on Linux. I will send it to you in a 
> separate email because it is big.

Thanks, I'll have a look and see if I can make sense of it but not sure 
when will I find time.

> I have another idea on how to improve QEMU's performance. What if you 
> enabled more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux 
> support multiple CPUs. It might actually be easier to do this than to

Have you tried if it works? I think MTTCG is enabled for PPC64 but not 
sure about 32 bit PPC. The mac99 machine seems to init multiple CPUs but 
not sure if they'll use MTTCG. But you could test it to see if it makes 
any difference.

> improve the FPU. I imagine the performance increase with multiple 
> emulated CPUs would be much more noticeable.

The Amiga like OSes I'm interested in don't use multiple cores so I'm 
mainly interested in improving single core performance. Also I'm not sure 
if (part of) your problem is slow FPU preventing fast enough audio 
decoding then having multiple CPUs with slow FPU would help as this may 
use a single thread anyway.

Regards,
BALATON Zoltan
Programmingkid Feb. 26, 2020, 10:46 a.m. UTC | #16
> On Feb 25, 2020, at 7:09 AM, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> 
> On Mon, 24 Feb 2020, Programmingkid wrote:
>> Intel Core i5-2500S CPU @ 2.70GHz.
> [...]
>> Ok, I did test on the G4, here are my results:
>> 
>> Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f
>> Mac OS 10.4.3 VM
>> -cpu G4
>> -USB audio device
>> 
>> Hardfloat=false
>> Audio sounds bad when playing midi file.
>> Extraction rate: 1.5x
>> Converting rate: 0.7x
>> Total time: 7:24
>> 
>> Hardfloat=true
>> Midi audio sounded perfect for about 30 seconds, then it went silent!
>> Extraction rate: 1.4x (slower with hard float)
>> Converting rate: 0.7x (same as without hardfloat)
>> Total time: 7:16 (faster time with hardfloat)
> 
> How is that extraction rate is slower but total time is less than without hardfloat? There must be other factors here than just FP ops. Maybe a better test is to not play the audio just save it to a file so other issues with USB is not influencing the test.

I does seem odd to me also. 

>> When I played sound this second time I hard the same broken audio I usually hear with the USB audio device with hardfloat set to false. When playing the same midi file with hardfloat set to true, the audio played perfectly! It only played for 30 seconds before it went silent.
> 
> So probably there are at least two problems: FPU emulation is not fast enough to decode audio to fill buffer then there's also something with usb-audio that jams it after a while? I don't think all of this is FPU related.

I think a timeout takes place and that is why audio stops playing. It is probably an USB OHCI issue. The other USB controller seems to work better. 

> 
>> I can give you the full testing suite if you like. I run it on Mac OS 10.4 but it should compile with gcc on Linux. I will send it to you in a separate email because it is big.
> 
> Thanks, I'll have a look and see if I can make sense of it but not sure when will I find time.

Please let me know if you have any questions with it.

> 
>> I have another idea on how to improve QEMU's performance. What if you enabled more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux support multiple CPUs. It might actually be easier to do this than to
> 
> Have you tried if it works? I think MTTCG is enabled for PPC64 but not sure about 32 bit PPC. The mac99 machine seems to init multiple CPUs but not sure if they'll use MTTCG. But you could test it to see if it makes any difference.

I had completely forgot about MTTCG. I think Howard once did some performance testing with it and came back with favorable results. Maybe this is another avenue we should look at.

> 
>> improve the FPU. I imagine the performance increase with multiple emulated CPUs would be much more noticeable.
> 
> The Amiga like OSes I'm interested in don't use multiple cores so I'm mainly interested in improving single core performance. Also I'm not sure if (part of) your problem is slow FPU preventing fast enough audio decoding then having multiple CPUs with slow FPU would help as this may use a single thread anyway.

Good point. MTTCG might be the option that really helps with speed improvements.
BALATON Zoltan Feb. 26, 2020, 11:28 a.m. UTC | #17
On Wed, 26 Feb 2020, Programmingkid wrote:
> I think a timeout takes place and that is why audio stops playing. It is 
> probably an USB OHCI issue. The other USB controller seems to work 
> better.

Which other USB controller? Maybe you could try enabling some usb_ohci* 
traces and see if they reveal anything.

>> The Amiga like OSes I'm interested in don't use multiple cores so I'm 
>> mainly interested in improving single core performance. Also I'm not 
>> sure if (part of) your problem is slow FPU preventing fast enough audio 
>> decoding then having multiple CPUs with slow FPU would help as this may 
>> use a single thread anyway.
>
> Good point. MTTCG might be the option that really helps with speed improvements.

Only if you have multithreaded workload in the guest because AFAIK MTTCG 
only runs different vcpus in parallel, it won't make single emulated CPU 
faster in any way. OSX probably can benefit from having multiple cores 
emulated but I don't think MacOS would use it apart from some apps maybe.

Regards,
BALATON Zoltan
Alex Bennée Feb. 26, 2020, 12:28 p.m. UTC | #18
BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Fri, 21 Feb 2020, Peter Maydell wrote:
>> On Fri, 21 Feb 2020 at 18:04, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>> On Fri, 21 Feb 2020, Peter Maydell wrote:
>>>> I think that is the wrong approach. Enabling use of the host
>>>> FPU should not affect the accuracy of the emulation, which
>>>> should remain bitwise-correct. We should only be using the
>>>> host FPU to the extent that we can do that without discarding
>>>> accuracy. As far as I'm aware that's how the hardfloat support
>>>> for other guest CPUs that use it works.

Correct - we only use hardfloat when we know it will give the same
result which is broadly when the inexact flag is already set and we are
dealing with normal floating point numbers.

We added a whole bunch of testing to ensure we maintain accuracy when
the code went in.

<snip>
>>
>> I don't know much about PPC, but if you can't emulate the
>> guest architecture accurately with the host FPU, then
>> don't use the host FPU. We used to have a kind of 'hardfloat'
>
> I don't know if it's possible or not to emulate these accurately and
> use the FPU but nobody did it for QEMU so far. But if someone knows a
> way please speak up then we can try to implement it. Unfortunately
> this would require more detailed knowledge about different FPU
> implementations (at least X86_64, ARM and PPC that are the mostly used
> platforms) than what I have or willing to spend time to learn.
>
>> support that was fast but inaccurate, but it was a mess
>> because it meant that most guest code sort of worked but
>> some guest code would confusingly misbehave. Deliberately
>> not correctly emulating the guest CPU/FPU behaviour is not
>> something I want us to return to.
>>
>> You're right that sometimes you can't get both speed
>> and accuracy; other emulators (and especially ones
>> which are trying to emulate games consoles) may choose
>> to prefer speed over accuracy. For QEMU we prefer to
>> choose accuracy over speed in this area.
>
> OK, then how about keeping the default accurate but allow to opt in to
> use FPU even if it's known to break some bits for workloads where
> users would need speed over accuracy and would be happy to live with
> the limitation.

About the only comparison I can think of is the thread=single:multi
flags for TCG which is mostly there to help developers eliminate causes
of bugs. The default for MTTCG is it is enabled when it's safe. If you
enable it via the command line where QEMU hasn't defaulted it on you
will get lots of loud warnings about potential instability. The most
commonly used case is thread=single when you want to check it's not a
MTTCG bug.

I'm as cautious as Peter here about adding a "faster but broken" command
line flag because users will invariably read up to the "faster" and then
spend a lot of time scratching their heads when things break.

> Note that i've found that just removing the define
> that disables hardfloat for PPC target makes VMX vector instructions
> faster while normal FPU is a little slower without any other changes
> so disabling hardfloat already limits performance for guests using VMX
> even when not using the FPU for cases when it would cause inaccuracy.
> If you say we want accuracy and don't care about speed, then just
> don't disable hardfloat as it helps at least VMX and then we can add
> option to allow the user to say we can use hardfloat even if it's
> inaccurate then they can test their workload and decide for
> themselves.
>
> Regards,
> BALATON Zoltan
luigi burdo Feb. 26, 2020, 1 p.m. UTC | #19
Hi Zoltan,
i can say MacOs Leopard use multiple cores on PowerMac G5 Quad the most of the apps did for  Panter/Tiger/leopard use for sure 2 Core in smtp only apps did for Tiger/leopard use more than 2 Cores.
Ciao and thenks
 Luigi
BALATON Zoltan Feb. 26, 2020, 1:07 p.m. UTC | #20
On Wed, 26 Feb 2020, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> OK, then how about keeping the default accurate but allow to opt in to
>> use FPU even if it's known to break some bits for workloads where
>> users would need speed over accuracy and would be happy to live with
>> the limitation.
>
> About the only comparison I can think of is the thread=single:multi
> flags for TCG which is mostly there to help developers eliminate causes
> of bugs. The default for MTTCG is it is enabled when it's safe. If you
> enable it via the command line where QEMU hasn't defaulted it on you
> will get lots of loud warnings about potential instability. The most
> commonly used case is thread=single when you want to check it's not a
> MTTCG bug.
>
> I'm as cautious as Peter here about adding a "faster but broken" command
> line flag because users will invariably read up to the "faster" and then
> spend a lot of time scratching their heads when things break.

OK understood. However this is specifically about PPC emulation where some 
bits exists in FPSCR that's hard to emulate with hardfloat but until 
recently these weren't correctly emulated in my understanding and things 
still ran OK so it's believed that these bits are seldom used. It was only 
fixed due to some tests revealing it not because of actual negative 
effects seen.

Unfortunately enabling hardfloat without any other changes would make the 
default softfloat case a bit slower so that's why I was asking about also 
making hardfloat case the default.

I can try to try some other approach to avoid always setting inexact bit 
and only keep it set as other archs do but not sure when will I have time 
to get there. Some help in this would be appreciated by those who know PPC 
FPU and interested in running it faster in QEMU.

Regards,
BALATON Zoltan
Dino Papararo Feb. 26, 2020, 1:08 p.m. UTC | #21
Please let's go with hardfloat pps support, it's really a good feature to implement.
Even if in a first step it could lead to inaccuracy results, later it could solved with other patches.

I think it's important for qemu to as global as possible and don't target only recent hardware.

Regards,
Dino Papararo

Da: Qemu-ppc <qemu-ppc-bounces+skizzato73=msn.com@nongnu.org> Per conto di luigi burdo
Inviato: mercoledì 26 febbraio 2020 14:01
A: BALATON Zoltan <balaton@eik.bme.hu>; Programmingkid <programmingkidx@gmail.com>
Cc: David Gibson <david@gibson.dropbear.id.au>; qemu-ppc@nongnu.org; qemu-devel qemu-devel <qemu-devel@nongnu.org>; Howard Spoelstra <hsp.cat7@gmail.com>
Oggetto: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

Hi Zoltan,
i can say MacOs Leopard use multiple cores on PowerMac G5 Quad the most of the apps did for  Panter/Tiger/leopard use for sure 2 Core in smtp only apps did for Tiger/leopard use more than 2 Cores.
Ciao and thenks
 Luigi
Alex Bennée Feb. 26, 2020, 2:28 p.m. UTC | #22
Dino Papararo <skizzato73@msn.com> writes:

> Please let's go with hardfloat pps support, it's really a good feature to implement.
> Even if in a first step it could lead to inaccuracy results, later it
> could solved with other patches.

That's the wrong way around. We have regression tests for a reason. I'll
happily accept patches to turn on hardfloat for PPC if:

 a) they don't cause regressions in our fairly extensive floating point
 tests
 b) the PPC maintainers are happy with the new performance profile

The way forward would be to:

 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)
 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded
 splatting of flags (if any)
 3. measure the before/after performance effect and decide if on balance
 it's worth keeping

> I think it's important for qemu to as global as possible and don't
> target only recent hardware.

Are you referring to guests or hosts? For guests we will always favour
accuracy of speed of emulation. For hosts we need to have IEEE compliant
FPU HW to even stand a chance of using hardfloat.
Aleksandar Markovic Feb. 26, 2020, 3:50 p.m. UTC | #23
On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Dino Papararo <skizzato73@msn.com> writes:
>
> > Please let's go with hardfloat pps support, it's really a good feature
to implement.
> > Even if in a first step it could lead to inaccuracy results, later it
> > could solved with other patches.
>
> That's the wrong way around. We have regression tests for a reason.

I tend to agree with Alex here, and additionally want to expand more on
this topic.

In my view: (that I think is at least very close to the community consensus)

This is *not* a ppc-specific issue. There exist a principle across all
targets
that QEMU FPU calculation must be accurate - exactly as specified in any
applicable particular ISA document. Any discrepancy is an outright bug.

We even recently had several patches for FPU in ppc target that handled
some fairly obscure cases of inaccuracies, I believe they were authored
by Paul Clarke, so there are people in ppc community that care about
FPU accuracy (as I guess is the case for any target).

There shouldn't be a target that decides by itself and within itself
"ok, we don't need accuracy, let's trade it for speed". This violates
the architecture of QEMU. Please allow that for any given software
project, there is an architecture that should be respected.

This doesn't mean that anybody's experimentation is discouraged. No-one
can stop anybody from forking from QEMU upstream tree and do whatever
is wanted.

But, this doesn't mean such experimentation will be upstreamed. QEMU
upstream should be collecting place for the best ideas and implementations,
not for arbitrary experimentations.

Best regards,
Aleksandar


> I'll happily accept patches to turn on hardfloat for PPC if:
>
>  a) they don't cause regressions in our fairly extensive floating point
>  tests
>  b) the PPC maintainers are happy with the new performance profile
>
> The way forward would be to:
>
>  1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)
>  2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded
>  splatting of flags (if any)
>  3. measure the before/after performance effect and decide if on balance
>  it's worth keeping
>
> > I think it's important for qemu to as global as possible and don't
> > target only recent hardware.
>
> Are you referring to guests or hosts? For guests we will always favour
> accuracy of speed of emulation. For hosts we need to have IEEE compliant
> FPU HW to even stand a chance of using hardfloat.
>
> --
> Alex Bennée
>
Programmingkid Feb. 26, 2020, 5:04 p.m. UTC | #24
Accuracy is an important part of the IEEE 754 floating point standard. The
whole purpose of this standard is to ensure floating point calculations are
consistent across multiple CPUs. I believe referring to this patch as
inaccurate is itself inaccurate. That gives the impression that this patch
produces calculations that are not inline with established standards. This
is not true. The only part of this patch that will produce incorrect values
are the flags. There *may* be a program or two out there that depend on
these flags, but for the majority of programs that only care about basic
floating point arithmetic this patch will produce correct values. Currently
the emulated PowerPC's FPU already produces wrong values for the flags.
This patch does set the Inexact flag (which I don't like), but since I have
never encountered any source code that cares for this flag, I can let it
go. I think giving the user the ability to decide which option to use is
the best thing to do.

On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <
aleksandar.m.mail@gmail.com> wrote:

>
>
> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org>
> wrote:
> >
> >
> > Dino Papararo <skizzato73@msn.com> writes:
> >
> > > Please let's go with hardfloat pps support, it's really a good feature
> to implement.
> > > Even if in a first step it could lead to inaccuracy results, later it
> > > could solved with other patches.
> >
> > That's the wrong way around. We have regression tests for a reason.
>
> I tend to agree with Alex here, and additionally want to expand more on
> this topic.
>
> In my view: (that I think is at least very close to the community
> consensus)
>
> This is *not* a ppc-specific issue. There exist a principle across all
> targets
> that QEMU FPU calculation must be accurate - exactly as specified in any
> applicable particular ISA document. Any discrepancy is an outright bug.
>
> We even recently had several patches for FPU in ppc target that handled
> some fairly obscure cases of inaccuracies, I believe they were authored
> by Paul Clarke, so there are people in ppc community that care about
> FPU accuracy (as I guess is the case for any target).
>
> There shouldn't be a target that decides by itself and within itself
> "ok, we don't need accuracy, let's trade it for speed". This violates
> the architecture of QEMU. Please allow that for any given software
> project, there is an architecture that should be respected.
>
> This doesn't mean that anybody's experimentation is discouraged. No-one
> can stop anybody from forking from QEMU upstream tree and do whatever
> is wanted.
>
> But, this doesn't mean such experimentation will be upstreamed. QEMU
> upstream should be collecting place for the best ideas and implementations,
> not for arbitrary experimentations.
>
> Best regards,
> Aleksandar
>
>
> > I'll happily accept patches to turn on hardfloat for PPC if:
> >
> >  a) they don't cause regressions in our fairly extensive floating point
> >  tests
> >  b) the PPC maintainers are happy with the new performance profile
> >
> > The way forward would be to:
> >
> >  1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)
> >  2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded
> >  splatting of flags (if any)
> >  3. measure the before/after performance effect and decide if on balance
> >  it's worth keeping
> >
> > > I think it's important for qemu to as global as possible and don't
> > > target only recent hardware.
> >
> > Are you referring to guests or hosts? For guests we will always favour
> > accuracy of speed of emulation. For hosts we need to have IEEE compliant
> > FPU HW to even stand a chance of using hardfloat.
> >
> > --
> > Alex Bennée
> >
>
Aleksandar Markovic Feb. 26, 2020, 5:27 p.m. UTC | #25
On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote:
>
> Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do.
>

From the experiments described above, the patch in question changes the behavior
of applications (for example, sound is different with and without the
patch), which is
in contradiction with your claim that you "never encountered any
source code that
cares for this flag" and that "the only part of this patch that will
produce incorrect
values are the flags".

In other words, and playing further with them:

The claim that "referring to this patch as inaccurate is itself
inaccurate" is itself inaccurate.

Best regards,
Aleksandar


> On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote:
>>
>>
>>
>> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>> >
>> >
>> > Dino Papararo <skizzato73@msn.com> writes:
>> >
>> > > Please let's go with hardfloat pps support, it's really a good feature to implement.
>> > > Even if in a first step it could lead to inaccuracy results, later it
>> > > could solved with other patches.
>> >
>> > That's the wrong way around. We have regression tests for a reason.
>>
>> I tend to agree with Alex here, and additionally want to expand more on
>> this topic.
>>
>> In my view: (that I think is at least very close to the community consensus)
>>
>> This is *not* a ppc-specific issue. There exist a principle across all targets
>> that QEMU FPU calculation must be accurate - exactly as specified in any
>> applicable particular ISA document. Any discrepancy is an outright bug.
>>
>> We even recently had several patches for FPU in ppc target that handled
>> some fairly obscure cases of inaccuracies, I believe they were authored
>> by Paul Clarke, so there are people in ppc community that care about
>> FPU accuracy (as I guess is the case for any target).
>>
>> There shouldn't be a target that decides by itself and within itself
>> "ok, we don't need accuracy, let's trade it for speed". This violates
>> the architecture of QEMU. Please allow that for any given software
>> project, there is an architecture that should be respected.
>>
>> This doesn't mean that anybody's experimentation is discouraged. No-one
>> can stop anybody from forking from QEMU upstream tree and do whatever
>> is wanted.
>>
>> But, this doesn't mean such experimentation will be upstreamed. QEMU
>> upstream should be collecting place for the best ideas and implementations,
>> not for arbitrary experimentations.
>>
>> Best regards,
>> Aleksandar
>>
>>
>> > I'll happily accept patches to turn on hardfloat for PPC if:
>> >
>> >  a) they don't cause regressions in our fairly extensive floating point
>> >  tests
>> >  b) the PPC maintainers are happy with the new performance profile
>> >
>> > The way forward would be to:
>> >
>> >  1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)
>> >  2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded
>> >  splatting of flags (if any)
>> >  3. measure the before/after performance effect and decide if on balance
>> >  it's worth keeping
>> >
>> > > I think it's important for qemu to as global as possible and don't
>> > > target only recent hardware.
>> >
>> > Are you referring to guests or hosts? For guests we will always favour
>> > accuracy of speed of emulation. For hosts we need to have IEEE compliant
>> > FPU HW to even stand a chance of using hardfloat.
>> >
>> > --
>> > Alex Bennée
>> >
Alex Bennée Feb. 26, 2020, 6:09 p.m. UTC | #26
G 3 <programmingkidx@gmail.com> writes:

> Accuracy is an important part of the IEEE 754 floating point standard. The
> whole purpose of this standard is to ensure floating point calculations are
> consistent across multiple CPUs. I believe referring to this patch as
> inaccurate is itself inaccurate. That gives the impression that this patch
> produces calculations that are not inline with established standards. This
> is not true. The only part of this patch that will produce incorrect values
> are the flags.

As I stated further up the thread I'd be happy to take a patch that
turns this on without the messing about with the FPU flags which AFAICT
breaks the architectural compliance of those instructions. The ability
to detect inexact results is part of the IEEE spec even if it is
considerably looser about when the flag should be reset.

> There *may* be a program or two out there that depend on
> these flags, but for the majority of programs that only care about basic
> floating point arithmetic this patch will produce correct values. Currently
> the emulated PowerPC's FPU already produces wrong values for the
> flags.

Those are bugs that should be fixed. The state of the flags after a
calculation should be considered part of the "values" generated by the
FPU.

> This patch does set the Inexact flag (which I don't like), but since I have
> never encountered any source code that cares for this flag, I can let it
> go. I think giving the user the ability to decide which option to use is
> the best thing to do.

Giving the user the option to break things is a poor flag, most of QEMUs
configurable knobs are about trading of size/speed without affecting
correctness.

If the PPC maintainers are happy that hardfloat's speed trade-offs are
worth it for usual workloads (whatever they may be) then I have no
objection to making defaulting hardfloat to on - perhaps* even with an
option to force softfloat if it faster for some workloads.

>
> On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <
> aleksandar.m.mail@gmail.com> wrote:
>
>>
>>
>> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org>
>> wrote:
>> >
>> >
>> > Dino Papararo <skizzato73@msn.com> writes:
>> >
>> > > Please let's go with hardfloat pps support, it's really a good feature
>> to implement.
>> > > Even if in a first step it could lead to inaccuracy results, later it
>> > > could solved with other patches.
>> >
>> > That's the wrong way around. We have regression tests for a reason.
>>
>> I tend to agree with Alex here, and additionally want to expand more on
>> this topic.
>>
>> In my view: (that I think is at least very close to the community
>> consensus)
>>
>> This is *not* a ppc-specific issue. There exist a principle across all
>> targets
>> that QEMU FPU calculation must be accurate - exactly as specified in any
>> applicable particular ISA document. Any discrepancy is an outright bug.
>>
>> We even recently had several patches for FPU in ppc target that handled
>> some fairly obscure cases of inaccuracies, I believe they were authored
>> by Paul Clarke, so there are people in ppc community that care about
>> FPU accuracy (as I guess is the case for any target).
>>
>> There shouldn't be a target that decides by itself and within itself
>> "ok, we don't need accuracy, let's trade it for speed". This violates
>> the architecture of QEMU. Please allow that for any given software
>> project, there is an architecture that should be respected.
>>
>> This doesn't mean that anybody's experimentation is discouraged. No-one
>> can stop anybody from forking from QEMU upstream tree and do whatever
>> is wanted.
>>
>> But, this doesn't mean such experimentation will be upstreamed. QEMU
>> upstream should be collecting place for the best ideas and implementations,
>> not for arbitrary experimentations.
>>
>> Best regards,
>> Aleksandar
>>
>>
>> > I'll happily accept patches to turn on hardfloat for PPC if:
>> >
>> >  a) they don't cause regressions in our fairly extensive floating point
>> >  tests
>> >  b) the PPC maintainers are happy with the new performance profile
>> >
>> > The way forward would be to:
>> >
>> >  1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)
>> >  2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded
>> >  splatting of flags (if any)
>> >  3. measure the before/after performance effect and decide if on balance
>> >  it's worth keeping
>> >
>> > > I think it's important for qemu to as global as possible and don't
>> > > target only recent hardware.
>> >
>> > Are you referring to guests or hosts? For guests we will always favour
>> > accuracy of speed of emulation. For hosts we need to have IEEE compliant
>> > FPU HW to even stand a chance of using hardfloat.
>> >
>> > --
>> > Alex Bennée
>> >
>>
Dino Papararo Feb. 26, 2020, 6:14 p.m. UTC | #27
I think we all agree the best solution is to resolve powerpc issues about hardfloat current implementation.
I think also powerpc is an important branch of qemu, for hystorical, present and (why not?) future reasons, and it must NOT be left behind.
So I would invite best Qemu community's skilled programmers to work on this and solve the issue maybe in few days.
The same group who worked on recent altivec optimizations is able to make a good patch even for this.

In a subordinate way I'd like to implement anyway hardfloat support for powerpc, advising users about inaccurancy of results/flags and letting them choose.
Of course I understand, and in part agree, on all your objections. 
Simply I prefer have always a choice.

Best Regards,
Dino Papararo

-----Messaggio originale-----
Da: Aleksandar Markovic <aleksandar.m.mail@gmail.com> 
Inviato: mercoledì 26 febbraio 2020 18:27
A: G 3 <programmingkidx@gmail.com>
Cc: Alex Bennée <alex.bennee@linaro.org>; Dino Papararo <skizzato73@msn.com>; QEMU Developers <qemu-devel@nongnu.org>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; luigi burdo <intermediadc@hotmail.com>; David Gibson <david@gibson.dropbear.id.au>
Oggetto: Re: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC

On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote:
>
> Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do.
>

From the experiments described above, the patch in question changes the behavior of applications (for example, sound is different with and without the patch), which is in contradiction with your claim that you "never encountered any source code that cares for this flag" and that "the only part of this patch that will produce incorrect values are the flags".

In other words, and playing further with them:

The claim that "referring to this patch as inaccurate is itself inaccurate" is itself inaccurate.

Best regards,
Aleksandar


> On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote:
>>
>>
>>
>> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote:
>> >
>> >
>> > Dino Papararo <skizzato73@msn.com> writes:
>> >
>> > > Please let's go with hardfloat pps support, it's really a good feature to implement.
>> > > Even if in a first step it could lead to inaccuracy results, 
>> > > later it could solved with other patches.
>> >
>> > That's the wrong way around. We have regression tests for a reason.
>>
>> I tend to agree with Alex here, and additionally want to expand more 
>> on this topic.
>>
>> In my view: (that I think is at least very close to the community 
>> consensus)
>>
>> This is *not* a ppc-specific issue. There exist a principle across 
>> all targets that QEMU FPU calculation must be accurate - exactly as 
>> specified in any applicable particular ISA document. Any discrepancy is an outright bug.
>>
>> We even recently had several patches for FPU in ppc target that 
>> handled some fairly obscure cases of inaccuracies, I believe they 
>> were authored by Paul Clarke, so there are people in ppc community 
>> that care about FPU accuracy (as I guess is the case for any target).
>>
>> There shouldn't be a target that decides by itself and within itself 
>> "ok, we don't need accuracy, let's trade it for speed". This violates 
>> the architecture of QEMU. Please allow that for any given software 
>> project, there is an architecture that should be respected.
>>
>> This doesn't mean that anybody's experimentation is discouraged. 
>> No-one can stop anybody from forking from QEMU upstream tree and do 
>> whatever is wanted.
>>
>> But, this doesn't mean such experimentation will be upstreamed. QEMU 
>> upstream should be collecting place for the best ideas and 
>> implementations, not for arbitrary experimentations.
>>
>> Best regards,
>> Aleksandar
>>
>>
>> > I'll happily accept patches to turn on hardfloat for PPC if:
>> >
>> >  a) they don't cause regressions in our fairly extensive floating 
>> > point  tests
>> >  b) the PPC maintainers are happy with the new performance profile
>> >
>> > The way forward would be to:
>> >
>> >  1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)  
>> > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any 
>> > unneeded  splatting of flags (if any)  3. measure the before/after 
>> > performance effect and decide if on balance  it's worth keeping
>> >
>> > > I think it's important for qemu to as global as possible and 
>> > > don't target only recent hardware.
>> >
>> > Are you referring to guests or hosts? For guests we will always 
>> > favour accuracy of speed of emulation. For hosts we need to have 
>> > IEEE compliant FPU HW to even stand a chance of using hardfloat.
>> >
>> > --
>> > Alex Bennée
>> >
Aleksandar Markovic Feb. 26, 2020, 6:51 p.m. UTC | #28
On Wed, Feb 26, 2020 at 7:14 PM Dino Papararo <skizzato73@msn.com> wrote:
>
> I think we all agree the best solution is to resolve powerpc issues about hardfloat current implementation.
> I think also powerpc is an important branch of qemu, for hystorical, present and (why not?) future reasons, and it must NOT be left behind.
> So I would invite best Qemu community's skilled programmers to work on this and solve the issue maybe in few days.
> The same group who worked on recent altivec optimizations is able to make a good patch even for this.
>
> In a subordinate way I'd like to implement anyway hardfloat support for powerpc, advising users about inaccurancy of results/flags and letting them choose.

Just to be clear, concluding from your surrounding sentences, you don't
mean here to implement "hardfloat support for powerpc", but something
like "improve performance of ppc FPU emulation by misusing hardfloat
support and giving up accuracy". I think we have to be honest between
ourselves.

Regards,
Aleksandar

> Of course I understand, and in part agree, on all your objections.
> Simply I prefer have always a choice.
>
> Best Regards,
> Dino Papararo
>
> -----Messaggio originale-----
> Da: Aleksandar Markovic <aleksandar.m.mail@gmail.com>
> Inviato: mercoledì 26 febbraio 2020 18:27
> A: G 3 <programmingkidx@gmail.com>
> Cc: Alex Bennée <alex.bennee@linaro.org>; Dino Papararo <skizzato73@msn.com>; QEMU Developers <qemu-devel@nongnu.org>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; luigi burdo <intermediadc@hotmail.com>; David Gibson <david@gibson.dropbear.id.au>
> Oggetto: Re: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
>
> On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote:
> >
> > Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do.
> >
>
> From the experiments described above, the patch in question changes the behavior of applications (for example, sound is different with and without the patch), which is in contradiction with your claim that you "never encountered any source code that cares for this flag" and that "the only part of this patch that will produce incorrect values are the flags".
>
> In other words, and playing further with them:
>
> The claim that "referring to this patch as inaccurate is itself inaccurate" is itself inaccurate.
>
> Best regards,
> Aleksandar
>
>
> > On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote:
> >>
> >>
> >>
> >> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote:
> >> >
> >> >
> >> > Dino Papararo <skizzato73@msn.com> writes:
> >> >
> >> > > Please let's go with hardfloat pps support, it's really a good feature to implement.
> >> > > Even if in a first step it could lead to inaccuracy results,
> >> > > later it could solved with other patches.
> >> >
> >> > That's the wrong way around. We have regression tests for a reason.
> >>
> >> I tend to agree with Alex here, and additionally want to expand more
> >> on this topic.
> >>
> >> In my view: (that I think is at least very close to the community
> >> consensus)
> >>
> >> This is *not* a ppc-specific issue. There exist a principle across
> >> all targets that QEMU FPU calculation must be accurate - exactly as
> >> specified in any applicable particular ISA document. Any discrepancy is an outright bug.
> >>
> >> We even recently had several patches for FPU in ppc target that
> >> handled some fairly obscure cases of inaccuracies, I believe they
> >> were authored by Paul Clarke, so there are people in ppc community
> >> that care about FPU accuracy (as I guess is the case for any target).
> >>
> >> There shouldn't be a target that decides by itself and within itself
> >> "ok, we don't need accuracy, let's trade it for speed". This violates
> >> the architecture of QEMU. Please allow that for any given software
> >> project, there is an architecture that should be respected.
> >>
> >> This doesn't mean that anybody's experimentation is discouraged.
> >> No-one can stop anybody from forking from QEMU upstream tree and do
> >> whatever is wanted.
> >>
> >> But, this doesn't mean such experimentation will be upstreamed. QEMU
> >> upstream should be collecting place for the best ideas and
> >> implementations, not for arbitrary experimentations.
> >>
> >> Best regards,
> >> Aleksandar
> >>
> >>
> >> > I'll happily accept patches to turn on hardfloat for PPC if:
> >> >
> >> >  a) they don't cause regressions in our fairly extensive floating
> >> > point  tests
> >> >  b) the PPC maintainers are happy with the new performance profile
> >> >
> >> > The way forward would be to:
> >> >
> >> >  1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)
> >> > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any
> >> > unneeded  splatting of flags (if any)  3. measure the before/after
> >> > performance effect and decide if on balance  it's worth keeping
> >> >
> >> > > I think it's important for qemu to as global as possible and
> >> > > don't target only recent hardware.
> >> >
> >> > Are you referring to guests or hosts? For guests we will always
> >> > favour accuracy of speed of emulation. For hosts we need to have
> >> > IEEE compliant FPU HW to even stand a chance of using hardfloat.
> >> >
> >> > --
> >> > Alex Bennée
> >> >
BALATON Zoltan Feb. 26, 2020, 10:51 p.m. UTC | #29
On Wed, 26 Feb 2020, Alex Bennée wrote:
> That's the wrong way around. We have regression tests for a reason. I'll
> happily accept patches to turn on hardfloat for PPC if:
>
> a) they don't cause regressions in our fairly extensive floating point
> tests

Where are these tests and how to run them? I'm not aware of such tests so 
I've only tried running simple guest code to test changes but if there are 
more extensive FP tests I'd better use those.

> b) the PPC maintainers are happy with the new performance profile
>
> The way forward would be to:
>
> 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)

This is simple but I've found that while it seems to make some vector 
instructions faster it also makes most simple FP ops slower because it 
will go thorugh checking if it can use hardfloat but never can because the 
fp_status is cleared before every FP op. That's why I've set inexact bit 
to let it use hardfloat and be able to test if it would work at all. 
That's all my RFC patch did, I've had a 2nd version trying to avoid slow 
down with above #if defined() dropped but hardfloat=false so it only uses 
softfloat as before but it did not worked out too well, some tests said v2 
was even slower. Maybe to avoid overhead we should add a variable instead 
of the QEMU_NO_HARDFLOAT define that can be set during runtime but 
probably that won't be faster either. Thus it seems there's no way to 
enable hardfloat for PPC and not have slower performance for most FP ops 
without also doing some of the other points below (even if it's beneficial 
for vector ops).

> 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded
> splatting of flags (if any)

This would either need someone who knows PPC FPU or someone who can take 
the time to learn and go through the code. Not sure I want to volunteer 
for that. But I think the clearing of the flags is mainly to emulate FI 
bit which is an non-sticky inexact bit that should show the inexact status 
of last FP op. (There's another simliar bit for fraction rounded as well 
but that does not disable hardfloat.) Question is if we really want to 
accurately emulate these bits? Are there any software we care about 
relying on these? If we can live with not having correct FI bit emulation 
(which was the case for a long time until these were fixed) then we could 
have an easy way to enable hardfloat without more extensive changes. If we 
want to accurately emulate also these bits then we probably will need 
changes to softfloat to allow registering FP exception handlers so we 
don't have to clear and check bits but can get an exception from FPU and 
then can set those bits but I have no idea how to do that.

Regards,
BALATON Zoltan
Programmingkid Feb. 27, 2020, 2:43 a.m. UTC | #30
> On Feb 26, 2020, at 12:27 PM, Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote:
> 
> On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote:
>> 
>> Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do.
>> 
> 
> From the experiments described above, the patch in question changes the behavior
> of applications (for example, sound is different with and without the
> patch), which is
> in contradiction with your claim that you "never encountered any
> source code that
> cares for this flag" and that "the only part of this patch that will
> produce incorrect
> values are the flags".
> 
> In other words, and playing further with them:
> 
> The claim that "referring to this patch as inaccurate is itself
> inaccurate" is itself inaccurate.
> 
> Best regards,
> Aleksandar

It is inaccurate to state that just because the USB audio device seems to play better with the hardfloat feature enabled that this changes the fact that I have yet to see any source code that actually reviews the flags. I have reviewed both the USB audio device and Apple's AppleUSBAudio class code and have not seen any mention of the exception flags.
Aleksandar Markovic Feb. 27, 2020, 7:16 a.m. UTC | #31
On Thursday, February 27, 2020, Programmingkid <programmingkidx@gmail.com>
wrote:

>
> > On Feb 26, 2020, at 12:27 PM, Aleksandar Markovic <
> aleksandar.m.mail@gmail.com> wrote:
> >
> > On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote:
> >>
> >> Accuracy is an important part of the IEEE 754 floating point standard.
> The whole purpose of this standard is to ensure floating point calculations
> are consistent across multiple CPUs. I believe referring to this patch as
> inaccurate is itself inaccurate. That gives the impression that this patch
> produces calculations that are not inline with established standards. This
> is not true. The only part of this patch that will produce incorrect values
> are the flags. There *may* be a program or two out there that depend on
> these flags, but for the majority of programs that only care about basic
> floating point arithmetic this patch will produce correct values. Currently
> the emulated PowerPC's FPU already produces wrong values for the flags.
> This patch does set the Inexact flag (which I don't like), but since I have
> never encountered any source code that cares for this flag, I can let it
> go. I think giving the user the ability to decide which option to use is
> the best thing to do.
> >>
> >
> > From the experiments described above, the patch in question changes the
> behavior
> > of applications (for example, sound is different with and without the
> > patch), which is
> > in contradiction with your claim that you "never encountered any
> > source code that
> > cares for this flag" and that "the only part of this patch that will
> > produce incorrect
> > values are the flags".
> >
> > In other words, and playing further with them:
> >
> > The claim that "referring to this patch as inaccurate is itself
> > inaccurate" is itself inaccurate.
> >
> > Best regards,
> > Aleksandar
>
> It is inaccurate to state that just because the USB audio device seems to
> play better with the hardfloat feature enabled that this changes the fact
> that I have yet to see any source code that actually reviews the flags. I
> have reviewed both the USB audio device and Apple's AppleUSBAudio class
> code and have not seen any mention of the exception flags.
>
>
I totally disagree with your using the term "hardfloat feature enabled" in
this context, speaking about this particulat patch. This may be just
wishful thinking. The right wording would be "hardfloat feature hacked", or
"hardfloat feature fooled".

The patch itself has the wrong, intentionally misleading and confusing
title from the outset. It should be something like  "target/ppc: Cheat
hardfloat feature into beleiving that inexact flag is always set"

Best regards,
Aleksabdar
BALATON Zoltan Feb. 27, 2020, 11:54 a.m. UTC | #32
On Thu, 27 Feb 2020, Aleksandar Markovic wrote:
> I totally disagree with your using the term "hardfloat feature enabled" in
> this context, speaking about this particulat patch. This may be just
> wishful thinking. The right wording would be "hardfloat feature hacked", or
> "hardfloat feature fooled".
>
> The patch itself has the wrong, intentionally misleading and confusing
> title from the outset. It should be something like  "target/ppc: Cheat
> hardfloat feature into beleiving that inexact flag is always set"

May I point out that the patch is RFC, meaning it's not meant to be merged 
only to test it and provide feedback. Also the limitations were stated in 
the commit message. There's no other easy way that I know to test if 
hardfloat would work with PPC than forcing inexact bit to have it run with 
hardfloat most of the time. Once it's tested what regression this would 
cause (other than the expected inexact bit) then we can see if there are 
any other problem with hardfloat and PPC or only this bit. Then we can 
either change it to only not clear inexact bit like it's done on other 
archs or do something else as even not clearing sticky inexact bit would 
break the non-sticky counterpart PPC has. Breakage would be limited to the 
non-sticky version and discussion was about if even that's unacceptable.

Regards,
BALATON Zoltan
Programmingkid March 2, 2020, 12:13 a.m. UTC | #33
> On Feb 26, 2020, at 1:09 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
> 
> 
> G 3 <programmingkidx@gmail.com> writes:
> 
>> Accuracy is an important part of the IEEE 754 floating point standard. The
>> whole purpose of this standard is to ensure floating point calculations are
>> consistent across multiple CPUs. I believe referring to this patch as
>> inaccurate is itself inaccurate. That gives the impression that this patch
>> produces calculations that are not inline with established standards. This
>> is not true. The only part of this patch that will produce incorrect values
>> are the flags.
> 
> As I stated further up the thread I'd be happy to take a patch that
> turns this on without the messing about with the FPU flags which AFAICT
> breaks the architectural compliance of those instructions. The ability
> to detect inexact results is part of the IEEE spec even if it is
> considerably looser about when the flag should be reset.
> 
>> There *may* be a program or two out there that depend on
>> these flags, but for the majority of programs that only care about basic
>> floating point arithmetic this patch will produce correct values. Currently
>> the emulated PowerPC's FPU already produces wrong values for the
>> flags.
> 
> Those are bugs that should be fixed. The state of the flags after a
> calculation should be considered part of the "values" generated by the
> FPU.
> 
>> This patch does set the Inexact flag (which I don't like), but since I have
>> never encountered any source code that cares for this flag, I can let it
>> go. I think giving the user the ability to decide which option to use is
>> the best thing to do.
> 
> Giving the user the option to break things is a poor flag, most of QEMUs
> configurable knobs are about trading of size/speed without affecting
> correctness.
> 
> If the PPC maintainers are happy that hardfloat's speed trade-offs are
> worth it for usual workloads (whatever they may be) then I have no
> objection to making defaulting hardfloat to on - perhaps* even with an
> option to force softfloat if it faster for some workloads.
> 
>> 
>> On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <
>> aleksandar.m.mail@gmail.com> wrote:
>> 
>>> 
>>> 
>>> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org>
>>> wrote:
>>>> 
>>>> 
>>>> Dino Papararo <skizzato73@msn.com> writes:
>>>> 
>>>>> Please let's go with hardfloat pps support, it's really a good feature
>>> to implement.
>>>>> Even if in a first step it could lead to inaccuracy results, later it
>>>>> could solved with other patches.
>>>> 
>>>> That's the wrong way around. We have regression tests for a reason.
>>> 
>>> I tend to agree with Alex here, and additionally want to expand more on
>>> this topic.
>>> 
>>> In my view: (that I think is at least very close to the community
>>> consensus)
>>> 
>>> This is *not* a ppc-specific issue. There exist a principle across all
>>> targets
>>> that QEMU FPU calculation must be accurate - exactly as specified in any
>>> applicable particular ISA document. Any discrepancy is an outright bug.
>>> 
>>> We even recently had several patches for FPU in ppc target that handled
>>> some fairly obscure cases of inaccuracies, I believe they were authored
>>> by Paul Clarke, so there are people in ppc community that care about
>>> FPU accuracy (as I guess is the case for any target).
>>> 
>>> There shouldn't be a target that decides by itself and within itself
>>> "ok, we don't need accuracy, let's trade it for speed". This violates
>>> the architecture of QEMU. Please allow that for any given software
>>> project, there is an architecture that should be respected.
>>> 
>>> This doesn't mean that anybody's experimentation is discouraged. No-one
>>> can stop anybody from forking from QEMU upstream tree and do whatever
>>> is wanted.
>>> 
>>> But, this doesn't mean such experimentation will be upstreamed. QEMU
>>> upstream should be collecting place for the best ideas and implementations,
>>> not for arbitrary experimentations.
>>> 
>>> Best regards,
>>> Aleksandar
>>> 
>>> 
>>>> I'll happily accept patches to turn on hardfloat for PPC if:
>>>> 
>>>> a) they don't cause regressions in our fairly extensive floating point
>>>> tests
>>>> b) the PPC maintainers are happy with the new performance profile
>>>> 
>>>> The way forward would be to:
>>>> 
>>>> 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__)
>>>> 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded
>>>> splatting of flags (if any)
>>>> 3. measure the before/after performance effect and decide if on balance
>>>> it's worth keeping
>>>> 
>>>>> I think it's important for qemu to as global as possible and don't
>>>>> target only recent hardware.
>>>> 
>>>> Are you referring to guests or hosts? For guests we will always favour
>>>> accuracy of speed of emulation. For hosts we need to have IEEE compliant
>>>> FPU HW to even stand a chance of using hardfloat.
>>>> 
>>>> --
>>>> Alex Bennée
>>>> 
>>> 
> 
> 
> -- 
> Alex Bennée

Ok, I was just looking at Intel's x87 chip documentation. It supports IEEE 754 floating point operations and exception flags. This leads me to this question. Would simply taking the host exception flags and using them to set the PowerPC's FPU's flag be an acceptable solution to this problem? 

These are the flags that all CPU's that support the IEEE 754 standard implement: 
   - division by zero
   - inexact
   - overflow
   - underflow
   - invalid operation

This could be an API that is used to retrieve the host flags' value:
 - get_host_div_zero_flag()
 - get_host_inexact_flag()
 - get_host_overflow_flag()
 - get_host_underflow_flag()
 - get_host_invalid_oper_flag()

We could then use this API to set the PowerPC FPU exception flags. 

Does this sound like a good solution?
Richard Henderson March 2, 2020, 4:28 a.m. UTC | #34
On 3/1/20 4:13 PM, Programmingkid wrote:
> Ok, I was just looking at Intel's x87 chip documentation. It supports IEEE 754 floating point operations and exception flags. This leads me to this question. Would simply taking the host exception flags and using them to set the PowerPC's FPU's flag be an acceptable solution to this problem? 

No.

The primary issue is the FPSCR.FI flag.  This is not an accumulative bit, per
ieee754, but per operation.

The "hardfloat" option works (with other targets) only with ieee745
accumulative exceptions, when the most common of those exceptions, inexact, has
already been raised.  And thus need not be raised a second time.

Per the PowerPC architecture, inexact must be recognized afresh for every
operation.  Which is cheap in hardware but expensive in software.

And once you're done with FI, FR has been and continues to be emulated incorrectly.


r~
BALATON Zoltan March 2, 2020, 11:42 a.m. UTC | #35
On Sun, 1 Mar 2020, Richard Henderson wrote:
> On 3/1/20 4:13 PM, Programmingkid wrote:
>> Ok, I was just looking at Intel's x87 chip documentation. It supports 
>> IEEE 754 floating point operations and exception flags. This leads me 
>> to this question. Would simply taking the host exception flags and 
>> using them to set the PowerPC's FPU's flag be an acceptable solution to 
>> this problem?

In my understanding that's what is currently done, the problem with PPC as 
Richard said is the non-sticky versions of some of these bits which need 
clearing FP exception status before every FPU op which seems to be 
expensive and slower than using softfloat. So to use hardfloat we either 
accept that we can't emulate these bits with hardfloat or we need to do 
something else than clearing flags and checking after every FPU op.

While not emulating these bits don't seem to matter for most clients and 
other PPC emulations got away with it, QEMU prefers accuracy over speed 
even for rarely used features.

> No.
>
> The primary issue is the FPSCR.FI flag.  This is not an accumulative bit, per
> ieee754, but per operation.
>
> The "hardfloat" option works (with other targets) only with ieee745
> accumulative exceptions, when the most common of those exceptions, inexact, has
> already been raised.  And thus need not be raised a second time.

Why exactly it's done that way? What are the differences between IEEE FP 
implementations that prevents using hardfloat most of the time instead of 
only using it in some (although supposedly common) special cases?

> Per the PowerPC architecture, inexact must be recognized afresh for every
> operation.  Which is cheap in hardware but expensive in software.
>
> And once you're done with FI, FR has been and continues to be emulated incorrectly.

I think CPUs can also raise exceptions when they detect the condition in 
hardware so maybe we should install our FPU exception handler and set 
guest flags from that then we don't need to check and won't have problem 
with these bits either. Why is that not possible or isn't done? The 
softfloat code has a comment that working with exceptions is not pleasent 
but why? Isn't setting flags from a handler easier than checking 
separately for each op? If this is because of differences in how flags are 
handled by different targets we don't have to do that from the host FPU 
exception handler. That handler could only set a global flag on each 
exception that targets can be checked by targets and handle differences. 
This global flag then can include non-sticky versions if needed because 
clearing a global should be less expensive than clearing FPU status reg. 
But I don't really know, just guessing, somone who knows more about FPUs 
probably knows a better way.

Regards,
BALATON Zoltan
Richard Henderson March 2, 2020, 4:55 p.m. UTC | #36
On 3/2/20 3:42 AM, BALATON Zoltan wrote:
>> The "hardfloat" option works (with other targets) only with ieee745
>> accumulative exceptions, when the most common of those exceptions, inexact, has
>> already been raised.  And thus need not be raised a second time.
> 
> Why exactly it's done that way? What are the differences between IEEE FP
> implementations that prevents using hardfloat most of the time instead of only
> using it in some (although supposedly common) special cases?

While it is possible to read the host's ieee exception word after the hardfloat
operation, there are two reasons that is undesirable:

(1) It is *slow*.  So slow that it's faster to run the softfloat code instead.
 I thought it would be easier to find the benchmark numbers that Emilio
generated when this was tested, but I can't find it.

(2) IEEE has a number of implementation choices for corner cases, and we need
to implement the target's choices, not the host's choices.

> I think CPUs can also raise exceptions when they detect the condition in
> hardware so maybe we should install our FPU exception handler and set guest
> flags from that then we don't need to check and won't have problem with these
> bits either. Why is that not possible or isn't done?

If we have to enable and disable host fpu exceptions going in and out of
softfloat routines, we're back to modifying the host fpu control word, which as
described above, is *slow*.

> That handler could only
> set a global flag on each exception that targets can be checked by targets and
> handle differences. This global flag then can include non-sticky versions if
> needed because clearing a global should be less expensive than clearing FPU
> status reg. But I don't really know, just guessing, somone who knows more about
> FPUs probably knows a better way.

I don't know if anyone has tried that variant, where we simply leave the
exceptions enabled, leave the signal handler enabled, and use a global.

Feel free to try it and benchmark it.


r~
Alex Bennée March 2, 2020, 5:10 p.m. UTC | #37
BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Sun, 1 Mar 2020, Richard Henderson wrote:
>> On 3/1/20 4:13 PM, Programmingkid wrote:
>>> Ok, I was just looking at Intel's x87 chip documentation. It
>>> supports IEEE 754 floating point operations and exception flags.
>>> This leads me to this question. Would simply taking the host
>>> exception flags and using them to set the PowerPC's FPU's flag be
>>> an acceptable solution to this problem?
>
> In my understanding that's what is currently done, the problem with
> PPC as Richard said is the non-sticky versions of some of these bits
> which need clearing FP exception status before every FPU op which
> seems to be expensive and slower than using softfloat. So to use
> hardfloat we either accept that we can't emulate these bits with
> hardfloat or we need to do something else than clearing flags and
> checking after every FPU op.
>
> While not emulating these bits don't seem to matter for most clients
> and other PPC emulations got away with it, QEMU prefers accuracy over
> speed even for rarely used features.
>
>> No.
>>
>> The primary issue is the FPSCR.FI flag.  This is not an accumulative bit, per
>> ieee754, but per operation.
>>
>> The "hardfloat" option works (with other targets) only with ieee745
>> accumulative exceptions, when the most common of those exceptions, inexact, has
>> already been raised.  And thus need not be raised a second time.
>
> Why exactly it's done that way? What are the differences between IEEE
> FP implementations that prevents using hardfloat most of the time
> instead of only using it in some (although supposedly common) special
> cases?

There are a couple of wrinkles. As far as NaN and denormal behaviour
goes we have enough slack in the spec that different guests have
slightly different behaviour. See pickNaN and friends in the soft float
specialisation code. As a result we never try and hand off to hardfloat
for NaNs, Infs and Zeros. Luckily testing for those cases if a fairly
small part of the cost of the calculation.

Also things tend to get unstuck on changes to rounding modes.
Fortunately it doesn't seem to be supper common. 

You can read even more detail in the paper that originally prompted
Emilio's work:

  "supporting the neon and VFP instruction sets in an LLVM-based
   binary translator"
   https://www.thinkmind.org/download.php?articleid=icas_2015_5_20_20033

>> Per the PowerPC architecture, inexact must be recognized afresh for every
>> operation.  Which is cheap in hardware but expensive in software.
>>
>> And once you're done with FI, FR has been and continues to be emulated incorrectly.
>
> I think CPUs can also raise exceptions when they detect the condition
> in hardware so maybe we should install our FPU exception handler and
> set guest flags from that then we don't need to check and won't have
> problem with these bits either. Why is that not possible or isn't
> done?

One of my original patches did just this:

  Subject: [PATCH] fpu/softfloat: use hardware sqrt if we can (EXPERIMENT!)
  Date: Tue, 20 Feb 2018 21:01:37 +0000
  Message-Id: <20180220210137.18018-1-alex.bennee@linaro.org>

The two problems you run into are:

 - relying on a trap for inexact will be slow if you keep hitting it
 - reading host FPU flag registers turns out to be pretty expensive

> The softfloat code has a comment that working with exceptions is
> not pleasent but why? Isn't setting flags from a handler easier than
> checking separately for each op? If this is because of differences in
> how flags are handled by different targets we don't have to do that
> from the host FPU exception handler. That handler could only set a
> global flag on each exception that targets can be checked by targets
> and handle differences. This global flag then can include non-sticky
> versions if needed because clearing a global should be less expensive
> than clearing FPU status reg. But I don't really know, just guessing,
> somone who knows more about FPUs probably knows a better way.
>
> Regards,
> BALATON Zoltan
BALATON Zoltan March 2, 2020, 11:01 p.m. UTC | #38
On Mon, 2 Mar 2020, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> On Sun, 1 Mar 2020, Richard Henderson wrote:
>>> On 3/1/20 4:13 PM, Programmingkid wrote:
>>>> Ok, I was just looking at Intel's x87 chip documentation. It
>>>> supports IEEE 754 floating point operations and exception flags.
>>>> This leads me to this question. Would simply taking the host
>>>> exception flags and using them to set the PowerPC's FPU's flag be
>>>> an acceptable solution to this problem?
>>
>> In my understanding that's what is currently done, the problem with
>> PPC as Richard said is the non-sticky versions of some of these bits
>> which need clearing FP exception status before every FPU op which
>> seems to be expensive and slower than using softfloat. So to use
>> hardfloat we either accept that we can't emulate these bits with
>> hardfloat or we need to do something else than clearing flags and
>> checking after every FPU op.
>>
>> While not emulating these bits don't seem to matter for most clients
>> and other PPC emulations got away with it, QEMU prefers accuracy over
>> speed even for rarely used features.
>>
>>> No.
>>>
>>> The primary issue is the FPSCR.FI flag.  This is not an accumulative bit, per
>>> ieee754, but per operation.
>>>
>>> The "hardfloat" option works (with other targets) only with ieee745
>>> accumulative exceptions, when the most common of those exceptions, inexact, has
>>> already been raised.  And thus need not be raised a second time.
>>
>> Why exactly it's done that way? What are the differences between IEEE
>> FP implementations that prevents using hardfloat most of the time
>> instead of only using it in some (although supposedly common) special
>> cases?
>
> There are a couple of wrinkles. As far as NaN and denormal behaviour
> goes we have enough slack in the spec that different guests have
> slightly different behaviour. See pickNaN and friends in the soft float
> specialisation code. As a result we never try and hand off to hardfloat
> for NaNs, Infs and Zeros. Luckily testing for those cases if a fairly
> small part of the cost of the calculation.
>
> Also things tend to get unstuck on changes to rounding modes.
> Fortunately it doesn't seem to be supper common.

OK but how do these relate to inexact flag and why is that the one that's 
checked for using hardfloat? Also rounding mode is checked but why can't 
we set the same mode on host and why only use hardfloat in one specific 
rounding mode? These two checks seem to further limit hardfloat use beyond 
the above cases or are these the same?

> You can read even more detail in the paper that originally prompted
> Emilio's work:
>
>  "supporting the neon and VFP instruction sets in an LLVM-based
>   binary translator"
>   https://www.thinkmind.org/download.php?articleid=icas_2015_5_20_20033

I've only had a quick look at it but seems to not discuss all details. 
They say the ARM instruction they wanted to emulate have some non-standard 
flush-to-zero behaviour where exceptions (including inexact) are handled 
differently. Is this related to the check above and if yes shouldn't that 
only apply to ARM target? Other standard compliant target probably should 
not be limited by this.

They've also found out that clearing and reading host FP flags is "slower 
than QEMU" which is what we have for PPC currently. They say the solution 
is to not use host exceptions at all but calculate the exception flags 
from software looking at inputs and result instead maybe trying different 
FP ops that test for the exception cases. Unfortunately this paper does 
not describe how exactly that's done just say maybe it will be described 
later. It seems like kind of softfloat but using FPU for actual 
calculation and deduce exeptions without access to intermediate reaults 
that softfloat may be using. So they can use hardware for calculation 
which should be the largest part and calculate the flags from software. 
This way they claim 1.24 to 3.36 times speed up compared to then QEMU 
(using only softfloat I guess which is still what we have for PPC today).

>>> Per the PowerPC architecture, inexact must be recognized afresh for every
>>> operation.  Which is cheap in hardware but expensive in software.
>>>
>>> And once you're done with FI, FR has been and continues to be emulated incorrectly.
>>
>> I think CPUs can also raise exceptions when they detect the condition
>> in hardware so maybe we should install our FPU exception handler and
>> set guest flags from that then we don't need to check and won't have
>> problem with these bits either. Why is that not possible or isn't
>> done?
>
> One of my original patches did just this:
>
>  Subject: [PATCH] fpu/softfloat: use hardware sqrt if we can (EXPERIMENT!)
>  Date: Tue, 20 Feb 2018 21:01:37 +0000
>  Message-Id: <20180220210137.18018-1-alex.bennee@linaro.org>

It's this patch:
http://patchwork.ozlabs.org/patch/875764/

This at least shows where to hook in FP exception handling but based on 
the above paper maybe that's not the best solution after all but may worth 
a try anyway in case it's simpler than what they did.

> The two problems you run into are:
>
> - relying on a trap for inexact will be slow if you keep hitting it

Which is slower? Clearing exception flags before every op and reading them 
again or trapping for exceptions? I'd expect even if exceptions are common 
they should be less frequent than every op (otherwise they would not be 
"exceptional").

> - reading host FPU flag registers turns out to be pretty expensive

That's what using exceptions should avoid. If we only need to read and 
clear flags when exception happens that should be less frequent than doing 
that for every FP op. Hopefully even with the additional overhead of 
calling the handler if all the handler has to do is set a corresponding 
flag in a global.

Regards,
BALATON Zoltan
BALATON Zoltan March 2, 2020, 11:16 p.m. UTC | #39
On Mon, 2 Mar 2020, Richard Henderson wrote:
> On 3/2/20 3:42 AM, BALATON Zoltan wrote:
>>> The "hardfloat" option works (with other targets) only with ieee745
>>> accumulative exceptions, when the most common of those exceptions, inexact, has
>>> already been raised.  And thus need not be raised a second time.
>>
>> Why exactly it's done that way? What are the differences between IEEE FP
>> implementations that prevents using hardfloat most of the time instead of only
>> using it in some (although supposedly common) special cases?
>
> While it is possible to read the host's ieee exception word after the hardfloat
> operation, there are two reasons that is undesirable:
>
> (1) It is *slow*.  So slow that it's faster to run the softfloat code instead.
> I thought it would be easier to find the benchmark numbers that Emilio
> generated when this was tested, but I can't find it.

I remember those benchmarks too and this is also what the paper Alex 
referred to also confirmed. Also I've found that enabling hardfloat for 
PPC without doing anything else is slightly slower (on a recent CPU, on 
older CPUs could be even slower). Interetingly however it does give a 
speedup for vector instructions (maybe because they don't clear flags 
between each sub operation). Does that mean these vector instruction 
helpers are also buggy regarding exceptions?

> (2) IEEE has a number of implementation choices for corner cases, and we need
> to implement the target's choices, not the host's choices.

But how is that related to inexact flag and float_round_nearest_even 
rounding mode which are the only two things can_use_fpu() function checks 
for?

>> I think CPUs can also raise exceptions when they detect the condition in
>> hardware so maybe we should install our FPU exception handler and set guest
>> flags from that then we don't need to check and won't have problem with these
>> bits either. Why is that not possible or isn't done?
>
> If we have to enable and disable host fpu exceptions going in and out of
> softfloat routines, we're back to modifying the host fpu control word, which as
> described above, is *slow*.
>
>> That handler could only
>> set a global flag on each exception that targets can be checked by targets and
>> handle differences. This global flag then can include non-sticky versions if
>> needed because clearing a global should be less expensive than clearing FPU
>> status reg. But I don't really know, just guessing, somone who knows more about
>> FPUs probably knows a better way.
>
> I don't know if anyone has tried that variant, where we simply leave the
> exceptions enabled, leave the signal handler enabled, and use a global.
>
> Feel free to try it and benchmark it.

I probably won't try any time soon. I have several other half finished 
stuff to hack on to not take up another one I likely can't finish, but 
hope this discussion inspires someone to try it. I'm also interested in 
the results. If nobody tries in the next two years maybe I get there 
eventually.

Regards,
BALATON Zoltan
Richard Henderson March 3, 2020, 12:11 a.m. UTC | #40
On 3/2/20 3:16 PM, BALATON Zoltan wrote:
>> (2) IEEE has a number of implementation choices for corner cases, and we need
>> to implement the target's choices, not the host's choices.
> 
> But how is that related to inexact flag and float_round_nearest_even rounding
> mode which are the only two things can_use_fpu() function checks for?

float_round_nearest_even is the default rounding mode, and is what the host fpu
has been set for.  In order to use hardfloat for a different rounding mode, we
would need to change the host fpu control register, which is slow.

Inexact is the only flag that we cannot compute easily from the inputs and
output.  We check for NaN and Inf as inputs and outputs,
which is the clue that we may have to raise Invalid or Overflow.


r~
Programmingkid March 4, 2020, 6:43 p.m. UTC | #41
---------- Forwarded message ---------
From: G 3 <programmingkidx@gmail.com>
Date: Wed, Mar 4, 2020 at 1:35 PM
Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC
To: BALATON Zoltan <balaton@eik.bme.hu>




On Mon, Mar 2, 2020 at 6:16 PM BALATON Zoltan <balaton@eik.bme.hu> wrote:

> On Mon, 2 Mar 2020, Richard Henderson wrote:
> > On 3/2/20 3:42 AM, BALATON Zoltan wrote:
> >>> The "hardfloat" option works (with other targets) only with ieee745
> >>> accumulative exceptions, when the most common of those exceptions,
> inexact, has
> >>> already been raised.  And thus need not be raised a second time.
> >>
> >> Why exactly it's done that way? What are the differences between IEEE FP
> >> implementations that prevents using hardfloat most of the time instead
> of only
> >> using it in some (although supposedly common) special cases?
> >
> > While it is possible to read the host's ieee exception word after the
> hardfloat
> > operation, there are two reasons that is undesirable:
> >
> > (1) It is *slow*.  So slow that it's faster to run the softfloat code
> instead.
> > I thought it would be easier to find the benchmark numbers that Emilio
> > generated when this was tested, but I can't find it.
>
> I remember those benchmarks too and this is also what the paper Alex
> referred to also confirmed. Also I've found that enabling hardfloat for
> PPC without doing anything else is slightly slower (on a recent CPU, on
> older CPUs could be even slower). Interetingly however it does give a
> speedup for vector instructions (maybe because they don't clear flags
> between each sub operation). Does that mean these vector instruction
> helpers are also buggy regarding exceptions?
>

I am all intrigued by these vector instructions. Apple was really big on
using them back in the day so programs like Quicktime and iTunes definitely
use them. I'm not sure if the PowerPC's altivec vector instructions map to
host vector instructions already, but if they don't, mapping them would
give us a huge speedup in certain places. Would anyone know if this was
already done in QEMU?
Richard Henderson March 5, 2020, 7:25 p.m. UTC | #42
On 3/4/20 10:43 AM, G 3 wrote:
> I am all intrigued by these vector instructions. Apple was really big on using
> them back in the day so programs like Quicktime and iTunes definitely use them.
> I'm not sure if the PowerPC's altivec vector instructions map to host vector
> instructions already, but if they don't, mapping them would give us a huge
> speedup in certain places. Would anyone know if this was already done in QEMU?

They are, provided that your x86 host supports AVX.  Which should be everything
manufactured after about 2011.


r~
Yonggang Luo April 10, 2020, 1:50 p.m. UTC | #43
Are this stable now? I'd like to see hard float to be landed:)

On Wed, Feb 19, 2020 at 1:19 AM BALATON Zoltan <balaton@eik.bme.hu> wrote:

> While other targets take advantage of using host FPU to do floating
> point computations, this was disabled for PPC target because always
> clearing exception flags before every FP op made it slightly slower
> than emulating everyting with softfloat. To emulate some FPSCR bits,
> clearing of fp_status may be necessary (unless these could be handled
> e.g. using FP exceptions on host but there's no API for that in QEMU
> yet) but preserving at least the inexact flag makes hardfloat usable
> and faster than softfloat. Since most clients don't actually care
> about this flag, we can gain some speed trading some emulation
> accuracy.
>
> This patch implements a simple way to keep the inexact flag set for
> hardfloat while still allowing to revert to softfloat for workloads
> that need more accurate albeit slower emulation. (Set hardfloat
> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may
> still be room for further improvement but this seems to increase
> floating point performance. Unfortunately the softfloat case is slower
> than before this patch so this patch only makes sense if the default
> is also set to enable hardfloat.
>
> Because of the above this patch at the moment is mainly for testing
> different workloads to evaluate how viable would this be in practice.
> Thus, RFC and not ready for merge yet.
>
> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
> ---
> v2: use different approach to avoid needing if () in
> helper_reset_fpstatus() but this does not seem to change overhead
> much, also make it a single patch as adding the hardfloat option is
> only a few lines; with this we can use same value at other places where
> float_status is reset and maybe enable hardfloat for a few more places
> for a little more performance but not too much. With this I got:
>
> lame: 3:13, lame_vmx: 1:55 (this is probably within jitter though and
> still far from the results on real hardware) also tried mplayer test
> and got results between 144-146s (this test is more VMX bound).
>
> I've also done some profiling for hardfloat=true and false cases with
> this patch to see what are the hot functions. Results are:
>
> Overhead  Command          Symbol
> -cpu G4,hardfloat=false, lame:
>    9.82%  qemu-system-ppc  [.] round_canonical
>    8.35%  qemu-system-ppc  [.] soft_f64_muladd
>    7.16%  qemu-system-ppc  [.] soft_f64_addsub
>    5.27%  qemu-system-ppc  [.] float32_to_float64
>    5.20%  qemu-system-ppc  [.] helper_compute_fprf_float64
>    4.61%  qemu-system-ppc  [.] helper_frsp
>    4.59%  qemu-system-ppc  [.] soft_f64_mul
>    4.01%  qemu-system-ppc  [.] float_to_float.isra.26
>    3.84%  qemu-system-ppc  [.] float64_classify
>    2.97%  qemu-system-ppc  [.] do_float_check_status
>
> -cpu G4,hardfloat=false, lame_vmx:
> Overhead  Command          Symbol
>   10.04%  qemu-system-ppc  [.] float32_muladd
>    9.49%  qemu-system-ppc  [.] helper_vperm
>    6.10%  qemu-system-ppc  [.] round_canonical
>    4.13%  qemu-system-ppc  [.] soft_f64_addsub
>    3.23%  qemu-system-ppc  [.] helper_frsp
>    3.13%  qemu-system-ppc  [.] soft_f64_muladd
>    2.88%  qemu-system-ppc  [.] helper_vmaddfp
>    2.69%  qemu-system-ppc  [.] float32_add
>    2.60%  qemu-system-ppc  [.] float32_to_float64
>    2.52%  qemu-system-ppc  [.] helper_compute_fprf_float64
>
> -cpu G4,hardfloat=true, lame:
>   11.59%  qemu-system-ppc  [.] round_canonical
>    6.18%  qemu-system-ppc  [.] helper_compute_fprf_float64
>    6.01%  qemu-system-ppc  [.] float32_to_float64
>    4.58%  qemu-system-ppc  [.] float64_classify
>    3.87%  qemu-system-ppc  [.] helper_frsp
>    3.75%  qemu-system-ppc  [.] float_to_float.isra.26
>    3.48%  qemu-system-ppc  [.] helper_todouble
>    3.31%  qemu-system-ppc  [.] float64_muladd
>    3.21%  qemu-system-ppc  [.] do_float_check_status
>    3.01%  qemu-system-ppc  [.] float64_mul
>
> -cpu G4,hardfloat=true, lame_vmx:
>    9.34%  qemu-system-ppc  [.] float32_muladd
>    8.83%  qemu-system-ppc  [.] helper_vperm
>    5.41%  qemu-system-ppc  [.] round_canonical
>    4.51%  qemu-system-ppc  [.] page_collection_lock
>    3.58%  qemu-system-ppc  [.] page_trylock_add.isra.17
>    2.71%  qemu-system-ppc  [.] helper_vmaddfp
>    2.53%  qemu-system-ppc  [.] float32_add
>    2.30%  qemu-system-ppc  [.] helper_compute_fprf_float64
>    2.21%  qemu-system-ppc  [.] float32_to_float64
>    2.06%  qemu-system-ppc  [.] helper_frsp
>
> round_canonical seems to come up frequently in this with large overhead.
>
> Could those with better test cases or benchmarks give it a test please
> on different CPUs to see what else this would break?
>
> ---
> fpu/softfloat.c                 | 14 +++++++-------
>  target/ppc/cpu.h                |  2 ++
>  target/ppc/fpu_helper.c         | 32 ++++++++++++++++----------------
>  target/ppc/translate_init.inc.c |  3 +++
>  4 files changed, 28 insertions(+), 23 deletions(-)
>
> diff --git a/fpu/softfloat.c b/fpu/softfloat.c
> index 301ce3b537..6d3f4af72a 100644
> --- a/fpu/softfloat.c
> +++ b/fpu/softfloat.c
> @@ -216,15 +216,15 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64)
>  #endif
>
>  /*
> - * Some targets clear the FP flags before most FP operations. This
> prevents
> - * the use of hardfloat, since hardfloat relies on the inexact flag being
> - * already set.
> + * Disable hardfloat for known problem cases.
> + * Additionally, some targets clear the FP flags before most FP
> operations.
> + * This prevents the use of hardfloat, since it relies on the inexact flag
> + * being already set and clearing it often may result in slower
> computations.
> + * Those targets could also be listed here.
>   */
> -#if defined(TARGET_PPC) || defined(__FAST_MATH__)
> -# if defined(__FAST_MATH__)
> -#  warning disabling hardfloat due to -ffast-math: hardfloat requires an
> exact \
> +#if defined(__FAST_MATH__)
> +# warning disabling hardfloat due to -ffast-math: hardfloat requires an
> exact \
>      IEEE implementation
> -# endif
>  # define QEMU_NO_HARDFLOAT 1
>  # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN
>  #else
> diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
> index b283042515..5f412f9fba 100644
> --- a/target/ppc/cpu.h
> +++ b/target/ppc/cpu.h
> @@ -1033,6 +1033,7 @@ struct CPUPPCState {
>      float_status vec_status;
>      float_status fp_status; /* Floating point execution context */
>      target_ulong fpscr;     /* Floating point status and control register
> */
> +    int default_fp_excpt_flags;
>
>      /* Internal devices resources */
>      ppc_tb_t *tb_env;      /* Time base and decrementer */
> @@ -1163,6 +1164,7 @@ struct PowerPCCPU {
>      void *machine_data;
>      int32_t node_id; /* NUMA node this CPU belongs to */
>      PPCHash64Options *hash64_opts;
> +    bool hardfloat;  /* use hardfloat (this breaks FPSCR[FI] bit
> emulation) */
>
>      /* Those resources are used only during code translation */
>      /* opcode handlers */
> diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c
> index ae43b08eb5..bbbd1cb987 100644
> --- a/target/ppc/fpu_helper.c
> +++ b/target/ppc/fpu_helper.c
> @@ -659,7 +659,7 @@ void helper_float_check_status(CPUPPCState *env)
>
>  void helper_reset_fpstatus(CPUPPCState *env)
>  {
> -    set_float_exception_flags(0, &env->fp_status);
> +    set_float_exception_flags(env->default_fp_excpt_flags,
> &env->fp_status);
>  }
>
>  static void float_invalid_op_addsub(CPUPPCState *env, bool set_fpcc,
> @@ -1823,7 +1823,7 @@ void helper_##name(CPUPPCState *env, ppc_vsr_t *xt,
>                         \
>
>     \
>      for (i = 0; i < nels; i++) {
>    \
>          float_status tstat = env->fp_status;
>    \
> -        set_float_exception_flags(0, &tstat);
>     \
> +        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>     \
>          t.fld = tp##_##op(xa->fld, xb->fld, &tstat);
>    \
>          env->fp_status.float_exception_flags |=
> tstat.float_exception_flags; \
>
>     \
> @@ -1867,7 +1867,7 @@ void helper_xsaddqp(CPUPPCState *env, uint32_t
> opcode,
>          tstat.float_rounding_mode = float_round_to_odd;
>      }
>
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      t.f128 = float128_add(xa->f128, xb->f128, &tstat);
>      env->fp_status.float_exception_flags |= tstat.float_exception_flags;
>
> @@ -1902,7 +1902,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,
>                         \
>
>     \
>      for (i = 0; i < nels; i++) {
>    \
>          float_status tstat = env->fp_status;
>    \
> -        set_float_exception_flags(0, &tstat);
>     \
> +        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>     \
>          t.fld = tp##_mul(xa->fld, xb->fld, &tstat);
>     \
>          env->fp_status.float_exception_flags |=
> tstat.float_exception_flags; \
>
>     \
> @@ -1942,7 +1942,7 @@ void helper_xsmulqp(CPUPPCState *env, uint32_t
> opcode,
>          tstat.float_rounding_mode = float_round_to_odd;
>      }
>
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      t.f128 = float128_mul(xa->f128, xb->f128, &tstat);
>      env->fp_status.float_exception_flags |= tstat.float_exception_flags;
>
> @@ -1976,7 +1976,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,
>                          \
>
>      \
>      for (i = 0; i < nels; i++) {
>     \
>          float_status tstat = env->fp_status;
>     \
> -        set_float_exception_flags(0, &tstat);
>      \
> +        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      \
>          t.fld = tp##_div(xa->fld, xb->fld, &tstat);
>      \
>          env->fp_status.float_exception_flags |=
> tstat.float_exception_flags;  \
>
>      \
> @@ -2019,7 +2019,7 @@ void helper_xsdivqp(CPUPPCState *env, uint32_t
> opcode,
>          tstat.float_rounding_mode = float_round_to_odd;
>      }
>
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      t.f128 = float128_div(xa->f128, xb->f128, &tstat);
>      env->fp_status.float_exception_flags |= tstat.float_exception_flags;
>
> @@ -2095,7 +2095,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,
> ppc_vsr_t *xb)             \
>
>     \
>      for (i = 0; i < nels; i++) {
>    \
>          float_status tstat = env->fp_status;
>    \
> -        set_float_exception_flags(0, &tstat);
>     \
> +        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>     \
>          t.fld = tp##_sqrt(xb->fld, &tstat);
>     \
>          env->fp_status.float_exception_flags |=
> tstat.float_exception_flags; \
>
>     \
> @@ -2143,7 +2143,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,
> ppc_vsr_t *xb)             \
>
>     \
>      for (i = 0; i < nels; i++) {
>    \
>          float_status tstat = env->fp_status;
>    \
> -        set_float_exception_flags(0, &tstat);
>     \
> +        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>     \
>          t.fld = tp##_sqrt(xb->fld, &tstat);
>     \
>          t.fld = tp##_div(tp##_one, t.fld, &tstat);
>    \
>          env->fp_status.float_exception_flags |=
> tstat.float_exception_flags; \
> @@ -2305,7 +2305,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,
>                          \
>
>      \
>      for (i = 0; i < nels; i++) {
>     \
>          float_status tstat = env->fp_status;
>     \
> -        set_float_exception_flags(0, &tstat);
>      \
> +        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      \
>          if (r2sp && (tstat.float_rounding_mode ==
> float_round_nearest_even)) {\
>              /*
>     \
>               * Avoid double rounding errors by rounding the intermediate
>     \
> @@ -2886,7 +2886,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t
> xb)
>      uint64_t result, sign, exp, frac;
>
>      float_status tstat = env->fp_status;
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>
>      sign = extract64(xb, 63,  1);
>      exp  = extract64(xb, 52, 11);
> @@ -2924,7 +2924,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t
> xb)
>  uint64_t helper_xscvspdpn(CPUPPCState *env, uint64_t xb)
>  {
>      float_status tstat = env->fp_status;
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>
>      return float32_to_float64(xb >> 32, &tstat);
>  }
> @@ -3327,7 +3327,7 @@ void helper_xsrqpi(CPUPPCState *env, uint32_t opcode,
>      }
>
>      tstat = env->fp_status;
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      set_float_rounding_mode(rmode, &tstat);
>      t.f128 = float128_round_to_int(xb->f128, &tstat);
>      env->fp_status.float_exception_flags |= tstat.float_exception_flags;
> @@ -3384,7 +3384,7 @@ void helper_xsrqpxp(CPUPPCState *env, uint32_t
> opcode,
>      }
>
>      tstat = env->fp_status;
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      set_float_rounding_mode(rmode, &tstat);
>      round_res = float128_to_floatx80(xb->f128, &tstat);
>      t.f128 = floatx80_to_float128(round_res, &tstat);
> @@ -3415,7 +3415,7 @@ void helper_xssqrtqp(CPUPPCState *env, uint32_t
> opcode,
>          tstat.float_rounding_mode = float_round_to_odd;
>      }
>
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      t.f128 = float128_sqrt(xb->f128, &tstat);
>      env->fp_status.float_exception_flags |= tstat.float_exception_flags;
>
> @@ -3449,7 +3449,7 @@ void helper_xssubqp(CPUPPCState *env, uint32_t
> opcode,
>          tstat.float_rounding_mode = float_round_to_odd;
>      }
>
> -    set_float_exception_flags(0, &tstat);
> +    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
>      t.f128 = float128_sub(xa->f128, xb->f128, &tstat);
>      env->fp_status.float_exception_flags |= tstat.float_exception_flags;
>
> diff --git a/target/ppc/translate_init.inc.c
> b/target/ppc/translate_init.inc.c
> index 53995f62ea..ab1a6db4f1 100644
> --- a/target/ppc/translate_init.inc.c
> +++ b/target/ppc/translate_init.inc.c
> @@ -10736,6 +10736,8 @@ static void ppc_cpu_reset(CPUState *s)
>      /* tininess for underflow is detected before rounding */
>      set_float_detect_tininess(float_tininess_before_rounding,
>                                &env->fp_status);
> +    /* hardfloat needs inexact flag already set */
> +    env->default_fp_excpt_flags = (cpu->hardfloat ? float_flag_inexact :
> 0);
>
>      for (i = 0; i < ARRAY_SIZE(env->spr_cb); i++) {
>          ppc_spr_t *spr = &env->spr_cb[i];
> @@ -10868,6 +10870,7 @@ static Property ppc_cpu_properties[] = {
>                       false),
>      DEFINE_PROP_BOOL("pre-3.0-migration", PowerPCCPU, pre_3_0_migration,
>                       false),
> +    DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>
> --
> 2.21.1
>
>
>
BALATON Zoltan April 10, 2020, 6:04 p.m. UTC | #44
On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> Are this stable now? I'd like to see hard float to be landed:)

If you want to see hardfloat for PPC then you should read the replies to 
this patch which can be found here:

http://patchwork.ozlabs.org/patch/1240235/

to understand what's needed then try to implement the solution with FP 
exceptions cached in a global that maybe could work. I won't be able to do 
that as said here:

https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.html

because I don't have time to learn all the details needed. I think others 
are in the same situation so unless somebody puts in the necessary effort 
this won't change.

Regards,
BALATON Zoltan
diff mbox series

Patch

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 301ce3b537..6d3f4af72a 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -216,15 +216,15 @@  GEN_INPUT_FLUSH3(float64_input_flush3, float64)
 #endif
 
 /*
- * Some targets clear the FP flags before most FP operations. This prevents
- * the use of hardfloat, since hardfloat relies on the inexact flag being
- * already set.
+ * Disable hardfloat for known problem cases.
+ * Additionally, some targets clear the FP flags before most FP operations.
+ * This prevents the use of hardfloat, since it relies on the inexact flag
+ * being already set and clearing it often may result in slower computations.
+ * Those targets could also be listed here.
  */
-#if defined(TARGET_PPC) || defined(__FAST_MATH__)
-# if defined(__FAST_MATH__)
-#  warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \
+#if defined(__FAST_MATH__)
+# warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \
     IEEE implementation
-# endif
 # define QEMU_NO_HARDFLOAT 1
 # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN
 #else
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index b283042515..5f412f9fba 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -1033,6 +1033,7 @@  struct CPUPPCState {
     float_status vec_status;
     float_status fp_status; /* Floating point execution context */
     target_ulong fpscr;     /* Floating point status and control register */
+    int default_fp_excpt_flags;
 
     /* Internal devices resources */
     ppc_tb_t *tb_env;      /* Time base and decrementer */
@@ -1163,6 +1164,7 @@  struct PowerPCCPU {
     void *machine_data;
     int32_t node_id; /* NUMA node this CPU belongs to */
     PPCHash64Options *hash64_opts;
+    bool hardfloat;  /* use hardfloat (this breaks FPSCR[FI] bit emulation) */
 
     /* Those resources are used only during code translation */
     /* opcode handlers */
diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c
index ae43b08eb5..bbbd1cb987 100644
--- a/target/ppc/fpu_helper.c
+++ b/target/ppc/fpu_helper.c
@@ -659,7 +659,7 @@  void helper_float_check_status(CPUPPCState *env)
 
 void helper_reset_fpstatus(CPUPPCState *env)
 {
-    set_float_exception_flags(0, &env->fp_status);
+    set_float_exception_flags(env->default_fp_excpt_flags, &env->fp_status);
 }
 
 static void float_invalid_op_addsub(CPUPPCState *env, bool set_fpcc,
@@ -1823,7 +1823,7 @@  void helper_##name(CPUPPCState *env, ppc_vsr_t *xt,                          \
                                                                              \
     for (i = 0; i < nels; i++) {                                             \
         float_status tstat = env->fp_status;                                 \
-        set_float_exception_flags(0, &tstat);                                \
+        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);      \
         t.fld = tp##_##op(xa->fld, xb->fld, &tstat);                         \
         env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
                                                                              \
@@ -1867,7 +1867,7 @@  void helper_xsaddqp(CPUPPCState *env, uint32_t opcode,
         tstat.float_rounding_mode = float_round_to_odd;
     }
 
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
     t.f128 = float128_add(xa->f128, xb->f128, &tstat);
     env->fp_status.float_exception_flags |= tstat.float_exception_flags;
 
@@ -1902,7 +1902,7 @@  void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,                            \
                                                                              \
     for (i = 0; i < nels; i++) {                                             \
         float_status tstat = env->fp_status;                                 \
-        set_float_exception_flags(0, &tstat);                                \
+        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);      \
         t.fld = tp##_mul(xa->fld, xb->fld, &tstat);                          \
         env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
                                                                              \
@@ -1942,7 +1942,7 @@  void helper_xsmulqp(CPUPPCState *env, uint32_t opcode,
         tstat.float_rounding_mode = float_round_to_odd;
     }
 
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
     t.f128 = float128_mul(xa->f128, xb->f128, &tstat);
     env->fp_status.float_exception_flags |= tstat.float_exception_flags;
 
@@ -1976,7 +1976,7 @@  void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,                             \
                                                                               \
     for (i = 0; i < nels; i++) {                                              \
         float_status tstat = env->fp_status;                                  \
-        set_float_exception_flags(0, &tstat);                                 \
+        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);       \
         t.fld = tp##_div(xa->fld, xb->fld, &tstat);                           \
         env->fp_status.float_exception_flags |= tstat.float_exception_flags;  \
                                                                               \
@@ -2019,7 +2019,7 @@  void helper_xsdivqp(CPUPPCState *env, uint32_t opcode,
         tstat.float_rounding_mode = float_round_to_odd;
     }
 
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
     t.f128 = float128_div(xa->f128, xb->f128, &tstat);
     env->fp_status.float_exception_flags |= tstat.float_exception_flags;
 
@@ -2095,7 +2095,7 @@  void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, ppc_vsr_t *xb)             \
                                                                              \
     for (i = 0; i < nels; i++) {                                             \
         float_status tstat = env->fp_status;                                 \
-        set_float_exception_flags(0, &tstat);                                \
+        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);      \
         t.fld = tp##_sqrt(xb->fld, &tstat);                                  \
         env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
                                                                              \
@@ -2143,7 +2143,7 @@  void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, ppc_vsr_t *xb)             \
                                                                              \
     for (i = 0; i < nels; i++) {                                             \
         float_status tstat = env->fp_status;                                 \
-        set_float_exception_flags(0, &tstat);                                \
+        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);      \
         t.fld = tp##_sqrt(xb->fld, &tstat);                                  \
         t.fld = tp##_div(tp##_one, t.fld, &tstat);                           \
         env->fp_status.float_exception_flags |= tstat.float_exception_flags; \
@@ -2305,7 +2305,7 @@  void helper_##op(CPUPPCState *env, ppc_vsr_t *xt,                             \
                                                                               \
     for (i = 0; i < nels; i++) {                                              \
         float_status tstat = env->fp_status;                                  \
-        set_float_exception_flags(0, &tstat);                                 \
+        set_float_exception_flags(env->default_fp_excpt_flags, &tstat);       \
         if (r2sp && (tstat.float_rounding_mode == float_round_nearest_even)) {\
             /*                                                                \
              * Avoid double rounding errors by rounding the intermediate      \
@@ -2886,7 +2886,7 @@  uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t xb)
     uint64_t result, sign, exp, frac;
 
     float_status tstat = env->fp_status;
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
 
     sign = extract64(xb, 63,  1);
     exp  = extract64(xb, 52, 11);
@@ -2924,7 +2924,7 @@  uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t xb)
 uint64_t helper_xscvspdpn(CPUPPCState *env, uint64_t xb)
 {
     float_status tstat = env->fp_status;
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
 
     return float32_to_float64(xb >> 32, &tstat);
 }
@@ -3327,7 +3327,7 @@  void helper_xsrqpi(CPUPPCState *env, uint32_t opcode,
     }
 
     tstat = env->fp_status;
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
     set_float_rounding_mode(rmode, &tstat);
     t.f128 = float128_round_to_int(xb->f128, &tstat);
     env->fp_status.float_exception_flags |= tstat.float_exception_flags;
@@ -3384,7 +3384,7 @@  void helper_xsrqpxp(CPUPPCState *env, uint32_t opcode,
     }
 
     tstat = env->fp_status;
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
     set_float_rounding_mode(rmode, &tstat);
     round_res = float128_to_floatx80(xb->f128, &tstat);
     t.f128 = floatx80_to_float128(round_res, &tstat);
@@ -3415,7 +3415,7 @@  void helper_xssqrtqp(CPUPPCState *env, uint32_t opcode,
         tstat.float_rounding_mode = float_round_to_odd;
     }
 
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
     t.f128 = float128_sqrt(xb->f128, &tstat);
     env->fp_status.float_exception_flags |= tstat.float_exception_flags;
 
@@ -3449,7 +3449,7 @@  void helper_xssubqp(CPUPPCState *env, uint32_t opcode,
         tstat.float_rounding_mode = float_round_to_odd;
     }
 
-    set_float_exception_flags(0, &tstat);
+    set_float_exception_flags(env->default_fp_excpt_flags, &tstat);
     t.f128 = float128_sub(xa->f128, xb->f128, &tstat);
     env->fp_status.float_exception_flags |= tstat.float_exception_flags;
 
diff --git a/target/ppc/translate_init.inc.c b/target/ppc/translate_init.inc.c
index 53995f62ea..ab1a6db4f1 100644
--- a/target/ppc/translate_init.inc.c
+++ b/target/ppc/translate_init.inc.c
@@ -10736,6 +10736,8 @@  static void ppc_cpu_reset(CPUState *s)
     /* tininess for underflow is detected before rounding */
     set_float_detect_tininess(float_tininess_before_rounding,
                               &env->fp_status);
+    /* hardfloat needs inexact flag already set */
+    env->default_fp_excpt_flags = (cpu->hardfloat ? float_flag_inexact : 0);
 
     for (i = 0; i < ARRAY_SIZE(env->spr_cb); i++) {
         ppc_spr_t *spr = &env->spr_cb[i];
@@ -10868,6 +10870,7 @@  static Property ppc_cpu_properties[] = {
                      false),
     DEFINE_PROP_BOOL("pre-3.0-migration", PowerPCCPU, pre_3_0_migration,
                      false),
+    DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true),
     DEFINE_PROP_END_OF_LIST(),
 };