Message ID | 20200218171702.979F074637D@zero.eik.bme.hu |
---|---|
State | New |
Headers | show |
Series | [RFC,v2] target/ppc: Enable hardfloat for PPC | expand |
On Tue, 18 Feb 2020, BALATON Zoltan wrote: > While other targets take advantage of using host FPU to do floating > point computations, this was disabled for PPC target because always > clearing exception flags before every FP op made it slightly slower > than emulating everyting with softfloat. To emulate some FPSCR bits, > clearing of fp_status may be necessary (unless these could be handled > e.g. using FP exceptions on host but there's no API for that in QEMU > yet) but preserving at least the inexact flag makes hardfloat usable > and faster than softfloat. Since most clients don't actually care > about this flag, we can gain some speed trading some emulation > accuracy. > > This patch implements a simple way to keep the inexact flag set for > hardfloat while still allowing to revert to softfloat for workloads > that need more accurate albeit slower emulation. (Set hardfloat > property of CPU, i.e. -cpu name,hardfloat=false for that.) There may > still be room for further improvement but this seems to increase > floating point performance. Unfortunately the softfloat case is slower > than before this patch so this patch only makes sense if the default > is also set to enable hardfloat. > > Because of the above this patch at the moment is mainly for testing > different workloads to evaluate how viable would this be in practice. > Thus, RFC and not ready for merge yet. > > Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> > --- Forgot to say, in case that's not clear that this should currently apply to David's ppc-for-5.0 branch because there are some clean up patches from me queued there so until those reach master this patch may not apply cleanly. Just in case someone wanted to try it and get errors. Regards, BALATON Zoltan
> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <balaton@eik.bme.hu> wrote: > > While other targets take advantage of using host FPU to do floating > point computations, this was disabled for PPC target because always > clearing exception flags before every FP op made it slightly slower > than emulating everyting with softfloat. To emulate some FPSCR bits, > clearing of fp_status may be necessary (unless these could be handled > e.g. using FP exceptions on host but there's no API for that in QEMU > yet) but preserving at least the inexact flag makes hardfloat usable > and faster than softfloat. Since most clients don't actually care > about this flag, we can gain some speed trading some emulation > accuracy. > > This patch implements a simple way to keep the inexact flag set for > hardfloat while still allowing to revert to softfloat for workloads > that need more accurate albeit slower emulation. (Set hardfloat > property of CPU, i.e. -cpu name,hardfloat=false for that.) There may > still be room for further improvement but this seems to increase > floating point performance. Unfortunately the softfloat case is slower > than before this patch so this patch only makes sense if the default > is also set to enable hardfloat. > > Because of the above this patch at the moment is mainly for testing > different workloads to evaluate how viable would this be in practice. > Thus, RFC and not ready for merge yet. > > Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> > --- > v2: use different approach to avoid needing if () in > helper_reset_fpstatus() but this does not seem to change overhead > much, also make it a single patch as adding the hardfloat option is > only a few lines; with this we can use same value at other places where > float_status is reset and maybe enable hardfloat for a few more places > for a little more performance but not too much. With this I got: <snip> Thank you for working on this. It is about time we have a better FPU. I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly and compiled easily. Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3. I did several tests and here are my results: With hard float: - The USB audio device does not produce any sound. - Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft float :) ). For my FPSCR test program, 21 tests failed. The high number is because the inexact exception is being set for situations it should not be set for. With soft float: - Some sound can be heard from the USB audio device. It isn't good sounding. I had to force quit Quicktime player because it stopped working. - Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard float). - 13 tests failed with my FPSCR test program. This patch is a good start. I'm not worried about the Floating Point Status and Control Register flags being wrong since hardly any software bothers to check them. I think more optimizations can happen by simplifying the FPU. As it is now it makes a lot of calls per operation.
Hello, On Tue, 18 Feb 2020, Programmingkid wrote: >> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <balaton@eik.bme.hu> wrote: >> While other targets take advantage of using host FPU to do floating >> point computations, this was disabled for PPC target because always >> clearing exception flags before every FP op made it slightly slower >> than emulating everyting with softfloat. To emulate some FPSCR bits, >> clearing of fp_status may be necessary (unless these could be handled >> e.g. using FP exceptions on host but there's no API for that in QEMU >> yet) but preserving at least the inexact flag makes hardfloat usable >> and faster than softfloat. Since most clients don't actually care >> about this flag, we can gain some speed trading some emulation >> accuracy. >> >> This patch implements a simple way to keep the inexact flag set for >> hardfloat while still allowing to revert to softfloat for workloads >> that need more accurate albeit slower emulation. (Set hardfloat >> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may >> still be room for further improvement but this seems to increase >> floating point performance. Unfortunately the softfloat case is slower >> than before this patch so this patch only makes sense if the default >> is also set to enable hardfloat. >> >> Because of the above this patch at the moment is mainly for testing >> different workloads to evaluate how viable would this be in practice. >> Thus, RFC and not ready for merge yet. >> >> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> >> --- >> v2: use different approach to avoid needing if () in >> helper_reset_fpstatus() but this does not seem to change overhead >> much, also make it a single patch as adding the hardfloat option is >> only a few lines; with this we can use same value at other places where >> float_status is reset and maybe enable hardfloat for a few more places >> for a little more performance but not too much. With this I got: > > <snip> > > Thank you for working on this. It is about time we have a better FPU. Thank you for testing it. I think it would be great if we could come up with some viable approach to improve this before the next freeze. > I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly and compiled easily. I've heard some preliminary results from others that there's also a difference between v1 and v2 of the patch in performance where v1 may be faster for same cases so if you (or someone else) want and have time you could experiment with different versions and combinations as well to find the one that's best on all CPUs. Basically we have these parts: 1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to force float_flag_inexact on in case hadfloat is enabled, I've tried two approaches for this: a. In v1 added an if () in the function b. In v2 used a variable from env set earlier (I've hoped this may be faster but maybe it's not, testing and explanation is welcome) 2. Also change places where env->fp_status is copied to a local tstat and then reset (I think this is done to accumulate flags from multiple FP ops that would individually reset env->fp_status or some other reason, maybe this could be avoided if we reset fp_status less often but that would need more understanding of the FP emulation that I don't have so I did not try to clean that up yet). If v2 is really slower than v1 then I'm not sure is it because also changing places with tstat or because of the different approach in helper_reset_fpstatus() so you could try combinations of these as well. > Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3. What was the host CPU and OS this was tested on? Please always share CPU info and host OS when sharing bechmark results so they are somewhat comparable. It also depends on CPU features for vector instrucions at least so without CPU info the results could not be understood. I think G3 does not have AltiVec/VMX so maybe testing with G4 would be better to also test those ops unless there's a reason to only test G3. I've tested with G4 both FPU only and FPU+VMX code on Linux host with i7-9700K CPU @ 3.60GHz as was noted in the original cover letter but may be I'va also forgotten some details so I list it here again. > I did several tests and here are my results: > > With hard float: > - The USB audio device does not produce any sound. I've heard this could also be due to some other problem not directly related to FPU, maybe there's a problem with USB/OHCI emulation as well because problems with that were reported but it's interesting why you get different results changing FPU related stuff. I think OSX uses float samples so probably does use FPU for processing sound and may rely on some pecularity of the hardware as it was probably optimised for Apple hardware. It would be interesting to find out how FPU stuff is related to this but since it's broken anyway probably not a show stopper at the moment. > - Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft float :) ). Does resulting file match? As a simple test I've verified md5sum of the resulting mp3 with the lame benchmark I've tried just to find any big errors. Even if it does not prove that nothing broke, it shuold detect if something breaks badly. However that was WAV->MP3 where results were same, although the VMX build did produce different result than FPU only but did so consistently for multiple runs. With MIDI there might be slight timing difference that could cause different audio results so you should first verify if doing the conversion multiple times does produce the same result at all without any patch first. > For my FPSCR test program, 21 tests failed. The high number is because > the inexact exception is being set for situations it should not be set > for. Since we force the inexact flag to be set to enable hardfloat this is expected. More interesting is if apart from this are there any difference in the results compared to the soffloat case (that may also be host CPU dependent I think). Do you have more detailed info on the errors and differences found? Some of the problems with inexact may be fixed by not always forcing the flag on but just not clearing it. As I undersood other targets do that so it starts with softfloat but the first time the inexact flag is set it will start using hardfloat as long as the guest does not clear this flag. Probably this is done to automatically detect code that needs the flag and assume it's not used when it's not touched. Since PPC also has an inexact flag just for previous FP op (the FI bit) apart from the usual cumulative flag, the client could read that instead of clearing the cumulative flag so we can't detect guest usage this way, teherefore we might as well break inexact completely to always use hardfloat and need to manually enable it for guests that we know need it. I'm not sure however if forcing the inexact flag would lead to unwanted FP exceptions as well so this may also need to be made conditional on the enabled/disabled status of inexact FP exceptions. Does anyone have more info on this? > With soft float: - Some sound can be heard from the USB audio device. It > isn't good sounding. I had to force quit Quicktime player because it > stopped working. > - Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard float). > - 13 tests failed with my FPSCR test program. > > This patch is a good start. I'm not worried about the Floating Point > Status and Control Register flags being wrong since hardly any software > bothers to check them. I think more optimizations can happen by I don't know if guest code checks fpscr and what flags it cares about. Also don't know if it's a fact that these are not used but maybe if we test with more guest codes we can find out. That's why I'd like to at least have an option to test with hardfloat. Unfortunately enabling hardfloat without also making it default would make it slower so if we go this way we should make sure we can also enable hardfloat as default. > simplifying the FPU. As it is now it makes a lot of calls per operation. Question is if those calls are really needed to emulate PPC FPU or if not why would they be there? If the FPU is really that much different so all these calls are needed then there's not much to simplify (but maybe there could be some optimisations possible). This would need someone to understand the current code in full first that probably we don't currently (ar least I don't for sure so can't really make changes either). Another more viable approach is to pick a small part and follow through with that and try to clean up and optimise that small part only. The exception and fpscr handling is one such part, another could be round_canonical() that seems to be high on profiles I've taken. Maybe this could be done by reading and understading docs only on the small part picked that may be easier than getting everything first. I wonder if such smaller tasks could be defined and given out as GSoC or other volunteer projects? Regards, BALATON Zoltan
Hi, I tested with the current ppc-for-5.0 branch and with v1 of the hardfloat patches applied on top of that. There is a noticeable speed improvement in Linux and OSX hosts. Windows 10 host doesn't seem to be impressed at all. I saw no obvious glitches so far. The fpu performance on OSX hosts seems very slow. This was not always the case in the past, when it was on par with Linux performance. Below are my results. Best, Howard Host Linux (Fedora 31): Mac OS tests: 9.2 with MacBench 5.0 Baseline(100%): G3 300Mhz 5.0 branch + hardfloat patches: cpu 193%, fpu 86% 5.0 branch: cpu 188%, fpu 57% Mac OSX tests: 10.5 with Skidmarks 4.0 test Baseline(100%): G4 1.0Ghz. 5.0 branch + hardfloat patches: Int:131 FP:11 Vec:15 5.0 branch: Int:131 FP:9 Vec:11 Host OSX Sierra: Mac OS tests: 9.2 with MacBench 5.0 Baseline(100%): G3 300Mhz 5.0 branch + hardfloat patches: cpu 199%, fpu 66% 5.0 branch: cpu 199%, fpu 40% Mac OSX tests: 10.5 with Skidmarks 4.0 test Baseline(100%): G4 1.0Ghz. 5.0 branch + hardfloat patches: Int:129 PF:11 Vec:14 5.0 branch: Int:129 FP:8 Vec:9 Host Windows 10: Mac OS tests: 9.2 with MacBench 5.0 Baseline(100%): G3 300Mhz 5.0 branch + hardfloat patches: cpu 180%, fpu 54% 5.0 branch: cpu 199%, fpu 40% Mac OSX tests: 10.5 with Skidmarks 4.0 test Baseline(100%): G4 1.0Ghz. 5.0 branch + hardfloat patches: Int:130 FP:9 Vec:10 5.0 branch: Int:130 FP:10 Vec:11 All tests done on the same host with v1 of the hardfloat patches Intel i7-4770K at 3.50Ghz. 32Gb memory All guests set to 1024x768 and "thousands" of colors. Linux and OSX (with brew) use default compilers. Windows build cross-compiled from Fedora with x86_64-win64-mingw32 Skidmarks test is a combination of tests: ParseVid (INT) MPEG (INT) PixBlend (INT) Ellipticrypt (INT) Rijndael (INT) Quicksort (INT) MDCT (FP) IntToFloat (FP) Q3 (FP) FFT (FP) VolInt (FP) Quant (VMX) Galaxy (VMX) IDCT (VMX) BigMult (VMX)
On Wed, 19 Feb 2020, Howard Spoelstra wrote: > I tested with the current ppc-for-5.0 branch and with v1 of the hardfloat > patches applied on top of that. There is a noticeable speed improvement in > Linux and OSX hosts. Windows 10 host doesn't seem to be impressed at all. I > saw no obvious glitches so far. The fpu performance on OSX hosts seems very > slow. This was not always the case in the past, when it was on par with > Linux performance. Interesting, thanks for the measurements. > Below are my results. > > Best, > Howard > > Host Linux (Fedora 31): > Mac OS tests: 9.2 with MacBench 5.0 > Baseline(100%): G3 300Mhz > 5.0 branch + hardfloat patches: cpu 193%, fpu 86% > 5.0 branch: cpu 188%, fpu 57% Here there's a difference in cpu value before and after patch which I can't explain (only changed FPU stuff so it should not change others) but also not seen in other measurements so this could be some external influence such as something else using CPU while running test? Unless this happens consistently I'd put it down to measurement error. > Mac OSX tests: 10.5 with Skidmarks 4.0 test > Baseline(100%): G4 1.0Ghz. > 5.0 branch + hardfloat patches: Int:131 FP:11 Vec:15 > 5.0 branch: Int:131 FP:9 Vec:11 > > Host OSX Sierra: > Mac OS tests: 9.2 with MacBench 5.0 > Baseline(100%): G3 300Mhz > 5.0 branch + hardfloat patches: cpu 199%, fpu 66% > 5.0 branch: cpu 199%, fpu 40% > Mac OSX tests: 10.5 with Skidmarks 4.0 test > Baseline(100%): G4 1.0Ghz. > 5.0 branch + hardfloat patches: Int:129 PF:11 Vec:14 These values seem to match Linux measurement above so don't seem slower although MacOS9 seems to be slower (66 vs. 86) so either this depends on the ops used or something else. > 5.0 branch: Int:129 FP:8 Vec:9 > > Host Windows 10: > Mac OS tests: 9.2 with MacBench 5.0 > Baseline(100%): G3 300Mhz > 5.0 branch + hardfloat patches: cpu 180%, fpu 54% > 5.0 branch: cpu 199%, fpu 40% Here there's again difference in cpu value but the other way so maybe if the cause is external CPU usage then this again may be an outlying measurement? You could retake these two to verify if you get same numbers again. The fpu value does seem to improve just not as much as the others and it's also lower to start with. I wonder why. > Mac OSX tests: 10.5 with Skidmarks 4.0 test > Baseline(100%): G4 1.0Ghz. > 5.0 branch + hardfloat patches: Int:130 FP:9 Vec:10 > 5.0 branch: Int:130 FP:10 Vec:11 > > All tests done on the same host with v1 of the hardfloat patches > Intel i7-4770K at 3.50Ghz. 32Gb memory > All guests set to 1024x768 and "thousands" of colors. Does it mean this host machine were rebooted into these OSes or these were run in a VM. In case using VM, were all three running in VM or one was on host (I'd guess OSX host with Linux and Windows VMs). > Linux and OSX (with brew) use default compilers. > Windows build cross-compiled from Fedora with x86_64-win64-mingw32 I assume Linux and OSX were 64 bit builds, is Windows 32 bit or 64 bit exe? Regards, BALATON Zoltan
On Wed, Feb 19, 2020 at 8:28 PM BALATON Zoltan <balaton@eik.bme.hu> wrote: > On Wed, 19 Feb 2020, Howard Spoelstra wrote: > > I tested with the current ppc-for-5.0 branch and with v1 of the hardfloat > > patches applied on top of that. There is a noticeable speed improvement > in > > Linux and OSX hosts. Windows 10 host doesn't seem to be impressed at > all. I > > saw no obvious glitches so far. The fpu performance on OSX hosts seems > very > > slow. This was not always the case in the past, when it was on par with > > Linux performance. > > Interesting, thanks for the measurements. > > > Below are my results. > > > > Best, > > Howard > > > > Host Linux (Fedora 31): > > Mac OS tests: 9.2 with MacBench 5.0 > > Baseline(100%): G3 300Mhz > > 5.0 branch + hardfloat patches: cpu 193%, fpu 86% > > 5.0 branch: cpu 188%, fpu 57% > > Here there's a difference in cpu value before and after patch which I > can't explain (only changed FPU stuff so it should not change others) but > also not seen in other measurements so this could be some external > influence such as something else using CPU while running test? Unless this > happens consistently I'd put it down to measurement error. > Yes, I would put that cpu value down to some fluctuation in the test > > > Mac OSX tests: 10.5 with Skidmarks 4.0 test > > Baseline(100%): G4 1.0Ghz. > > 5.0 branch + hardfloat patches: Int:131 FP:11 Vec:15 > > 5.0 branch: Int:131 FP:9 Vec:11 > > > > Host OSX Sierra: > > Mac OS tests: 9.2 with MacBench 5.0 > > Baseline(100%): G3 300Mhz > > 5.0 branch + hardfloat patches: cpu 199%, fpu 66% > > 5.0 branch: cpu 199%, fpu 40% > > Mac OSX tests: 10.5 with Skidmarks 4.0 test > > Baseline(100%): G4 1.0Ghz. > > 5.0 branch + hardfloat patches: Int:129 PF:11 Vec:14 > > These values seem to match Linux measurement above so don't seem slower > although MacOS9 seems to be slower (66 vs. 86) so either this depends on > the ops used or something else. > Yes, the baseline speed for the fpu in Mac OS 9.2 is relatively low. > > > 5.0 branch: Int:129 FP:8 Vec:9 > > > > Host Windows 10: > > Mac OS tests: 9.2 with MacBench 5.0 > > Baseline(100%): G3 300Mhz > > 5.0 branch + hardfloat patches: cpu 180%, fpu 54% > new run 5.0 branch + hardfloat patches: cpu 184%, fpu 54% > 5.0 branch: cpu 199%, fpu 40% > new run 5.0 branch: cpu 184%, fpu 56% It seems I misreported (copy/past without changing the values) the earlier Windows-based results with Mac OS 9.2 guest. As said above (and this now seems to confirm) Windows is not impressed at all and perhaps a bit slower even. Windows builds are particularly sensitive to any other activity on the system. Moving the qemu window drops performance considerably. Perhaps due to SDL not running in its own thread? > > Here there's again difference in cpu value but the other way so maybe if > the cause is external CPU usage then this again may be an outlying > measurement? You could retake these two to verify if you get same numbers > again. The fpu value does seem to improve just not as much as the others > and it's also lower to start with. I wonder why. > > > Mac OSX tests: 10.5 with Skidmarks 4.0 test > > Baseline(100%): G4 1.0Ghz. > > 5.0 branch + hardfloat patches: Int:130 FP:9 Vec:10 > > 5.0 branch: Int:130 FP:10 Vec:11 > > > > All tests done on the same host with v1 of the hardfloat patches > > Intel i7-4770K at 3.50Ghz. 32Gb memory > > All guests set to 1024x768 and "thousands" of colors. > > Does it mean this host machine were rebooted into these OSes or these were > run in a VM. In case using VM, were all three running in VM or one was on > host (I'd guess OSX host with Linux and Windows VMs). > > > Linux and OSX (with brew) use default compilers. > > Windows build cross-compiled from Fedora with x86_64-win64-mingw32 > > I assume Linux and OSX were 64 bit builds, is Windows 32 bit or 64 bit > exe? > No virtualisation. I run all on the same bare metal, so booted into these three separately from separate SSDs. You might guess OSX Sierra is running on less "official" hardware and you would be right. All qemu builds were 64 bit. Best, Howard
On 2/18/20 9:10 AM, BALATON Zoltan wrote: > void helper_reset_fpstatus(CPUPPCState *env) > { > - set_float_exception_flags(0, &env->fp_status); > + set_float_exception_flags(env->default_fp_excpt_flags, &env->fp_status); > } What I don't like is the forced setting of inexact. I don't mind leaving it set if it is already set, which corresponds to the normal accumulation of exceptions. In addition, if the inexact exception is unmasked, I would expect a signal to be delivered only when an inexact exception happens. Whereas this patch would deliver a signal for every fp operation. It should be just as easy to do flags = get_float_exception_flags(status); flags &= env->save_fp_exception_flags; set_float_exception_flags(flags, status); > + DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true), I would also prefer a different name here -- perhaps x-no-fp-fi. r~
On Thu, 20 Feb 2020, Richard Henderson wrote: > On 2/18/20 9:10 AM, BALATON Zoltan wrote: >> void helper_reset_fpstatus(CPUPPCState *env) >> { >> - set_float_exception_flags(0, &env->fp_status); >> + set_float_exception_flags(env->default_fp_excpt_flags, &env->fp_status); >> } > > What I don't like is the forced setting of inexact. I don't mind leaving it > set if it is already set, which corresponds to the normal accumulation of > exceptions. In a previous reply I've explained this (apart from just trying to get it work the simplest way for testing first): On Wed, 19 Feb 2020, BALATON Zoltan wrote: > Some of the problems with inexact may be fixed by not always forcing the > flag on but just not clearing it. As I undersood other targets do that > so it starts with softfloat but the first time the inexact flag is set > it will start using hardfloat as long as the guest does not clear this > flag. Probably this is done to automatically detect code that needs the > flag and assume it's not used when it's not touched. Since PPC also has > an inexact flag just for previous FP op (the FI bit) apart from the > usual cumulative flag, the client could read that instead of clearing > the cumulative flag so we can't detect guest usage this way, therefore > we might as well break inexact completely to always use hardfloat and > need to manually enable it for guests that we know need it. I'm not sure > however if forcing the inexact flag would lead to unwanted FP exceptions > as well so this may also need to be made conditional on the > enabled/disabled status of inexact FP exceptions. Does anyone have more > info on this? So I know this should be improved but not sure how. I would not worry about setting the bit from the start due to the above but exceptions are likely wrong this way as you say. > In addition, if the inexact exception is unmasked, I would expect a signal to > be delivered only when an inexact exception happens. Whereas this patch would > deliver a signal for every fp operation. > > It should be just as easy to do > > flags = get_float_exception_flags(status); > flags &= env->save_fp_exception_flags; > set_float_exception_flags(flags, status); I'm not sure I get this. Can you please explain a bit more? Where would save_flags be set and where would the above masking with it be done? Currently we reset flags before every FP op as far as I understand and detecting of exceptions rely on this and test fp_status bits. Probably we should not reset the flags and change this to keep one fp_status accumulating flags and only reset if softfloat is forced but that would need rewriting exception handling that I'm not sure yet how to do. Any idea? Maybe if we could just unbreak inexact exceptions when enabled with current patch then this rewrite of exception handling could be done later. >> + DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true), > > I would also prefer a different name here -- perhaps x-no-fp-fi. What's wrong with hardfloat? That's how the code refers to this so if anyone searches what it does would turn up some meaningful results. Your proposed property name is very cryptic and nobody would be able to guess what is it for. If you insist not to use hardfloat maybe we could invert it to softfloat or try x-use-fpu maybe. I specificaly don't want to tie it to the FI but as there's at least another non-sticky bit in FPSCR (don't remember which now) that might also need to be broken for best hardfloat results. Regards, BALATON Zoltan
On Fri, 21 Feb 2020 at 16:05, BALATON Zoltan <balaton@eik.bme.hu> wrote: > > On Thu, 20 Feb 2020, Richard Henderson wrote: > > On 2/18/20 9:10 AM, BALATON Zoltan wrote: > >> + DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true), > > > > I would also prefer a different name here -- perhaps x-no-fp-fi. > > What's wrong with hardfloat? That's how the code refers to this so if > anyone searches what it does would turn up some meaningful results. This prompted me to check what you're using the property for. The cover letter says: > This patch implements a simple way to keep the inexact flag set for > hardfloat while still allowing to revert to softfloat for workloads > that need more accurate albeit slower emulation. (Set hardfloat > property of CPU, i.e. -cpu name,hardfloat=false for that.) I think that is the wrong approach. Enabling use of the host FPU should not affect the accuracy of the emulation, which should remain bitwise-correct. We should only be using the host FPU to the extent that we can do that without discarding accuracy. As far as I'm aware that's how the hardfloat support for other guest CPUs that use it works. thanks -- PMM
On Fri, Feb 21, 2020 at 5:11 PM Peter Maydell <peter.maydell@linaro.org> wrote: > > On Fri, 21 Feb 2020 at 16:05, BALATON Zoltan <balaton@eik.bme.hu> wrote: > > > > On Thu, 20 Feb 2020, Richard Henderson wrote: > > > On 2/18/20 9:10 AM, BALATON Zoltan wrote: > > >> + DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true), > > > > > > I would also prefer a different name here -- perhaps x-no-fp-fi. > > > > What's wrong with hardfloat? That's how the code refers to this so if > > anyone searches what it does would turn up some meaningful results. > > This prompted me to check what you're using the property for. > The cover letter says: > > This patch implements a simple way to keep the inexact flag set for > > hardfloat while still allowing to revert to softfloat for workloads > > that need more accurate albeit slower emulation. (Set hardfloat > > property of CPU, i.e. -cpu name,hardfloat=false for that.) > > I think that is the wrong approach. Enabling use of the host > FPU should not affect the accuracy of the emulation, which > should remain bitwise-correct. We should only be using the > host FPU to the extent that we can do that without discarding > accuracy. As far as I'm aware that's how the hardfloat support > for other guest CPUs that use it works. > Right, that is my understanding as well. There shouldn't be "hardfloat" switch at all. If there is, it is either a mistake, or for experimental or other similar purposes. In my understanding, hardfloat feature should work entirely transparently, making the decision whether to use host math functions (instead of softfloat library) by itself, based on the particular execution circumstances. In other words, the accuracy of FP emulation should not be compromised in absolutely any case (there should not be an option "ok, we are happy with approximate calculations"). On top of that, artificial generating of "inexact" flag really seems problematic (or, speaking bluntly, looks like a hack)). Perhaps hardfloat feature needs a little bit of more work, but not in this way. Yours, Aleksandar > thanks > -- PMM >
On Fri, 21 Feb 2020, Peter Maydell wrote: > On Fri, 21 Feb 2020 at 16:05, BALATON Zoltan <balaton@eik.bme.hu> wrote: >> On Thu, 20 Feb 2020, Richard Henderson wrote: >>> On 2/18/20 9:10 AM, BALATON Zoltan wrote: >>>> + DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true), >>> >>> I would also prefer a different name here -- perhaps x-no-fp-fi. >> >> What's wrong with hardfloat? That's how the code refers to this so if >> anyone searches what it does would turn up some meaningful results. > > This prompted me to check what you're using the property for. > The cover letter says: >> This patch implements a simple way to keep the inexact flag set for >> hardfloat while still allowing to revert to softfloat for workloads >> that need more accurate albeit slower emulation. (Set hardfloat >> property of CPU, i.e. -cpu name,hardfloat=false for that.) > > I think that is the wrong approach. Enabling use of the host > FPU should not affect the accuracy of the emulation, which > should remain bitwise-correct. We should only be using the > host FPU to the extent that we can do that without discarding > accuracy. As far as I'm aware that's how the hardfloat support > for other guest CPUs that use it works. I don't know of a better approach. Please see section 4.2.2 Floating-Point Status and Control Register on page 124 in this document: https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 especially the definition of the FR and FI bits and tell me how can we emulate these accurately and use host FPU. Not using the FPU even when these bits are not needed (which seems to be the case for all workloads we've tested so far) seriously limits the emulation speed so spending time to emulate obscure and unused part of an architecture when not actually needed just to keep emulation accurate but unusably slow does not seem to be the right approach. In an ideal world of course this should be both fast and accurate but we don't seem to have anyone who could achieve that in past two years so maybe we could give up some accuracy now to get usable speed and worry about emulating obscure features when we come across some workload that actually needs it (but we have the option to revert to accurate but slow emulation for that until a better way can be devised that's both fast and accurate). Insisting on accuracy without any solution to current state just hinders making any progress with this. Other PowerPC emulators also seem to not bother or have similar optimisation. I've quickly checked three that I know about: https://github.com/mamedev/mame/blob/master/src/devices/cpu/powerpc/ppcdrc.cpp#L1893 https://github.com/mamedev/mame/blob/master/src/devices/cpu/powerpc/ppcdrc.cpp#L3503 there's also something here but no mention of FI bit I could notice: https://github.com/mamedev/mame/blob/master/src/devices/cpu/powerpc/ppccom.cpp#L2023 https://github.com/xenia-project/xenia/blob/master/src/xenia/cpu/ppc/ppc_hir_builder.cc#L428 https://github.com/dolphin-emu/dolphin/blob/master/Source/Core/Core/PowerPC/Jit64/Jit_FloatingPoint.cpp But I'm not sure I understand all of the above so hope this makes more sense to someone and can advise. Regards, BALATON Zoltan
On Fri, 21 Feb 2020 at 18:04, BALATON Zoltan <balaton@eik.bme.hu> wrote: > On Fri, 21 Feb 2020, Peter Maydell wrote: > > I think that is the wrong approach. Enabling use of the host > > FPU should not affect the accuracy of the emulation, which > > should remain bitwise-correct. We should only be using the > > host FPU to the extent that we can do that without discarding > > accuracy. As far as I'm aware that's how the hardfloat support > > for other guest CPUs that use it works. > > I don't know of a better approach. Please see section 4.2.2 Floating-Point > Status and Control Register on page 124 in this document: > > https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 > > especially the definition of the FR and FI bits and tell me how can we > emulate these accurately and use host FPU. I don't know much about PPC, but if you can't emulate the guest architecture accurately with the host FPU, then don't use the host FPU. We used to have a kind of 'hardfloat' support that was fast but inaccurate, but it was a mess because it meant that most guest code sort of worked but some guest code would confusingly misbehave. Deliberately not correctly emulating the guest CPU/FPU behaviour is not something I want us to return to. You're right that sometimes you can't get both speed and accuracy; other emulators (and especially ones which are trying to emulate games consoles) may choose to prefer speed over accuracy. For QEMU we prefer to choose accuracy over speed in this area. thanks -- PMM
On Fri, 21 Feb 2020, Peter Maydell wrote: > On Fri, 21 Feb 2020 at 18:04, BALATON Zoltan <balaton@eik.bme.hu> wrote: >> On Fri, 21 Feb 2020, Peter Maydell wrote: >>> I think that is the wrong approach. Enabling use of the host >>> FPU should not affect the accuracy of the emulation, which >>> should remain bitwise-correct. We should only be using the >>> host FPU to the extent that we can do that without discarding >>> accuracy. As far as I'm aware that's how the hardfloat support >>> for other guest CPUs that use it works. >> >> I don't know of a better approach. Please see section 4.2.2 Floating-Point >> Status and Control Register on page 124 in this document: >> >> https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 >> >> especially the definition of the FR and FI bits and tell me how can we >> emulate these accurately and use host FPU. > > I don't know much about PPC, but if you can't emulate the > guest architecture accurately with the host FPU, then > don't use the host FPU. We used to have a kind of 'hardfloat' I don't know if it's possible or not to emulate these accurately and use the FPU but nobody did it for QEMU so far. But if someone knows a way please speak up then we can try to implement it. Unfortunately this would require more detailed knowledge about different FPU implementations (at least X86_64, ARM and PPC that are the mostly used platforms) than what I have or willing to spend time to learn. > support that was fast but inaccurate, but it was a mess > because it meant that most guest code sort of worked but > some guest code would confusingly misbehave. Deliberately > not correctly emulating the guest CPU/FPU behaviour is not > something I want us to return to. > > You're right that sometimes you can't get both speed > and accuracy; other emulators (and especially ones > which are trying to emulate games consoles) may choose > to prefer speed over accuracy. For QEMU we prefer to > choose accuracy over speed in this area. OK, then how about keeping the default accurate but allow to opt in to use FPU even if it's known to break some bits for workloads where users would need speed over accuracy and would be happy to live with the limitation. Note that i've found that just removing the define that disables hardfloat for PPC target makes VMX vector instructions faster while normal FPU is a little slower without any other changes so disabling hardfloat already limits performance for guests using VMX even when not using the FPU for cases when it would cause inaccuracy. If you say we want accuracy and don't care about speed, then just don't disable hardfloat as it helps at least VMX and then we can add option to allow the user to say we can use hardfloat even if it's inaccurate then they can test their workload and decide for themselves. Regards, BALATON Zoltan
> On Feb 19, 2020, at 10:35 AM, BALATON Zoltan <balaton@eik.bme.hu> wrote: > > Hello, > > On Tue, 18 Feb 2020, Programmingkid wrote: >>> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan <balaton@eik.bme.hu> wrote: >>> While other targets take advantage of using host FPU to do floating >>> point computations, this was disabled for PPC target because always >>> clearing exception flags before every FP op made it slightly slower >>> than emulating everyting with softfloat. To emulate some FPSCR bits, >>> clearing of fp_status may be necessary (unless these could be handled >>> e.g. using FP exceptions on host but there's no API for that in QEMU >>> yet) but preserving at least the inexact flag makes hardfloat usable >>> and faster than softfloat. Since most clients don't actually care >>> about this flag, we can gain some speed trading some emulation >>> accuracy. >>> >>> This patch implements a simple way to keep the inexact flag set for >>> hardfloat while still allowing to revert to softfloat for workloads >>> that need more accurate albeit slower emulation. (Set hardfloat >>> property of CPU, i.e. -cpu name,hardfloat=false for that.) There may >>> still be room for further improvement but this seems to increase >>> floating point performance. Unfortunately the softfloat case is slower >>> than before this patch so this patch only makes sense if the default >>> is also set to enable hardfloat. >>> >>> Because of the above this patch at the moment is mainly for testing >>> different workloads to evaluate how viable would this be in practice. >>> Thus, RFC and not ready for merge yet. >>> >>> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> >>> --- >>> v2: use different approach to avoid needing if () in >>> helper_reset_fpstatus() but this does not seem to change overhead >>> much, also make it a single patch as adding the hardfloat option is >>> only a few lines; with this we can use same value at other places where >>> float_status is reset and maybe enable hardfloat for a few more places >>> for a little more performance but not too much. With this I got: >> >> <snip> >> >> Thank you for working on this. It is about time we have a better FPU. > > Thank you for testing it. I think it would be great if we could come up with some viable approach to improve this before the next freeze. > >> I applied your patch over David Gibson's ppc-for-5.0 branch. It applied cleanly and compiled easily. > > I've heard some preliminary results from others that there's also a difference between v1 and v2 of the patch in performance where v1 may be faster for same cases so if you (or someone else) want and have time you could experiment with different versions and combinations as well to find the one that's best on all CPUs. Basically we have these parts: > > 1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to force float_flag_inexact on in case hadfloat is enabled, I've tried two approaches for this: > > a. In v1 added an if () in the function > b. In v2 used a variable from env set earlier (I've hoped this may be faster but maybe it's not, testing and explanation is welcome) > > 2. Also change places where env->fp_status is copied to a local tstat and then reset (I think this is done to accumulate flags from multiple FP ops that would individually reset env->fp_status or some other reason, maybe this could be avoided if we reset fp_status less often but that would need more understanding of the FP emulation that I don't have so I did not try to clean that up yet). > > If v2 is really slower than v1 then I'm not sure is it because also changing places with tstat or because of the different approach in helper_reset_fpstatus() so you could try combinations of these as well. > >> Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3. > > What was the host CPU and OS this was tested on? Please always share CPU info and host OS when sharing bechmark results so they are somewhat comparable. It also depends on CPU features for vector instrucions at least so without CPU info the results could not be understood. Intel Core i5-2500S CPU @ 2.70GHz. > I think G3 does not have AltiVec/VMX so maybe testing with G4 would be better to also test those ops unless there's a reason to only test G3. I've tested with G4 both FPU only and FPU+VMX code on Linux host with i7-9700K CPU @ 3.60GHz as was noted in the original cover letter but may be I'va also forgotten some details so I list it here again. Ok, I did test on the G4, here are my results: Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f Mac OS 10.4.3 VM -cpu G4 -USB audio device Hardfloat=false Audio sounds bad when playing midi file. Extraction rate: 1.5x Converting rate: 0.7x Total time: 7:24 Hardfloat=true Midi audio sounded perfect for about 30 seconds, then it went silent! Extraction rate: 1.4x (slower with hard float) Converting rate: 0.7x (same as without hardfloat) Total time: 7:16 (faster time with hardfloat) > >> I did several tests and here are my results: >> >> With hard float: >> - The USB audio device does not produce any sound. > > I've heard this could also be due to some other problem not directly related to FPU, maybe there's a problem with USB/OHCI emulation as well because problems with that were reported but it's interesting why you get different results changing FPU related stuff. I think OSX uses float samples so probably does use FPU for processing sound and may rely on some pecularity of the hardware as it was probably optimised for Apple hardware. It would be interesting to find out how FPU stuff is related to this but since it's broken anyway probably not a show stopper at the moment. When I played sound this second time I hard the same broken audio I usually hear with the USB audio device with hardfloat set to false. When playing the same midi file with hardfloat set to true, the audio played perfectly! It only played for 30 seconds before it went silent. > >> - Converting a MIDI file to AAC in iTunes happens at 0.4x (faster than soft float :) ). > > Does resulting file match? As a simple test I've verified md5sum of the resulting mp3 with the lame benchmark I've tried just to find any big errors. Even if it does not prove that nothing broke, it shuold detect if something breaks badly. However that was WAV->MP3 where results were same, although the VMX build did produce different result than FPU only but did so consistently for multiple runs. With MIDI there might be slight timing difference that could cause different audio results so you should first verify if doing the conversion multiple times does produce the same result at all without any patch first. The md5 number for each file does match. Note this time I converted a midi file to MP3. > >> For my FPSCR test program, 21 tests failed. The high number is because the inexact exception is being set for situations it should not be set for. > > Since we force the inexact flag to be set to enable hardfloat this is expected. More interesting is if apart from this are there any difference in the results compared to the soffloat case (that may also be host CPU dependent I think). Do you have more detailed info on the errors and differences found? I can give you the full testing suite if you like. I run it on Mac OS 10.4 but it should compile with gcc on Linux. I will send it to you in a separate email because it is big. <snip> >> With soft float: - Some sound can be heard from the USB audio device. It isn't good sounding. I had to force quit Quicktime player because it stopped working. >> - Converting a MIDI file to AAC in iTunes happens at 0.3x (slower than hard float). >> - 13 tests failed with my FPSCR test program. >> >> This patch is a good start. I'm not worried about the Floating Point Status and Control Register flags being wrong since hardly any software bothers to check them. I think more optimizations can happen by > > I don't know if guest code checks fpscr and what flags it cares about. Also don't know if it's a fact that these are not used but maybe if we test with more guest codes we can find out. That's why I'd like to at least have an option to test with hardfloat. Unfortunately enabling hardfloat without also making it default would make it slower so if we go this way we should make sure we can also enable hardfloat as default. > >> simplifying the FPU. As it is now it makes a lot of calls per operation. > > Question is if those calls are really needed to emulate PPC FPU or if not why would they be there? If the FPU is really that much different so all these calls are needed then there's not much to simplify (but maybe there could be some optimisations possible). This would need someone to understand the current code in full first that probably we don't currently (ar least I don't for sure so can't really make changes either). Another more viable approach is to pick a small part and follow through with that and try to clean up and optimise that small part only. The exception and fpscr handling is one such part, another could be round_canonical() that seems to be high on profiles I've taken. Maybe this could be done by reading and understading docs only on the small part picked that may be easier than getting everything first. I wonder if such smaller tasks could be defined and given out as GSoC or other volunteer projects? I have another idea on how to improve QEMU's performance. What if you enabled more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux support multiple CPUs. It might actually be easier to do this than to improve the FPU. I imagine the performance increase with multiple emulated CPUs would be much more noticeable.
On Mon, 24 Feb 2020, Programmingkid wrote: > Intel Core i5-2500S CPU @ 2.70GHz. [...] > Ok, I did test on the G4, here are my results: > > Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f > Mac OS 10.4.3 VM > -cpu G4 > -USB audio device > > Hardfloat=false > Audio sounds bad when playing midi file. > Extraction rate: 1.5x > Converting rate: 0.7x > Total time: 7:24 > > Hardfloat=true > Midi audio sounded perfect for about 30 seconds, then it went silent! > Extraction rate: 1.4x (slower with hard float) > Converting rate: 0.7x (same as without hardfloat) > Total time: 7:16 (faster time with hardfloat) How is that extraction rate is slower but total time is less than without hardfloat? There must be other factors here than just FP ops. Maybe a better test is to not play the audio just save it to a file so other issues with USB is not influencing the test. > When I played sound this second time I hard the same broken audio I > usually hear with the USB audio device with hardfloat set to false. When > playing the same midi file with hardfloat set to true, the audio played > perfectly! It only played for 30 seconds before it went silent. So probably there are at least two problems: FPU emulation is not fast enough to decode audio to fill buffer then there's also something with usb-audio that jams it after a while? I don't think all of this is FPU related. > I can give you the full testing suite if you like. I run it on Mac OS > 10.4 but it should compile with gcc on Linux. I will send it to you in a > separate email because it is big. Thanks, I'll have a look and see if I can make sense of it but not sure when will I find time. > I have another idea on how to improve QEMU's performance. What if you > enabled more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux > support multiple CPUs. It might actually be easier to do this than to Have you tried if it works? I think MTTCG is enabled for PPC64 but not sure about 32 bit PPC. The mac99 machine seems to init multiple CPUs but not sure if they'll use MTTCG. But you could test it to see if it makes any difference. > improve the FPU. I imagine the performance increase with multiple > emulated CPUs would be much more noticeable. The Amiga like OSes I'm interested in don't use multiple cores so I'm mainly interested in improving single core performance. Also I'm not sure if (part of) your problem is slow FPU preventing fast enough audio decoding then having multiple CPUs with slow FPU would help as this may use a single thread anyway. Regards, BALATON Zoltan
> On Feb 25, 2020, at 7:09 AM, BALATON Zoltan <balaton@eik.bme.hu> wrote: > > On Mon, 24 Feb 2020, Programmingkid wrote: >> Intel Core i5-2500S CPU @ 2.70GHz. > [...] >> Ok, I did test on the G4, here are my results: >> >> Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f >> Mac OS 10.4.3 VM >> -cpu G4 >> -USB audio device >> >> Hardfloat=false >> Audio sounds bad when playing midi file. >> Extraction rate: 1.5x >> Converting rate: 0.7x >> Total time: 7:24 >> >> Hardfloat=true >> Midi audio sounded perfect for about 30 seconds, then it went silent! >> Extraction rate: 1.4x (slower with hard float) >> Converting rate: 0.7x (same as without hardfloat) >> Total time: 7:16 (faster time with hardfloat) > > How is that extraction rate is slower but total time is less than without hardfloat? There must be other factors here than just FP ops. Maybe a better test is to not play the audio just save it to a file so other issues with USB is not influencing the test. I does seem odd to me also. >> When I played sound this second time I hard the same broken audio I usually hear with the USB audio device with hardfloat set to false. When playing the same midi file with hardfloat set to true, the audio played perfectly! It only played for 30 seconds before it went silent. > > So probably there are at least two problems: FPU emulation is not fast enough to decode audio to fill buffer then there's also something with usb-audio that jams it after a while? I don't think all of this is FPU related. I think a timeout takes place and that is why audio stops playing. It is probably an USB OHCI issue. The other USB controller seems to work better. > >> I can give you the full testing suite if you like. I run it on Mac OS 10.4 but it should compile with gcc on Linux. I will send it to you in a separate email because it is big. > > Thanks, I'll have a look and see if I can make sense of it but not sure when will I find time. Please let me know if you have any questions with it. > >> I have another idea on how to improve QEMU's performance. What if you enabled more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux support multiple CPUs. It might actually be easier to do this than to > > Have you tried if it works? I think MTTCG is enabled for PPC64 but not sure about 32 bit PPC. The mac99 machine seems to init multiple CPUs but not sure if they'll use MTTCG. But you could test it to see if it makes any difference. I had completely forgot about MTTCG. I think Howard once did some performance testing with it and came back with favorable results. Maybe this is another avenue we should look at. > >> improve the FPU. I imagine the performance increase with multiple emulated CPUs would be much more noticeable. > > The Amiga like OSes I'm interested in don't use multiple cores so I'm mainly interested in improving single core performance. Also I'm not sure if (part of) your problem is slow FPU preventing fast enough audio decoding then having multiple CPUs with slow FPU would help as this may use a single thread anyway. Good point. MTTCG might be the option that really helps with speed improvements.
On Wed, 26 Feb 2020, Programmingkid wrote: > I think a timeout takes place and that is why audio stops playing. It is > probably an USB OHCI issue. The other USB controller seems to work > better. Which other USB controller? Maybe you could try enabling some usb_ohci* traces and see if they reveal anything. >> The Amiga like OSes I'm interested in don't use multiple cores so I'm >> mainly interested in improving single core performance. Also I'm not >> sure if (part of) your problem is slow FPU preventing fast enough audio >> decoding then having multiple CPUs with slow FPU would help as this may >> use a single thread anyway. > > Good point. MTTCG might be the option that really helps with speed improvements. Only if you have multithreaded workload in the guest because AFAIK MTTCG only runs different vcpus in parallel, it won't make single emulated CPU faster in any way. OSX probably can benefit from having multiple cores emulated but I don't think MacOS would use it apart from some apps maybe. Regards, BALATON Zoltan
BALATON Zoltan <balaton@eik.bme.hu> writes: > On Fri, 21 Feb 2020, Peter Maydell wrote: >> On Fri, 21 Feb 2020 at 18:04, BALATON Zoltan <balaton@eik.bme.hu> wrote: >>> On Fri, 21 Feb 2020, Peter Maydell wrote: >>>> I think that is the wrong approach. Enabling use of the host >>>> FPU should not affect the accuracy of the emulation, which >>>> should remain bitwise-correct. We should only be using the >>>> host FPU to the extent that we can do that without discarding >>>> accuracy. As far as I'm aware that's how the hardfloat support >>>> for other guest CPUs that use it works. Correct - we only use hardfloat when we know it will give the same result which is broadly when the inexact flag is already set and we are dealing with normal floating point numbers. We added a whole bunch of testing to ensure we maintain accuracy when the code went in. <snip> >> >> I don't know much about PPC, but if you can't emulate the >> guest architecture accurately with the host FPU, then >> don't use the host FPU. We used to have a kind of 'hardfloat' > > I don't know if it's possible or not to emulate these accurately and > use the FPU but nobody did it for QEMU so far. But if someone knows a > way please speak up then we can try to implement it. Unfortunately > this would require more detailed knowledge about different FPU > implementations (at least X86_64, ARM and PPC that are the mostly used > platforms) than what I have or willing to spend time to learn. > >> support that was fast but inaccurate, but it was a mess >> because it meant that most guest code sort of worked but >> some guest code would confusingly misbehave. Deliberately >> not correctly emulating the guest CPU/FPU behaviour is not >> something I want us to return to. >> >> You're right that sometimes you can't get both speed >> and accuracy; other emulators (and especially ones >> which are trying to emulate games consoles) may choose >> to prefer speed over accuracy. For QEMU we prefer to >> choose accuracy over speed in this area. > > OK, then how about keeping the default accurate but allow to opt in to > use FPU even if it's known to break some bits for workloads where > users would need speed over accuracy and would be happy to live with > the limitation. About the only comparison I can think of is the thread=single:multi flags for TCG which is mostly there to help developers eliminate causes of bugs. The default for MTTCG is it is enabled when it's safe. If you enable it via the command line where QEMU hasn't defaulted it on you will get lots of loud warnings about potential instability. The most commonly used case is thread=single when you want to check it's not a MTTCG bug. I'm as cautious as Peter here about adding a "faster but broken" command line flag because users will invariably read up to the "faster" and then spend a lot of time scratching their heads when things break. > Note that i've found that just removing the define > that disables hardfloat for PPC target makes VMX vector instructions > faster while normal FPU is a little slower without any other changes > so disabling hardfloat already limits performance for guests using VMX > even when not using the FPU for cases when it would cause inaccuracy. > If you say we want accuracy and don't care about speed, then just > don't disable hardfloat as it helps at least VMX and then we can add > option to allow the user to say we can use hardfloat even if it's > inaccurate then they can test their workload and decide for > themselves. > > Regards, > BALATON Zoltan
Hi Zoltan, i can say MacOs Leopard use multiple cores on PowerMac G5 Quad the most of the apps did for Panter/Tiger/leopard use for sure 2 Core in smtp only apps did for Tiger/leopard use more than 2 Cores. Ciao and thenks Luigi
On Wed, 26 Feb 2020, Alex Bennée wrote: > BALATON Zoltan <balaton@eik.bme.hu> writes: >> OK, then how about keeping the default accurate but allow to opt in to >> use FPU even if it's known to break some bits for workloads where >> users would need speed over accuracy and would be happy to live with >> the limitation. > > About the only comparison I can think of is the thread=single:multi > flags for TCG which is mostly there to help developers eliminate causes > of bugs. The default for MTTCG is it is enabled when it's safe. If you > enable it via the command line where QEMU hasn't defaulted it on you > will get lots of loud warnings about potential instability. The most > commonly used case is thread=single when you want to check it's not a > MTTCG bug. > > I'm as cautious as Peter here about adding a "faster but broken" command > line flag because users will invariably read up to the "faster" and then > spend a lot of time scratching their heads when things break. OK understood. However this is specifically about PPC emulation where some bits exists in FPSCR that's hard to emulate with hardfloat but until recently these weren't correctly emulated in my understanding and things still ran OK so it's believed that these bits are seldom used. It was only fixed due to some tests revealing it not because of actual negative effects seen. Unfortunately enabling hardfloat without any other changes would make the default softfloat case a bit slower so that's why I was asking about also making hardfloat case the default. I can try to try some other approach to avoid always setting inexact bit and only keep it set as other archs do but not sure when will I have time to get there. Some help in this would be appreciated by those who know PPC FPU and interested in running it faster in QEMU. Regards, BALATON Zoltan
Please let's go with hardfloat pps support, it's really a good feature to implement. Even if in a first step it could lead to inaccuracy results, later it could solved with other patches. I think it's important for qemu to as global as possible and don't target only recent hardware. Regards, Dino Papararo Da: Qemu-ppc <qemu-ppc-bounces+skizzato73=msn.com@nongnu.org> Per conto di luigi burdo Inviato: mercoledì 26 febbraio 2020 14:01 A: BALATON Zoltan <balaton@eik.bme.hu>; Programmingkid <programmingkidx@gmail.com> Cc: David Gibson <david@gibson.dropbear.id.au>; qemu-ppc@nongnu.org; qemu-devel qemu-devel <qemu-devel@nongnu.org>; Howard Spoelstra <hsp.cat7@gmail.com> Oggetto: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC Hi Zoltan, i can say MacOs Leopard use multiple cores on PowerMac G5 Quad the most of the apps did for Panter/Tiger/leopard use for sure 2 Core in smtp only apps did for Tiger/leopard use more than 2 Cores. Ciao and thenks Luigi
Dino Papararo <skizzato73@msn.com> writes: > Please let's go with hardfloat pps support, it's really a good feature to implement. > Even if in a first step it could lead to inaccuracy results, later it > could solved with other patches. That's the wrong way around. We have regression tests for a reason. I'll happily accept patches to turn on hardfloat for PPC if: a) they don't cause regressions in our fairly extensive floating point tests b) the PPC maintainers are happy with the new performance profile The way forward would be to: 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded splatting of flags (if any) 3. measure the before/after performance effect and decide if on balance it's worth keeping > I think it's important for qemu to as global as possible and don't > target only recent hardware. Are you referring to guests or hosts? For guests we will always favour accuracy of speed of emulation. For hosts we need to have IEEE compliant FPU HW to even stand a chance of using hardfloat.
On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote: > > > Dino Papararo <skizzato73@msn.com> writes: > > > Please let's go with hardfloat pps support, it's really a good feature to implement. > > Even if in a first step it could lead to inaccuracy results, later it > > could solved with other patches. > > That's the wrong way around. We have regression tests for a reason. I tend to agree with Alex here, and additionally want to expand more on this topic. In my view: (that I think is at least very close to the community consensus) This is *not* a ppc-specific issue. There exist a principle across all targets that QEMU FPU calculation must be accurate - exactly as specified in any applicable particular ISA document. Any discrepancy is an outright bug. We even recently had several patches for FPU in ppc target that handled some fairly obscure cases of inaccuracies, I believe they were authored by Paul Clarke, so there are people in ppc community that care about FPU accuracy (as I guess is the case for any target). There shouldn't be a target that decides by itself and within itself "ok, we don't need accuracy, let's trade it for speed". This violates the architecture of QEMU. Please allow that for any given software project, there is an architecture that should be respected. This doesn't mean that anybody's experimentation is discouraged. No-one can stop anybody from forking from QEMU upstream tree and do whatever is wanted. But, this doesn't mean such experimentation will be upstreamed. QEMU upstream should be collecting place for the best ideas and implementations, not for arbitrary experimentations. Best regards, Aleksandar > I'll happily accept patches to turn on hardfloat for PPC if: > > a) they don't cause regressions in our fairly extensive floating point > tests > b) the PPC maintainers are happy with the new performance profile > > The way forward would be to: > > 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded > splatting of flags (if any) > 3. measure the before/after performance effect and decide if on balance > it's worth keeping > > > I think it's important for qemu to as global as possible and don't > > target only recent hardware. > > Are you referring to guests or hosts? For guests we will always favour > accuracy of speed of emulation. For hosts we need to have IEEE compliant > FPU HW to even stand a chance of using hardfloat. > > -- > Alex Bennée >
Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do. On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic < aleksandar.m.mail@gmail.com> wrote: > > > On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> > wrote: > > > > > > Dino Papararo <skizzato73@msn.com> writes: > > > > > Please let's go with hardfloat pps support, it's really a good feature > to implement. > > > Even if in a first step it could lead to inaccuracy results, later it > > > could solved with other patches. > > > > That's the wrong way around. We have regression tests for a reason. > > I tend to agree with Alex here, and additionally want to expand more on > this topic. > > In my view: (that I think is at least very close to the community > consensus) > > This is *not* a ppc-specific issue. There exist a principle across all > targets > that QEMU FPU calculation must be accurate - exactly as specified in any > applicable particular ISA document. Any discrepancy is an outright bug. > > We even recently had several patches for FPU in ppc target that handled > some fairly obscure cases of inaccuracies, I believe they were authored > by Paul Clarke, so there are people in ppc community that care about > FPU accuracy (as I guess is the case for any target). > > There shouldn't be a target that decides by itself and within itself > "ok, we don't need accuracy, let's trade it for speed". This violates > the architecture of QEMU. Please allow that for any given software > project, there is an architecture that should be respected. > > This doesn't mean that anybody's experimentation is discouraged. No-one > can stop anybody from forking from QEMU upstream tree and do whatever > is wanted. > > But, this doesn't mean such experimentation will be upstreamed. QEMU > upstream should be collecting place for the best ideas and implementations, > not for arbitrary experimentations. > > Best regards, > Aleksandar > > > > I'll happily accept patches to turn on hardfloat for PPC if: > > > > a) they don't cause regressions in our fairly extensive floating point > > tests > > b) the PPC maintainers are happy with the new performance profile > > > > The way forward would be to: > > > > 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) > > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded > > splatting of flags (if any) > > 3. measure the before/after performance effect and decide if on balance > > it's worth keeping > > > > > I think it's important for qemu to as global as possible and don't > > > target only recent hardware. > > > > Are you referring to guests or hosts? For guests we will always favour > > accuracy of speed of emulation. For hosts we need to have IEEE compliant > > FPU HW to even stand a chance of using hardfloat. > > > > -- > > Alex Bennée > > >
On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote: > > Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do. > From the experiments described above, the patch in question changes the behavior of applications (for example, sound is different with and without the patch), which is in contradiction with your claim that you "never encountered any source code that cares for this flag" and that "the only part of this patch that will produce incorrect values are the flags". In other words, and playing further with them: The claim that "referring to this patch as inaccurate is itself inaccurate" is itself inaccurate. Best regards, Aleksandar > On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote: >> >> >> >> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote: >> > >> > >> > Dino Papararo <skizzato73@msn.com> writes: >> > >> > > Please let's go with hardfloat pps support, it's really a good feature to implement. >> > > Even if in a first step it could lead to inaccuracy results, later it >> > > could solved with other patches. >> > >> > That's the wrong way around. We have regression tests for a reason. >> >> I tend to agree with Alex here, and additionally want to expand more on >> this topic. >> >> In my view: (that I think is at least very close to the community consensus) >> >> This is *not* a ppc-specific issue. There exist a principle across all targets >> that QEMU FPU calculation must be accurate - exactly as specified in any >> applicable particular ISA document. Any discrepancy is an outright bug. >> >> We even recently had several patches for FPU in ppc target that handled >> some fairly obscure cases of inaccuracies, I believe they were authored >> by Paul Clarke, so there are people in ppc community that care about >> FPU accuracy (as I guess is the case for any target). >> >> There shouldn't be a target that decides by itself and within itself >> "ok, we don't need accuracy, let's trade it for speed". This violates >> the architecture of QEMU. Please allow that for any given software >> project, there is an architecture that should be respected. >> >> This doesn't mean that anybody's experimentation is discouraged. No-one >> can stop anybody from forking from QEMU upstream tree and do whatever >> is wanted. >> >> But, this doesn't mean such experimentation will be upstreamed. QEMU >> upstream should be collecting place for the best ideas and implementations, >> not for arbitrary experimentations. >> >> Best regards, >> Aleksandar >> >> >> > I'll happily accept patches to turn on hardfloat for PPC if: >> > >> > a) they don't cause regressions in our fairly extensive floating point >> > tests >> > b) the PPC maintainers are happy with the new performance profile >> > >> > The way forward would be to: >> > >> > 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) >> > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded >> > splatting of flags (if any) >> > 3. measure the before/after performance effect and decide if on balance >> > it's worth keeping >> > >> > > I think it's important for qemu to as global as possible and don't >> > > target only recent hardware. >> > >> > Are you referring to guests or hosts? For guests we will always favour >> > accuracy of speed of emulation. For hosts we need to have IEEE compliant >> > FPU HW to even stand a chance of using hardfloat. >> > >> > -- >> > Alex Bennée >> >
G 3 <programmingkidx@gmail.com> writes: > Accuracy is an important part of the IEEE 754 floating point standard. The > whole purpose of this standard is to ensure floating point calculations are > consistent across multiple CPUs. I believe referring to this patch as > inaccurate is itself inaccurate. That gives the impression that this patch > produces calculations that are not inline with established standards. This > is not true. The only part of this patch that will produce incorrect values > are the flags. As I stated further up the thread I'd be happy to take a patch that turns this on without the messing about with the FPU flags which AFAICT breaks the architectural compliance of those instructions. The ability to detect inexact results is part of the IEEE spec even if it is considerably looser about when the flag should be reset. > There *may* be a program or two out there that depend on > these flags, but for the majority of programs that only care about basic > floating point arithmetic this patch will produce correct values. Currently > the emulated PowerPC's FPU already produces wrong values for the > flags. Those are bugs that should be fixed. The state of the flags after a calculation should be considered part of the "values" generated by the FPU. > This patch does set the Inexact flag (which I don't like), but since I have > never encountered any source code that cares for this flag, I can let it > go. I think giving the user the ability to decide which option to use is > the best thing to do. Giving the user the option to break things is a poor flag, most of QEMUs configurable knobs are about trading of size/speed without affecting correctness. If the PPC maintainers are happy that hardfloat's speed trade-offs are worth it for usual workloads (whatever they may be) then I have no objection to making defaulting hardfloat to on - perhaps* even with an option to force softfloat if it faster for some workloads. > > On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic < > aleksandar.m.mail@gmail.com> wrote: > >> >> >> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> >> wrote: >> > >> > >> > Dino Papararo <skizzato73@msn.com> writes: >> > >> > > Please let's go with hardfloat pps support, it's really a good feature >> to implement. >> > > Even if in a first step it could lead to inaccuracy results, later it >> > > could solved with other patches. >> > >> > That's the wrong way around. We have regression tests for a reason. >> >> I tend to agree with Alex here, and additionally want to expand more on >> this topic. >> >> In my view: (that I think is at least very close to the community >> consensus) >> >> This is *not* a ppc-specific issue. There exist a principle across all >> targets >> that QEMU FPU calculation must be accurate - exactly as specified in any >> applicable particular ISA document. Any discrepancy is an outright bug. >> >> We even recently had several patches for FPU in ppc target that handled >> some fairly obscure cases of inaccuracies, I believe they were authored >> by Paul Clarke, so there are people in ppc community that care about >> FPU accuracy (as I guess is the case for any target). >> >> There shouldn't be a target that decides by itself and within itself >> "ok, we don't need accuracy, let's trade it for speed". This violates >> the architecture of QEMU. Please allow that for any given software >> project, there is an architecture that should be respected. >> >> This doesn't mean that anybody's experimentation is discouraged. No-one >> can stop anybody from forking from QEMU upstream tree and do whatever >> is wanted. >> >> But, this doesn't mean such experimentation will be upstreamed. QEMU >> upstream should be collecting place for the best ideas and implementations, >> not for arbitrary experimentations. >> >> Best regards, >> Aleksandar >> >> >> > I'll happily accept patches to turn on hardfloat for PPC if: >> > >> > a) they don't cause regressions in our fairly extensive floating point >> > tests >> > b) the PPC maintainers are happy with the new performance profile >> > >> > The way forward would be to: >> > >> > 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) >> > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded >> > splatting of flags (if any) >> > 3. measure the before/after performance effect and decide if on balance >> > it's worth keeping >> > >> > > I think it's important for qemu to as global as possible and don't >> > > target only recent hardware. >> > >> > Are you referring to guests or hosts? For guests we will always favour >> > accuracy of speed of emulation. For hosts we need to have IEEE compliant >> > FPU HW to even stand a chance of using hardfloat. >> > >> > -- >> > Alex Bennée >> > >>
I think we all agree the best solution is to resolve powerpc issues about hardfloat current implementation. I think also powerpc is an important branch of qemu, for hystorical, present and (why not?) future reasons, and it must NOT be left behind. So I would invite best Qemu community's skilled programmers to work on this and solve the issue maybe in few days. The same group who worked on recent altivec optimizations is able to make a good patch even for this. In a subordinate way I'd like to implement anyway hardfloat support for powerpc, advising users about inaccurancy of results/flags and letting them choose. Of course I understand, and in part agree, on all your objections. Simply I prefer have always a choice. Best Regards, Dino Papararo -----Messaggio originale----- Da: Aleksandar Markovic <aleksandar.m.mail@gmail.com> Inviato: mercoledì 26 febbraio 2020 18:27 A: G 3 <programmingkidx@gmail.com> Cc: Alex Bennée <alex.bennee@linaro.org>; Dino Papararo <skizzato73@msn.com>; QEMU Developers <qemu-devel@nongnu.org>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; luigi burdo <intermediadc@hotmail.com>; David Gibson <david@gibson.dropbear.id.au> Oggetto: Re: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote: > > Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do. > From the experiments described above, the patch in question changes the behavior of applications (for example, sound is different with and without the patch), which is in contradiction with your claim that you "never encountered any source code that cares for this flag" and that "the only part of this patch that will produce incorrect values are the flags". In other words, and playing further with them: The claim that "referring to this patch as inaccurate is itself inaccurate" is itself inaccurate. Best regards, Aleksandar > On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote: >> >> >> >> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote: >> > >> > >> > Dino Papararo <skizzato73@msn.com> writes: >> > >> > > Please let's go with hardfloat pps support, it's really a good feature to implement. >> > > Even if in a first step it could lead to inaccuracy results, >> > > later it could solved with other patches. >> > >> > That's the wrong way around. We have regression tests for a reason. >> >> I tend to agree with Alex here, and additionally want to expand more >> on this topic. >> >> In my view: (that I think is at least very close to the community >> consensus) >> >> This is *not* a ppc-specific issue. There exist a principle across >> all targets that QEMU FPU calculation must be accurate - exactly as >> specified in any applicable particular ISA document. Any discrepancy is an outright bug. >> >> We even recently had several patches for FPU in ppc target that >> handled some fairly obscure cases of inaccuracies, I believe they >> were authored by Paul Clarke, so there are people in ppc community >> that care about FPU accuracy (as I guess is the case for any target). >> >> There shouldn't be a target that decides by itself and within itself >> "ok, we don't need accuracy, let's trade it for speed". This violates >> the architecture of QEMU. Please allow that for any given software >> project, there is an architecture that should be respected. >> >> This doesn't mean that anybody's experimentation is discouraged. >> No-one can stop anybody from forking from QEMU upstream tree and do >> whatever is wanted. >> >> But, this doesn't mean such experimentation will be upstreamed. QEMU >> upstream should be collecting place for the best ideas and >> implementations, not for arbitrary experimentations. >> >> Best regards, >> Aleksandar >> >> >> > I'll happily accept patches to turn on hardfloat for PPC if: >> > >> > a) they don't cause regressions in our fairly extensive floating >> > point tests >> > b) the PPC maintainers are happy with the new performance profile >> > >> > The way forward would be to: >> > >> > 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) >> > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any >> > unneeded splatting of flags (if any) 3. measure the before/after >> > performance effect and decide if on balance it's worth keeping >> > >> > > I think it's important for qemu to as global as possible and >> > > don't target only recent hardware. >> > >> > Are you referring to guests or hosts? For guests we will always >> > favour accuracy of speed of emulation. For hosts we need to have >> > IEEE compliant FPU HW to even stand a chance of using hardfloat. >> > >> > -- >> > Alex Bennée >> >
On Wed, Feb 26, 2020 at 7:14 PM Dino Papararo <skizzato73@msn.com> wrote: > > I think we all agree the best solution is to resolve powerpc issues about hardfloat current implementation. > I think also powerpc is an important branch of qemu, for hystorical, present and (why not?) future reasons, and it must NOT be left behind. > So I would invite best Qemu community's skilled programmers to work on this and solve the issue maybe in few days. > The same group who worked on recent altivec optimizations is able to make a good patch even for this. > > In a subordinate way I'd like to implement anyway hardfloat support for powerpc, advising users about inaccurancy of results/flags and letting them choose. Just to be clear, concluding from your surrounding sentences, you don't mean here to implement "hardfloat support for powerpc", but something like "improve performance of ppc FPU emulation by misusing hardfloat support and giving up accuracy". I think we have to be honest between ourselves. Regards, Aleksandar > Of course I understand, and in part agree, on all your objections. > Simply I prefer have always a choice. > > Best Regards, > Dino Papararo > > -----Messaggio originale----- > Da: Aleksandar Markovic <aleksandar.m.mail@gmail.com> > Inviato: mercoledì 26 febbraio 2020 18:27 > A: G 3 <programmingkidx@gmail.com> > Cc: Alex Bennée <alex.bennee@linaro.org>; Dino Papararo <skizzato73@msn.com>; QEMU Developers <qemu-devel@nongnu.org>; qemu-ppc@nongnu.org; Howard Spoelstra <hsp.cat7@gmail.com>; luigi burdo <intermediadc@hotmail.com>; David Gibson <david@gibson.dropbear.id.au> > Oggetto: Re: R: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC > > On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote: > > > > Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do. > > > > From the experiments described above, the patch in question changes the behavior of applications (for example, sound is different with and without the patch), which is in contradiction with your claim that you "never encountered any source code that cares for this flag" and that "the only part of this patch that will produce incorrect values are the flags". > > In other words, and playing further with them: > > The claim that "referring to this patch as inaccurate is itself inaccurate" is itself inaccurate. > > Best regards, > Aleksandar > > > > On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote: > >> > >> > >> > >> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> wrote: > >> > > >> > > >> > Dino Papararo <skizzato73@msn.com> writes: > >> > > >> > > Please let's go with hardfloat pps support, it's really a good feature to implement. > >> > > Even if in a first step it could lead to inaccuracy results, > >> > > later it could solved with other patches. > >> > > >> > That's the wrong way around. We have regression tests for a reason. > >> > >> I tend to agree with Alex here, and additionally want to expand more > >> on this topic. > >> > >> In my view: (that I think is at least very close to the community > >> consensus) > >> > >> This is *not* a ppc-specific issue. There exist a principle across > >> all targets that QEMU FPU calculation must be accurate - exactly as > >> specified in any applicable particular ISA document. Any discrepancy is an outright bug. > >> > >> We even recently had several patches for FPU in ppc target that > >> handled some fairly obscure cases of inaccuracies, I believe they > >> were authored by Paul Clarke, so there are people in ppc community > >> that care about FPU accuracy (as I guess is the case for any target). > >> > >> There shouldn't be a target that decides by itself and within itself > >> "ok, we don't need accuracy, let's trade it for speed". This violates > >> the architecture of QEMU. Please allow that for any given software > >> project, there is an architecture that should be respected. > >> > >> This doesn't mean that anybody's experimentation is discouraged. > >> No-one can stop anybody from forking from QEMU upstream tree and do > >> whatever is wanted. > >> > >> But, this doesn't mean such experimentation will be upstreamed. QEMU > >> upstream should be collecting place for the best ideas and > >> implementations, not for arbitrary experimentations. > >> > >> Best regards, > >> Aleksandar > >> > >> > >> > I'll happily accept patches to turn on hardfloat for PPC if: > >> > > >> > a) they don't cause regressions in our fairly extensive floating > >> > point tests > >> > b) the PPC maintainers are happy with the new performance profile > >> > > >> > The way forward would be to: > >> > > >> > 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) > >> > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any > >> > unneeded splatting of flags (if any) 3. measure the before/after > >> > performance effect and decide if on balance it's worth keeping > >> > > >> > > I think it's important for qemu to as global as possible and > >> > > don't target only recent hardware. > >> > > >> > Are you referring to guests or hosts? For guests we will always > >> > favour accuracy of speed of emulation. For hosts we need to have > >> > IEEE compliant FPU HW to even stand a chance of using hardfloat. > >> > > >> > -- > >> > Alex Bennée > >> >
On Wed, 26 Feb 2020, Alex Bennée wrote: > That's the wrong way around. We have regression tests for a reason. I'll > happily accept patches to turn on hardfloat for PPC if: > > a) they don't cause regressions in our fairly extensive floating point > tests Where are these tests and how to run them? I'm not aware of such tests so I've only tried running simple guest code to test changes but if there are more extensive FP tests I'd better use those. > b) the PPC maintainers are happy with the new performance profile > > The way forward would be to: > > 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) This is simple but I've found that while it seems to make some vector instructions faster it also makes most simple FP ops slower because it will go thorugh checking if it can use hardfloat but never can because the fp_status is cleared before every FP op. That's why I've set inexact bit to let it use hardfloat and be able to test if it would work at all. That's all my RFC patch did, I've had a 2nd version trying to avoid slow down with above #if defined() dropped but hardfloat=false so it only uses softfloat as before but it did not worked out too well, some tests said v2 was even slower. Maybe to avoid overhead we should add a variable instead of the QEMU_NO_HARDFLOAT define that can be set during runtime but probably that won't be faster either. Thus it seems there's no way to enable hardfloat for PPC and not have slower performance for most FP ops without also doing some of the other points below (even if it's beneficial for vector ops). > 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded > splatting of flags (if any) This would either need someone who knows PPC FPU or someone who can take the time to learn and go through the code. Not sure I want to volunteer for that. But I think the clearing of the flags is mainly to emulate FI bit which is an non-sticky inexact bit that should show the inexact status of last FP op. (There's another simliar bit for fraction rounded as well but that does not disable hardfloat.) Question is if we really want to accurately emulate these bits? Are there any software we care about relying on these? If we can live with not having correct FI bit emulation (which was the case for a long time until these were fixed) then we could have an easy way to enable hardfloat without more extensive changes. If we want to accurately emulate also these bits then we probably will need changes to softfloat to allow registering FP exception handlers so we don't have to clear and check bits but can get an exception from FPU and then can set those bits but I have no idea how to do that. Regards, BALATON Zoltan
> On Feb 26, 2020, at 12:27 PM, Aleksandar Markovic <aleksandar.m.mail@gmail.com> wrote: > > On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote: >> >> Accuracy is an important part of the IEEE 754 floating point standard. The whole purpose of this standard is to ensure floating point calculations are consistent across multiple CPUs. I believe referring to this patch as inaccurate is itself inaccurate. That gives the impression that this patch produces calculations that are not inline with established standards. This is not true. The only part of this patch that will produce incorrect values are the flags. There *may* be a program or two out there that depend on these flags, but for the majority of programs that only care about basic floating point arithmetic this patch will produce correct values. Currently the emulated PowerPC's FPU already produces wrong values for the flags. This patch does set the Inexact flag (which I don't like), but since I have never encountered any source code that cares for this flag, I can let it go. I think giving the user the ability to decide which option to use is the best thing to do. >> > > From the experiments described above, the patch in question changes the behavior > of applications (for example, sound is different with and without the > patch), which is > in contradiction with your claim that you "never encountered any > source code that > cares for this flag" and that "the only part of this patch that will > produce incorrect > values are the flags". > > In other words, and playing further with them: > > The claim that "referring to this patch as inaccurate is itself > inaccurate" is itself inaccurate. > > Best regards, > Aleksandar It is inaccurate to state that just because the USB audio device seems to play better with the hardfloat feature enabled that this changes the fact that I have yet to see any source code that actually reviews the flags. I have reviewed both the USB audio device and Apple's AppleUSBAudio class code and have not seen any mention of the exception flags.
On Thursday, February 27, 2020, Programmingkid <programmingkidx@gmail.com> wrote: > > > On Feb 26, 2020, at 12:27 PM, Aleksandar Markovic < > aleksandar.m.mail@gmail.com> wrote: > > > > On Wed, Feb 26, 2020 at 6:04 PM G 3 <programmingkidx@gmail.com> wrote: > >> > >> Accuracy is an important part of the IEEE 754 floating point standard. > The whole purpose of this standard is to ensure floating point calculations > are consistent across multiple CPUs. I believe referring to this patch as > inaccurate is itself inaccurate. That gives the impression that this patch > produces calculations that are not inline with established standards. This > is not true. The only part of this patch that will produce incorrect values > are the flags. There *may* be a program or two out there that depend on > these flags, but for the majority of programs that only care about basic > floating point arithmetic this patch will produce correct values. Currently > the emulated PowerPC's FPU already produces wrong values for the flags. > This patch does set the Inexact flag (which I don't like), but since I have > never encountered any source code that cares for this flag, I can let it > go. I think giving the user the ability to decide which option to use is > the best thing to do. > >> > > > > From the experiments described above, the patch in question changes the > behavior > > of applications (for example, sound is different with and without the > > patch), which is > > in contradiction with your claim that you "never encountered any > > source code that > > cares for this flag" and that "the only part of this patch that will > > produce incorrect > > values are the flags". > > > > In other words, and playing further with them: > > > > The claim that "referring to this patch as inaccurate is itself > > inaccurate" is itself inaccurate. > > > > Best regards, > > Aleksandar > > It is inaccurate to state that just because the USB audio device seems to > play better with the hardfloat feature enabled that this changes the fact > that I have yet to see any source code that actually reviews the flags. I > have reviewed both the USB audio device and Apple's AppleUSBAudio class > code and have not seen any mention of the exception flags. > > I totally disagree with your using the term "hardfloat feature enabled" in this context, speaking about this particulat patch. This may be just wishful thinking. The right wording would be "hardfloat feature hacked", or "hardfloat feature fooled". The patch itself has the wrong, intentionally misleading and confusing title from the outset. It should be something like "target/ppc: Cheat hardfloat feature into beleiving that inexact flag is always set" Best regards, Aleksabdar
On Thu, 27 Feb 2020, Aleksandar Markovic wrote: > I totally disagree with your using the term "hardfloat feature enabled" in > this context, speaking about this particulat patch. This may be just > wishful thinking. The right wording would be "hardfloat feature hacked", or > "hardfloat feature fooled". > > The patch itself has the wrong, intentionally misleading and confusing > title from the outset. It should be something like "target/ppc: Cheat > hardfloat feature into beleiving that inexact flag is always set" May I point out that the patch is RFC, meaning it's not meant to be merged only to test it and provide feedback. Also the limitations were stated in the commit message. There's no other easy way that I know to test if hardfloat would work with PPC than forcing inexact bit to have it run with hardfloat most of the time. Once it's tested what regression this would cause (other than the expected inexact bit) then we can see if there are any other problem with hardfloat and PPC or only this bit. Then we can either change it to only not clear inexact bit like it's done on other archs or do something else as even not clearing sticky inexact bit would break the non-sticky counterpart PPC has. Breakage would be limited to the non-sticky version and discussion was about if even that's unacceptable. Regards, BALATON Zoltan
> On Feb 26, 2020, at 1:09 PM, Alex Bennée <alex.bennee@linaro.org> wrote: > > > G 3 <programmingkidx@gmail.com> writes: > >> Accuracy is an important part of the IEEE 754 floating point standard. The >> whole purpose of this standard is to ensure floating point calculations are >> consistent across multiple CPUs. I believe referring to this patch as >> inaccurate is itself inaccurate. That gives the impression that this patch >> produces calculations that are not inline with established standards. This >> is not true. The only part of this patch that will produce incorrect values >> are the flags. > > As I stated further up the thread I'd be happy to take a patch that > turns this on without the messing about with the FPU flags which AFAICT > breaks the architectural compliance of those instructions. The ability > to detect inexact results is part of the IEEE spec even if it is > considerably looser about when the flag should be reset. > >> There *may* be a program or two out there that depend on >> these flags, but for the majority of programs that only care about basic >> floating point arithmetic this patch will produce correct values. Currently >> the emulated PowerPC's FPU already produces wrong values for the >> flags. > > Those are bugs that should be fixed. The state of the flags after a > calculation should be considered part of the "values" generated by the > FPU. > >> This patch does set the Inexact flag (which I don't like), but since I have >> never encountered any source code that cares for this flag, I can let it >> go. I think giving the user the ability to decide which option to use is >> the best thing to do. > > Giving the user the option to break things is a poor flag, most of QEMUs > configurable knobs are about trading of size/speed without affecting > correctness. > > If the PPC maintainers are happy that hardfloat's speed trade-offs are > worth it for usual workloads (whatever they may be) then I have no > objection to making defaulting hardfloat to on - perhaps* even with an > option to force softfloat if it faster for some workloads. > >> >> On Wed, Feb 26, 2020 at 10:51 AM Aleksandar Markovic < >> aleksandar.m.mail@gmail.com> wrote: >> >>> >>> >>> On Wed, Feb 26, 2020 at 3:29 PM Alex Bennée <alex.bennee@linaro.org> >>> wrote: >>>> >>>> >>>> Dino Papararo <skizzato73@msn.com> writes: >>>> >>>>> Please let's go with hardfloat pps support, it's really a good feature >>> to implement. >>>>> Even if in a first step it could lead to inaccuracy results, later it >>>>> could solved with other patches. >>>> >>>> That's the wrong way around. We have regression tests for a reason. >>> >>> I tend to agree with Alex here, and additionally want to expand more on >>> this topic. >>> >>> In my view: (that I think is at least very close to the community >>> consensus) >>> >>> This is *not* a ppc-specific issue. There exist a principle across all >>> targets >>> that QEMU FPU calculation must be accurate - exactly as specified in any >>> applicable particular ISA document. Any discrepancy is an outright bug. >>> >>> We even recently had several patches for FPU in ppc target that handled >>> some fairly obscure cases of inaccuracies, I believe they were authored >>> by Paul Clarke, so there are people in ppc community that care about >>> FPU accuracy (as I guess is the case for any target). >>> >>> There shouldn't be a target that decides by itself and within itself >>> "ok, we don't need accuracy, let's trade it for speed". This violates >>> the architecture of QEMU. Please allow that for any given software >>> project, there is an architecture that should be respected. >>> >>> This doesn't mean that anybody's experimentation is discouraged. No-one >>> can stop anybody from forking from QEMU upstream tree and do whatever >>> is wanted. >>> >>> But, this doesn't mean such experimentation will be upstreamed. QEMU >>> upstream should be collecting place for the best ideas and implementations, >>> not for arbitrary experimentations. >>> >>> Best regards, >>> Aleksandar >>> >>> >>>> I'll happily accept patches to turn on hardfloat for PPC if: >>>> >>>> a) they don't cause regressions in our fairly extensive floating point >>>> tests >>>> b) the PPC maintainers are happy with the new performance profile >>>> >>>> The way forward would be to: >>>> >>>> 1. patch to drop #if defined(TARGET_PPC) || defined(__FAST_MATH__) >>>> 2. audit target/ppc/fpu_helper.c w.r.t chip manual and fix any unneeded >>>> splatting of flags (if any) >>>> 3. measure the before/after performance effect and decide if on balance >>>> it's worth keeping >>>> >>>>> I think it's important for qemu to as global as possible and don't >>>>> target only recent hardware. >>>> >>>> Are you referring to guests or hosts? For guests we will always favour >>>> accuracy of speed of emulation. For hosts we need to have IEEE compliant >>>> FPU HW to even stand a chance of using hardfloat. >>>> >>>> -- >>>> Alex Bennée >>>> >>> > > > -- > Alex Bennée Ok, I was just looking at Intel's x87 chip documentation. It supports IEEE 754 floating point operations and exception flags. This leads me to this question. Would simply taking the host exception flags and using them to set the PowerPC's FPU's flag be an acceptable solution to this problem? These are the flags that all CPU's that support the IEEE 754 standard implement: - division by zero - inexact - overflow - underflow - invalid operation This could be an API that is used to retrieve the host flags' value: - get_host_div_zero_flag() - get_host_inexact_flag() - get_host_overflow_flag() - get_host_underflow_flag() - get_host_invalid_oper_flag() We could then use this API to set the PowerPC FPU exception flags. Does this sound like a good solution?
On 3/1/20 4:13 PM, Programmingkid wrote:
> Ok, I was just looking at Intel's x87 chip documentation. It supports IEEE 754 floating point operations and exception flags. This leads me to this question. Would simply taking the host exception flags and using them to set the PowerPC's FPU's flag be an acceptable solution to this problem?
No.
The primary issue is the FPSCR.FI flag. This is not an accumulative bit, per
ieee754, but per operation.
The "hardfloat" option works (with other targets) only with ieee745
accumulative exceptions, when the most common of those exceptions, inexact, has
already been raised. And thus need not be raised a second time.
Per the PowerPC architecture, inexact must be recognized afresh for every
operation. Which is cheap in hardware but expensive in software.
And once you're done with FI, FR has been and continues to be emulated incorrectly.
r~
On Sun, 1 Mar 2020, Richard Henderson wrote: > On 3/1/20 4:13 PM, Programmingkid wrote: >> Ok, I was just looking at Intel's x87 chip documentation. It supports >> IEEE 754 floating point operations and exception flags. This leads me >> to this question. Would simply taking the host exception flags and >> using them to set the PowerPC's FPU's flag be an acceptable solution to >> this problem? In my understanding that's what is currently done, the problem with PPC as Richard said is the non-sticky versions of some of these bits which need clearing FP exception status before every FPU op which seems to be expensive and slower than using softfloat. So to use hardfloat we either accept that we can't emulate these bits with hardfloat or we need to do something else than clearing flags and checking after every FPU op. While not emulating these bits don't seem to matter for most clients and other PPC emulations got away with it, QEMU prefers accuracy over speed even for rarely used features. > No. > > The primary issue is the FPSCR.FI flag. This is not an accumulative bit, per > ieee754, but per operation. > > The "hardfloat" option works (with other targets) only with ieee745 > accumulative exceptions, when the most common of those exceptions, inexact, has > already been raised. And thus need not be raised a second time. Why exactly it's done that way? What are the differences between IEEE FP implementations that prevents using hardfloat most of the time instead of only using it in some (although supposedly common) special cases? > Per the PowerPC architecture, inexact must be recognized afresh for every > operation. Which is cheap in hardware but expensive in software. > > And once you're done with FI, FR has been and continues to be emulated incorrectly. I think CPUs can also raise exceptions when they detect the condition in hardware so maybe we should install our FPU exception handler and set guest flags from that then we don't need to check and won't have problem with these bits either. Why is that not possible or isn't done? The softfloat code has a comment that working with exceptions is not pleasent but why? Isn't setting flags from a handler easier than checking separately for each op? If this is because of differences in how flags are handled by different targets we don't have to do that from the host FPU exception handler. That handler could only set a global flag on each exception that targets can be checked by targets and handle differences. This global flag then can include non-sticky versions if needed because clearing a global should be less expensive than clearing FPU status reg. But I don't really know, just guessing, somone who knows more about FPUs probably knows a better way. Regards, BALATON Zoltan
On 3/2/20 3:42 AM, BALATON Zoltan wrote: >> The "hardfloat" option works (with other targets) only with ieee745 >> accumulative exceptions, when the most common of those exceptions, inexact, has >> already been raised. And thus need not be raised a second time. > > Why exactly it's done that way? What are the differences between IEEE FP > implementations that prevents using hardfloat most of the time instead of only > using it in some (although supposedly common) special cases? While it is possible to read the host's ieee exception word after the hardfloat operation, there are two reasons that is undesirable: (1) It is *slow*. So slow that it's faster to run the softfloat code instead. I thought it would be easier to find the benchmark numbers that Emilio generated when this was tested, but I can't find it. (2) IEEE has a number of implementation choices for corner cases, and we need to implement the target's choices, not the host's choices. > I think CPUs can also raise exceptions when they detect the condition in > hardware so maybe we should install our FPU exception handler and set guest > flags from that then we don't need to check and won't have problem with these > bits either. Why is that not possible or isn't done? If we have to enable and disable host fpu exceptions going in and out of softfloat routines, we're back to modifying the host fpu control word, which as described above, is *slow*. > That handler could only > set a global flag on each exception that targets can be checked by targets and > handle differences. This global flag then can include non-sticky versions if > needed because clearing a global should be less expensive than clearing FPU > status reg. But I don't really know, just guessing, somone who knows more about > FPUs probably knows a better way. I don't know if anyone has tried that variant, where we simply leave the exceptions enabled, leave the signal handler enabled, and use a global. Feel free to try it and benchmark it. r~
BALATON Zoltan <balaton@eik.bme.hu> writes: > On Sun, 1 Mar 2020, Richard Henderson wrote: >> On 3/1/20 4:13 PM, Programmingkid wrote: >>> Ok, I was just looking at Intel's x87 chip documentation. It >>> supports IEEE 754 floating point operations and exception flags. >>> This leads me to this question. Would simply taking the host >>> exception flags and using them to set the PowerPC's FPU's flag be >>> an acceptable solution to this problem? > > In my understanding that's what is currently done, the problem with > PPC as Richard said is the non-sticky versions of some of these bits > which need clearing FP exception status before every FPU op which > seems to be expensive and slower than using softfloat. So to use > hardfloat we either accept that we can't emulate these bits with > hardfloat or we need to do something else than clearing flags and > checking after every FPU op. > > While not emulating these bits don't seem to matter for most clients > and other PPC emulations got away with it, QEMU prefers accuracy over > speed even for rarely used features. > >> No. >> >> The primary issue is the FPSCR.FI flag. This is not an accumulative bit, per >> ieee754, but per operation. >> >> The "hardfloat" option works (with other targets) only with ieee745 >> accumulative exceptions, when the most common of those exceptions, inexact, has >> already been raised. And thus need not be raised a second time. > > Why exactly it's done that way? What are the differences between IEEE > FP implementations that prevents using hardfloat most of the time > instead of only using it in some (although supposedly common) special > cases? There are a couple of wrinkles. As far as NaN and denormal behaviour goes we have enough slack in the spec that different guests have slightly different behaviour. See pickNaN and friends in the soft float specialisation code. As a result we never try and hand off to hardfloat for NaNs, Infs and Zeros. Luckily testing for those cases if a fairly small part of the cost of the calculation. Also things tend to get unstuck on changes to rounding modes. Fortunately it doesn't seem to be supper common. You can read even more detail in the paper that originally prompted Emilio's work: "supporting the neon and VFP instruction sets in an LLVM-based binary translator" https://www.thinkmind.org/download.php?articleid=icas_2015_5_20_20033 >> Per the PowerPC architecture, inexact must be recognized afresh for every >> operation. Which is cheap in hardware but expensive in software. >> >> And once you're done with FI, FR has been and continues to be emulated incorrectly. > > I think CPUs can also raise exceptions when they detect the condition > in hardware so maybe we should install our FPU exception handler and > set guest flags from that then we don't need to check and won't have > problem with these bits either. Why is that not possible or isn't > done? One of my original patches did just this: Subject: [PATCH] fpu/softfloat: use hardware sqrt if we can (EXPERIMENT!) Date: Tue, 20 Feb 2018 21:01:37 +0000 Message-Id: <20180220210137.18018-1-alex.bennee@linaro.org> The two problems you run into are: - relying on a trap for inexact will be slow if you keep hitting it - reading host FPU flag registers turns out to be pretty expensive > The softfloat code has a comment that working with exceptions is > not pleasent but why? Isn't setting flags from a handler easier than > checking separately for each op? If this is because of differences in > how flags are handled by different targets we don't have to do that > from the host FPU exception handler. That handler could only set a > global flag on each exception that targets can be checked by targets > and handle differences. This global flag then can include non-sticky > versions if needed because clearing a global should be less expensive > than clearing FPU status reg. But I don't really know, just guessing, > somone who knows more about FPUs probably knows a better way. > > Regards, > BALATON Zoltan
On Mon, 2 Mar 2020, Alex Bennée wrote: > BALATON Zoltan <balaton@eik.bme.hu> writes: >> On Sun, 1 Mar 2020, Richard Henderson wrote: >>> On 3/1/20 4:13 PM, Programmingkid wrote: >>>> Ok, I was just looking at Intel's x87 chip documentation. It >>>> supports IEEE 754 floating point operations and exception flags. >>>> This leads me to this question. Would simply taking the host >>>> exception flags and using them to set the PowerPC's FPU's flag be >>>> an acceptable solution to this problem? >> >> In my understanding that's what is currently done, the problem with >> PPC as Richard said is the non-sticky versions of some of these bits >> which need clearing FP exception status before every FPU op which >> seems to be expensive and slower than using softfloat. So to use >> hardfloat we either accept that we can't emulate these bits with >> hardfloat or we need to do something else than clearing flags and >> checking after every FPU op. >> >> While not emulating these bits don't seem to matter for most clients >> and other PPC emulations got away with it, QEMU prefers accuracy over >> speed even for rarely used features. >> >>> No. >>> >>> The primary issue is the FPSCR.FI flag. This is not an accumulative bit, per >>> ieee754, but per operation. >>> >>> The "hardfloat" option works (with other targets) only with ieee745 >>> accumulative exceptions, when the most common of those exceptions, inexact, has >>> already been raised. And thus need not be raised a second time. >> >> Why exactly it's done that way? What are the differences between IEEE >> FP implementations that prevents using hardfloat most of the time >> instead of only using it in some (although supposedly common) special >> cases? > > There are a couple of wrinkles. As far as NaN and denormal behaviour > goes we have enough slack in the spec that different guests have > slightly different behaviour. See pickNaN and friends in the soft float > specialisation code. As a result we never try and hand off to hardfloat > for NaNs, Infs and Zeros. Luckily testing for those cases if a fairly > small part of the cost of the calculation. > > Also things tend to get unstuck on changes to rounding modes. > Fortunately it doesn't seem to be supper common. OK but how do these relate to inexact flag and why is that the one that's checked for using hardfloat? Also rounding mode is checked but why can't we set the same mode on host and why only use hardfloat in one specific rounding mode? These two checks seem to further limit hardfloat use beyond the above cases or are these the same? > You can read even more detail in the paper that originally prompted > Emilio's work: > > "supporting the neon and VFP instruction sets in an LLVM-based > binary translator" > https://www.thinkmind.org/download.php?articleid=icas_2015_5_20_20033 I've only had a quick look at it but seems to not discuss all details. They say the ARM instruction they wanted to emulate have some non-standard flush-to-zero behaviour where exceptions (including inexact) are handled differently. Is this related to the check above and if yes shouldn't that only apply to ARM target? Other standard compliant target probably should not be limited by this. They've also found out that clearing and reading host FP flags is "slower than QEMU" which is what we have for PPC currently. They say the solution is to not use host exceptions at all but calculate the exception flags from software looking at inputs and result instead maybe trying different FP ops that test for the exception cases. Unfortunately this paper does not describe how exactly that's done just say maybe it will be described later. It seems like kind of softfloat but using FPU for actual calculation and deduce exeptions without access to intermediate reaults that softfloat may be using. So they can use hardware for calculation which should be the largest part and calculate the flags from software. This way they claim 1.24 to 3.36 times speed up compared to then QEMU (using only softfloat I guess which is still what we have for PPC today). >>> Per the PowerPC architecture, inexact must be recognized afresh for every >>> operation. Which is cheap in hardware but expensive in software. >>> >>> And once you're done with FI, FR has been and continues to be emulated incorrectly. >> >> I think CPUs can also raise exceptions when they detect the condition >> in hardware so maybe we should install our FPU exception handler and >> set guest flags from that then we don't need to check and won't have >> problem with these bits either. Why is that not possible or isn't >> done? > > One of my original patches did just this: > > Subject: [PATCH] fpu/softfloat: use hardware sqrt if we can (EXPERIMENT!) > Date: Tue, 20 Feb 2018 21:01:37 +0000 > Message-Id: <20180220210137.18018-1-alex.bennee@linaro.org> It's this patch: http://patchwork.ozlabs.org/patch/875764/ This at least shows where to hook in FP exception handling but based on the above paper maybe that's not the best solution after all but may worth a try anyway in case it's simpler than what they did. > The two problems you run into are: > > - relying on a trap for inexact will be slow if you keep hitting it Which is slower? Clearing exception flags before every op and reading them again or trapping for exceptions? I'd expect even if exceptions are common they should be less frequent than every op (otherwise they would not be "exceptional"). > - reading host FPU flag registers turns out to be pretty expensive That's what using exceptions should avoid. If we only need to read and clear flags when exception happens that should be less frequent than doing that for every FP op. Hopefully even with the additional overhead of calling the handler if all the handler has to do is set a corresponding flag in a global. Regards, BALATON Zoltan
On Mon, 2 Mar 2020, Richard Henderson wrote: > On 3/2/20 3:42 AM, BALATON Zoltan wrote: >>> The "hardfloat" option works (with other targets) only with ieee745 >>> accumulative exceptions, when the most common of those exceptions, inexact, has >>> already been raised. And thus need not be raised a second time. >> >> Why exactly it's done that way? What are the differences between IEEE FP >> implementations that prevents using hardfloat most of the time instead of only >> using it in some (although supposedly common) special cases? > > While it is possible to read the host's ieee exception word after the hardfloat > operation, there are two reasons that is undesirable: > > (1) It is *slow*. So slow that it's faster to run the softfloat code instead. > I thought it would be easier to find the benchmark numbers that Emilio > generated when this was tested, but I can't find it. I remember those benchmarks too and this is also what the paper Alex referred to also confirmed. Also I've found that enabling hardfloat for PPC without doing anything else is slightly slower (on a recent CPU, on older CPUs could be even slower). Interetingly however it does give a speedup for vector instructions (maybe because they don't clear flags between each sub operation). Does that mean these vector instruction helpers are also buggy regarding exceptions? > (2) IEEE has a number of implementation choices for corner cases, and we need > to implement the target's choices, not the host's choices. But how is that related to inexact flag and float_round_nearest_even rounding mode which are the only two things can_use_fpu() function checks for? >> I think CPUs can also raise exceptions when they detect the condition in >> hardware so maybe we should install our FPU exception handler and set guest >> flags from that then we don't need to check and won't have problem with these >> bits either. Why is that not possible or isn't done? > > If we have to enable and disable host fpu exceptions going in and out of > softfloat routines, we're back to modifying the host fpu control word, which as > described above, is *slow*. > >> That handler could only >> set a global flag on each exception that targets can be checked by targets and >> handle differences. This global flag then can include non-sticky versions if >> needed because clearing a global should be less expensive than clearing FPU >> status reg. But I don't really know, just guessing, somone who knows more about >> FPUs probably knows a better way. > > I don't know if anyone has tried that variant, where we simply leave the > exceptions enabled, leave the signal handler enabled, and use a global. > > Feel free to try it and benchmark it. I probably won't try any time soon. I have several other half finished stuff to hack on to not take up another one I likely can't finish, but hope this discussion inspires someone to try it. I'm also interested in the results. If nobody tries in the next two years maybe I get there eventually. Regards, BALATON Zoltan
On 3/2/20 3:16 PM, BALATON Zoltan wrote: >> (2) IEEE has a number of implementation choices for corner cases, and we need >> to implement the target's choices, not the host's choices. > > But how is that related to inexact flag and float_round_nearest_even rounding > mode which are the only two things can_use_fpu() function checks for? float_round_nearest_even is the default rounding mode, and is what the host fpu has been set for. In order to use hardfloat for a different rounding mode, we would need to change the host fpu control register, which is slow. Inexact is the only flag that we cannot compute easily from the inputs and output. We check for NaN and Inf as inputs and outputs, which is the clue that we may have to raise Invalid or Overflow. r~
---------- Forwarded message --------- From: G 3 <programmingkidx@gmail.com> Date: Wed, Mar 4, 2020 at 1:35 PM Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC To: BALATON Zoltan <balaton@eik.bme.hu> On Mon, Mar 2, 2020 at 6:16 PM BALATON Zoltan <balaton@eik.bme.hu> wrote: > On Mon, 2 Mar 2020, Richard Henderson wrote: > > On 3/2/20 3:42 AM, BALATON Zoltan wrote: > >>> The "hardfloat" option works (with other targets) only with ieee745 > >>> accumulative exceptions, when the most common of those exceptions, > inexact, has > >>> already been raised. And thus need not be raised a second time. > >> > >> Why exactly it's done that way? What are the differences between IEEE FP > >> implementations that prevents using hardfloat most of the time instead > of only > >> using it in some (although supposedly common) special cases? > > > > While it is possible to read the host's ieee exception word after the > hardfloat > > operation, there are two reasons that is undesirable: > > > > (1) It is *slow*. So slow that it's faster to run the softfloat code > instead. > > I thought it would be easier to find the benchmark numbers that Emilio > > generated when this was tested, but I can't find it. > > I remember those benchmarks too and this is also what the paper Alex > referred to also confirmed. Also I've found that enabling hardfloat for > PPC without doing anything else is slightly slower (on a recent CPU, on > older CPUs could be even slower). Interetingly however it does give a > speedup for vector instructions (maybe because they don't clear flags > between each sub operation). Does that mean these vector instruction > helpers are also buggy regarding exceptions? > I am all intrigued by these vector instructions. Apple was really big on using them back in the day so programs like Quicktime and iTunes definitely use them. I'm not sure if the PowerPC's altivec vector instructions map to host vector instructions already, but if they don't, mapping them would give us a huge speedup in certain places. Would anyone know if this was already done in QEMU?
On 3/4/20 10:43 AM, G 3 wrote: > I am all intrigued by these vector instructions. Apple was really big on using > them back in the day so programs like Quicktime and iTunes definitely use them. > I'm not sure if the PowerPC's altivec vector instructions map to host vector > instructions already, but if they don't, mapping them would give us a huge > speedup in certain places. Would anyone know if this was already done in QEMU? They are, provided that your x86 host supports AVX. Which should be everything manufactured after about 2011. r~
Are this stable now? I'd like to see hard float to be landed:) On Wed, Feb 19, 2020 at 1:19 AM BALATON Zoltan <balaton@eik.bme.hu> wrote: > While other targets take advantage of using host FPU to do floating > point computations, this was disabled for PPC target because always > clearing exception flags before every FP op made it slightly slower > than emulating everyting with softfloat. To emulate some FPSCR bits, > clearing of fp_status may be necessary (unless these could be handled > e.g. using FP exceptions on host but there's no API for that in QEMU > yet) but preserving at least the inexact flag makes hardfloat usable > and faster than softfloat. Since most clients don't actually care > about this flag, we can gain some speed trading some emulation > accuracy. > > This patch implements a simple way to keep the inexact flag set for > hardfloat while still allowing to revert to softfloat for workloads > that need more accurate albeit slower emulation. (Set hardfloat > property of CPU, i.e. -cpu name,hardfloat=false for that.) There may > still be room for further improvement but this seems to increase > floating point performance. Unfortunately the softfloat case is slower > than before this patch so this patch only makes sense if the default > is also set to enable hardfloat. > > Because of the above this patch at the moment is mainly for testing > different workloads to evaluate how viable would this be in practice. > Thus, RFC and not ready for merge yet. > > Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> > --- > v2: use different approach to avoid needing if () in > helper_reset_fpstatus() but this does not seem to change overhead > much, also make it a single patch as adding the hardfloat option is > only a few lines; with this we can use same value at other places where > float_status is reset and maybe enable hardfloat for a few more places > for a little more performance but not too much. With this I got: > > lame: 3:13, lame_vmx: 1:55 (this is probably within jitter though and > still far from the results on real hardware) also tried mplayer test > and got results between 144-146s (this test is more VMX bound). > > I've also done some profiling for hardfloat=true and false cases with > this patch to see what are the hot functions. Results are: > > Overhead Command Symbol > -cpu G4,hardfloat=false, lame: > 9.82% qemu-system-ppc [.] round_canonical > 8.35% qemu-system-ppc [.] soft_f64_muladd > 7.16% qemu-system-ppc [.] soft_f64_addsub > 5.27% qemu-system-ppc [.] float32_to_float64 > 5.20% qemu-system-ppc [.] helper_compute_fprf_float64 > 4.61% qemu-system-ppc [.] helper_frsp > 4.59% qemu-system-ppc [.] soft_f64_mul > 4.01% qemu-system-ppc [.] float_to_float.isra.26 > 3.84% qemu-system-ppc [.] float64_classify > 2.97% qemu-system-ppc [.] do_float_check_status > > -cpu G4,hardfloat=false, lame_vmx: > Overhead Command Symbol > 10.04% qemu-system-ppc [.] float32_muladd > 9.49% qemu-system-ppc [.] helper_vperm > 6.10% qemu-system-ppc [.] round_canonical > 4.13% qemu-system-ppc [.] soft_f64_addsub > 3.23% qemu-system-ppc [.] helper_frsp > 3.13% qemu-system-ppc [.] soft_f64_muladd > 2.88% qemu-system-ppc [.] helper_vmaddfp > 2.69% qemu-system-ppc [.] float32_add > 2.60% qemu-system-ppc [.] float32_to_float64 > 2.52% qemu-system-ppc [.] helper_compute_fprf_float64 > > -cpu G4,hardfloat=true, lame: > 11.59% qemu-system-ppc [.] round_canonical > 6.18% qemu-system-ppc [.] helper_compute_fprf_float64 > 6.01% qemu-system-ppc [.] float32_to_float64 > 4.58% qemu-system-ppc [.] float64_classify > 3.87% qemu-system-ppc [.] helper_frsp > 3.75% qemu-system-ppc [.] float_to_float.isra.26 > 3.48% qemu-system-ppc [.] helper_todouble > 3.31% qemu-system-ppc [.] float64_muladd > 3.21% qemu-system-ppc [.] do_float_check_status > 3.01% qemu-system-ppc [.] float64_mul > > -cpu G4,hardfloat=true, lame_vmx: > 9.34% qemu-system-ppc [.] float32_muladd > 8.83% qemu-system-ppc [.] helper_vperm > 5.41% qemu-system-ppc [.] round_canonical > 4.51% qemu-system-ppc [.] page_collection_lock > 3.58% qemu-system-ppc [.] page_trylock_add.isra.17 > 2.71% qemu-system-ppc [.] helper_vmaddfp > 2.53% qemu-system-ppc [.] float32_add > 2.30% qemu-system-ppc [.] helper_compute_fprf_float64 > 2.21% qemu-system-ppc [.] float32_to_float64 > 2.06% qemu-system-ppc [.] helper_frsp > > round_canonical seems to come up frequently in this with large overhead. > > Could those with better test cases or benchmarks give it a test please > on different CPUs to see what else this would break? > > --- > fpu/softfloat.c | 14 +++++++------- > target/ppc/cpu.h | 2 ++ > target/ppc/fpu_helper.c | 32 ++++++++++++++++---------------- > target/ppc/translate_init.inc.c | 3 +++ > 4 files changed, 28 insertions(+), 23 deletions(-) > > diff --git a/fpu/softfloat.c b/fpu/softfloat.c > index 301ce3b537..6d3f4af72a 100644 > --- a/fpu/softfloat.c > +++ b/fpu/softfloat.c > @@ -216,15 +216,15 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64) > #endif > > /* > - * Some targets clear the FP flags before most FP operations. This > prevents > - * the use of hardfloat, since hardfloat relies on the inexact flag being > - * already set. > + * Disable hardfloat for known problem cases. > + * Additionally, some targets clear the FP flags before most FP > operations. > + * This prevents the use of hardfloat, since it relies on the inexact flag > + * being already set and clearing it often may result in slower > computations. > + * Those targets could also be listed here. > */ > -#if defined(TARGET_PPC) || defined(__FAST_MATH__) > -# if defined(__FAST_MATH__) > -# warning disabling hardfloat due to -ffast-math: hardfloat requires an > exact \ > +#if defined(__FAST_MATH__) > +# warning disabling hardfloat due to -ffast-math: hardfloat requires an > exact \ > IEEE implementation > -# endif > # define QEMU_NO_HARDFLOAT 1 > # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN > #else > diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h > index b283042515..5f412f9fba 100644 > --- a/target/ppc/cpu.h > +++ b/target/ppc/cpu.h > @@ -1033,6 +1033,7 @@ struct CPUPPCState { > float_status vec_status; > float_status fp_status; /* Floating point execution context */ > target_ulong fpscr; /* Floating point status and control register > */ > + int default_fp_excpt_flags; > > /* Internal devices resources */ > ppc_tb_t *tb_env; /* Time base and decrementer */ > @@ -1163,6 +1164,7 @@ struct PowerPCCPU { > void *machine_data; > int32_t node_id; /* NUMA node this CPU belongs to */ > PPCHash64Options *hash64_opts; > + bool hardfloat; /* use hardfloat (this breaks FPSCR[FI] bit > emulation) */ > > /* Those resources are used only during code translation */ > /* opcode handlers */ > diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c > index ae43b08eb5..bbbd1cb987 100644 > --- a/target/ppc/fpu_helper.c > +++ b/target/ppc/fpu_helper.c > @@ -659,7 +659,7 @@ void helper_float_check_status(CPUPPCState *env) > > void helper_reset_fpstatus(CPUPPCState *env) > { > - set_float_exception_flags(0, &env->fp_status); > + set_float_exception_flags(env->default_fp_excpt_flags, > &env->fp_status); > } > > static void float_invalid_op_addsub(CPUPPCState *env, bool set_fpcc, > @@ -1823,7 +1823,7 @@ void helper_##name(CPUPPCState *env, ppc_vsr_t *xt, > \ > > \ > for (i = 0; i < nels; i++) { > \ > float_status tstat = env->fp_status; > \ > - set_float_exception_flags(0, &tstat); > \ > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > \ > t.fld = tp##_##op(xa->fld, xb->fld, &tstat); > \ > env->fp_status.float_exception_flags |= > tstat.float_exception_flags; \ > > \ > @@ -1867,7 +1867,7 @@ void helper_xsaddqp(CPUPPCState *env, uint32_t > opcode, > tstat.float_rounding_mode = float_round_to_odd; > } > > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > t.f128 = float128_add(xa->f128, xb->f128, &tstat); > env->fp_status.float_exception_flags |= tstat.float_exception_flags; > > @@ -1902,7 +1902,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, > \ > > \ > for (i = 0; i < nels; i++) { > \ > float_status tstat = env->fp_status; > \ > - set_float_exception_flags(0, &tstat); > \ > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > \ > t.fld = tp##_mul(xa->fld, xb->fld, &tstat); > \ > env->fp_status.float_exception_flags |= > tstat.float_exception_flags; \ > > \ > @@ -1942,7 +1942,7 @@ void helper_xsmulqp(CPUPPCState *env, uint32_t > opcode, > tstat.float_rounding_mode = float_round_to_odd; > } > > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > t.f128 = float128_mul(xa->f128, xb->f128, &tstat); > env->fp_status.float_exception_flags |= tstat.float_exception_flags; > > @@ -1976,7 +1976,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, > \ > > \ > for (i = 0; i < nels; i++) { > \ > float_status tstat = env->fp_status; > \ > - set_float_exception_flags(0, &tstat); > \ > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > \ > t.fld = tp##_div(xa->fld, xb->fld, &tstat); > \ > env->fp_status.float_exception_flags |= > tstat.float_exception_flags; \ > > \ > @@ -2019,7 +2019,7 @@ void helper_xsdivqp(CPUPPCState *env, uint32_t > opcode, > tstat.float_rounding_mode = float_round_to_odd; > } > > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > t.f128 = float128_div(xa->f128, xb->f128, &tstat); > env->fp_status.float_exception_flags |= tstat.float_exception_flags; > > @@ -2095,7 +2095,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, > ppc_vsr_t *xb) \ > > \ > for (i = 0; i < nels; i++) { > \ > float_status tstat = env->fp_status; > \ > - set_float_exception_flags(0, &tstat); > \ > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > \ > t.fld = tp##_sqrt(xb->fld, &tstat); > \ > env->fp_status.float_exception_flags |= > tstat.float_exception_flags; \ > > \ > @@ -2143,7 +2143,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, > ppc_vsr_t *xb) \ > > \ > for (i = 0; i < nels; i++) { > \ > float_status tstat = env->fp_status; > \ > - set_float_exception_flags(0, &tstat); > \ > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > \ > t.fld = tp##_sqrt(xb->fld, &tstat); > \ > t.fld = tp##_div(tp##_one, t.fld, &tstat); > \ > env->fp_status.float_exception_flags |= > tstat.float_exception_flags; \ > @@ -2305,7 +2305,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, > \ > > \ > for (i = 0; i < nels; i++) { > \ > float_status tstat = env->fp_status; > \ > - set_float_exception_flags(0, &tstat); > \ > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > \ > if (r2sp && (tstat.float_rounding_mode == > float_round_nearest_even)) {\ > /* > \ > * Avoid double rounding errors by rounding the intermediate > \ > @@ -2886,7 +2886,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t > xb) > uint64_t result, sign, exp, frac; > > float_status tstat = env->fp_status; > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > > sign = extract64(xb, 63, 1); > exp = extract64(xb, 52, 11); > @@ -2924,7 +2924,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t > xb) > uint64_t helper_xscvspdpn(CPUPPCState *env, uint64_t xb) > { > float_status tstat = env->fp_status; > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > > return float32_to_float64(xb >> 32, &tstat); > } > @@ -3327,7 +3327,7 @@ void helper_xsrqpi(CPUPPCState *env, uint32_t opcode, > } > > tstat = env->fp_status; > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > set_float_rounding_mode(rmode, &tstat); > t.f128 = float128_round_to_int(xb->f128, &tstat); > env->fp_status.float_exception_flags |= tstat.float_exception_flags; > @@ -3384,7 +3384,7 @@ void helper_xsrqpxp(CPUPPCState *env, uint32_t > opcode, > } > > tstat = env->fp_status; > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > set_float_rounding_mode(rmode, &tstat); > round_res = float128_to_floatx80(xb->f128, &tstat); > t.f128 = floatx80_to_float128(round_res, &tstat); > @@ -3415,7 +3415,7 @@ void helper_xssqrtqp(CPUPPCState *env, uint32_t > opcode, > tstat.float_rounding_mode = float_round_to_odd; > } > > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > t.f128 = float128_sqrt(xb->f128, &tstat); > env->fp_status.float_exception_flags |= tstat.float_exception_flags; > > @@ -3449,7 +3449,7 @@ void helper_xssubqp(CPUPPCState *env, uint32_t > opcode, > tstat.float_rounding_mode = float_round_to_odd; > } > > - set_float_exception_flags(0, &tstat); > + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); > t.f128 = float128_sub(xa->f128, xb->f128, &tstat); > env->fp_status.float_exception_flags |= tstat.float_exception_flags; > > diff --git a/target/ppc/translate_init.inc.c > b/target/ppc/translate_init.inc.c > index 53995f62ea..ab1a6db4f1 100644 > --- a/target/ppc/translate_init.inc.c > +++ b/target/ppc/translate_init.inc.c > @@ -10736,6 +10736,8 @@ static void ppc_cpu_reset(CPUState *s) > /* tininess for underflow is detected before rounding */ > set_float_detect_tininess(float_tininess_before_rounding, > &env->fp_status); > + /* hardfloat needs inexact flag already set */ > + env->default_fp_excpt_flags = (cpu->hardfloat ? float_flag_inexact : > 0); > > for (i = 0; i < ARRAY_SIZE(env->spr_cb); i++) { > ppc_spr_t *spr = &env->spr_cb[i]; > @@ -10868,6 +10870,7 @@ static Property ppc_cpu_properties[] = { > false), > DEFINE_PROP_BOOL("pre-3.0-migration", PowerPCCPU, pre_3_0_migration, > false), > + DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true), > DEFINE_PROP_END_OF_LIST(), > }; > > -- > 2.21.1 > > >
On Fri, 10 Apr 2020, 罗勇刚(Yonggang Luo) wrote:
> Are this stable now? I'd like to see hard float to be landed:)
If you want to see hardfloat for PPC then you should read the replies to
this patch which can be found here:
http://patchwork.ozlabs.org/patch/1240235/
to understand what's needed then try to implement the solution with FP
exceptions cached in a global that maybe could work. I won't be able to do
that as said here:
https://lists.nongnu.org/archive/html/qemu-ppc/2020-03/msg00006.html
because I don't have time to learn all the details needed. I think others
are in the same situation so unless somebody puts in the necessary effort
this won't change.
Regards,
BALATON Zoltan
diff --git a/fpu/softfloat.c b/fpu/softfloat.c index 301ce3b537..6d3f4af72a 100644 --- a/fpu/softfloat.c +++ b/fpu/softfloat.c @@ -216,15 +216,15 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64) #endif /* - * Some targets clear the FP flags before most FP operations. This prevents - * the use of hardfloat, since hardfloat relies on the inexact flag being - * already set. + * Disable hardfloat for known problem cases. + * Additionally, some targets clear the FP flags before most FP operations. + * This prevents the use of hardfloat, since it relies on the inexact flag + * being already set and clearing it often may result in slower computations. + * Those targets could also be listed here. */ -#if defined(TARGET_PPC) || defined(__FAST_MATH__) -# if defined(__FAST_MATH__) -# warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \ +#if defined(__FAST_MATH__) +# warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \ IEEE implementation -# endif # define QEMU_NO_HARDFLOAT 1 # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN #else diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h index b283042515..5f412f9fba 100644 --- a/target/ppc/cpu.h +++ b/target/ppc/cpu.h @@ -1033,6 +1033,7 @@ struct CPUPPCState { float_status vec_status; float_status fp_status; /* Floating point execution context */ target_ulong fpscr; /* Floating point status and control register */ + int default_fp_excpt_flags; /* Internal devices resources */ ppc_tb_t *tb_env; /* Time base and decrementer */ @@ -1163,6 +1164,7 @@ struct PowerPCCPU { void *machine_data; int32_t node_id; /* NUMA node this CPU belongs to */ PPCHash64Options *hash64_opts; + bool hardfloat; /* use hardfloat (this breaks FPSCR[FI] bit emulation) */ /* Those resources are used only during code translation */ /* opcode handlers */ diff --git a/target/ppc/fpu_helper.c b/target/ppc/fpu_helper.c index ae43b08eb5..bbbd1cb987 100644 --- a/target/ppc/fpu_helper.c +++ b/target/ppc/fpu_helper.c @@ -659,7 +659,7 @@ void helper_float_check_status(CPUPPCState *env) void helper_reset_fpstatus(CPUPPCState *env) { - set_float_exception_flags(0, &env->fp_status); + set_float_exception_flags(env->default_fp_excpt_flags, &env->fp_status); } static void float_invalid_op_addsub(CPUPPCState *env, bool set_fpcc, @@ -1823,7 +1823,7 @@ void helper_##name(CPUPPCState *env, ppc_vsr_t *xt, \ \ for (i = 0; i < nels; i++) { \ float_status tstat = env->fp_status; \ - set_float_exception_flags(0, &tstat); \ + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \ t.fld = tp##_##op(xa->fld, xb->fld, &tstat); \ env->fp_status.float_exception_flags |= tstat.float_exception_flags; \ \ @@ -1867,7 +1867,7 @@ void helper_xsaddqp(CPUPPCState *env, uint32_t opcode, tstat.float_rounding_mode = float_round_to_odd; } - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); t.f128 = float128_add(xa->f128, xb->f128, &tstat); env->fp_status.float_exception_flags |= tstat.float_exception_flags; @@ -1902,7 +1902,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, \ \ for (i = 0; i < nels; i++) { \ float_status tstat = env->fp_status; \ - set_float_exception_flags(0, &tstat); \ + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \ t.fld = tp##_mul(xa->fld, xb->fld, &tstat); \ env->fp_status.float_exception_flags |= tstat.float_exception_flags; \ \ @@ -1942,7 +1942,7 @@ void helper_xsmulqp(CPUPPCState *env, uint32_t opcode, tstat.float_rounding_mode = float_round_to_odd; } - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); t.f128 = float128_mul(xa->f128, xb->f128, &tstat); env->fp_status.float_exception_flags |= tstat.float_exception_flags; @@ -1976,7 +1976,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, \ \ for (i = 0; i < nels; i++) { \ float_status tstat = env->fp_status; \ - set_float_exception_flags(0, &tstat); \ + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \ t.fld = tp##_div(xa->fld, xb->fld, &tstat); \ env->fp_status.float_exception_flags |= tstat.float_exception_flags; \ \ @@ -2019,7 +2019,7 @@ void helper_xsdivqp(CPUPPCState *env, uint32_t opcode, tstat.float_rounding_mode = float_round_to_odd; } - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); t.f128 = float128_div(xa->f128, xb->f128, &tstat); env->fp_status.float_exception_flags |= tstat.float_exception_flags; @@ -2095,7 +2095,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, ppc_vsr_t *xb) \ \ for (i = 0; i < nels; i++) { \ float_status tstat = env->fp_status; \ - set_float_exception_flags(0, &tstat); \ + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \ t.fld = tp##_sqrt(xb->fld, &tstat); \ env->fp_status.float_exception_flags |= tstat.float_exception_flags; \ \ @@ -2143,7 +2143,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, ppc_vsr_t *xb) \ \ for (i = 0; i < nels; i++) { \ float_status tstat = env->fp_status; \ - set_float_exception_flags(0, &tstat); \ + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \ t.fld = tp##_sqrt(xb->fld, &tstat); \ t.fld = tp##_div(tp##_one, t.fld, &tstat); \ env->fp_status.float_exception_flags |= tstat.float_exception_flags; \ @@ -2305,7 +2305,7 @@ void helper_##op(CPUPPCState *env, ppc_vsr_t *xt, \ \ for (i = 0; i < nels; i++) { \ float_status tstat = env->fp_status; \ - set_float_exception_flags(0, &tstat); \ + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); \ if (r2sp && (tstat.float_rounding_mode == float_round_nearest_even)) {\ /* \ * Avoid double rounding errors by rounding the intermediate \ @@ -2886,7 +2886,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t xb) uint64_t result, sign, exp, frac; float_status tstat = env->fp_status; - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); sign = extract64(xb, 63, 1); exp = extract64(xb, 52, 11); @@ -2924,7 +2924,7 @@ uint64_t helper_xscvdpspn(CPUPPCState *env, uint64_t xb) uint64_t helper_xscvspdpn(CPUPPCState *env, uint64_t xb) { float_status tstat = env->fp_status; - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); return float32_to_float64(xb >> 32, &tstat); } @@ -3327,7 +3327,7 @@ void helper_xsrqpi(CPUPPCState *env, uint32_t opcode, } tstat = env->fp_status; - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); set_float_rounding_mode(rmode, &tstat); t.f128 = float128_round_to_int(xb->f128, &tstat); env->fp_status.float_exception_flags |= tstat.float_exception_flags; @@ -3384,7 +3384,7 @@ void helper_xsrqpxp(CPUPPCState *env, uint32_t opcode, } tstat = env->fp_status; - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); set_float_rounding_mode(rmode, &tstat); round_res = float128_to_floatx80(xb->f128, &tstat); t.f128 = floatx80_to_float128(round_res, &tstat); @@ -3415,7 +3415,7 @@ void helper_xssqrtqp(CPUPPCState *env, uint32_t opcode, tstat.float_rounding_mode = float_round_to_odd; } - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); t.f128 = float128_sqrt(xb->f128, &tstat); env->fp_status.float_exception_flags |= tstat.float_exception_flags; @@ -3449,7 +3449,7 @@ void helper_xssubqp(CPUPPCState *env, uint32_t opcode, tstat.float_rounding_mode = float_round_to_odd; } - set_float_exception_flags(0, &tstat); + set_float_exception_flags(env->default_fp_excpt_flags, &tstat); t.f128 = float128_sub(xa->f128, xb->f128, &tstat); env->fp_status.float_exception_flags |= tstat.float_exception_flags; diff --git a/target/ppc/translate_init.inc.c b/target/ppc/translate_init.inc.c index 53995f62ea..ab1a6db4f1 100644 --- a/target/ppc/translate_init.inc.c +++ b/target/ppc/translate_init.inc.c @@ -10736,6 +10736,8 @@ static void ppc_cpu_reset(CPUState *s) /* tininess for underflow is detected before rounding */ set_float_detect_tininess(float_tininess_before_rounding, &env->fp_status); + /* hardfloat needs inexact flag already set */ + env->default_fp_excpt_flags = (cpu->hardfloat ? float_flag_inexact : 0); for (i = 0; i < ARRAY_SIZE(env->spr_cb); i++) { ppc_spr_t *spr = &env->spr_cb[i]; @@ -10868,6 +10870,7 @@ static Property ppc_cpu_properties[] = { false), DEFINE_PROP_BOOL("pre-3.0-migration", PowerPCCPU, pre_3_0_migration, false), + DEFINE_PROP_BOOL("hardfloat", PowerPCCPU, hardfloat, true), DEFINE_PROP_END_OF_LIST(), };
While other targets take advantage of using host FPU to do floating point computations, this was disabled for PPC target because always clearing exception flags before every FP op made it slightly slower than emulating everyting with softfloat. To emulate some FPSCR bits, clearing of fp_status may be necessary (unless these could be handled e.g. using FP exceptions on host but there's no API for that in QEMU yet) but preserving at least the inexact flag makes hardfloat usable and faster than softfloat. Since most clients don't actually care about this flag, we can gain some speed trading some emulation accuracy. This patch implements a simple way to keep the inexact flag set for hardfloat while still allowing to revert to softfloat for workloads that need more accurate albeit slower emulation. (Set hardfloat property of CPU, i.e. -cpu name,hardfloat=false for that.) There may still be room for further improvement but this seems to increase floating point performance. Unfortunately the softfloat case is slower than before this patch so this patch only makes sense if the default is also set to enable hardfloat. Because of the above this patch at the moment is mainly for testing different workloads to evaluate how viable would this be in practice. Thus, RFC and not ready for merge yet. Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu> --- v2: use different approach to avoid needing if () in helper_reset_fpstatus() but this does not seem to change overhead much, also make it a single patch as adding the hardfloat option is only a few lines; with this we can use same value at other places where float_status is reset and maybe enable hardfloat for a few more places for a little more performance but not too much. With this I got: lame: 3:13, lame_vmx: 1:55 (this is probably within jitter though and still far from the results on real hardware) also tried mplayer test and got results between 144-146s (this test is more VMX bound). I've also done some profiling for hardfloat=true and false cases with this patch to see what are the hot functions. Results are: Overhead Command Symbol -cpu G4,hardfloat=false, lame: 9.82% qemu-system-ppc [.] round_canonical 8.35% qemu-system-ppc [.] soft_f64_muladd 7.16% qemu-system-ppc [.] soft_f64_addsub 5.27% qemu-system-ppc [.] float32_to_float64 5.20% qemu-system-ppc [.] helper_compute_fprf_float64 4.61% qemu-system-ppc [.] helper_frsp 4.59% qemu-system-ppc [.] soft_f64_mul 4.01% qemu-system-ppc [.] float_to_float.isra.26 3.84% qemu-system-ppc [.] float64_classify 2.97% qemu-system-ppc [.] do_float_check_status -cpu G4,hardfloat=false, lame_vmx: Overhead Command Symbol 10.04% qemu-system-ppc [.] float32_muladd 9.49% qemu-system-ppc [.] helper_vperm 6.10% qemu-system-ppc [.] round_canonical 4.13% qemu-system-ppc [.] soft_f64_addsub 3.23% qemu-system-ppc [.] helper_frsp 3.13% qemu-system-ppc [.] soft_f64_muladd 2.88% qemu-system-ppc [.] helper_vmaddfp 2.69% qemu-system-ppc [.] float32_add 2.60% qemu-system-ppc [.] float32_to_float64 2.52% qemu-system-ppc [.] helper_compute_fprf_float64 -cpu G4,hardfloat=true, lame: 11.59% qemu-system-ppc [.] round_canonical 6.18% qemu-system-ppc [.] helper_compute_fprf_float64 6.01% qemu-system-ppc [.] float32_to_float64 4.58% qemu-system-ppc [.] float64_classify 3.87% qemu-system-ppc [.] helper_frsp 3.75% qemu-system-ppc [.] float_to_float.isra.26 3.48% qemu-system-ppc [.] helper_todouble 3.31% qemu-system-ppc [.] float64_muladd 3.21% qemu-system-ppc [.] do_float_check_status 3.01% qemu-system-ppc [.] float64_mul -cpu G4,hardfloat=true, lame_vmx: 9.34% qemu-system-ppc [.] float32_muladd 8.83% qemu-system-ppc [.] helper_vperm 5.41% qemu-system-ppc [.] round_canonical 4.51% qemu-system-ppc [.] page_collection_lock 3.58% qemu-system-ppc [.] page_trylock_add.isra.17 2.71% qemu-system-ppc [.] helper_vmaddfp 2.53% qemu-system-ppc [.] float32_add 2.30% qemu-system-ppc [.] helper_compute_fprf_float64 2.21% qemu-system-ppc [.] float32_to_float64 2.06% qemu-system-ppc [.] helper_frsp round_canonical seems to come up frequently in this with large overhead. Could those with better test cases or benchmarks give it a test please on different CPUs to see what else this would break? --- fpu/softfloat.c | 14 +++++++------- target/ppc/cpu.h | 2 ++ target/ppc/fpu_helper.c | 32 ++++++++++++++++---------------- target/ppc/translate_init.inc.c | 3 +++ 4 files changed, 28 insertions(+), 23 deletions(-)