Message ID | 20140705225513.GY18016@ZenIV.linux.org.uk |
---|---|
State | New |
Headers | show |
On 07/05/2014 03:55 PM, Al Viro wrote: > +/* Input handing with software completion. Trap for denorms, > + unless DNZ is set. *IF* we try to support DNOD (which > + none of the produced hardware did, AFAICS), we'll need > + to suppress the trap when FPCR.DNOD is set; then the > + code downstream of that will need to cope with denorms > + sans flush_input_to_zero. Most of it should work sanely, > + but there's nothing to compare with... > +*/ > +void helper_ieee_input_s(CPUAlphaState *env, uint64_t val) > +{ > + if (unlikely(2 * val - 1 < 0x1fffffffffffff)) { > + if (!FP_STATUS.flush_inputs_to_zero) { > + arith_excp(env, GETPC(), EXC_M_INV | EXC_M_SWC, 0); > + } > + } > +} > + A couple of points here: 1) We should never raise this in user-only mode. In that mode, we emulate the whole fpu stack, all the way through from HW to the OS completion handler. 2) Because of that, we have the capability of doing the same thing in system mode. This lets us do more of the computation in the host, and less in the guest, which is faster. The only thing this makes more difficult is debugging the OS completion handlers within the kernel, since they'll only get invoked when SIGFPE needs to be sent. 3) If we do want to implement a mode where we faithfully send SWC for all of the bits of IEEE that real HW didn't implement, do we really need to avoid a store to the output register when signalling this? I.e. can we notice this condition after the fact with float_flag_input_denormal, rather than having another function call to prep the inputs? r~
On Mon, Jul 07, 2014 at 07:11:58AM -0700, Richard Henderson wrote: > A couple of points here: > > 1) We should never raise this in user-only mode. In that mode, we emulate the > whole fpu stack, all the way through from HW to the OS completion handler. How is that different from other cases where we have an exception raised by an fp operation? > 2) Because of that, we have the capability of doing the same thing in system > mode. This lets us do more of the computation in the host, and less in the > guest, which is faster. The only thing this makes more difficult is debugging > the OS completion handlers within the kernel, since they'll only get invoked > when SIGFPE needs to be sent. Umm... The effect of software completion depends on current->ieee_state; how would you keep track of that outside of guest kernel? > 3) If we do want to implement a mode where we faithfully send SWC for all of > the bits of IEEE that real HW didn't implement, do we really need to avoid a > store to the output register when signalling this? I.e. can we notice this > condition after the fact with float_flag_input_denormal, rather than having > another function call to prep the inputs? But flag_input_denormal is raised only when we do have DNZ set. Which is an entirely different case, where we should not (and do not) get an exception at all...
On 07/07/2014 08:06 AM, Al Viro wrote: > On Mon, Jul 07, 2014 at 07:11:58AM -0700, Richard Henderson wrote: >> A couple of points here: >> >> 1) We should never raise this in user-only mode. In that mode, we emulate the >> whole fpu stack, all the way through from HW to the OS completion handler. > > How is that different from other cases where we have an exception raised > by an fp operation? In all other cases we know we're going to send SIGFPE. That's either through a non /S insn which the kernel wouldn't touch, or by having computed the true IEEE result and examined the exceptions to be raised. >> 2) Because of that, we have the capability of doing the same thing in system >> mode. This lets us do more of the computation in the host, and less in the >> guest, which is faster. The only thing this makes more difficult is debugging >> the OS completion handlers within the kernel, since they'll only get invoked >> when SIGFPE needs to be sent. > > Umm... The effect of software completion depends on current->ieee_state; > how would you keep track of that outside of guest kernel? The kernel essentially keeps a copy of IEEE_STATE in the FPCR. I don't see any missing bits in ieee_swcr_to_fpcr, do you? While real hardware might ignore some of those bits once stored, qemu doesn't. While in real hardware one could force the FPCR and IEEE_STATE to differ, honestly that'd be a bug. (Although a silly one; I wish the kernel took the EV6 FPCR as gospel for everything, not just the status flags. That could make certain libm.so computations much faster.) > >> 3) If we do want to implement a mode where we faithfully send SWC for all of >> the bits of IEEE that real HW didn't implement, do we really need to avoid a >> store to the output register when signalling this? I.e. can we notice this >> condition after the fact with float_flag_input_denormal, rather than having >> another function call to prep the inputs? > > But flag_input_denormal is raised only when we do have DNZ set. Which is > an entirely different case, where we should not (and do not) get an exception > at all... Ah, you're right about that. I'd mis-remembered the implementation. r~
On Mon, Jul 07, 2014 at 09:20:28AM -0700, Richard Henderson wrote: > > How is that different from other cases where we have an exception raised > > by an fp operation? > > In all other cases we know we're going to send SIGFPE. That's either through a > non /S insn which the kernel wouldn't touch, or by having computed the true > IEEE result and examined the exceptions to be raised. Umm... Not quite. Consider e.g. CVTQL case. There we have the following picture in case of overflow (real hw with Linux on top of it): no suffix: IOV INE /v: IOV INE SIGFPE /sv, no IEEE INVE: IOV INE INV /sv, IEEE INVE: IOV INE INV SIGFPE This is after the completion had a chance to run. From the hw POV it's no suffix IOV INE no trap /v IOV INE trap<IOV> /sv IOV INE trap<SWC,IOV> and it's alpha_fp_emul() that does the rest in /sv case. Actually, it's even simpler: if overflow FPCR.INE = 1 raise IOV do usual trap suffix handling and I'm reasonably sure that this is what they did internally. You are proposing to do 4 cases in all their messy glory in qemu itself... And I wouldn't bet a dime on not having similar turds in other insns; after all, "it's hard, let's offload it to software" was only a part of motivation for software completions. "We really don't like this part of IEEE standard and we'd love to tell you to see figure 1, but we need conformance, so you can mark an insn with /s and have the kernel do what IEEE requires" is also very visible in their manual. Result is a mess - if you try to fold the kernel-side stuff into "hardware", you end up with a pile of inconsistent behaviours. In principle, it's doable, especially since we are not really constrained by actual hw in terms of what we do in case of FPCR.DNOD being true - no actual hw could set it. So we want * hw behaviour without /s (denorms trap) * hw behaviour with /s without denorms * hw behaviour with /s with denorms with FPCR.DNZ (same as with 0) * kernel completion behaviour with /s with denorm and it might even be what they intended for DNOD to do. But it's going to be messy as hell. And that's not even going into generating the right si_code for that SIGFPE. What produces those TARGET_GEN_FLTINE and friends?
On 07/07/2014 09:20 PM, Al Viro wrote: > and I'm reasonably sure that this is what they did internally. You are > proposing to do 4 cases in all their messy glory in qemu itself... Yes. Primarily because we *have* to do so for the linux-user case. > And that's not even going into generating the right si_code for that SIGFPE. > What produces those TARGET_GEN_FLTINE and friends? linux-user/main.c, cpu_loop. r~
On Mon, Jul 07, 2014 at 11:03:08PM -0700, Richard Henderson wrote: > On 07/07/2014 09:20 PM, Al Viro wrote: > > and I'm reasonably sure that this is what they did internally. You are > > proposing to do 4 cases in all their messy glory in qemu itself... > > Yes. Primarily because we *have* to do so for the linux-user case. > > > And that's not even going into generating the right si_code for that SIGFPE. > > What produces those TARGET_GEN_FLTINE and friends? > > linux-user/main.c, cpu_loop. That's where we consume it; where is it produced? Sure, explicit gentrap in alpha code will lead there, with whatever we have in $16 deciding what'll go into si_code, but where does that happen on fp exception codepaths? IOW, what sets si_code on those?
On Tue, Jul 08, 2014 at 07:54:36AM +0100, Al Viro wrote: > On Mon, Jul 07, 2014 at 11:03:08PM -0700, Richard Henderson wrote: > > On 07/07/2014 09:20 PM, Al Viro wrote: > > > and I'm reasonably sure that this is what they did internally. You are > > > proposing to do 4 cases in all their messy glory in qemu itself... > > > > Yes. Primarily because we *have* to do so for the linux-user case. > > > > > And that's not even going into generating the right si_code for that SIGFPE. > > > What produces those TARGET_GEN_FLTINE and friends? > > > > linux-user/main.c, cpu_loop. > > That's where we consume it; where is it produced? Sure, explicit > gentrap in alpha code will lead there, with whatever we have in > $16 deciding what'll go into si_code, but where does that happen on > fp exception codepaths? IOW, what sets si_code on those? Actually, that's badly worded; what codepath ends up setting si_code on e.g. fp addition overflows? In system mode it's done by completion code in the kernel, but AFAICS in user mode there are only two places where it might happen - one is gentrap handling and another - osf_setsysinfo(2) emulation for TARGET_SSI_IEEE_FP_CONTROL. What I don't understand is how do we get from float_raise(&FP_STATUS, float_flag_overflow) in fpu/softfloat.c to either of those. IOW, suppose I do x = DBL_MAX; feenableexcept(FE_ALL_EXCEPT); x *= x; I understand how I'll get SIGFPE, but what will set correct si_code in siginfo I'll see in the hanler?
On 8 July 2014 08:13, Al Viro <viro@zeniv.linux.org.uk> wrote: > Actually, that's badly worded; what codepath ends up setting si_code on > e.g. fp addition overflows? In system mode it's done by completion code > in the kernel, but AFAICS in user mode there are only two places where it > might happen - one is gentrap handling and another - osf_setsysinfo(2) > emulation for TARGET_SSI_IEEE_FP_CONTROL. What I don't understand is how > do we get from float_raise(&FP_STATUS, float_flag_overflow) in fpu/softfloat.c > to either of those. > > IOW, suppose I do > x = DBL_MAX; > feenableexcept(FE_ALL_EXCEPT); > x *= x; > I understand how I'll get SIGFPE, but what will set correct si_code in > siginfo I'll see in the hanler? The code we have currently may well be buggy, but the correct place to set si_code is (as Richard says) the Alpha cpu_loop() in linux-user/main.c, which has access to the trap type that just caused us to stop executing code, plus the CPUState, which should be enough information to set si_code correctly. In particular the GENTRAP case seems to be setting a variety of different si_code values for SIGFPE. thanks -- PMM
On 07/08/2014 01:05 AM, Peter Maydell wrote: > On 8 July 2014 08:13, Al Viro <viro@zeniv.linux.org.uk> wrote: >> Actually, that's badly worded; what codepath ends up setting si_code on >> e.g. fp addition overflows? In system mode it's done by completion code >> in the kernel, but AFAICS in user mode there are only two places where it >> might happen - one is gentrap handling and another - osf_setsysinfo(2) >> emulation for TARGET_SSI_IEEE_FP_CONTROL. What I don't understand is how >> do we get from float_raise(&FP_STATUS, float_flag_overflow) in fpu/softfloat.c >> to either of those. >> >> IOW, suppose I do >> x = DBL_MAX; >> feenableexcept(FE_ALL_EXCEPT); >> x *= x; >> I understand how I'll get SIGFPE, but what will set correct si_code in >> siginfo I'll see in the hanler? > > The code we have currently may well be buggy, but the correct > place to set si_code is (as Richard says) the Alpha cpu_loop() in > linux-user/main.c, which has access to the trap type that just > caused us to stop executing code, plus the CPUState, which > should be enough information to set si_code correctly. In > particular the GENTRAP case seems to be setting a variety > of different si_code values for SIGFPE. The gentrap case is a red-herring. The case you're looking for is EXC_ARITH. The path is from arith_excp dynamic_excp cpu_loop_exit longjmp cpu_exec cpu_loop It's also true that we don't install the correct si_code there, but we could. Mostly the gcc/glibc test cases really only care that SIGFPE gets raised, not what the codes are, so I haven't bothered. r~
On 8 July 2014 17:13, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Tue, Jul 08, 2014 at 09:05:10AM +0100, Peter Maydell wrote: > >> The code we have currently may well be buggy, but the correct > > It is ;-/ We set TARGET_FPE_FLTINV unconditionally there. BTW, what's > the reason why all these cpu_loop() instances can't go into > linux-user/<arch>/something? It's just ancient code nobody's cleaned up yet. I do have "move all this stuff into arch directories" on my "would like to do" list, but I just haven't got round to it yet, since it's not actually actively broken (unlike many other areas of our codebase :-/). > BTW, are there any more or less uptodate docs on qemu profiling? I mean, > things like perf/oprofile on the host obviously end up lumping all tcg > output together. Is there any way to get information beyond "~40% of time > is spent in generated code, ~15% - in tb_find_fast(), and the rest is very > much colder"? Alex, did you say you'd done something with profiling recently? > Incidentally, combination of --enable-gprof and (default) --enable-pie > won't build - it dies with ld(1) complaining about relocs in gcrt1.o. This sounds like a toolchain bug to me :-) -- PMM
On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote: > > Incidentally, combination of --enable-gprof and (default) --enable-pie > > won't build - it dies with ld(1) complaining about relocs in gcrt1.o. > > This sounds like a toolchain bug to me :-) Debian stable/amd64, gcc 4.7.2, binutils 2.22. And google search finds this, for example: http://osdir.com/ml/qemu-devel/2013-05/msg00710.html. That one has gcc 4.4.3. Anyway, adding --disable-pie to --enable-gprof gets it to build, but as I said, gprof is no better than perf and oprofile - same problem. Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and going through their kernel package build. I have perf report in front of me right now; the top ones are 41.77% qemu-system-alp perf-24701.map [.] 0x7fbbee558930 11.78% qemu-system-alp qemu-system-alpha [.] cpu_alpha_exec 4.95% qemu-system-alp [vdso] [.] 0x7fffdd7ff8de 2.40% qemu-system-alp qemu-system-alpha [.] phys_page_find 1.49% qemu-system-alp qemu-system-alpha [.] address_space_translate_internal 1.34% qemu-system-alp [kernel.kallsyms] [k] read_hpet 1.26% qemu-system-alp qemu-system-alpha [.] tlb_set_page 1.23% qemu-system-alp qemu-system-alpha [.] find_next_bit 1.04% qemu-system-alp qemu-system-alpha [.] get_page_addr_code 1.01% qemu-system-alp libpthread-2.13.so [.] pthread_mutex_lock 0.88% qemu-system-alp qemu-system-alpha [.] helper_cmpbge 0.80% qemu-system-alp libc-2.13.so [.] __memset_sse2 0.72% qemu-system-alp libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0.70% qemu-system-alp qemu-system-alpha [.] get_physical_address 0.69% qemu-system-alp qemu-system-alpha [.] address_space_translate 0.68% qemu-system-alp qemu-system-alpha [.] tcg_optimize 0.67% qemu-system-alp qemu-system-alpha [.] ldq_phys 0.63% qemu-system-alp qemu-system-alpha [.] qemu_get_ram_ptr 0.62% qemu-system-alp qemu-system-alpha [.] helper_le_ldq_mmu 0.57% qemu-system-alp qemu-system-alpha [.] memory_region_is_ram and cpu_alpha_exec() spends most of the time in inlined tb_find_fast(). It might be worth checking the actual distribution of the hash of virt address used by that sucker - I wonder if dividing its argument by 4 wouldn't improve the things, but I don't have stats on actual frequency of conflicts, etc. In any case, the first lump (42%) seems to be tastier ;-) There are all kinds of microoptimizations possible (e.g. helper_cmpbge() could be done by a couple of MMX insns on amd64 host[1]), but it would be nice to have some details on what we spend the time on in tcg output... [1] The reason why helper_cmpbge() shows up is that string functions on alpha use that insn a lot; it _might_ be worth optimizing.
On 07/08/2014 09:13 AM, Al Viro wrote: > Frankly, I suspect that it's better to have qemu-system-alpha behave like > the actual hardware does (including "FPCR.DNOD can't be set") and keep the > linux-user behaviour as is, for somebody brave and masochistic enough to > fight that one. And no, it's nowhere near "just let denorms ride through > the normal softfloat code and play a bit with the flags it might raise". > And then there's netbsd/alpha and openbsd/alpha, so in theory somebody might > want to play with their software completion semantics (not identical to Linux > one) for the sake of yet-to-be-written bsd-user alpha support... You're probably right there. I've pushed a couple more patches to the branch, split out from your patch here. I believe I've got it all, and havn't mucked things up in the process. I'll run some tests later today when I've got time. r~
On Tue, Jul 08, 2014 at 11:12:20AM -0700, Richard Henderson wrote: > On 07/08/2014 09:13 AM, Al Viro wrote: > > Frankly, I suspect that it's better to have qemu-system-alpha behave like > > the actual hardware does (including "FPCR.DNOD can't be set") and keep the > > linux-user behaviour as is, for somebody brave and masochistic enough to > > fight that one. And no, it's nowhere near "just let denorms ride through > > the normal softfloat code and play a bit with the flags it might raise". > > And then there's netbsd/alpha and openbsd/alpha, so in theory somebody might > > want to play with their software completion semantics (not identical to Linux > > one) for the sake of yet-to-be-written bsd-user alpha support... > > You're probably right there. > > I've pushed a couple more patches to the branch, split out from your patch > here. I believe I've got it all, and havn't mucked things up in the process. > I'll run some tests later today when I've got time. Just one thing - 0x1fffffffffffff will make 32bit hosts whine about integer constant being too large. So will 0x1ffffffffffffful, unfortunately - it really ought to be ull.
On 07/08/2014 12:02 PM, Al Viro wrote: > On Tue, Jul 08, 2014 at 11:12:20AM -0700, Richard Henderson wrote: >> On 07/08/2014 09:13 AM, Al Viro wrote: >>> Frankly, I suspect that it's better to have qemu-system-alpha behave like >>> the actual hardware does (including "FPCR.DNOD can't be set") and keep the >>> linux-user behaviour as is, for somebody brave and masochistic enough to >>> fight that one. And no, it's nowhere near "just let denorms ride through >>> the normal softfloat code and play a bit with the flags it might raise". >>> And then there's netbsd/alpha and openbsd/alpha, so in theory somebody might >>> want to play with their software completion semantics (not identical to Linux >>> one) for the sake of yet-to-be-written bsd-user alpha support... >> >> You're probably right there. >> >> I've pushed a couple more patches to the branch, split out from your patch >> here. I believe I've got it all, and havn't mucked things up in the process. >> I'll run some tests later today when I've got time. > > Just one thing - 0x1fffffffffffff will make 32bit hosts whine about integer > constant being too large. So will 0x1ffffffffffffful, unfortunately - it > really ought to be ull. > I did use ull on the branch. r~
On 8 July 2014 18:20, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote: > >> > Incidentally, combination of --enable-gprof and (default) --enable-pie >> > won't build - it dies with ld(1) complaining about relocs in gcrt1.o. >> >> This sounds like a toolchain bug to me :-) > > Debian stable/amd64, gcc 4.7.2, binutils 2.22. And google search finds > this, for example: http://osdir.com/ml/qemu-devel/2013-05/msg00710.html. > That one has gcc 4.4.3. That just makes it a long-standing toolchain bug. I don't see any reason why PIE + gprof shouldn't work, it just looks like gprof doesn't ship and link a PIE runtime. > Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and > going through their kernel package build. I have perf report in front of > me right now; the top ones are > 41.77% qemu-system-alp perf-24701.map [.] 0x7fbbee558930 > 11.78% qemu-system-alp qemu-system-alpha [.] cpu_alpha_exec > and cpu_alpha_exec() spends most of the time in inlined tb_find_fast(). > It might be worth checking the actual distribution of the hash of virt > address used by that sucker - I wonder if dividing its argument by 4 > wouldn't improve the things, but I don't have stats on actual frequency > of conflicts, etc. In any case, the first lump (42%) seems to be tastier ;-) Depends on your point of view -- arguably we ought to be spending *more* time executing translated guest code... (As you say, the problem is that we don't have any breakdown of what things might turn out to be hotspots in the translated code.) -- PMM
On Tue, Jul 08, 2014 at 08:32:55PM +0100, Peter Maydell wrote: On 8 July 2014 18:20, Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote: > > > >> > Incidentally, combination of --enable-gprof and (default) --enable-pie > >> > won't build - it dies with ld(1) complaining about relocs in gcrt1.o. > >> > >> This sounds like a toolchain bug to me :-) > > > > Debian stable/amd64, gcc 4.7.2, binutils 2.22. And google search finds > > this, for example: http://osdir.com/ml/qemu-devel/2013-05/msg00710.html. > > That one has gcc 4.4.3. > > That just makes it a long-standing toolchain bug. I don't see any > reason why PIE + gprof shouldn't work, it just looks like gprof > doesn't ship and link a PIE runtime. *nod* It's not a huge itch to scratch for me, and I'm not even sure whether the bug should be filed for gcc or for libc (probably the latter). In any case, having that information findable in list archives would probably be a good thing. Again, gprof isn't particulary useful - kernel-side profilers are at least as good. So I suspect that most of the people running into that simply shrug and use those instead. Narrowing it down to -pie didn't take long and I can confirm that this is the root cause of that breakage. Should make debugging said toolchain bug a bit easier, if anybody cares to do that... > > Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and > > going through their kernel package build. I have perf report in front of > > me right now; the top ones are > > 41.77% qemu-system-alp perf-24701.map [.] 0x7fbbee558930 > > 11.78% qemu-system-alp qemu-system-alpha [.] cpu_alpha_exec > > > and cpu_alpha_exec() spends most of the time in inlined tb_find_fast(). > > It might be worth checking the actual distribution of the hash of virt > > address used by that sucker - I wonder if dividing its argument by 4 > > wouldn't improve the things, but I don't have stats on actual frequency > > of conflicts, etc. In any case, the first lump (42%) seems to be tastier ;-) > > Depends on your point of view -- arguably we ought to be spending *more* > time executing translated guest code... (As you say, the problem is that > we don't have any breakdown of what things might turn out to be hotspots > in the translated code.) Might be a fun project to teach perf that hits in such-and-such page should lead to lookup in a table annotating it. As in "offsets 42..69 should be recorded as (<this address> + offset - 42). Then tcg could generate such tables and we'd get information like "that much time is spent in the second host insn of instances of that code pattern generated by tcg_gen_shr_i64", etc. No idea if anything of that sort exists - qemu is not the only possible user for that; looks like it might be useful for any JIT profiling, so somebody could've done that already...
On Tue, Jul 08, 2014 at 12:04:10PM -0700, Richard Henderson wrote: > > Just one thing - 0x1fffffffffffff will make 32bit hosts whine about integer > > constant being too large. So will 0x1ffffffffffffful, unfortunately - it > > really ought to be ull. > > > > I did use ull on the branch. Aha... So you've caught that one already... I've looked at your branch; AFAICS, the only thing missing there is treating stores to FPCR.DNOD in system mode as "not implemented" (which it is in the code as well as in 21[0-3]64 hardware). Other than that everything seems to be fine; you are right about cvtql treatment - since that sucker doesn't have /i in any allowed trap suffices, we might as well just raise Inexact and let it be masked out - float_flag_inexact will be present in 'ignore'. And yes, folding the calculation itself in there obviously makes sense.
On 07/08/2014 01:20 PM, Al Viro wrote: > Aha... So you've caught that one already... I've looked at your branch; > AFAICS, the only thing missing there is treating stores to FPCR.DNOD in > system mode as "not implemented" (which it is in the code as well as in > 21[0-3]64 hardware). Is it loaded and stored on 21264, or it is read-as-zero/write-ignore? Is UNDZ not required to be paired with DNOD? r~
On Tue, Jul 08, 2014 at 09:59:33PM -0700, Richard Henderson wrote: > On 07/08/2014 01:20 PM, Al Viro wrote: > > Aha... So you've caught that one already... I've looked at your branch; > > AFAICS, the only thing missing there is treating stores to FPCR.DNOD in > > system mode as "not implemented" (which it is in the code as well as in > > 21[0-3]64 hardware). > > Is it loaded and stored on 21264, or it is read-as-zero/write-ignore? RAZ, and the same on 21364 if Compaq manual for compiler-writers is to be believed. On 21264 bits 48..62 are writable, bit 63 is disjunction of bits 52..57 (stores are ignored), bits 0..47 are RAZ. AARM requires RAZ bits 0..46 and RAZ on everything optional that is unimplemented. IOW, DNOD is unimplemented there, all other optional ones are implemented. And according to https://archive.org/details/dec-comp_guide_v2 21364 doesn't implement DNOD either... > Is UNDZ not required to be paired with DNOD? There are 4 bits having some relation to handling of denorms. DNZ and DNOD are about denorm inputs; UNDZ and UNFD - about denorm output. All of them have effect only for IEEE insns with /S in trap suffix. Rules: * if DNZ, denorm inputs are silently replaced with zero. * if !DNZ && !DNOD, denorm inputs trigger trap (invalid). Same as what would happen without /S. * if !DNZ && DNOD, perform operation on denorm(s). And I would like to play with whatever you are using to bring hardware from alternative universes. * if !UNFD, denorm output triggers trap (underflow). Same as what would happen without /S. * if UNFD && UNDZ, denorm output is replaced with zero. * if UNFD && !UNDZ, denorm output remains as is. So env->fpcr_flush_to_zero = env->fpcr_dnod & env->fpcr_undz; is another bug - needs s/dnod/unfd/ there...
Peter Maydell writes: > On 8 July 2014 17:13, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Tue, Jul 08, 2014 at 09:05:10AM +0100, Peter Maydell wrote: >> <snip> >> BTW, are there any more or less uptodate docs on qemu profiling? I mean, >> things like perf/oprofile on the host obviously end up lumping all tcg >> output together. Is there any way to get information beyond "~40% of time >> is spent in generated code, ~15% - in tb_find_fast(), and the rest is very >> much colder"? > > Alex, did you say you'd done something with profiling recently? I posted some RFC patches up a while back that spit out the perf /tmp/perf.<pid> JIT maps that helps with breaking down which TCG TBs are the most executed. There is another set of patches which allow you to selectively dump translation blocks so you don't end up with multi-gigabyte log files. I'm going to be doing some profiling myself over the next few days so I'll clean-up the patches and re-submit to the list soon.
Al Viro writes: > On Tue, Jul 08, 2014 at 08:32:55PM +0100, Peter Maydell wrote: > On 8 July 2014 18:20, Al Viro <viro@zeniv.linux.org.uk> wrote: >> > On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote: <snip> > Again, gprof isn't particulary useful - kernel-side profilers are at least as > good. So I suspect that most of the people running into that simply shrug and > use those instead. Narrowing it down to -pie didn't take long and I can > confirm that this is the root cause of that breakage. Should make debugging > said toolchain bug a bit easier, if anybody cares to do that... > >> > Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and >> > going through their kernel package build. I have perf report in front of >> > me right now; the top ones are >> > 41.77% qemu-system-alp perf-24701.map [.] 0x7fbbee558930 >> > 11.78% qemu-system-alp qemu-system-alpha [.] cpu_alpha_exec >> >> > and cpu_alpha_exec() spends most of the time in inlined tb_find_fast(). >> > It might be worth checking the actual distribution of the hash of virt >> > address used by that sucker - I wonder if dividing its argument by 4 >> > wouldn't improve the things, but I don't have stats on actual frequency >> > of conflicts, etc. In any case, the first lump (42%) seems to be tastier ;-) >> >> Depends on your point of view -- arguably we ought to be spending *more* >> time executing translated guest code... (As you say, the problem is that >> we don't have any breakdown of what things might turn out to be hotspots >> in the translated code.) > > Might be a fun project to teach perf that hits in such-and-such page should > lead to lookup in a table annotating it. As in "offsets 42..69 should be > recorded as (<this address> + offset - 42). Then tcg could generate > such tables and we'd get information like "that much time is spent in > the second host insn of instances of that code pattern generated by > tcg_gen_shr_i64", etc. > > No idea if anything of that sort exists - qemu is not the only possible user > for that; looks like it might be useful for any JIT profiling, so somebody > could've done that already... Handily our patch tracker has remembered what I couldn't find ;-) https://patches.linaro.org/27229/ As I mentioned previously I plan to clean these up over the next week.
On 07/08/2014 10:47 PM, Al Viro wrote: > So env->fpcr_flush_to_zero = env->fpcr_dnod & env->fpcr_undz; is another > bug - needs s/dnod/unfd/ there... That's exactly what I was looking at, thanks. r~
On Wed, Jul 09, 2014 at 08:14:12AM -0700, Richard Henderson wrote: > On 07/08/2014 10:47 PM, Al Viro wrote: > > So env->fpcr_flush_to_zero = env->fpcr_dnod & env->fpcr_undz; is another > > bug - needs s/dnod/unfd/ there... > > That's exactly what I was looking at, thanks. BTW, that (unimplementeds being RAZ) is why AARM insists on having FP_C in software - FPCR isn't guaranteed to have the "trap disable" bits and, in fact, doesn't have anywhere to store IEEE_TRAP_ENABLE_DNO on actual hw. The software completion is where it has to be dealt with; note that both swcr_update_status() and ieee_swcr_to_fpcr() treat ->ieee_state (i.e. our FP_C) as authoritative wrt trap enable bits, 21264 or not. Trap _status_ bits are different - there (on 21264) FPCR is considered authoritative, but that's it. Unimplemented trap bits are treated as "trap enabled", so the completion gets to decide what it wants to do. If you want to keep FPCR authoritative for all those bits in linux-user case, we have to treat FPCR.DNOD as writable bit for that mode, which is why my variant slapped an ifdef CONFIG_USER_ONLY around env->fpcr_dnod = (val & FPCR_DNOD) != 0; instead of ripping it out completely...
diff --git a/target-alpha/fpu_helper.c b/target-alpha/fpu_helper.c index 9b297de..25c83b5 100644 --- a/target-alpha/fpu_helper.c +++ b/target-alpha/fpu_helper.c @@ -44,6 +44,12 @@ uint32_t helper_fp_exc_get(CPUAlphaState *env) return get_float_exception_flags(&FP_STATUS); } +enum { + Exc_Mask = float_flag_invalid | float_flag_int_overflow | + float_flag_divbyzero | float_flag_overflow | + float_flag_underflow | float_flag_inexact +}; + static inline void fp_exc_raise1(CPUAlphaState *env, uintptr_t retaddr, uint32_t exc, uint32_t regno, uint32_t hw_exc) { @@ -73,7 +79,7 @@ static inline void fp_exc_raise1(CPUAlphaState *env, uintptr_t retaddr, doesn't apply. */ void helper_fp_exc_raise(CPUAlphaState *env, uint32_t ignore, uint32_t regno) { - uint32_t exc = (uint8_t)env->fp_status.float_exception_flags; + uint32_t exc = (uint8_t)env->fp_status.float_exception_flags & Exc_Mask; if (exc) { env->fpcr_exc_status |= exc; exc &= ~ignore; @@ -86,7 +92,7 @@ void helper_fp_exc_raise(CPUAlphaState *env, uint32_t ignore, uint32_t regno) /* Raise exceptions for ieee fp insns with software completion. */ void helper_fp_exc_raise_s(CPUAlphaState *env, uint32_t ignore, uint32_t regno) { - uint32_t exc = (uint8_t)env->fp_status.float_exception_flags; + uint32_t exc = (uint8_t)env->fp_status.float_exception_flags & Exc_Mask; if (exc) { env->fpcr_exc_status |= exc; exc &= ~ignore; @@ -105,16 +111,14 @@ void helper_ieee_input(CPUAlphaState *env, uint64_t val) uint64_t frac = val & 0xfffffffffffffull; if (exp == 0) { - /* Denormals without DNZ set raise an exception. */ - if (frac != 0 && !env->fp_status.flush_inputs_to_zero) { - arith_excp(env, GETPC(), EXC_M_UNF, 0); + /* Denormals without /S raise an exception. */ + if (frac != 0) { + arith_excp(env, GETPC(), EXC_M_INV, 0); } } else if (exp == 0x7ff) { /* Infinity or NaN. */ - /* ??? I'm not sure these exception bit flags are correct. I do - know that the Linux kernel, at least, doesn't rely on them and - just emulates the insn to figure out what exception to use. */ - arith_excp(env, GETPC(), frac ? EXC_M_INV : EXC_M_FOV, 0); + env->fpcr_exc_status |= float_flag_invalid; + arith_excp(env, GETPC(), EXC_M_INV, 0); } } @@ -125,16 +129,34 @@ void helper_ieee_input_cmp(CPUAlphaState *env, uint64_t val) uint64_t frac = val & 0xfffffffffffffull; if (exp == 0) { - /* Denormals without DNZ set raise an exception. */ - if (frac != 0 && !env->fp_status.flush_inputs_to_zero) { - arith_excp(env, GETPC(), EXC_M_UNF, 0); + /* Denormals raise an exception. */ + if (frac != 0) { + arith_excp(env, GETPC(), EXC_M_INV, 0); } } else if (exp == 0x7ff && frac) { /* NaN. */ + env->fpcr_exc_status |= float_flag_invalid; arith_excp(env, GETPC(), EXC_M_INV, 0); } } +/* Input handing with software completion. Trap for denorms, + unless DNZ is set. *IF* we try to support DNOD (which + none of the produced hardware did, AFAICS), we'll need + to suppress the trap when FPCR.DNOD is set; then the + code downstream of that will need to cope with denorms + sans flush_input_to_zero. Most of it should work sanely, + but there's nothing to compare with... +*/ +void helper_ieee_input_s(CPUAlphaState *env, uint64_t val) +{ + if (unlikely(2 * val - 1 < 0x1fffffffffffff)) { + if (!FP_STATUS.flush_inputs_to_zero) { + arith_excp(env, GETPC(), EXC_M_INV | EXC_M_SWC, 0); + } + } +} + /* F floating (VAX) */ static uint64_t float32_to_f(float32 fa) { @@ -707,7 +729,8 @@ static inline uint64_t inline_cvttq(CPUAlphaState *env, uint64_t a, frac = a & 0xfffffffffffffull; if (exp == 0) { - if (unlikely(frac != 0)) { + if (unlikely(frac != 0) && !FP_STATUS.flush_inputs_to_zero) { + /* not going to happen without working DNOD; ifdef out, perhaps? */ goto do_underflow; } } else if (exp == 0x7ff) { @@ -826,7 +849,9 @@ uint64_t helper_cvtqg(CPUAlphaState *env, uint64_t a) void helper_fcvtql_v_input(CPUAlphaState *env, uint64_t val) { + set_float_exception_flags(0, &FP_STATUS); if (val != (int32_t)val) { - arith_excp(env, GETPC(), EXC_M_IOV, 0); + float_raise(float_flag_int_overflow, &FP_STATUS); + env->fpcr_exc_status |= float_flag_inexact; } } diff --git a/target-alpha/helper.h b/target-alpha/helper.h index 2cc100b..596f24d 100644 --- a/target-alpha/helper.h +++ b/target-alpha/helper.h @@ -88,6 +88,7 @@ DEF_HELPER_FLAGS_3(fp_exc_raise_s, TCG_CALL_NO_WG, void, env, i32, i32) DEF_HELPER_FLAGS_2(ieee_input, TCG_CALL_NO_WG, void, env, i64) DEF_HELPER_FLAGS_2(ieee_input_cmp, TCG_CALL_NO_WG, void, env, i64) +DEF_HELPER_FLAGS_2(ieee_input_s, TCG_CALL_NO_WG, void, env, i64) DEF_HELPER_FLAGS_2(fcvtql_v_input, TCG_CALL_NO_WG, void, env, i64) #if !defined (CONFIG_USER_ONLY) diff --git a/target-alpha/translate.c b/target-alpha/translate.c index 6ea33f3..611b293 100644 --- a/target-alpha/translate.c +++ b/target-alpha/translate.c @@ -655,7 +655,8 @@ static TCGv gen_ieee_input(DisasContext *ctx, int reg, int fn11, int is_cmp) } else { gen_helper_ieee_input(cpu_env, val); } - } + } else + gen_helper_ieee_input_s(cpu_env, val); } return val; } @@ -2256,24 +2257,15 @@ static ExitStatus translate_one(DisasContext *ctx, uint32_t insn) gen_fcmov(ctx, TCG_COND_GT, ra, rb, rc); break; case 0x030: - /* CVTQL */ - REQUIRE_REG_31(ra); - vc = dest_fpr(ctx, rc); - vb = load_fpr(ctx, rb); - gen_fcvtql(vc, vb); - break; case 0x130: - /* CVTQL/V */ case 0x530: - /* CVTQL/SV */ + /* CVTQL, CVTQL/V, CVTQL/SV */ REQUIRE_REG_31(ra); - /* ??? I'm pretty sure there's nothing that /sv needs to do that - /v doesn't do. The only thing I can think is that /sv is a - valid instruction merely for completeness in the ISA. */ vc = dest_fpr(ctx, rc); vb = load_fpr(ctx, rb); gen_helper_fcvtql_v_input(cpu_env, vb); gen_fcvtql(vc, vb); + gen_fp_exc_raise(rc, fn11); break; default: goto invalid_opc;