diff mbox

[RFC] alpha qemu arithmetic exceptions

Message ID 20140705225513.GY18016@ZenIV.linux.org.uk
State New
Headers show

Commit Message

Al Viro July 5, 2014, 10:55 p.m. UTC
On Sat, Jul 05, 2014 at 10:09:51PM +0100, Al Viro wrote:

> Anyway, the current delta (on top of 26f86) follows; seems to get IEEE
> insns behave on non-finite arguments as they do on 21264.  The main
> exception is that register bitmask supplied to trap isn't calculated in
> a bunch of cases; since its main purpose is to help locating the trapping
> insn and we report precise traps (amask feature bit 9), it's probably not
> an interesting problem.  Current Linux kernel definitely won't look at that
> thing under qemu; an old one might, but it would have to be something
> older than 2.3... <checks the history> than 2.2.8, actually.  And the impact
> is that insns with /S getting a denorm argument won't be properly emulated
> and you'll get SIGFPE.  Again, it has to be a really old kernel (older than
> May 1999) to be affected at all.

... and a followup (and the last part of exception handling for non-VAX
insn inputs, AFAICS) - CVTQL.

* whether it triggers a trap or not, it sets IOV and INE on overflow.
* in case of trap it does *not* bugger off immediately - result is
calculated, stored and only then we trap.
* trap summary word is different from cvtql/v and cvtql/sv.  IOW, it's yet
another case of "we think that IEEE semantics is stupid and if you need it,
you'd damn better ask for it explicitly".  Note that cvtql/v sets IOV|INE and
hits SIGFPE no matter what, while cvtql/sv set INV instead and triggers SIGFPE
only if FP_INVALID is enabled.  All difference is kernel-side and it's
triggered by EXC_M_SWC in summary word.

AFAICS, that should be it for IEEE and shared insns, as far as exceptions
on inputs are concerned.

Combined delta follows:

Comments

Richard Henderson July 7, 2014, 2:11 p.m. UTC | #1
On 07/05/2014 03:55 PM, Al Viro wrote:
> +/* Input handing with software completion.  Trap for denorms,
> +   unless DNZ is set.  *IF* we try to support DNOD (which
> +   none of the produced hardware did, AFAICS), we'll need
> +   to suppress the trap when FPCR.DNOD is set; then the
> +   code downstream of that will need to cope with denorms
> +   sans flush_input_to_zero.  Most of it should work sanely,
> +   but there's nothing to compare with...
> +*/
> +void helper_ieee_input_s(CPUAlphaState *env, uint64_t val)
> +{
> +    if (unlikely(2 * val - 1 < 0x1fffffffffffff)) {
> +	if (!FP_STATUS.flush_inputs_to_zero) {
> +	    arith_excp(env, GETPC(), EXC_M_INV | EXC_M_SWC, 0);
> +	}
> +    }
> +}
> +

A couple of points here:

1) We should never raise this in user-only mode.  In that mode, we emulate the
whole fpu stack, all the way through from HW to the OS completion handler.

2) Because of that, we have the capability of doing the same thing in system
mode.  This lets us do more of the computation in the host, and less in the
guest, which is faster.  The only thing this makes more difficult is debugging
the OS completion handlers within the kernel, since they'll only get invoked
when SIGFPE needs to be sent.

3) If we do want to implement a mode where we faithfully send SWC for all of
the bits of IEEE that real HW didn't implement, do we really need to avoid a
store to the output register when signalling this?  I.e. can we notice this
condition after the fact with float_flag_input_denormal, rather than having
another function call to prep the inputs?


r~
Al Viro July 7, 2014, 3:06 p.m. UTC | #2
On Mon, Jul 07, 2014 at 07:11:58AM -0700, Richard Henderson wrote:
> A couple of points here:
> 
> 1) We should never raise this in user-only mode.  In that mode, we emulate the
> whole fpu stack, all the way through from HW to the OS completion handler.

How is that different from other cases where we have an exception raised
by an fp operation?

> 2) Because of that, we have the capability of doing the same thing in system
> mode.  This lets us do more of the computation in the host, and less in the
> guest, which is faster.  The only thing this makes more difficult is debugging
> the OS completion handlers within the kernel, since they'll only get invoked
> when SIGFPE needs to be sent.

Umm...  The effect of software completion depends on current->ieee_state;
how would you keep track of that outside of guest kernel?

> 3) If we do want to implement a mode where we faithfully send SWC for all of
> the bits of IEEE that real HW didn't implement, do we really need to avoid a
> store to the output register when signalling this?  I.e. can we notice this
> condition after the fact with float_flag_input_denormal, rather than having
> another function call to prep the inputs?

But flag_input_denormal is raised only when we do have DNZ set.  Which is
an entirely different case, where we should not (and do not) get an exception
at all...
Richard Henderson July 7, 2014, 4:20 p.m. UTC | #3
On 07/07/2014 08:06 AM, Al Viro wrote:
> On Mon, Jul 07, 2014 at 07:11:58AM -0700, Richard Henderson wrote:
>> A couple of points here:
>>
>> 1) We should never raise this in user-only mode.  In that mode, we emulate the
>> whole fpu stack, all the way through from HW to the OS completion handler.
> 
> How is that different from other cases where we have an exception raised
> by an fp operation?

In all other cases we know we're going to send SIGFPE.  That's either through a
non /S insn which the kernel wouldn't touch, or by having computed the true
IEEE result and examined the exceptions to be raised.

>> 2) Because of that, we have the capability of doing the same thing in system
>> mode.  This lets us do more of the computation in the host, and less in the
>> guest, which is faster.  The only thing this makes more difficult is debugging
>> the OS completion handlers within the kernel, since they'll only get invoked
>> when SIGFPE needs to be sent.
> 
> Umm...  The effect of software completion depends on current->ieee_state;
> how would you keep track of that outside of guest kernel?

The kernel essentially keeps a copy of IEEE_STATE in the FPCR.  I don't see any
missing bits in ieee_swcr_to_fpcr, do you?

While real hardware might ignore some of those bits once stored, qemu doesn't.

While in real hardware one could force the FPCR and IEEE_STATE to differ,
honestly that'd be a bug.  (Although a silly one; I wish the kernel took the
EV6 FPCR as gospel for everything, not just the status flags.  That could make
certain libm.so computations much faster.)

> 
>> 3) If we do want to implement a mode where we faithfully send SWC for all of
>> the bits of IEEE that real HW didn't implement, do we really need to avoid a
>> store to the output register when signalling this?  I.e. can we notice this
>> condition after the fact with float_flag_input_denormal, rather than having
>> another function call to prep the inputs?
> 
> But flag_input_denormal is raised only when we do have DNZ set.  Which is
> an entirely different case, where we should not (and do not) get an exception
> at all...

Ah, you're right about that.  I'd mis-remembered the implementation.


r~
Al Viro July 8, 2014, 4:20 a.m. UTC | #4
On Mon, Jul 07, 2014 at 09:20:28AM -0700, Richard Henderson wrote:
> > How is that different from other cases where we have an exception raised
> > by an fp operation?
> 
> In all other cases we know we're going to send SIGFPE.  That's either through a
> non /S insn which the kernel wouldn't touch, or by having computed the true
> IEEE result and examined the exceptions to be raised.

Umm...  Not quite.  Consider e.g. CVTQL case.  There we have the following
picture in case of overflow (real hw with Linux on top of it):
	no suffix:		IOV INE
	/v:			IOV INE SIGFPE
	/sv, no IEEE INVE:	IOV INE INV
	/sv, IEEE INVE:		IOV INE INV SIGFPE
This is after the completion had a chance to run.  From the hw POV it's
	no suffix		IOV INE no trap
	/v			IOV INE trap<IOV>
	/sv			IOV INE trap<SWC,IOV>
and it's alpha_fp_emul() that does the rest in /sv case.  Actually, it's even
simpler:
	if overflow
		FPCR.INE = 1
		raise IOV
	do usual trap suffix handling
and I'm reasonably sure that this is what they did internally.  You are
proposing to do 4 cases in all their messy glory in qemu itself...

And I wouldn't bet a dime on not having similar turds in other insns; after
all, "it's hard, let's offload it to software" was only a part of motivation
for software completions.  "We really don't like this part of IEEE standard
and we'd love to tell you to see figure 1, but we need conformance, so you
can mark an insn with /s and have the kernel do what IEEE requires" is also
very visible in their manual.

Result is a mess - if you try to fold the kernel-side stuff into "hardware",
you end up with a pile of inconsistent behaviours.  In principle, it's
doable, especially since we are not really constrained by actual hw in terms
of what we do in case of FPCR.DNOD being true - no actual hw could set it.
So we want
	* hw behaviour without /s (denorms trap)
	* hw behaviour with /s without denorms
	* hw behaviour with /s with denorms with FPCR.DNZ (same as with 0) 
	* kernel completion behaviour with /s with denorm
and it might even be what they intended for DNOD to do.  But it's going
to be messy as hell.

And that's not even going into generating the right si_code for that SIGFPE.
What produces those TARGET_GEN_FLTINE and friends?
Richard Henderson July 8, 2014, 6:03 a.m. UTC | #5
On 07/07/2014 09:20 PM, Al Viro wrote:
> and I'm reasonably sure that this is what they did internally.  You are
> proposing to do 4 cases in all their messy glory in qemu itself...

Yes.  Primarily because we *have* to do so for the linux-user case.

> And that's not even going into generating the right si_code for that SIGFPE.
> What produces those TARGET_GEN_FLTINE and friends?

linux-user/main.c, cpu_loop.


r~
Al Viro July 8, 2014, 6:54 a.m. UTC | #6
On Mon, Jul 07, 2014 at 11:03:08PM -0700, Richard Henderson wrote:
> On 07/07/2014 09:20 PM, Al Viro wrote:
> > and I'm reasonably sure that this is what they did internally.  You are
> > proposing to do 4 cases in all their messy glory in qemu itself...
> 
> Yes.  Primarily because we *have* to do so for the linux-user case.
> 
> > And that's not even going into generating the right si_code for that SIGFPE.
> > What produces those TARGET_GEN_FLTINE and friends?
> 
> linux-user/main.c, cpu_loop.

That's where we consume it; where is it produced?  Sure, explicit
gentrap in alpha code will lead there, with whatever we have in
$16 deciding what'll go into si_code, but where does that happen on
fp exception codepaths?  IOW, what sets si_code on those?
Al Viro July 8, 2014, 7:13 a.m. UTC | #7
On Tue, Jul 08, 2014 at 07:54:36AM +0100, Al Viro wrote:
> On Mon, Jul 07, 2014 at 11:03:08PM -0700, Richard Henderson wrote:
> > On 07/07/2014 09:20 PM, Al Viro wrote:
> > > and I'm reasonably sure that this is what they did internally.  You are
> > > proposing to do 4 cases in all their messy glory in qemu itself...
> > 
> > Yes.  Primarily because we *have* to do so for the linux-user case.
> > 
> > > And that's not even going into generating the right si_code for that SIGFPE.
> > > What produces those TARGET_GEN_FLTINE and friends?
> > 
> > linux-user/main.c, cpu_loop.
> 
> That's where we consume it; where is it produced?  Sure, explicit
> gentrap in alpha code will lead there, with whatever we have in
> $16 deciding what'll go into si_code, but where does that happen on
> fp exception codepaths?  IOW, what sets si_code on those?

Actually, that's badly worded; what codepath ends up setting si_code on
e.g. fp addition overflows?  In system mode it's done by completion code
in the kernel, but AFAICS in user mode there are only two places where it
might happen - one is gentrap handling and another - osf_setsysinfo(2)
emulation for TARGET_SSI_IEEE_FP_CONTROL.  What I don't understand is how
do we get from float_raise(&FP_STATUS, float_flag_overflow) in fpu/softfloat.c
to either of those.

IOW, suppose I do
	x = DBL_MAX;
	feenableexcept(FE_ALL_EXCEPT);
	x *= x;
I understand how I'll get SIGFPE, but what will set correct si_code in
siginfo I'll see in the hanler?
Peter Maydell July 8, 2014, 8:05 a.m. UTC | #8
On 8 July 2014 08:13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> Actually, that's badly worded; what codepath ends up setting si_code on
> e.g. fp addition overflows?  In system mode it's done by completion code
> in the kernel, but AFAICS in user mode there are only two places where it
> might happen - one is gentrap handling and another - osf_setsysinfo(2)
> emulation for TARGET_SSI_IEEE_FP_CONTROL.  What I don't understand is how
> do we get from float_raise(&FP_STATUS, float_flag_overflow) in fpu/softfloat.c
> to either of those.
>
> IOW, suppose I do
>         x = DBL_MAX;
>         feenableexcept(FE_ALL_EXCEPT);
>         x *= x;
> I understand how I'll get SIGFPE, but what will set correct si_code in
> siginfo I'll see in the hanler?

The code we have currently may well be buggy, but the correct
place to set si_code is (as Richard says) the Alpha cpu_loop() in
linux-user/main.c, which has access to the trap type that just
caused us to stop executing code, plus the CPUState, which
should be enough information to set si_code correctly. In
particular the GENTRAP case seems to be setting a variety
of different si_code values for SIGFPE.

thanks
-- PMM
Richard Henderson July 8, 2014, 2:53 p.m. UTC | #9
On 07/08/2014 01:05 AM, Peter Maydell wrote:
> On 8 July 2014 08:13, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> Actually, that's badly worded; what codepath ends up setting si_code on
>> e.g. fp addition overflows?  In system mode it's done by completion code
>> in the kernel, but AFAICS in user mode there are only two places where it
>> might happen - one is gentrap handling and another - osf_setsysinfo(2)
>> emulation for TARGET_SSI_IEEE_FP_CONTROL.  What I don't understand is how
>> do we get from float_raise(&FP_STATUS, float_flag_overflow) in fpu/softfloat.c
>> to either of those.
>>
>> IOW, suppose I do
>>         x = DBL_MAX;
>>         feenableexcept(FE_ALL_EXCEPT);
>>         x *= x;
>> I understand how I'll get SIGFPE, but what will set correct si_code in
>> siginfo I'll see in the hanler?
> 
> The code we have currently may well be buggy, but the correct
> place to set si_code is (as Richard says) the Alpha cpu_loop() in
> linux-user/main.c, which has access to the trap type that just
> caused us to stop executing code, plus the CPUState, which
> should be enough information to set si_code correctly. In
> particular the GENTRAP case seems to be setting a variety
> of different si_code values for SIGFPE.

The gentrap case is a red-herring.

The case you're looking for is EXC_ARITH.  The path is from

    arith_excp
      dynamic_excp
        cpu_loop_exit
          longjmp
  cpu_exec
cpu_loop

It's also true that we don't install the correct si_code there, but we could.
Mostly the gcc/glibc test cases really only care that SIGFPE gets raised, not
what the codes are, so I haven't bothered.


r~
Peter Maydell July 8, 2014, 4:33 p.m. UTC | #10
On 8 July 2014 17:13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jul 08, 2014 at 09:05:10AM +0100, Peter Maydell wrote:
>
>> The code we have currently may well be buggy, but the correct
>
> It is ;-/  We set TARGET_FPE_FLTINV unconditionally there.  BTW, what's
> the reason why all these cpu_loop() instances can't go into
> linux-user/<arch>/something?

It's just ancient code nobody's cleaned up yet. I do have
"move all this stuff into arch directories" on my "would like
to do" list, but I just haven't got round to it yet, since it's
not actually actively broken (unlike many other areas of
our codebase :-/).

> BTW, are there any more or less uptodate docs on qemu profiling?  I mean,
> things like perf/oprofile on the host obviously end up lumping all tcg
> output together.  Is there any way to get information beyond "~40% of time
> is spent in generated code, ~15% - in tb_find_fast(), and the rest is very
> much colder"?

Alex, did you say you'd done something with profiling recently?

> Incidentally, combination of --enable-gprof and (default) --enable-pie
> won't build - it dies with ld(1) complaining about relocs in gcrt1.o.

This sounds like a toolchain bug to me :-)

-- PMM
Al Viro July 8, 2014, 5:20 p.m. UTC | #11
On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote:

> > Incidentally, combination of --enable-gprof and (default) --enable-pie
> > won't build - it dies with ld(1) complaining about relocs in gcrt1.o.
> 
> This sounds like a toolchain bug to me :-)

Debian stable/amd64, gcc 4.7.2, binutils 2.22.  And google search finds
this, for example: http://osdir.com/ml/qemu-devel/2013-05/msg00710.html.
That one has gcc 4.4.3.

Anyway, adding --disable-pie to --enable-gprof gets it to build, but
as I said, gprof is no better than perf and oprofile - same problem.

Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and
going through their kernel package build.  I have perf report in front of
me right now; the top ones are
 41.77%  qemu-system-alp  perf-24701.map           [.] 0x7fbbee558930
 11.78%  qemu-system-alp  qemu-system-alpha        [.] cpu_alpha_exec
  4.95%  qemu-system-alp  [vdso]                   [.] 0x7fffdd7ff8de
  2.40%  qemu-system-alp  qemu-system-alpha        [.] phys_page_find
  1.49%  qemu-system-alp  qemu-system-alpha        [.] address_space_translate_internal
  1.34%  qemu-system-alp  [kernel.kallsyms]        [k] read_hpet
  1.26%  qemu-system-alp  qemu-system-alpha        [.] tlb_set_page
  1.23%  qemu-system-alp  qemu-system-alpha        [.] find_next_bit
  1.04%  qemu-system-alp  qemu-system-alpha        [.] get_page_addr_code
  1.01%  qemu-system-alp  libpthread-2.13.so       [.] pthread_mutex_lock
  0.88%  qemu-system-alp  qemu-system-alpha        [.] helper_cmpbge
  0.80%  qemu-system-alp  libc-2.13.so             [.] __memset_sse2
  0.72%  qemu-system-alp  libpthread-2.13.so       [.] __pthread_mutex_unlock_usercnt
  0.70%  qemu-system-alp  qemu-system-alpha        [.] get_physical_address
  0.69%  qemu-system-alp  qemu-system-alpha        [.] address_space_translate
  0.68%  qemu-system-alp  qemu-system-alpha        [.] tcg_optimize
  0.67%  qemu-system-alp  qemu-system-alpha        [.] ldq_phys
  0.63%  qemu-system-alp  qemu-system-alpha        [.] qemu_get_ram_ptr
  0.62%  qemu-system-alp  qemu-system-alpha        [.] helper_le_ldq_mmu
  0.57%  qemu-system-alp  qemu-system-alpha        [.] memory_region_is_ram

and cpu_alpha_exec() spends most of the time in inlined tb_find_fast().
It might be worth checking the actual distribution of the hash of virt
address used by that sucker - I wonder if dividing its argument by 4
wouldn't improve the things, but I don't have stats on actual frequency
of conflicts, etc.  In any case, the first lump (42%) seems to be tastier ;-)
There are all kinds of microoptimizations possible (e.g. helper_cmpbge() could
be done by a couple of MMX insns on amd64 host[1]), but it would be nice to
have some details on what we spend the time on in tcg output...

[1] The reason why helper_cmpbge() shows up is that string functions on alpha
use that insn a lot; it _might_ be worth optimizing.
Richard Henderson July 8, 2014, 6:12 p.m. UTC | #12
On 07/08/2014 09:13 AM, Al Viro wrote:
> Frankly, I suspect that it's better to have qemu-system-alpha behave like
> the actual hardware does (including "FPCR.DNOD can't be set") and keep the
> linux-user behaviour as is, for somebody brave and masochistic enough to
> fight that one.  And no, it's nowhere near "just let denorms ride through
> the normal softfloat code and play a bit with the flags it might raise".
> And then there's netbsd/alpha and openbsd/alpha, so in theory somebody might
> want to play with their software completion semantics (not identical to Linux
> one) for the sake of yet-to-be-written bsd-user alpha support...

You're probably right there.

I've pushed a couple more patches to the branch, split out from your patch
here.  I believe I've got it all, and havn't mucked things up in the process.
I'll run some tests later today when I've got time.


r~
Al Viro July 8, 2014, 7:02 p.m. UTC | #13
On Tue, Jul 08, 2014 at 11:12:20AM -0700, Richard Henderson wrote:
> On 07/08/2014 09:13 AM, Al Viro wrote:
> > Frankly, I suspect that it's better to have qemu-system-alpha behave like
> > the actual hardware does (including "FPCR.DNOD can't be set") and keep the
> > linux-user behaviour as is, for somebody brave and masochistic enough to
> > fight that one.  And no, it's nowhere near "just let denorms ride through
> > the normal softfloat code and play a bit with the flags it might raise".
> > And then there's netbsd/alpha and openbsd/alpha, so in theory somebody might
> > want to play with their software completion semantics (not identical to Linux
> > one) for the sake of yet-to-be-written bsd-user alpha support...
> 
> You're probably right there.
> 
> I've pushed a couple more patches to the branch, split out from your patch
> here.  I believe I've got it all, and havn't mucked things up in the process.
> I'll run some tests later today when I've got time.

Just one thing - 0x1fffffffffffff will make 32bit hosts whine about integer
constant being too large.  So will 0x1ffffffffffffful, unfortunately - it
really ought to be ull.
Richard Henderson July 8, 2014, 7:04 p.m. UTC | #14
On 07/08/2014 12:02 PM, Al Viro wrote:
> On Tue, Jul 08, 2014 at 11:12:20AM -0700, Richard Henderson wrote:
>> On 07/08/2014 09:13 AM, Al Viro wrote:
>>> Frankly, I suspect that it's better to have qemu-system-alpha behave like
>>> the actual hardware does (including "FPCR.DNOD can't be set") and keep the
>>> linux-user behaviour as is, for somebody brave and masochistic enough to
>>> fight that one.  And no, it's nowhere near "just let denorms ride through
>>> the normal softfloat code and play a bit with the flags it might raise".
>>> And then there's netbsd/alpha and openbsd/alpha, so in theory somebody might
>>> want to play with their software completion semantics (not identical to Linux
>>> one) for the sake of yet-to-be-written bsd-user alpha support...
>>
>> You're probably right there.
>>
>> I've pushed a couple more patches to the branch, split out from your patch
>> here.  I believe I've got it all, and havn't mucked things up in the process.
>> I'll run some tests later today when I've got time.
> 
> Just one thing - 0x1fffffffffffff will make 32bit hosts whine about integer
> constant being too large.  So will 0x1ffffffffffffful, unfortunately - it
> really ought to be ull.
> 

I did use ull on the branch.


r~
Peter Maydell July 8, 2014, 7:32 p.m. UTC | #15
On 8 July 2014 18:20, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote:
>
>> > Incidentally, combination of --enable-gprof and (default) --enable-pie
>> > won't build - it dies with ld(1) complaining about relocs in gcrt1.o.
>>
>> This sounds like a toolchain bug to me :-)
>
> Debian stable/amd64, gcc 4.7.2, binutils 2.22.  And google search finds
> this, for example: http://osdir.com/ml/qemu-devel/2013-05/msg00710.html.
> That one has gcc 4.4.3.

That just makes it a long-standing toolchain bug. I don't see any
reason why PIE + gprof shouldn't work, it just looks like gprof
doesn't ship and link a PIE runtime.

> Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and
> going through their kernel package build.  I have perf report in front of
> me right now; the top ones are
>  41.77%  qemu-system-alp  perf-24701.map           [.] 0x7fbbee558930
>  11.78%  qemu-system-alp  qemu-system-alpha        [.] cpu_alpha_exec

> and cpu_alpha_exec() spends most of the time in inlined tb_find_fast().
> It might be worth checking the actual distribution of the hash of virt
> address used by that sucker - I wonder if dividing its argument by 4
> wouldn't improve the things, but I don't have stats on actual frequency
> of conflicts, etc.  In any case, the first lump (42%) seems to be tastier ;-)

Depends on your point of view -- arguably we ought to be spending *more*
time executing translated guest code... (As you say, the problem is that
we don't have any breakdown of what things might turn out to be hotspots
in the translated code.)

-- PMM
Al Viro July 8, 2014, 8:12 p.m. UTC | #16
On Tue, Jul 08, 2014 at 08:32:55PM +0100, Peter Maydell wrote:
 On 8 July 2014 18:20, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote:
> >
> >> > Incidentally, combination of --enable-gprof and (default) --enable-pie
> >> > won't build - it dies with ld(1) complaining about relocs in gcrt1.o.
> >>
> >> This sounds like a toolchain bug to me :-)
> >
> > Debian stable/amd64, gcc 4.7.2, binutils 2.22.  And google search finds
> > this, for example: http://osdir.com/ml/qemu-devel/2013-05/msg00710.html.
> > That one has gcc 4.4.3.
> 
> That just makes it a long-standing toolchain bug. I don't see any
> reason why PIE + gprof shouldn't work, it just looks like gprof
> doesn't ship and link a PIE runtime.

*nod*

It's not a huge itch to scratch for me, and I'm not even sure whether the
bug should be filed for gcc or for libc (probably the latter).  In any case,
having that information findable in list archives would probably be a good
thing.

Again, gprof isn't particulary useful - kernel-side profilers are at least as
good.  So I suspect that most of the people running into that simply shrug and
use those instead.  Narrowing it down to -pie didn't take long and I can
confirm that this is the root cause of that breakage.  Should make debugging
said toolchain bug a bit easier, if anybody cares to do that...

> > Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and
> > going through their kernel package build.  I have perf report in front of
> > me right now; the top ones are
> >  41.77%  qemu-system-alp  perf-24701.map           [.] 0x7fbbee558930
> >  11.78%  qemu-system-alp  qemu-system-alpha        [.] cpu_alpha_exec
> 
> > and cpu_alpha_exec() spends most of the time in inlined tb_find_fast().
> > It might be worth checking the actual distribution of the hash of virt
> > address used by that sucker - I wonder if dividing its argument by 4
> > wouldn't improve the things, but I don't have stats on actual frequency
> > of conflicts, etc.  In any case, the first lump (42%) seems to be tastier ;-)
> 
> Depends on your point of view -- arguably we ought to be spending *more*
> time executing translated guest code... (As you say, the problem is that
> we don't have any breakdown of what things might turn out to be hotspots
> in the translated code.)

Might be a fun project to teach perf that hits in such-and-such page should
lead to lookup in a table annotating it.  As in "offsets 42..69 should be
recorded as (<this address> + offset - 42).  Then tcg could generate
such tables and we'd get information like "that much time is spent in
the second host insn of instances of that code pattern generated by
tcg_gen_shr_i64", etc.

No idea if anything of that sort exists - qemu is not the only possible user
for that; looks like it might be useful for any JIT profiling, so somebody
could've done that already...
Al Viro July 8, 2014, 8:20 p.m. UTC | #17
On Tue, Jul 08, 2014 at 12:04:10PM -0700, Richard Henderson wrote:

> > Just one thing - 0x1fffffffffffff will make 32bit hosts whine about integer
> > constant being too large.  So will 0x1ffffffffffffful, unfortunately - it
> > really ought to be ull.
> > 
> 
> I did use ull on the branch.

Aha...  So you've caught that one already...  I've looked at your branch;
AFAICS, the only thing missing there is treating stores to FPCR.DNOD in
system mode as "not implemented" (which it is in the code as well as in
21[0-3]64 hardware).  Other than that everything seems to be fine; you are
right about cvtql treatment - since that sucker doesn't have /i in any
allowed trap suffices, we might as well just raise Inexact and let it be
masked out - float_flag_inexact will be present in 'ignore'.  And yes,
folding the calculation itself in there obviously makes sense.
Richard Henderson July 9, 2014, 4:59 a.m. UTC | #18
On 07/08/2014 01:20 PM, Al Viro wrote:
> Aha...  So you've caught that one already...  I've looked at your branch;
> AFAICS, the only thing missing there is treating stores to FPCR.DNOD in
> system mode as "not implemented" (which it is in the code as well as in
> 21[0-3]64 hardware).

Is it loaded and stored on 21264, or it is read-as-zero/write-ignore?
Is UNDZ not required to be paired with DNOD?


r~
Al Viro July 9, 2014, 5:47 a.m. UTC | #19
On Tue, Jul 08, 2014 at 09:59:33PM -0700, Richard Henderson wrote:
> On 07/08/2014 01:20 PM, Al Viro wrote:
> > Aha...  So you've caught that one already...  I've looked at your branch;
> > AFAICS, the only thing missing there is treating stores to FPCR.DNOD in
> > system mode as "not implemented" (which it is in the code as well as in
> > 21[0-3]64 hardware).
> 
> Is it loaded and stored on 21264, or it is read-as-zero/write-ignore?

RAZ, and the same on 21364 if Compaq manual for compiler-writers is to be
believed.

On 21264 bits 48..62 are writable, bit 63 is disjunction of bits 52..57
(stores are ignored), bits 0..47 are RAZ.  AARM requires RAZ bits 0..46
and RAZ on everything optional that is unimplemented.  IOW, DNOD is
unimplemented there, all other optional ones are implemented.  And
according to https://archive.org/details/dec-comp_guide_v2 21364 doesn't
implement DNOD either...

> Is UNDZ not required to be paired with DNOD?

There are 4 bits having some relation to handling of denorms.  DNZ and DNOD
are about denorm inputs; UNDZ and UNFD - about denorm output.  All of
them have effect only for IEEE insns with /S in trap suffix.

Rules:
	* if DNZ, denorm inputs are silently replaced with zero.
	* if !DNZ && !DNOD, denorm inputs trigger trap (invalid).  Same
as what would happen without /S.
	* if !DNZ && DNOD, perform operation on denorm(s).  And I would like
to play with whatever you are using to bring hardware from alternative
universes.
	* if !UNFD, denorm output triggers trap (underflow).  Same as what
would happen without /S.
	* if UNFD && UNDZ, denorm output is replaced with zero.
	* if UNFD && !UNDZ, denorm output remains as is.

So env->fpcr_flush_to_zero = env->fpcr_dnod & env->fpcr_undz; is another
bug - needs s/dnod/unfd/ there...
Alex Bennée July 9, 2014, 9:04 a.m. UTC | #20
Peter Maydell writes:

> On 8 July 2014 17:13, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Tue, Jul 08, 2014 at 09:05:10AM +0100, Peter Maydell wrote:
>>
<snip>
>> BTW, are there any more or less uptodate docs on qemu profiling?  I mean,
>> things like perf/oprofile on the host obviously end up lumping all tcg
>> output together.  Is there any way to get information beyond "~40% of time
>> is spent in generated code, ~15% - in tb_find_fast(), and the rest is very
>> much colder"?
>
> Alex, did you say you'd done something with profiling recently?

I posted some RFC patches up a while back that spit out the perf
/tmp/perf.<pid> JIT maps that helps with breaking down which TCG TBs are
the most executed.

There is another set of patches which allow you to selectively dump
translation blocks so you don't end up with multi-gigabyte log files.

I'm going to be doing some profiling myself over the next few days so
I'll clean-up the patches and re-submit to the list soon.
Alex Bennée July 9, 2014, 9:19 a.m. UTC | #21
Al Viro writes:

> On Tue, Jul 08, 2014 at 08:32:55PM +0100, Peter Maydell wrote:
>  On 8 July 2014 18:20, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> > On Tue, Jul 08, 2014 at 05:33:16PM +0100, Peter Maydell wrote:
<snip>
> Again, gprof isn't particulary useful - kernel-side profilers are at least as
> good.  So I suspect that most of the people running into that simply shrug and
> use those instead.  Narrowing it down to -pie didn't take long and I can
> confirm that this is the root cause of that breakage.  Should make debugging
> said toolchain bug a bit easier, if anybody cares to do that...
>
>> > Stats I quoted were from qemu-system-alpha booting debian/lenny (5.10) and
>> > going through their kernel package build.  I have perf report in front of
>> > me right now; the top ones are
>> >  41.77%  qemu-system-alp  perf-24701.map           [.] 0x7fbbee558930
>> >  11.78%  qemu-system-alp  qemu-system-alpha        [.] cpu_alpha_exec
>> 
>> > and cpu_alpha_exec() spends most of the time in inlined tb_find_fast().
>> > It might be worth checking the actual distribution of the hash of virt
>> > address used by that sucker - I wonder if dividing its argument by 4
>> > wouldn't improve the things, but I don't have stats on actual frequency
>> > of conflicts, etc.  In any case, the first lump (42%) seems to be tastier ;-)
>> 
>> Depends on your point of view -- arguably we ought to be spending *more*
>> time executing translated guest code... (As you say, the problem is that
>> we don't have any breakdown of what things might turn out to be hotspots
>> in the translated code.)
>
> Might be a fun project to teach perf that hits in such-and-such page should
> lead to lookup in a table annotating it.  As in "offsets 42..69 should be
> recorded as (<this address> + offset - 42).  Then tcg could generate
> such tables and we'd get information like "that much time is spent in
> the second host insn of instances of that code pattern generated by
> tcg_gen_shr_i64", etc.
>
> No idea if anything of that sort exists - qemu is not the only possible user
> for that; looks like it might be useful for any JIT profiling, so somebody
> could've done that already...

Handily our patch tracker has remembered what I couldn't find ;-)

https://patches.linaro.org/27229/

As I mentioned previously I plan to clean these up over the next week.
Richard Henderson July 9, 2014, 3:14 p.m. UTC | #22
On 07/08/2014 10:47 PM, Al Viro wrote:
> So env->fpcr_flush_to_zero = env->fpcr_dnod & env->fpcr_undz; is another
> bug - needs s/dnod/unfd/ there...

That's exactly what I was looking at, thanks.


r~
Al Viro July 9, 2014, 4:41 p.m. UTC | #23
On Wed, Jul 09, 2014 at 08:14:12AM -0700, Richard Henderson wrote:
> On 07/08/2014 10:47 PM, Al Viro wrote:
> > So env->fpcr_flush_to_zero = env->fpcr_dnod & env->fpcr_undz; is another
> > bug - needs s/dnod/unfd/ there...
> 
> That's exactly what I was looking at, thanks.

BTW, that (unimplementeds being RAZ) is why AARM insists on having FP_C in
software - FPCR isn't guaranteed to have the "trap disable" bits and, in fact,
doesn't have anywhere to store IEEE_TRAP_ENABLE_DNO on actual hw.  The
software completion is where it has to be dealt with; note that both
swcr_update_status() and ieee_swcr_to_fpcr() treat ->ieee_state (i.e. our FP_C)
as authoritative wrt trap enable bits, 21264 or not.  Trap _status_ bits are
different - there (on 21264) FPCR is considered authoritative, but that's it.

Unimplemented trap bits are treated as "trap enabled", so the completion gets
to decide what it wants to do.  If you want to keep FPCR authoritative for
all those bits in linux-user case, we have to treat FPCR.DNOD as writable
bit for that mode, which is why my variant slapped an ifdef CONFIG_USER_ONLY
around 
     env->fpcr_dnod = (val & FPCR_DNOD) != 0;
instead of ripping it out completely...
diff mbox

Patch

diff --git a/target-alpha/fpu_helper.c b/target-alpha/fpu_helper.c
index 9b297de..25c83b5 100644
--- a/target-alpha/fpu_helper.c
+++ b/target-alpha/fpu_helper.c
@@ -44,6 +44,12 @@  uint32_t helper_fp_exc_get(CPUAlphaState *env)
     return get_float_exception_flags(&FP_STATUS);
 }
 
+enum {
+	Exc_Mask = float_flag_invalid | float_flag_int_overflow |
+		   float_flag_divbyzero | float_flag_overflow |
+		   float_flag_underflow | float_flag_inexact
+};
+
 static inline void fp_exc_raise1(CPUAlphaState *env, uintptr_t retaddr,
                                  uint32_t exc, uint32_t regno, uint32_t hw_exc)
 {
@@ -73,7 +79,7 @@  static inline void fp_exc_raise1(CPUAlphaState *env, uintptr_t retaddr,
    doesn't apply.  */
 void helper_fp_exc_raise(CPUAlphaState *env, uint32_t ignore, uint32_t regno)
 {
-    uint32_t exc = (uint8_t)env->fp_status.float_exception_flags;
+    uint32_t exc = (uint8_t)env->fp_status.float_exception_flags & Exc_Mask;
     if (exc) {
         env->fpcr_exc_status |= exc;
         exc &= ~ignore;
@@ -86,7 +92,7 @@  void helper_fp_exc_raise(CPUAlphaState *env, uint32_t ignore, uint32_t regno)
 /* Raise exceptions for ieee fp insns with software completion.  */
 void helper_fp_exc_raise_s(CPUAlphaState *env, uint32_t ignore, uint32_t regno)
 {
-    uint32_t exc = (uint8_t)env->fp_status.float_exception_flags;
+    uint32_t exc = (uint8_t)env->fp_status.float_exception_flags & Exc_Mask;
     if (exc) {
         env->fpcr_exc_status |= exc;
         exc &= ~ignore;
@@ -105,16 +111,14 @@  void helper_ieee_input(CPUAlphaState *env, uint64_t val)
     uint64_t frac = val & 0xfffffffffffffull;
 
     if (exp == 0) {
-        /* Denormals without DNZ set raise an exception.  */
-        if (frac != 0 && !env->fp_status.flush_inputs_to_zero) {
-            arith_excp(env, GETPC(), EXC_M_UNF, 0);
+        /* Denormals without /S raise an exception.  */
+        if (frac != 0) {
+            arith_excp(env, GETPC(), EXC_M_INV, 0);
         }
     } else if (exp == 0x7ff) {
         /* Infinity or NaN.  */
-        /* ??? I'm not sure these exception bit flags are correct.  I do
-           know that the Linux kernel, at least, doesn't rely on them and
-           just emulates the insn to figure out what exception to use.  */
-        arith_excp(env, GETPC(), frac ? EXC_M_INV : EXC_M_FOV, 0);
+        env->fpcr_exc_status |= float_flag_invalid;
+        arith_excp(env, GETPC(), EXC_M_INV, 0);
     }
 }
 
@@ -125,16 +129,34 @@  void helper_ieee_input_cmp(CPUAlphaState *env, uint64_t val)
     uint64_t frac = val & 0xfffffffffffffull;
 
     if (exp == 0) {
-        /* Denormals without DNZ set raise an exception.  */
-        if (frac != 0 && !env->fp_status.flush_inputs_to_zero) {
-            arith_excp(env, GETPC(), EXC_M_UNF, 0);
+        /* Denormals raise an exception.  */
+        if (frac != 0) {
+            arith_excp(env, GETPC(), EXC_M_INV, 0);
         }
     } else if (exp == 0x7ff && frac) {
         /* NaN.  */
+        env->fpcr_exc_status |= float_flag_invalid;
         arith_excp(env, GETPC(), EXC_M_INV, 0);
     }
 }
 
+/* Input handing with software completion.  Trap for denorms,
+   unless DNZ is set.  *IF* we try to support DNOD (which
+   none of the produced hardware did, AFAICS), we'll need
+   to suppress the trap when FPCR.DNOD is set; then the
+   code downstream of that will need to cope with denorms
+   sans flush_input_to_zero.  Most of it should work sanely,
+   but there's nothing to compare with...
+*/
+void helper_ieee_input_s(CPUAlphaState *env, uint64_t val)
+{
+    if (unlikely(2 * val - 1 < 0x1fffffffffffff)) {
+	if (!FP_STATUS.flush_inputs_to_zero) {
+	    arith_excp(env, GETPC(), EXC_M_INV | EXC_M_SWC, 0);
+	}
+    }
+}
+
 /* F floating (VAX) */
 static uint64_t float32_to_f(float32 fa)
 {
@@ -707,7 +729,8 @@  static inline uint64_t inline_cvttq(CPUAlphaState *env, uint64_t a,
     frac = a & 0xfffffffffffffull;
 
     if (exp == 0) {
-        if (unlikely(frac != 0)) {
+        if (unlikely(frac != 0) && !FP_STATUS.flush_inputs_to_zero) {
+	    /* not going to happen without working DNOD; ifdef out, perhaps? */
             goto do_underflow;
         }
     } else if (exp == 0x7ff) {
@@ -826,7 +849,9 @@  uint64_t helper_cvtqg(CPUAlphaState *env, uint64_t a)
 
 void helper_fcvtql_v_input(CPUAlphaState *env, uint64_t val)
 {
+    set_float_exception_flags(0, &FP_STATUS);
     if (val != (int32_t)val) {
-        arith_excp(env, GETPC(), EXC_M_IOV, 0);
+        float_raise(float_flag_int_overflow, &FP_STATUS);
+	env->fpcr_exc_status |= float_flag_inexact;
     }
 }
diff --git a/target-alpha/helper.h b/target-alpha/helper.h
index 2cc100b..596f24d 100644
--- a/target-alpha/helper.h
+++ b/target-alpha/helper.h
@@ -88,6 +88,7 @@  DEF_HELPER_FLAGS_3(fp_exc_raise_s, TCG_CALL_NO_WG, void, env, i32, i32)
 
 DEF_HELPER_FLAGS_2(ieee_input, TCG_CALL_NO_WG, void, env, i64)
 DEF_HELPER_FLAGS_2(ieee_input_cmp, TCG_CALL_NO_WG, void, env, i64)
+DEF_HELPER_FLAGS_2(ieee_input_s, TCG_CALL_NO_WG, void, env, i64)
 DEF_HELPER_FLAGS_2(fcvtql_v_input, TCG_CALL_NO_WG, void, env, i64)
 
 #if !defined (CONFIG_USER_ONLY)
diff --git a/target-alpha/translate.c b/target-alpha/translate.c
index 6ea33f3..611b293 100644
--- a/target-alpha/translate.c
+++ b/target-alpha/translate.c
@@ -655,7 +655,8 @@  static TCGv gen_ieee_input(DisasContext *ctx, int reg, int fn11, int is_cmp)
             } else {
                 gen_helper_ieee_input(cpu_env, val);
             }
-        }
+        } else
+            gen_helper_ieee_input_s(cpu_env, val);
     }
     return val;
 }
@@ -2256,24 +2257,15 @@  static ExitStatus translate_one(DisasContext *ctx, uint32_t insn)
             gen_fcmov(ctx, TCG_COND_GT, ra, rb, rc);
             break;
         case 0x030:
-            /* CVTQL */
-            REQUIRE_REG_31(ra);
-            vc = dest_fpr(ctx, rc);
-            vb = load_fpr(ctx, rb);
-            gen_fcvtql(vc, vb);
-            break;
         case 0x130:
-            /* CVTQL/V */
         case 0x530:
-            /* CVTQL/SV */
+            /* CVTQL, CVTQL/V, CVTQL/SV */
             REQUIRE_REG_31(ra);
-            /* ??? I'm pretty sure there's nothing that /sv needs to do that
-               /v doesn't do.  The only thing I can think is that /sv is a
-               valid instruction merely for completeness in the ISA.  */
             vc = dest_fpr(ctx, rc);
             vb = load_fpr(ctx, rb);
             gen_helper_fcvtql_v_input(cpu_env, vb);
             gen_fcvtql(vc, vb);
+            gen_fp_exc_raise(rc, fn11);
             break;
         default:
             goto invalid_opc;