Message ID | 4E802DDD.8090100@siemens.com |
---|---|
State | New |
Headers | show |
Hi... On Mon, Sep 26, 2011 at 14:46, Jan Kiszka <jan.kiszka@siemens.com> wrote: > This increases the overhead of frequently executed helpers. > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> IMHO, stack protector setup put more stuffs during epilogue, but quite likely it is negligible unless it cause too much L1 cache misses. So, I think this micro tuning is somewhat unnecessary but still okay. Security wise, I think it's better to just leave it as is like now.
On Mon, Sep 26, 2011 at 10:01 AM, Mulyadi Santosa <mulyadi.santosa@gmail.com> wrote: > Hi... > > On Mon, Sep 26, 2011 at 14:46, Jan Kiszka <jan.kiszka@siemens.com> wrote: >> This increases the overhead of frequently executed helpers. >> >> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > > IMHO, stack protector setup put more stuffs during epilogue, but quite > likely it is negligible unless it cause too much L1 cache misses. So, > I think this micro tuning is somewhat unnecessary but still okay. The impact of stack protection is very high for instance running FFmpeg ARM with NEON optimizations: a few months ago I measured that removing stack protection improved the run time by more than 10%. Of course it's extreme since the proportion of NEON instructions (and hence of helper calls) is very high. Laurent
On 09/26/2011 11:15 AM, Laurent Desnogues wrote: > On Mon, Sep 26, 2011 at 10:01 AM, Mulyadi Santosa > <mulyadi.santosa@gmail.com> wrote: > > Hi... > > > > On Mon, Sep 26, 2011 at 14:46, Jan Kiszka<jan.kiszka@siemens.com> wrote: > >> This increases the overhead of frequently executed helpers. > >> > >> Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com> > > > > IMHO, stack protector setup put more stuffs during epilogue, but quite > > likely it is negligible unless it cause too much L1 cache misses. So, > > I think this micro tuning is somewhat unnecessary but still okay. > > The impact of stack protection is very high for instance running > FFmpeg ARM with NEON optimizations: a few months ago I > measured that removing stack protection improved the run time > by more than 10%. Of course it's extreme since the proportion > of NEON instructions (and hence of helper calls) is very high. I saw a lot of helper calls for sse in ordinary x86_64 code, likely for memcpy/cmp and friends. Native tcg ops for common vector instructions would probably be quite a speedup.
On 09/26/2011 10:41 AM, Avi Kivity wrote:
> Native tcg ops for common vector instructions would probably be quite a speedup.
It's very possible to simply open-code many of the vector operations.
I've done a port of qemu to the SPU (aka Cell) processor. This core
has no scalar operations; all operations are on vectors. It turned
out fairly well for the basic arithmetic. I only have to fall back
on helpers for the more esoteric operations.
That said, all FP vector operations should of course continue to be
done completely via helpers, since one would need helpers for the
individual FP operations anyway.
r~
On 09/26/2011 10:43 PM, Richard Henderson wrote: > On 09/26/2011 10:41 AM, Avi Kivity wrote: > > Native tcg ops for common vector instructions would probably be quite a speedup. > > It's very possible to simply open-code many of the vector operations. > > I've done a port of qemu to the SPU (aka Cell) processor. This core > has no scalar operations; all operations are on vectors. It turned > out fairly well for the basic arithmetic. I only have to fall back > on helpers for the more esoteric operations. > > That said, all FP vector operations should of course continue to be > done completely via helpers, since one would need helpers for the > individual FP operations anyway. Why do floating point ops need helpers? At least if all the edge cases match? (i.e. NaNs and denormals)
On 09/26/2011 12:52 PM, Avi Kivity wrote:
> Why do floating point ops need helpers?
Because TCG doesn't do any native floating point.
r~
On 26 September 2011 20:52, Avi Kivity <avi@redhat.com> wrote: > Why do floating point ops need helpers? At least if all the edge cases > match? (i.e. NaNs and denormals) The answer is that the edge cases basically never match. No CPU architecture does handling of NaNs and input denormals and output denormals and underflow checks and all the rest of it in exactly the same way as anybody else. (In particular x86 is pretty crazy, which is unfortunate given that it's the most significant host arch at the moment.) So any kind of TCG native floating point support would probably have to translate to "check if either input is a special case; if not, try the op; check if the output was a special case; if any of those checks fired, back off to the softfloat helper function". Which is quite a lot of inline code, and also annoyingly bouncing between fp ops and integer bitpattern checks on the fp values. -- PMM
On 09/26/2011 10:53 PM, Richard Henderson wrote: > On 09/26/2011 12:52 PM, Avi Kivity wrote: > > Why do floating point ops need helpers? > > Because TCG doesn't do any native floating point. > Well, it could be made to do it.
On 09/26/2011 11:19 PM, Peter Maydell wrote: > On 26 September 2011 20:52, Avi Kivity<avi@redhat.com> wrote: > > Why do floating point ops need helpers? At least if all the edge cases > > match? (i.e. NaNs and denormals) > > The answer is that the edge cases basically never match. Surely they do when host == target. Although there you can virtualize. > No CPU > architecture does handling of NaNs and input denormals and output > denormals and underflow checks and all the rest of it in exactly > the same way as anybody else. (In particular x86 is pretty crazy, > which is unfortunate given that it's the most significant host > arch at the moment.) So any kind of TCG native floating point > support would probably have to translate to "check if either > input is a special case; if not, try the op; check if the output > was a special case; if any of those checks fired, back off to > the softfloat helper function". Which is quite a lot of inline > code, and also annoyingly bouncing between fp ops and integer > bitpattern checks on the fp values. > Alternatively, configure the fpu to trap on these cases, and handle them in a slow path. At least x86 sse allows this (though perhaps not for "quiet NaN"s? Does it matter in practice? Perhaps we can have a fast-and-loose mode for the fpu (gcc does).
Peter Maydell <peter.maydell@linaro.org> writes: > > The answer is that the edge cases basically never match. No CPU > architecture does handling of NaNs and input denormals and output > denormals and underflow checks and all the rest of it in exactly > the same way as anybody else. (In particular x86 is pretty crazy, Can you clarify this? IEEE754 is pretty strict on how all these things are handled and to my knowledge all serious x86 are fully IEEE compliant. Or are you refering to the x87 80bits expansion? While useful that's not used anymore with SSE. On the other hand qemu is not very good at it, e.g. with x87 it doesn't even pass paranoia. -Andi
On 27 September 2011 05:29, Andi Kleen <andi@firstfloor.org> wrote: > Peter Maydell <peter.maydell@linaro.org> writes: >> The answer is that the edge cases basically never match. No CPU >> architecture does handling of NaNs and input denormals and output >> denormals and underflow checks and all the rest of it in exactly >> the same way as anybody else. (In particular x86 is pretty crazy, > > Can you clarify this? > > IEEE754 is pretty strict on how all these things are handled > and to my knowledge all serious x86 are fully IEEE compliant. > Or are you refering to the x87 80bits expansion? While useful > that's not used anymore with SSE. IEEE leaves some leeway for implementations. Just off the top of my head: * if two NaNs are passed to an op then which one is propagated is implementation defined * value of the 'default NaN' is imp-def * whether the msbit of the significand is 1 or 0 to indicate an SNaN is imp-def * how an SNaN is converted to a QNaN is imp-def * tininess can be detected either before or after rounding which different architectures vary on (and some are better at documenting their choices than others). Also implementations often have extra non-IEEE modes (which may even be the default, for speed): * squashing denormal outputs to zero * squashing denormal inputs to zero and there's even less agreement here. Common-but-not-officially-ieee ops like 'max' and 'min' can also vary: for instance Intel's SSE MAXSD/MINSD etc have very weird behaviour for the special cases: if both operands are 0.0s of any sign you always get the second operand, so max(-0,+0) != max(+0,-0), and if only one operand is a NaN then the second operand is returned whether it is the NaN or not (so max(NaN, 3) != max(3, NaN). > On the other hand qemu is not very good at it, e.g. with x87 > it doesn't even pass paranoia. This is only because nobody cares much about x86 TCG. ARM floating point emulation is (now) bit-for-bit correct apart from a handful of operations which don't set the right set of exception flags. -- PMM
diff --git a/Makefile.target b/Makefile.target index 88d2f1f..1cc758d 100644 --- a/Makefile.target +++ b/Makefile.target @@ -91,7 +91,7 @@ tcg/tcg.o: cpu.h # HELPER_CFLAGS is used for all the code compiled with static register # variables -op_helper.o user-exec.o: QEMU_CFLAGS += $(HELPER_CFLAGS) +op_helper.o user-exec.o: QEMU_CFLAGS := $(subst -fstack-protector-all,,$(QEMU_CFLAGS)) $(HELPER_CFLAGS) # Note: this is a workaround. The real fix is to avoid compiling # cpu_signal_handler() in user-exec.c.
This increases the overhead of frequently executed helpers. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> --- Maybe this should be applied to more hot-path functions, but I haven't done any thorough analysis yet. Makefile.target | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)