Patchwork tcg: Remove stack protection from helper functions

login
register
mail settings
Submitter Jan Kiszka
Date Sept. 26, 2011, 7:46 a.m.
Message ID <4E802DDD.8090100@siemens.com>
Download mbox | patch
Permalink /patch/116376/
State New
Headers show

Comments

Jan Kiszka - Sept. 26, 2011, 7:46 a.m.
This increases the overhead of frequently executed helpers.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---

Maybe this should be applied to more hot-path functions, but I haven't
done any thorough analysis yet.

 Makefile.target |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
Mulyadi Santosa - Sept. 26, 2011, 8:01 a.m.
Hi...

On Mon, Sep 26, 2011 at 14:46, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> This increases the overhead of frequently executed helpers.
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>

IMHO, stack protector setup put more stuffs during epilogue, but quite
likely it is negligible unless it cause too much L1 cache misses. So,
I think this micro tuning is somewhat unnecessary but still okay.
Security wise, I think it's better to just leave it as is like now.
Laurent Desnogues - Sept. 26, 2011, 8:15 a.m.
On Mon, Sep 26, 2011 at 10:01 AM, Mulyadi Santosa
<mulyadi.santosa@gmail.com> wrote:
> Hi...
>
> On Mon, Sep 26, 2011 at 14:46, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> This increases the overhead of frequently executed helpers.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>
> IMHO, stack protector setup put more stuffs during epilogue, but quite
> likely it is negligible unless it cause too much L1 cache misses. So,
> I think this micro tuning is somewhat unnecessary but still okay.

The impact of stack protection is very high for instance running
FFmpeg ARM with NEON optimizations:  a few months ago I
measured that removing stack protection improved the run time
by more than 10%.  Of course it's extreme since the proportion
of NEON instructions (and hence of helper calls) is very high.


Laurent
Avi Kivity - Sept. 26, 2011, 5:41 p.m.
On 09/26/2011 11:15 AM, Laurent Desnogues wrote:
> On Mon, Sep 26, 2011 at 10:01 AM, Mulyadi Santosa
> <mulyadi.santosa@gmail.com>  wrote:
> >  Hi...
> >
> >  On Mon, Sep 26, 2011 at 14:46, Jan Kiszka<jan.kiszka@siemens.com>  wrote:
> >>  This increases the overhead of frequently executed helpers.
> >>
> >>  Signed-off-by: Jan Kiszka<jan.kiszka@siemens.com>
> >
> >  IMHO, stack protector setup put more stuffs during epilogue, but quite
> >  likely it is negligible unless it cause too much L1 cache misses. So,
> >  I think this micro tuning is somewhat unnecessary but still okay.
>
> The impact of stack protection is very high for instance running
> FFmpeg ARM with NEON optimizations:  a few months ago I
> measured that removing stack protection improved the run time
> by more than 10%.  Of course it's extreme since the proportion
> of NEON instructions (and hence of helper calls) is very high.

I saw a lot of helper calls for sse in ordinary x86_64 code, likely for 
memcpy/cmp and friends.  Native tcg ops for common vector instructions 
would probably be quite a speedup.
Richard Henderson - Sept. 26, 2011, 7:43 p.m.
On 09/26/2011 10:41 AM, Avi Kivity wrote:
> Native tcg ops for common vector instructions would probably be quite a speedup.

It's very possible to simply open-code many of the vector operations.

I've done a port of qemu to the SPU (aka Cell) processor.  This core
has no scalar operations; all operations are on vectors.  It turned
out fairly well for the basic arithmetic.  I only have to fall back
on helpers for the more esoteric operations.

That said, all FP vector operations should of course continue to be
done completely via helpers, since one would need helpers for the
individual FP operations anyway.


r~
Avi Kivity - Sept. 26, 2011, 7:52 p.m.
On 09/26/2011 10:43 PM, Richard Henderson wrote:
> On 09/26/2011 10:41 AM, Avi Kivity wrote:
> >  Native tcg ops for common vector instructions would probably be quite a speedup.
>
> It's very possible to simply open-code many of the vector operations.
>
> I've done a port of qemu to the SPU (aka Cell) processor.  This core
> has no scalar operations; all operations are on vectors.  It turned
> out fairly well for the basic arithmetic.  I only have to fall back
> on helpers for the more esoteric operations.
>
> That said, all FP vector operations should of course continue to be
> done completely via helpers, since one would need helpers for the
> individual FP operations anyway.

Why do floating point ops need helpers?  At least if all the edge cases 
match? (i.e. NaNs and denormals)
Richard Henderson - Sept. 26, 2011, 7:53 p.m.
On 09/26/2011 12:52 PM, Avi Kivity wrote:
> Why do floating point ops need helpers?

Because TCG doesn't do any native floating point.


r~
Peter Maydell - Sept. 26, 2011, 8:19 p.m.
On 26 September 2011 20:52, Avi Kivity <avi@redhat.com> wrote:
> Why do floating point ops need helpers?  At least if all the edge cases
> match? (i.e. NaNs and denormals)

The answer is that the edge cases basically never match. No CPU
architecture does handling of NaNs and input denormals and output
denormals and underflow checks and all the rest of it in exactly
the same way as anybody else. (In particular x86 is pretty crazy,
which is unfortunate given that it's the most significant host
arch at the moment.) So any kind of TCG native floating point
support would probably have to translate to "check if either
input is a special case; if not, try the op; check if the output
was a special case; if any of those checks fired, back off to
the softfloat helper function". Which is quite a lot of inline
code, and also annoyingly bouncing between fp ops and integer
bitpattern checks on the fp values.

-- PMM
Avi Kivity - Sept. 26, 2011, 8:20 p.m.
On 09/26/2011 10:53 PM, Richard Henderson wrote:
> On 09/26/2011 12:52 PM, Avi Kivity wrote:
> >  Why do floating point ops need helpers?
>
> Because TCG doesn't do any native floating point.
>

Well, it could be made to do it.
Avi Kivity - Sept. 26, 2011, 8:26 p.m.
On 09/26/2011 11:19 PM, Peter Maydell wrote:
> On 26 September 2011 20:52, Avi Kivity<avi@redhat.com>  wrote:
> >  Why do floating point ops need helpers?  At least if all the edge cases
> >  match? (i.e. NaNs and denormals)
>
> The answer is that the edge cases basically never match.

Surely they do when host == target.  Although there you can virtualize.

>   No CPU
> architecture does handling of NaNs and input denormals and output
> denormals and underflow checks and all the rest of it in exactly
> the same way as anybody else. (In particular x86 is pretty crazy,
> which is unfortunate given that it's the most significant host
> arch at the moment.) So any kind of TCG native floating point
> support would probably have to translate to "check if either
> input is a special case; if not, try the op; check if the output
> was a special case; if any of those checks fired, back off to
> the softfloat helper function". Which is quite a lot of inline
> code, and also annoyingly bouncing between fp ops and integer
> bitpattern checks on the fp values.
>

Alternatively, configure the fpu to trap on these cases, and handle them 
in a slow path.  At least x86 sse allows this (though perhaps not for 
"quiet NaN"s?

Does it matter in practice?  Perhaps we can have a fast-and-loose mode 
for the fpu (gcc does).
Andi Kleen - Sept. 27, 2011, 4:29 a.m.
Peter Maydell <peter.maydell@linaro.org> writes:
>
> The answer is that the edge cases basically never match. No CPU
> architecture does handling of NaNs and input denormals and output
> denormals and underflow checks and all the rest of it in exactly
> the same way as anybody else. (In particular x86 is pretty crazy,

Can you clarify this? 

IEEE754 is pretty strict on how all these things are handled
and to my knowledge all serious x86 are fully IEEE compliant.
Or are you refering to the x87 80bits expansion? While useful
that's not used anymore with SSE.

On the other hand qemu is not very good at it, e.g. with x87
it doesn't even pass paranoia.

-Andi
Peter Maydell - Sept. 27, 2011, 7:58 a.m.
On 27 September 2011 05:29, Andi Kleen <andi@firstfloor.org> wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
>> The answer is that the edge cases basically never match. No CPU
>> architecture does handling of NaNs and input denormals and output
>> denormals and underflow checks and all the rest of it in exactly
>> the same way as anybody else. (In particular x86 is pretty crazy,
>
> Can you clarify this?
>
> IEEE754 is pretty strict on how all these things are handled
> and to my knowledge all serious x86 are fully IEEE compliant.
> Or are you refering to the x87 80bits expansion? While useful
> that's not used anymore with SSE.

IEEE leaves some leeway for implementations. Just off the top
of my head:
 * if two NaNs are passed to an op then which one is propagated
   is implementation defined
 * value of the 'default NaN' is imp-def
 * whether the msbit of the significand is 1 or 0 to indicate
   an SNaN is imp-def
 * how an SNaN is converted to a QNaN is imp-def
 * tininess can be detected either before or after rounding

which different architectures vary on (and some are better at
documenting their choices than others).

Also implementations often have extra non-IEEE modes (which
may even be the default, for speed):
 * squashing denormal outputs to zero
 * squashing denormal inputs to zero
and there's even less agreement here.

Common-but-not-officially-ieee ops like 'max' and 'min'
can also vary: for instance Intel's SSE MAXSD/MINSD etc
have very weird behaviour for the special cases: if both
operands are 0.0s of any sign you always get the second
operand, so max(-0,+0) != max(+0,-0), and if only one operand
is a NaN then the second operand is returned whether it is
the NaN or not (so max(NaN, 3) != max(3, NaN).

> On the other hand qemu is not very good at it, e.g. with x87
> it doesn't even pass paranoia.

This is only because nobody cares much about x86 TCG. ARM floating
point emulation is (now) bit-for-bit correct apart from a handful
of operations which don't set the right set of exception flags.

-- PMM

Patch

diff --git a/Makefile.target b/Makefile.target
index 88d2f1f..1cc758d 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -91,7 +91,7 @@  tcg/tcg.o: cpu.h
 
 # HELPER_CFLAGS is used for all the code compiled with static register
 # variables
-op_helper.o user-exec.o: QEMU_CFLAGS += $(HELPER_CFLAGS)
+op_helper.o user-exec.o: QEMU_CFLAGS := $(subst -fstack-protector-all,,$(QEMU_CFLAGS)) $(HELPER_CFLAGS)
 
 # Note: this is a workaround. The real fix is to avoid compiling
 # cpu_signal_handler() in user-exec.c.