bitops.h: Compile out asserts without --enable-debug

Message ID	20230520205444.887287457E7@zero.eik.bme.hu
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: BALATON Zoltan <balaton@eik.bme.hu> Subject: [PATCH] bitops.h: Compile out asserts without --enable-debug MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit To: qemu-devel@nongnu.org Cc: qemu-trivial@nongnu.org Message-Id: <20230520205444.887287457E7@zero.eik.bme.hu> Date: Sat, 20 May 2023 22:54:44 +0200 (CEST) Received-SPF: pass client-ip=152.66.115.2; envelope-from=balaton@eik.bme.hu; helo=zero.eik.bme.hu X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Series	bitops.h: Compile out asserts without --enable-debug \| expand bitops.h: Compile out asserts without --enable-debug

BALATON Zoltan May 20, 2023, 8:54 p.m. UTC

The low level extract and deposit funtions provided by bitops.h are
used in performance critical places. It crept into target/ppc via
FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
where hardfloat is also disabled is doubly affected.

Normally asserts should be compiled out from release builds with
-DNDEBUG but that cannot be used in QEMU because some places still
rely on asserts instead of proper error checking. To resolve this,
compile out asserts from deposit/extract functions in optimised builds
which improves performance for target/ppc and possibly others too.

Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
---
 include/qemu/bitops.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Peter Maydell May 22, 2023, 9:16 a.m. UTC | #1

On Sat, 20 May 2023 at 21:55, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>
> The low level extract and deposit funtions provided by bitops.h are
> used in performance critical places. It crept into target/ppc via
> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
> where hardfloat is also disabled is doubly affected.
>
> Normally asserts should be compiled out from release builds with
> -DNDEBUG but that cannot be used in QEMU because some places still
> rely on asserts instead of proper error checking. To resolve this,
> compile out asserts from deposit/extract functions in optimised builds
> which improves performance for target/ppc and possibly others too.

Can we have some figures for performance improvements,
please? General QEMU policy is that asserts remain,
even in non-debug builds, so exceptions from that policy
should come with justification with figures attached.

thanks
-- PMM

Alex Bennée May 22, 2023, 11:26 a.m. UTC | #2

BALATON Zoltan <balaton@eik.bme.hu> writes:

> The low level extract and deposit funtions provided by bitops.h are
> used in performance critical places. It crept into target/ppc via
> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
> where hardfloat is also disabled is doubly affected.

Most of these asserts compile out to nothing if the compiler is able to
verify the constants are in the range. For example examining
the start of float64_add:

  Dump of assembler code for function float64_add:
  ../../fpu/softfloat.c:
  1979	{
     0x00000000007ac9b0 <+0>:	movabs $0xfffffffffffff,%r9
     0x00000000007ac9ba <+10>:	push   %rbx

  /home/alex/lsrc/qemu.git/include/qemu/bitops.h:
  396	    return (value >> start) & (~0ULL >> (64 - length));
     0x00000000007ac9bb <+11>:	mov    %rdi,%rcx
     0x00000000007ac9be <+14>:	shr    $0x34,%rcx
     0x00000000007ac9c2 <+18>:	and    $0x7ff,%ecx

  ../../fpu/softfloat.c:
  1979	{
     0x00000000007ac9c8 <+24>:	sub    $0x30,%rsp

  /home/alex/lsrc/qemu.git/include/qemu/bitops.h:
  396	    return (value >> start) & (~0ULL >> (64 - length));
     0x00000000007ac9cc <+28>:	mov    %fs:0x28,%rax
     0x00000000007ac9d5 <+37>:	mov    %rax,0x28(%rsp)
     0x00000000007ac9da <+42>:	mov    %rdi,%rax
     0x00000000007ac9dd <+45>:	and    %r9,%rdi

  ../../fpu/softfloat.c:
  588	    *r = (FloatParts64) {
     0x00000000007ac9e0 <+48>:	mov    %ecx,0x4(%rsp)
     0x00000000007ac9e4 <+52>:	mov    %rdi,0x8(%rsp)

  /home/alex/lsrc/qemu.git/include/qemu/bitops.h:
  396	    return (value >> start) & (~0ULL >> (64 - length));
     0x00000000007ac9e9 <+57>:	shr    $0x3f,%rax

  ../../fpu/softfloat.c:
  588	    *r = (FloatParts64) {
     0x00000000007ac9ed <+61>:	mov    %al,0x1(%rsp)

  589	        .cls = float_class_unclassified,
  590	        .sign = extract64(raw, f_size + e_size, 1),
     0x00000000007ac9f1 <+65>:	mov    %rax,%r8

I don't see any check and abort steps because all the shift and mask
values are known at compile time. The softfloat compilation certainly
does have some assert points though:

 readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail

but the references are for the ISRA segments so its tricky to know if
they get used or are just there for LTO purposes.

If there are hot-paths that show up the extract/deposit functions I
suspect a better approach would be to implement _nocheck variants (or
maybe _noassert?) and use them where required rather than turning off
the assert checking for these utility functions.

BALATON Zoltan May 22, 2023, noon UTC | #3

On Mon, 22 May 2023, Peter Maydell wrote:
> On Sat, 20 May 2023 at 21:55, BALATON Zoltan <balaton@eik.bme.hu> wrote:
>>
>> The low level extract and deposit funtions provided by bitops.h are
>> used in performance critical places. It crept into target/ppc via
>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>> where hardfloat is also disabled is doubly affected.
>>
>> Normally asserts should be compiled out from release builds with
>> -DNDEBUG but that cannot be used in QEMU because some places still
>> rely on asserts instead of proper error checking. To resolve this,
>> compile out asserts from deposit/extract functions in optimised builds
>> which improves performance for target/ppc and possibly others too.
>
> Can we have some figures for performance improvements,
> please? General QEMU policy is that asserts remain,
> even in non-debug builds, so exceptions from that policy
> should come with justification with figures attached.

Here are some figures converting a 10MB wav file to mp3 with lame on 
AmigaOS pegasos2 which is using a lot of FPU operations (which is using
softfloat on TCG target/ppc due to no hardfloat so it's very slow):

    8.0: 1:11 0.8264x, 1:11 0x8258x
master: 1:12 0.8117x, 1:12 0.8103x
  patch: 1:02 0.9541x, 1:02 0.9506x

The numbers are time minute:seconds and speed compared to play speed (lame 
calls this play/CPU). I did two runs, first after booting the guest and 
another one after the first. Despite the second run should use more cache 
and less compile overhead the first runs seem to be a bit faster. Real 
hardware gets about 11x speed although that's with Altivec where on QEMU 
Altivec does not help much so we're still much behind even on an Intel 
Core i7 3.6GHz CPU for FPU ops where integer ops are much better. I got 
similar reports from Apple silicon hardware with macOS too. Eventually 
fixing hardfloat for target/ppc would help even more but this is 
independent of that and significant enough to remove this overhead now. 
I've never seen these asserts firing and unlikely to depend on run time 
values so leaving them in non-debug builds seems unnecessary overkill that 
also hurts performance.

Regards,
BALATON Zoltan

Peter Maydell May 22, 2023, 12:05 p.m. UTC | #4

On Mon, 22 May 2023 at 13:00, BALATON Zoltan <balaton@eik.bme.hu> wrote:
> On Mon, 22 May 2023, Peter Maydell wrote:
> > Can we have some figures for performance improvements,
> > please? General QEMU policy is that asserts remain,
> > even in non-debug builds, so exceptions from that policy
> > should come with justification with figures attached.
>
> Here are some figures converting a 10MB wav file to mp3 with lame on
> AmigaOS pegasos2 which is using a lot of FPU operations (which is using
> softfloat on TCG target/ppc due to no hardfloat so it's very slow):
>
>     8.0: 1:11 0.8264x, 1:11 0x8258x
> master: 1:12 0.8117x, 1:12 0.8103x
>   patch: 1:02 0.9541x, 1:02 0.9506x

Thanks; that does seem like quite a lot of overhead.

> I've never seen these asserts firing and unlikely to depend on run time
> values so leaving them in non-debug builds seems unnecessary overkill that
> also hurts performance.

Where they don't depend on runtime values, the compiler will be
able to detect that the assert condition is always false and
won't emit any code. It's only the cases where the start or
length are variable that will be causing the slowdown.

thanks
-- PMM

BALATON Zoltan May 22, 2023, 1:25 p.m. UTC | #5

On Mon, 22 May 2023, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>
>> The low level extract and deposit funtions provided by bitops.h are
>> used in performance critical places. It crept into target/ppc via
>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>> where hardfloat is also disabled is doubly affected.
>
> Most of these asserts compile out to nothing if the compiler is able to
> verify the constants are in the range. For example examining
> the start of float64_add:
>
>  Dump of assembler code for function float64_add:
>  ../../fpu/softfloat.c:
>  1979	{
>     0x00000000007ac9b0 <+0>:	movabs $0xfffffffffffff,%r9
>     0x00000000007ac9ba <+10>:	push   %rbx
>
>  /home/alex/lsrc/qemu.git/include/qemu/bitops.h:
>  396	    return (value >> start) & (~0ULL >> (64 - length));
>     0x00000000007ac9bb <+11>:	mov    %rdi,%rcx
>     0x00000000007ac9be <+14>:	shr    $0x34,%rcx
>     0x00000000007ac9c2 <+18>:	and    $0x7ff,%ecx
>
>  ../../fpu/softfloat.c:
>  1979	{
>     0x00000000007ac9c8 <+24>:	sub    $0x30,%rsp
>
>  /home/alex/lsrc/qemu.git/include/qemu/bitops.h:
>  396	    return (value >> start) & (~0ULL >> (64 - length));
>     0x00000000007ac9cc <+28>:	mov    %fs:0x28,%rax
>     0x00000000007ac9d5 <+37>:	mov    %rax,0x28(%rsp)
>     0x00000000007ac9da <+42>:	mov    %rdi,%rax
>     0x00000000007ac9dd <+45>:	and    %r9,%rdi
>
>  ../../fpu/softfloat.c:
>  588	    *r = (FloatParts64) {
>     0x00000000007ac9e0 <+48>:	mov    %ecx,0x4(%rsp)
>     0x00000000007ac9e4 <+52>:	mov    %rdi,0x8(%rsp)
>
>  /home/alex/lsrc/qemu.git/include/qemu/bitops.h:
>  396	    return (value >> start) & (~0ULL >> (64 - length));
>     0x00000000007ac9e9 <+57>:	shr    $0x3f,%rax
>
>  ../../fpu/softfloat.c:
>  588	    *r = (FloatParts64) {
>     0x00000000007ac9ed <+61>:	mov    %al,0x1(%rsp)
>
>  589	        .cls = float_class_unclassified,
>  590	        .sign = extract64(raw, f_size + e_size, 1),
>     0x00000000007ac9f1 <+65>:	mov    %rax,%r8
>
> I don't see any check and abort steps because all the shift and mask
> values are known at compile time. The softfloat compilation certainly
> does have some assert points though:
>
> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>
> but the references are for the ISRA segments so its tricky to know if
> they get used or are just there for LTO purposes.
>
> If there are hot-paths that show up the extract/deposit functions I
> suspect a better approach would be to implement _nocheck variants (or
> maybe _noassert?) and use them where required rather than turning off
> the assert checking for these utility functions.

Just to clarify again, the asserts are still there when compiled with 
--enable-debug. The patch only turns them off for optimised release builds 
which I think makes sense if these asserts are to catch programming 
errors. I think I've also suggested adding noassert versions of these but 
that wasn't a popular idea and it may also not be easy to convert all 
places to use that like for example the register fields related usage in 
target/ppc as that would also affect other places. So this seems to be the 
simplest and most effective approach.

The softfloat related usage in these tests I've done seem to mostly come 
from unpacking and repacking floats in softfloat which is done for every 
operation, e.g. muladd which mp3 encoding mostly uses does 3 unpacks and 1 
pack for each call and each unpack is 3 extracts so even small overheads 
add app quickly. Just 1 muladd will result in 9 extracts and 2 deposits at 
least plus updating PPC flags for each FPU op adds a bunch more. I did 
some profiling with perf to find these.

Regards,
BALATON Zoltan

Alex Bennée May 22, 2023, 4:48 p.m. UTC | #6

(ajb: add Richard for his compiler-fu)

BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Mon, 22 May 2023, Alex Bennée wrote:
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>
>>> The low level extract and deposit funtions provided by bitops.h are
>>> used in performance critical places. It crept into target/ppc via
>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>> where hardfloat is also disabled is doubly affected.
>>
>> Most of these asserts compile out to nothing if the compiler is able to
>> verify the constants are in the range. For example examining
>> the start of float64_add:
>>
<snip>
>>
>> I don't see any check and abort steps because all the shift and mask
>> values are known at compile time. The softfloat compilation certainly
>> does have some assert points though:
>>
>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>
>> but the references are for the ISRA segments so its tricky to know if
>> they get used or are just there for LTO purposes.
>>
>> If there are hot-paths that show up the extract/deposit functions I
>> suspect a better approach would be to implement _nocheck variants (or
>> maybe _noassert?) and use them where required rather than turning off
>> the assert checking for these utility functions.
>
> Just to clarify again, the asserts are still there when compiled with
> --enable-debug. The patch only turns them off for optimised release
> builds which I think makes sense if these asserts are to catch
> programming errors.

Well as Peter said the general policy is to keep asserts in but I
appreciate this is a hotpath case.

> I think I've also suggested adding noassert
> versions of these but that wasn't a popular idea and it may also not
> be easy to convert all places to use that like for example the
> register fields related usage in target/ppc as that would also affect
> other places.

Is code generation or device emulation really on the hot-path. Generally
a well predicted assert is in the noise for those operations.

> So this seems to be the simplest and most effective
> approach.
>
> The softfloat related usage in these tests I've done seem to mostly
> come from unpacking and repacking floats in softfloat which is done
> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
> unpacks and 1 pack for each call and each unpack is 3 extracts so even
> small overheads add app quickly. Just 1 muladd will result in 9
> extracts and 2 deposits at least plus updating PPC flags for each FPU
> op adds a bunch more. I did some profiling with perf to find these.

After some messing about trying to get lame to cross compile to a static
binary I was able to replicate what you've seen:

  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
   8.26%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
   3.62%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd

what I find confusing is the fact the parts extraction and packing
should all be known constants which should cause the asserts to
disappear. However it looks like the compiler decided to bring in a copy
of the whole inline function (ISRA = >interprocedural scalar replacement
of aggregates) which obviously can't fold the constants and eliminate
the assert.

Richard,

Any idea of why the compiler might decide to do something like this?

>
> Regards,
> BALATON Zoltan


--
Alex Bennée
Virtualisation Tech Lead @ Linaro

BALATON Zoltan May 22, 2023, 10:26 p.m. UTC | #7

On Mon, 22 May 2023, Alex Bennée wrote:
> (ajb: add Richard for his compiler-fu)
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> On Mon, 22 May 2023, Alex Bennée wrote:
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>
>>>> The low level extract and deposit funtions provided by bitops.h are
>>>> used in performance critical places. It crept into target/ppc via
>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>> where hardfloat is also disabled is doubly affected.
>>>
>>> Most of these asserts compile out to nothing if the compiler is able to
>>> verify the constants are in the range. For example examining
>>> the start of float64_add:
>>>
> <snip>
>>>
>>> I don't see any check and abort steps because all the shift and mask
>>> values are known at compile time. The softfloat compilation certainly
>>> does have some assert points though:
>>>
>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>>
>>> but the references are for the ISRA segments so its tricky to know if
>>> they get used or are just there for LTO purposes.
>>>
>>> If there are hot-paths that show up the extract/deposit functions I
>>> suspect a better approach would be to implement _nocheck variants (or
>>> maybe _noassert?) and use them where required rather than turning off
>>> the assert checking for these utility functions.
>>
>> Just to clarify again, the asserts are still there when compiled with
>> --enable-debug. The patch only turns them off for optimised release
>> builds which I think makes sense if these asserts are to catch
>> programming errors.
>
> Well as Peter said the general policy is to keep asserts in but I
> appreciate this is a hotpath case.
>
>> I think I've also suggested adding noassert
>> versions of these but that wasn't a popular idea and it may also not
>> be easy to convert all places to use that like for example the
>> register fields related usage in target/ppc as that would also affect
>> other places.
>
> Is code generation or device emulation really on the hot-path. Generally
> a well predicted assert is in the noise for those operations.

They aren't in code generation but in helpers as you can also see in the 
profile below and so they can be on hot path. Also I've noticed that 
extract8 and extract16 just call extract32 after adding another assert on 
their own in addition to the one in extract32 which is double overhead for 
really no reason. I'd delete all these asserts as the likelhood of bugs 
these could catch is very low anyway (how often do you expect somebody to 
call these with out of bound values that would not be obvious from the 
results otherwise?) but leaving them in non-debug builds is totally 
useless in my opinion.

>> So this seems to be the simplest and most effective
>> approach.
>>
>> The softfloat related usage in these tests I've done seem to mostly
>> come from unpacking and repacking floats in softfloat which is done
>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>> small overheads add app quickly. Just 1 muladd will result in 9
>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>> op adds a bunch more. I did some profiling with perf to find these.
>
> After some messing about trying to get lame to cross compile to a static
> binary I was able to replicate what you've seen:
>
>  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
>  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
>   8.26%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
>   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
>   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
>   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
>   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
>   3.62%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
>   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
>   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
>   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
>   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
>   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
>   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
>   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
>   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
>   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
>   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd

I've run 32 bit PPC version in qemu-system-ppc so the profile is a bit 
different (has more system related overhead that I plan to look at 
separately) but this part is similar to the above. I also wonder what 
makes helper_compute_fprf_float64 a bottleneck as that does not seem to 
have much extract/deposit, only a call to clz but it's hard to tell what 
it really does due to nested calls and macros. I've also seen this 
function among the top contenders in my profiling.

> what I find confusing is the fact the parts extraction and packing
> should all be known constants which should cause the asserts to
> disappear. However it looks like the compiler decided to bring in a copy
> of the whole inline function (ISRA = >interprocedural scalar replacement
> of aggregates) which obviously can't fold the constants and eliminate
> the assert.

Could it be related to that while the parts size and start are marked 
const but pulled out of a struct field so the compiler may not know their 
actual value until run time?

> Richard,
>
> Any idea of why the compiler might decide to do something like this?
>
>>
>> Regards,
>> BALATON Zoltan
>
>
> --
> Alex Bennée
> Virtualisation Tech Lead @ Linaro
>
>

Richard Henderson May 22, 2023, 11:34 p.m. UTC | #8

On 5/22/23 15:26, BALATON Zoltan wrote:
> On Mon, 22 May 2023, Alex Bennée wrote:
>> (ajb: add Richard for his compiler-fu)
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>
>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>> used in performance critical places. It crept into target/ppc via
>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>> where hardfloat is also disabled is doubly affected.
>>>>
>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>> verify the constants are in the range. For example examining
>>>> the start of float64_add:
>>>>
>> <snip>
>>>>
>>>> I don't see any check and abort steps because all the shift and mask
>>>> values are known at compile time. The softfloat compilation certainly
>>>> does have some assert points though:
>>>>
>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>>>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>>>
>>>> but the references are for the ISRA segments so its tricky to know if
>>>> they get used or are just there for LTO purposes.
>>>>
>>>> If there are hot-paths that show up the extract/deposit functions I
>>>> suspect a better approach would be to implement _nocheck variants (or
>>>> maybe _noassert?) and use them where required rather than turning off
>>>> the assert checking for these utility functions.
>>>
>>> Just to clarify again, the asserts are still there when compiled with
>>> --enable-debug. The patch only turns them off for optimised release
>>> builds which I think makes sense if these asserts are to catch
>>> programming errors.
>>
>> Well as Peter said the general policy is to keep asserts in but I
>> appreciate this is a hotpath case.
>>
>>> I think I've also suggested adding noassert
>>> versions of these but that wasn't a popular idea and it may also not
>>> be easy to convert all places to use that like for example the
>>> register fields related usage in target/ppc as that would also affect
>>> other places.
>>
>> Is code generation or device emulation really on the hot-path. Generally
>> a well predicted assert is in the noise for those operations.
> 
> They aren't in code generation but in helpers as you can also see in the profile below and 
> so they can be on hot path. Also I've noticed that extract8 and extract16 just call 
> extract32 after adding another assert on their own in addition to the one in extract32 
> which is double overhead for really no reason. I'd delete all these asserts as the 
> likelhood of bugs these could catch is very low anyway (how often do you expect somebody 
> to call these with out of bound values that would not be obvious from the results 
> otherwise?) but leaving them in non-debug builds is totally useless in my opinion.
> 
>>> So this seems to be the simplest and most effective
>>> approach.
>>>
>>> The softfloat related usage in these tests I've done seem to mostly
>>> come from unpacking and repacking floats in softfloat which is done
>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>> small overheads add app quickly. Just 1 muladd will result in 9
>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>> op adds a bunch more. I did some profiling with perf to find these.
>>
>> After some messing about trying to get lame to cross compile to a static
>> binary I was able to replicate what you've seen:
>>
>>  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
>>  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
>>   8.26%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
>>   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
>>   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
>>   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
>>   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
>>   3.62%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
>>   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
>>   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
>>   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
>>   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
>>   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
>>   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
>>   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
>>   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
>>   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
>>   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>>   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd
> 
> I've run 32 bit PPC version in qemu-system-ppc so the profile is a bit different (has more 
> system related overhead that I plan to look at separately) but this part is similar to the 
> above. I also wonder what makes helper_compute_fprf_float64 a bottleneck as that does not 
> seem to have much extract/deposit, only a call to clz but it's hard to tell what it really 
> does due to nested calls and macros. I've also seen this function among the top contenders 
> in my profiling.
> 
>> what I find confusing is the fact the parts extraction and packing
>> should all be known constants which should cause the asserts to
>> disappear. However it looks like the compiler decided to bring in a copy
>> of the whole inline function (ISRA = >interprocedural scalar replacement
>> of aggregates) which obviously can't fold the constants and eliminate
>> the assert.
> 
> Could it be related to that while the parts size and start are marked const but pulled out 
> of a struct field so the compiler may not know their actual value until run time?
> 
>> Richard,
>>
>> Any idea of why the compiler might decide to do something like this?

Try this.


r~

Alex Bennée May 23, 2023, 6:43 a.m. UTC | #9

Richard Henderson <richard.henderson@linaro.org> writes:

> On 5/22/23 15:26, BALATON Zoltan wrote:
>> On Mon, 22 May 2023, Alex Bennée wrote:
>>> (ajb: add Richard for his compiler-fu)
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>
>>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>>> used in performance critical places. It crept into target/ppc via
>>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>>> where hardfloat is also disabled is doubly affected.
>>>>>
>>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>>> verify the constants are in the range. For example examining
>>>>> the start of float64_add:
>>>>>
>>> <snip>
>>>>>
>>>>> I don't see any check and abort steps because all the shift and mask
>>>>> values are known at compile time. The softfloat compilation certainly
>>>>> does have some assert points though:
>>>>>
>>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>>>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>>>>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>>>>
>>>>> but the references are for the ISRA segments so its tricky to know if
>>>>> they get used or are just there for LTO purposes.
>>>>>
>>>>> If there are hot-paths that show up the extract/deposit functions I
>>>>> suspect a better approach would be to implement _nocheck variants (or
>>>>> maybe _noassert?) and use them where required rather than turning off
>>>>> the assert checking for these utility functions.
>>>>
>>>> Just to clarify again, the asserts are still there when compiled with
>>>> --enable-debug. The patch only turns them off for optimised release
>>>> builds which I think makes sense if these asserts are to catch
>>>> programming errors.
>>>
>>> Well as Peter said the general policy is to keep asserts in but I
>>> appreciate this is a hotpath case.
>>>
>>>> I think I've also suggested adding noassert
>>>> versions of these but that wasn't a popular idea and it may also not
>>>> be easy to convert all places to use that like for example the
>>>> register fields related usage in target/ppc as that would also affect
>>>> other places.
>>>
>>> Is code generation or device emulation really on the hot-path. Generally
>>> a well predicted assert is in the noise for those operations.
>> They aren't in code generation but in helpers as you can also see in
>> the profile below and so they can be on hot path. Also I've noticed
>> that extract8 and extract16 just call extract32 after adding another
>> assert on their own in addition to the one in extract32 which is
>> double overhead for really no reason. I'd delete all these asserts
>> as the likelhood of bugs these could catch is very low anyway (how
>> often do you expect somebody to call these with out of bound values
>> that would not be obvious from the results otherwise?) but leaving
>> them in non-debug builds is totally useless in my opinion.
>> 
>>>> So this seems to be the simplest and most effective
>>>> approach.
>>>>
>>>> The softfloat related usage in these tests I've done seem to mostly
>>>> come from unpacking and repacking floats in softfloat which is done
>>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>>> small overheads add app quickly. Just 1 muladd will result in 9
>>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>>> op adds a bunch more. I did some profiling with perf to find these.
>>>
>>> After some messing about trying to get lame to cross compile to a static
>>> binary I was able to replicate what you've seen:
>>>
>>>  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
>>>  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
>>>   8.26%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
>>>   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
>>>   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
>>>   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
>>>   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
>>>   3.62%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
>>>   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
>>>   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
>>>   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
>>>   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
>>>   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
>>>   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
>>>   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
>>>   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
>>>   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
>>>   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>>>   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd
>> I've run 32 bit PPC version in qemu-system-ppc so the profile is a
>> bit different (has more system related overhead that I plan to look
>> at separately) but this part is similar to the above. I also wonder
>> what makes helper_compute_fprf_float64 a bottleneck as that does not
>> seem to have much extract/deposit, only a call to clz but it's hard
>> to tell what it really does due to nested calls and macros. I've
>> also seen this function among the top contenders in my profiling.
>> 
>>> what I find confusing is the fact the parts extraction and packing
>>> should all be known constants which should cause the asserts to
>>> disappear. However it looks like the compiler decided to bring in a copy
>>> of the whole inline function (ISRA = >interprocedural scalar replacement
>>> of aggregates) which obviously can't fold the constants and eliminate
>>> the assert.
>> Could it be related to that while the parts size and start are
>> marked const but pulled out of a struct field so the compiler may
>> not know their actual value until run time?
>> 
>>> Richard,
>>>
>>> Any idea of why the compiler might decide to do something like this?
>
> Try this.

That seems to have done the trick, translated code is now dominating the
profile:

+   14.12%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000619420
+   13.30%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000616850
+   12.58%    12.19%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
+   10.62%     0.00%  qemu-ppc64  [unknown]                [.] 0x000000400061bf70
+    9.91%     9.73%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
+    7.84%     7.82%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
+    6.47%     5.78%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize.constprop.0
+    6.46%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000620130
+    6.42%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000619400
+    6.17%     6.04%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
+    5.85%     0.00%  qemu-ppc64  [unknown]                [.] 0x00000040006167e0
+    5.74%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000b693fcffffd3
+    5.45%     4.78%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
+    4.79%     0.00%  qemu-ppc64  [unknown]                [.] 0xfaff203940fe8240
+    4.29%     3.82%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
+    4.20%     3.87%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
+    3.51%     2.98%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
+    3.12%     2.97%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
+    3.06%     2.95%  qemu-ppc64  qemu-ppc64               [.] float64_add
+    2.88%     2.35%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
+    2.65%     0.00%  qemu-ppc64  [unknown]                [.] 0x6c555e9100004039
+    2.55%     1.30%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status

eyeballing the float functions and I can't see any asserts expressed.

Before:

Executed in  721.65 secs    fish           external
   usr time  721.38 secs  561.00 micros  721.38 secs
   sys time    0.20 secs  261.00 micros    0.20 secs

After:

Executed in  650.38 secs    fish           external
   usr time  650.11 secs    0.00 micros  650.11 secs
   sys time    0.20 secs  989.00 micros    0.20 secs



>
>
> r~
>
> [2. text/x-patch; z.patch]...

BALATON Zoltan June 5, 2023, 10:06 p.m. UTC | #10

On Tue, 23 May 2023, Alex Bennée wrote:
> Richard Henderson <richard.henderson@linaro.org> writes:
>> On 5/22/23 15:26, BALATON Zoltan wrote:
>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>> (ajb: add Richard for his compiler-fu)
>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>>
>>>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>>>> used in performance critical places. It crept into target/ppc via
>>>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>>>> where hardfloat is also disabled is doubly affected.
>>>>>>
>>>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>>>> verify the constants are in the range. For example examining
>>>>>> the start of float64_add:
>>>>>>
>>>> <snip>
>>>>>>
>>>>>> I don't see any check and abort steps because all the shift and mask
>>>>>> values are known at compile time. The softfloat compilation certainly
>>>>>> does have some assert points though:
>>>>>>
>>>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>>>>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>>>>>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>>>>>
>>>>>> but the references are for the ISRA segments so its tricky to know if
>>>>>> they get used or are just there for LTO purposes.
>>>>>>
>>>>>> If there are hot-paths that show up the extract/deposit functions I
>>>>>> suspect a better approach would be to implement _nocheck variants (or
>>>>>> maybe _noassert?) and use them where required rather than turning off
>>>>>> the assert checking for these utility functions.
>>>>>
>>>>> Just to clarify again, the asserts are still there when compiled with
>>>>> --enable-debug. The patch only turns them off for optimised release
>>>>> builds which I think makes sense if these asserts are to catch
>>>>> programming errors.
>>>>
>>>> Well as Peter said the general policy is to keep asserts in but I
>>>> appreciate this is a hotpath case.
>>>>
>>>>> I think I've also suggested adding noassert
>>>>> versions of these but that wasn't a popular idea and it may also not
>>>>> be easy to convert all places to use that like for example the
>>>>> register fields related usage in target/ppc as that would also affect
>>>>> other places.
>>>>
>>>> Is code generation or device emulation really on the hot-path. Generally
>>>> a well predicted assert is in the noise for those operations.
>>> They aren't in code generation but in helpers as you can also see in
>>> the profile below and so they can be on hot path. Also I've noticed
>>> that extract8 and extract16 just call extract32 after adding another
>>> assert on their own in addition to the one in extract32 which is
>>> double overhead for really no reason. I'd delete all these asserts
>>> as the likelhood of bugs these could catch is very low anyway (how
>>> often do you expect somebody to call these with out of bound values
>>> that would not be obvious from the results otherwise?) but leaving
>>> them in non-debug builds is totally useless in my opinion.
>>>
>>>>> So this seems to be the simplest and most effective
>>>>> approach.
>>>>>
>>>>> The softfloat related usage in these tests I've done seem to mostly
>>>>> come from unpacking and repacking floats in softfloat which is done
>>>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>>>> small overheads add app quickly. Just 1 muladd will result in 9
>>>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>>>> op adds a bunch more. I did some profiling with perf to find these.
>>>>
>>>> After some messing about trying to get lame to cross compile to a static
>>>> binary I was able to replicate what you've seen:
>>>>
>>>>  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
>>>>  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
>>>>   8.26%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
>>>>   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
>>>>   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
>>>>   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
>>>>   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
>>>>   3.62%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
>>>>   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
>>>>   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
>>>>   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
>>>>   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
>>>>   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
>>>>   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
>>>>   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
>>>>   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
>>>>   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
>>>>   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>>>>   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd
>>> I've run 32 bit PPC version in qemu-system-ppc so the profile is a
>>> bit different (has more system related overhead that I plan to look
>>> at separately) but this part is similar to the above. I also wonder
>>> what makes helper_compute_fprf_float64 a bottleneck as that does not
>>> seem to have much extract/deposit, only a call to clz but it's hard
>>> to tell what it really does due to nested calls and macros. I've
>>> also seen this function among the top contenders in my profiling.
>>>
>>>> what I find confusing is the fact the parts extraction and packing
>>>> should all be known constants which should cause the asserts to
>>>> disappear. However it looks like the compiler decided to bring in a copy
>>>> of the whole inline function (ISRA = >interprocedural scalar replacement
>>>> of aggregates) which obviously can't fold the constants and eliminate
>>>> the assert.
>>> Could it be related to that while the parts size and start are
>>> marked const but pulled out of a struct field so the compiler may
>>> not know their actual value until run time?
>>>
>>>> Richard,
>>>>
>>>> Any idea of why the compiler might decide to do something like this?
>>
>> Try this.
>
> That seems to have done the trick, translated code is now dominating the
> profile:
>
> +   14.12%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000619420
> +   13.30%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000616850
> +   12.58%    12.19%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
> +   10.62%     0.00%  qemu-ppc64  [unknown]                [.] 0x000000400061bf70
> +    9.91%     9.73%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
> +    7.84%     7.82%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
> +    6.47%     5.78%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize.constprop.0
> +    6.46%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000620130
> +    6.42%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000619400
> +    6.17%     6.04%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
> +    5.85%     0.00%  qemu-ppc64  [unknown]                [.] 0x00000040006167e0
> +    5.74%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000b693fcffffd3
> +    5.45%     4.78%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
> +    4.79%     0.00%  qemu-ppc64  [unknown]                [.] 0xfaff203940fe8240
> +    4.29%     3.82%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
> +    4.20%     3.87%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
> +    3.51%     2.98%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
> +    3.12%     2.97%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
> +    3.06%     2.95%  qemu-ppc64  qemu-ppc64               [.] float64_add
> +    2.88%     2.35%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
> +    2.65%     0.00%  qemu-ppc64  [unknown]                [.] 0x6c555e9100004039
> +    2.55%     1.30%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>
> eyeballing the float functions and I can't see any asserts expressed.
>
> Before:
>
> Executed in  721.65 secs    fish           external
>   usr time  721.38 secs  561.00 micros  721.38 secs
>   sys time    0.20 secs  261.00 micros    0.20 secs
>
> After:
>
> Executed in  650.38 secs    fish           external
>   usr time  650.11 secs    0.00 micros  650.11 secs
>   sys time    0.20 secs  989.00 micros    0.20 secs
>>
>>
>> r~
>>
>> [2. text/x-patch; z.patch]...

Will this be submitted as a proper patch and merged? This improves 
softfloat important for target/ppc where hardfloat is disabled. I guess 
you can also add

Tested-by: BALATON Zoltan <balaton@eik.bme.hu>

and Alex also did testing above. This is just pinging it so it's not 
forgotten. Thank you.

Regards,
BALATON Zoltan

bitops.h: Compile out asserts without --enable-debug

Commit Message

Comments

Patch