diff mbox

[00/14] softfloat: Use POSIX integer types - benchmarked

Message ID 1326674823-13069-1-git-send-email-afaerber@suse.de
State New
Headers show

Commit Message

Andreas Färber Jan. 16, 2012, 12:46 a.m. UTC
Hello,

Based on a suggestion from Alex earlier this week, I managed to run a
simple benchmark of softfloat performance with qemu-arm, as requested by
Peter.

I went for the Whetstone floating point benchmark:
http://en.wikipedia.org/wiki/Whetstone_%28benchmark%29

For a loop count of 100,000 and 5 runs I got the following results:

  current:        138.9-204.1 Whetstone-MIPS
  [u]int*_t:      185.2-188.7 Whetstone-MIPS
  [u]int_fast*_t: 285.7-294.1 Whetstone-MIPS

  Toshiba AC100:  833.3-909.1 Whetstone-MIPS

These results seem to indicate that the "fast" POSIX types are indeed
somewhat faster, both compared to exact-size POSIX types and to the
current state.

As a short summary of previous discussions, softfloat had these typedefs:
  [s]bits{8,16,32,64} - exact-size semantics, => [u]int{8,16,32,64}_t
  [u]int{8,16,32,64}  - minimum-width semantics, host-independent
AIX, Mac OS X and BeOS/Haiku have some or all of the latter already,
leading to type conflicts.
I had originally suggested the POSIX [u]int_least*_t types but we
rather preferred [u]int_fast*_t, worried about performance, to get the
fastest least-width types. For either of these the actual width depends
on system headers. On the other hand it would be futile to try to
performance-optimize integer types for every possible host inside QEMU.
If we don't mind performance or dislike host-dependencies, we could just
use [u]int*_t, but then with everything [u]int*_t we can't easily change
this in case it bites us in the future.

Personally I prefer those 'fast' types, because that way we get to
keep the distinction between what was formerly [s]bits* and [u]int* types.
The clear name distinction would even allow to do conversions by regex.

However, I noticed that target-mips/cpu.h had typedefs for
uint_fast{8,16}_t for Solaris <= 9. So I moved these to osdep.h and
added typedefs for the newly needed types.

Coccinelle needed some tweaking for the macros and only did the uint16
conversion fully. For the others I was too impatient, so I hand-converted
the remaining occurrences that Coccinelle did not catch.

Since Coccinelle stripped leading whitespace before types replaced anyway,
I hand-edited (and git-am'ed) the patches to make lines touched adhere to
Coding Style.

Patches 1-3 fix misuses of softfloat types that were introduced since the
last series of fixes. These could be cherry-picked.

Patch 4 fixes a misuse of int inside softfloat. This could be cherry-picked.

Patch 5 moves the Solaris typedefs to a central header, resolving an XXX
present since their introduction in r1979.

Patches 6-13 convert the softfloat integer types.

Patch 14 converts the 'flag' type to bool, removing the last softfloat type.

Please test that this doesn't break / unbreaks Your Favorite Host:
http://repo.or.cz/w/qemu/afaerber.git/shortlog/refs/heads/softfloat

Regards,
Andreas

Cc: Alexander Graf <agraf@suse.de>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Aurélien Jarno <aurelien@aurel32.net>
Cc: malc <av1474@comtv.ru>
Cc: Ben Taylor <bentaylor.solx86@gmail.com>

Cc: Rui Carmo <rui.carmo@gmail.com>
Cc: Eric Sunshine <sunshine@sunshineco.com>
Cc: Pavel Borzenkov <pavel.borzenkov@gmail.com>
Cc: Juan Pineda <juan@logician.com>

Host: HP Envy w/ Intel Core i7-2630QM 2.0 GHz (openSUSE 12.1)
QEMU: 2be276242135eac6e86be2a8259545e620c94107 plus + 2 patches
Configuration: --prefix=/usr/local
Command: arm-linux-user/qemu-arm path/to/whetstone -c 100000

Source: http://www.netlib.org/benchmark/whetstone.c
Compiler: gcc (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5
Compilation: gcc -O -s whetstone.c -o whetstone -lm -static

[u]int*_t:
---
 #define INLINE static inline
---

spatch(1) -macro_file_builtins:
---
#define STATUS_PARAM
#define STATUS_VAR

#define INLINE static inline

#define MINMAX
---

Andreas Färber (13):
  lm32: Fix mixup of uint32 and uint32_t
  target-sparc: Fix mixup of uint64 and uint64_t
  qemu-tool: Fix mixup of int64 and int64_t
  softfloat: Fix mixups of int and int16
  softfloat: Replace uint16 type with uint_fast16_t
  softfloat: Replace int16 type with int_fast16_t
  softfloat: Remove unused uint8 type
  softfloat: Replace int8 type with int_fast8_t
  softfloat: Replace uint32 type with uint_fast32_t
  softfloat: Replace int32 type with int_fast32_t
  softfloat: Replace uint64 type with uint_fast64_t
  softfloat: Replace int64 type with int_fast64_t
  softfloat: Replace flag type with bool

 fpu/softfloat-macros.h        |   52 ++--
 fpu/softfloat-specialize.h    |   56 ++--
 fpu/softfloat.c               |  626 ++++++++++++++++++++--------------------
 fpu/softfloat.h               |  115 ++++-----
 hw/milkymist-vgafb_template.h |    2 +-
 qemu-tool.c                   |    4 +-
 target-sparc/vis_helper.c     |    2 +-
 7 files changed, 419 insertions(+), 438 deletions(-)

Comments

Peter Maydell Jan. 16, 2012, 7:02 p.m. UTC | #1
On 16 January 2012 00:46, Andreas Färber <afaerber@suse.de> wrote:
> Based on a suggestion from Alex earlier this week, I managed to run a
> simple benchmark of softfloat performance with qemu-arm, as requested by
> Peter.
>
> I went for the Whetstone floating point benchmark:
> http://en.wikipedia.org/wiki/Whetstone_%28benchmark%29
>
> For a loop count of 100,000 and 5 runs I got the following results:
>
>  current:        138.9-204.1 Whetstone-MIPS
>  [u]int*_t:      185.2-188.7 Whetstone-MIPS
>  [u]int_fast*_t: 285.7-294.1 Whetstone-MIPS
>
>  Toshiba AC100:  833.3-909.1 Whetstone-MIPS

This is much better than I was expecting when I suggested running
a benchmark :-) and indicates that it's worth a bit of pain to
switch to the fast* types.

I've tested this series with the ARM VFP/Neon tests I have (random
instruction sequence testing). I found a few extra bugs which I've
sent a separate patch series for: those patches need to be
incorporated into this series or otherwise applied before it.

I've commented on patches 01, 05, 09; the rest are
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

with the following caveat.

The changes from int32/uint32 to int_fast32_t/uint_fast32_t are
the potentially dangerous ones in this set, since they change a
type that was easily mistaken for "exactly 32 bits" and happened
to be 32 bits to one that is now 64 bits (on 64 bit hosts).
The other type changes are either "not a width change in practice"
(int64) or "change from something that was already wider than the
number in it" (int16) or "wasn't being used for things where we
really cared about the type width" (int8, flag).

Specifically, it's this change that revealed the latent bugs I
sent a three-patch set for, and so it doesn't seem too unlikely
that other platforms may have some similar bugs in their
int-to-float/float-to-int code that will now be revealed on
64 bit hosts.

I don't think that means we shouldn't apply this patch set but
I would recommend some testing of other architectures :-)

-- PMM
Alexander Graf Jan. 16, 2012, 7:12 p.m. UTC | #2
On 16.01.2012, at 20:02, Peter Maydell wrote:

> On 16 January 2012 00:46, Andreas Färber <afaerber@suse.de> wrote:
>> Based on a suggestion from Alex earlier this week, I managed to run a
>> simple benchmark of softfloat performance with qemu-arm, as requested by
>> Peter.
>> 
>> I went for the Whetstone floating point benchmark:
>> http://en.wikipedia.org/wiki/Whetstone_%28benchmark%29
>> 
>> For a loop count of 100,000 and 5 runs I got the following results:
>> 
>>  current:        138.9-204.1 Whetstone-MIPS
>>  [u]int*_t:      185.2-188.7 Whetstone-MIPS
>>  [u]int_fast*_t: 285.7-294.1 Whetstone-MIPS
>> 
>>  Toshiba AC100:  833.3-909.1 Whetstone-MIPS
> 
> This is much better than I was expecting when I suggested running
> a benchmark :-) and indicates that it's worth a bit of pain to
> switch to the fast* types.
> 
> I've tested this series with the ARM VFP/Neon tests I have (random
> instruction sequence testing). I found a few extra bugs which I've
> sent a separate patch series for: those patches need to be
> incorporated into this series or otherwise applied before it.
> 
> I've commented on patches 01, 05, 09; the rest are
> Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
> 
> with the following caveat.
> 
> The changes from int32/uint32 to int_fast32_t/uint_fast32_t are
> the potentially dangerous ones in this set, since they change a
> type that was easily mistaken for "exactly 32 bits" and happened
> to be 32 bits to one that is now 64 bits (on 64 bit hosts).
> The other type changes are either "not a width change in practice"
> (int64) or "change from something that was already wider than the
> number in it" (int16) or "wasn't being used for things where we
> really cared about the type width" (int8, flag).
> 
> Specifically, it's this change that revealed the latent bugs I
> sent a three-patch set for, and so it doesn't seem too unlikely
> that other platforms may have some similar bugs in their
> int-to-float/float-to-int code that will now be revealed on
> 64 bit hosts.
> 
> I don't think that means we shouldn't apply this patch set but
> I would recommend some testing of other architectures :-)

So what if we leave uint32_t be uint32_t and make the other ones _fast_?


Alex
Peter Maydell Jan. 16, 2012, 7:17 p.m. UTC | #3
On 16 January 2012 19:12, Alexander Graf <agraf@suse.de> wrote:
>
> On 16.01.2012, at 20:02, Peter Maydell wrote:
>> The changes from int32/uint32 to int_fast32_t/uint_fast32_t are
>> the potentially dangerous ones in this set, since they change a
>> type that was easily mistaken for "exactly 32 bits" and happened
>> to be 32 bits to one that is now 64 bits (on 64 bit hosts).
>> The other type changes are either "not a width change in practice"
>> (int64) or "change from something that was already wider than the
>> number in it" (int16) or "wasn't being used for things where we
>> really cared about the type width" (int8, flag).
>>
>> Specifically, it's this change that revealed the latent bugs I
>> sent a three-patch set for, and so it doesn't seem too unlikely
>> that other platforms may have some similar bugs in their
>> int-to-float/float-to-int code that will now be revealed on
>> 64 bit hosts.
>>
>> I don't think that means we shouldn't apply this patch set but
>> I would recommend some testing of other architectures :-)
>
> So what if we leave uint32_t be uint32_t and make the other ones _fast_?

We'd need to get Andreas to do another benchmark run to check that
this didn't lose the perf gain. Also it's kind of aesthetically
not very pleasing...

-- PMM
Alexander Graf Jan. 16, 2012, 7:18 p.m. UTC | #4
On 16.01.2012, at 20:17, Peter Maydell wrote:

> On 16 January 2012 19:12, Alexander Graf <agraf@suse.de> wrote:
>> 
>> On 16.01.2012, at 20:02, Peter Maydell wrote:
>>> The changes from int32/uint32 to int_fast32_t/uint_fast32_t are
>>> the potentially dangerous ones in this set, since they change a
>>> type that was easily mistaken for "exactly 32 bits" and happened
>>> to be 32 bits to one that is now 64 bits (on 64 bit hosts).
>>> The other type changes are either "not a width change in practice"
>>> (int64) or "change from something that was already wider than the
>>> number in it" (int16) or "wasn't being used for things where we
>>> really cared about the type width" (int8, flag).
>>> 
>>> Specifically, it's this change that revealed the latent bugs I
>>> sent a three-patch set for, and so it doesn't seem too unlikely
>>> that other platforms may have some similar bugs in their
>>> int-to-float/float-to-int code that will now be revealed on
>>> 64 bit hosts.
>>> 
>>> I don't think that means we shouldn't apply this patch set but
>>> I would recommend some testing of other architectures :-)
>> 
>> So what if we leave uint32_t be uint32_t and make the other ones _fast_?
> 
> We'd need to get Andreas to do another benchmark run to check that
> this didn't lose the perf gain. Also it's kind of aesthetically
> not very pleasing...

Yeah, but it's safe. And I would assume that 32bit integers are better optimized for than smaller ones in today's CPUs.


Alex
Peter Maydell Jan. 16, 2012, 11:52 p.m. UTC | #5
On 16 January 2012 19:18, Alexander Graf <agraf@suse.de> wrote:
> On 16.01.2012, at 20:17, Peter Maydell wrote:
>> On 16 January 2012 19:12, Alexander Graf <agraf@suse.de> wrote:
>>> So what if we leave uint32_t be uint32_t and make the other ones _fast_?
>>
>> We'd need to get Andreas to do another benchmark run to check that
>> this didn't lose the perf gain. Also it's kind of aesthetically
>> not very pleasing...
>
> Yeah, but it's safe.

Safety is for people with insufficiently comprehensive test suites :-)

-- PMM
Alexander Graf Jan. 17, 2012, 12:05 a.m. UTC | #6
On 17.01.2012, at 00:52, Peter Maydell wrote:

> On 16 January 2012 19:18, Alexander Graf <agraf@suse.de> wrote:
>> On 16.01.2012, at 20:17, Peter Maydell wrote:
>>> On 16 January 2012 19:12, Alexander Graf <agraf@suse.de> wrote:
>>>> So what if we leave uint32_t be uint32_t and make the other ones _fast_?
>>> 
>>> We'd need to get Andreas to do another benchmark run to check that
>>> this didn't lose the perf gain. Also it's kind of aesthetically
>>> not very pleasing...
>> 
>> Yeah, but it's safe.
> 
> Safety is for people with insufficiently comprehensive test suites :-)

You mean all the non-arm targets? :)

Alex
Peter Maydell Jan. 20, 2012, 1:05 p.m. UTC | #7
On 16 January 2012 00:46, Andreas Färber <afaerber@suse.de> wrote:
> For a loop count of 100,000 and 5 runs I got the following results:
>
>  current:        138.9-204.1 Whetstone-MIPS
>  [u]int*_t:      185.2-188.7 Whetstone-MIPS
>  [u]int_fast*_t: 285.7-294.1 Whetstone-MIPS
>
>  Toshiba AC100:  833.3-909.1 Whetstone-MIPS
>
> These results seem to indicate that the "fast" POSIX types are indeed
> somewhat faster, both compared to exact-size POSIX types and to the
> current state.

OTOH I did a run of scimark2 and got:
current tree:
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:           12.98
FFT             Mflops:     7.66    (N=1024)
SOR             Mflops:    19.49    (100 x 100)
MonteCarlo:     Mflops:     6.12
Sparse matmult  Mflops:    15.34    (N=1000, nz=5000)
LU              Mflops:    16.28    (M=100, N=100)

with patches (yours and mine):
**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:           11.87
FFT             Mflops:     7.12    (N=1024)
SOR             Mflops:    17.66    (100 x 100)
MonteCarlo:     Mflops:     5.75
Sparse matmult  Mflops:    14.03    (N=1000, nz=5000)
LU              Mflops:    14.81    (M=100, N=100)

Hmmm...

-- PMM
Andreas Färber Jan. 23, 2012, 5:41 p.m. UTC | #8
Am 20.01.2012 14:05, schrieb Peter Maydell:
> On 16 January 2012 00:46, Andreas Färber <afaerber@suse.de> wrote:
>> For a loop count of 100,000 and 5 runs I got the following results:
>>
>>  current:        138.9-204.1 Whetstone-MIPS
>>  [u]int*_t:      185.2-188.7 Whetstone-MIPS
>>  [u]int_fast*_t: 285.7-294.1 Whetstone-MIPS
>>
>>  Toshiba AC100:  833.3-909.1 Whetstone-MIPS
>>
>> These results seem to indicate that the "fast" POSIX types are indeed
>> somewhat faster, both compared to exact-size POSIX types and to the
>> current state.
> 
> OTOH I did a run of scimark2 and got:
> current tree:
> **                                                              **
> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
> ** for details. (Results can be submitted to pozo@nist.gov)     **
> **                                                              **
> Using       2.00 seconds min time per kenel.
> Composite Score:           12.98
> FFT             Mflops:     7.66    (N=1024)
> SOR             Mflops:    19.49    (100 x 100)
> MonteCarlo:     Mflops:     6.12
> Sparse matmult  Mflops:    15.34    (N=1000, nz=5000)
> LU              Mflops:    16.28    (M=100, N=100)
> 
> with patches (yours and mine):
> **                                                              **
> ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
> ** for details. (Results can be submitted to pozo@nist.gov)     **
> **                                                              **
> Using       2.00 seconds min time per kenel.
> Composite Score:           11.87
> FFT             Mflops:     7.12    (N=1024)
> SOR             Mflops:    17.66    (100 x 100)
> MonteCarlo:     Mflops:     5.75
> Sparse matmult  Mflops:    14.03    (N=1000, nz=5000)
> LU              Mflops:    14.81    (M=100, N=100)

One difference between our test environments comes to mind: I had tested
only typedefs for the int types, whereas my series also converts flag.
The fixes for lm32, sparc and qemu-tool shouldn't matter here. Your
patches "degrade" some variables from fast types to exact types of course.

Anyway, here's my Whetstone and CoreMark results for
520c0d8d2772ccc9f9275bd934e13ec9b15774e4 (target-sparc: Fix mixup of
uint64 and uint64_t) plus our patches. The margin has shrunk for
Whetstone, and for CoreMark I see a slight degradation of the max.
(I also note a slight Whetstone degradation between Natty and Oneiric.)

Have you tried benchmarking after our preceding patches but before the
actual fast conversion for comparison?

Andreas


master:

C Converted Double Precision Whetstones: 200.0 MIPS
C Converted Double Precision Whetstones: 204.1 MIPS

CoreMark 1.0 : 1287.747086 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1336.094595 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1339.943722 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap


master + PMM + fast:

C Converted Double Precision Whetstones: 204.1 MIPS
C Converted Double Precision Whetstones: 208.3 MIPS

CoreMark 1.0 : 1297.690112 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1299.629606 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1309.071868 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1315.270288 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1318.913216 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1319.870653 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 1321.527686 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap


AC100 (Ubuntu Oneiric):

C Converted Double Precision Whetstones: 769.2 MIPS
C Converted Double Precision Whetstones: 833.3 MIPS

CoreMark 1.0 : 2257.506208 / GCC4.6.1 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 2303.086135 / GCC4.6.1 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 2326.122354 / GCC4.6.1 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 2349.624060 / GCC4.6.1 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 2350.360389 / GCC4.6.1 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap


Pandaboard (openSUSE Factory):

C Converted Double Precision Whetstones: 833.3 MIPS

CoreMark 1.0 : 2304.855562 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 2305.209774 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 2306.273063 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap
CoreMark 1.0 : 2306.627710 / GCC4.6.2 -O2 -static -DPERFORMANCE_RUN=1
-lrt / Heap


Results are sorted, duplicates removed.

whetstone.c was compiled with -mfloat-abi=hard this time, using GCC
(SUSE Linux) 4.6.2, and so was CoreMark except on the AC100 with GCC
(Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1.

coremark.exe was run with parameters 0x0 0x0 0x66 0.
diff mbox

Patch

diff --git a/fpu/softfloat.h b/fpu/softfloat.h
index 07c2929..6fe19d7 100644
--- a/fpu/softfloat.h
+++ b/fpu/softfloat.h
@@ -57,11 +57,11 @@  typedef uint8_t flag;
 typedef uint8_t uint8;
 typedef int8_t int8;
 #ifndef _AIX
-typedef int uint16;
-typedef int int16;
+typedef uint16_t uint16;
+typedef int16_t int16;
 #endif
-typedef unsigned int uint32;
-typedef signed int int32;
+typedef uint32_t uint32;
+typedef int32_t int32;
 typedef uint64_t uint64;
 typedef int64_t int64;

---

[u]int_fast*_t:
---
diff --git a/fpu/softfloat.h b/fpu/softfloat.h
index 07c2929..43486aa 100644
--- a/fpu/softfloat.h
+++ b/fpu/softfloat.h
@@ -54,16 +54,16 @@  these four paragraphs for those parts of this code
that are retained.
 | to the same as `int'.
 *----------------------------------------------------------------------------*/
 typedef uint8_t flag;
-typedef uint8_t uint8;
-typedef int8_t int8;
+typedef uint_fast8_t uint8;
+typedef int_fast8_t int8;
 #ifndef _AIX
-typedef int uint16;
-typedef int int16;
+typedef uint_fast16_t uint16;
+typedef int_fast16_t int16;
 #endif
-typedef unsigned int uint32;
-typedef signed int int32;
-typedef uint64_t uint64;
-typedef int64_t int64;
+typedef uint_fast32_t uint32;
+typedef int_fast32_t int32;
+typedef uint_fast64_t uint64;
+typedef int_fast64_t int64;

 #define LIT64( a ) a##LL