Patchwork [i386] : Use reciprocal sequences for vectorized SFmode division and sqrtf(x) for -ffast-math

login
register
mail settings
Submitter Uros Bizjak
Date Oct. 20, 2011, 8:20 a.m.
Message ID <CAFULd4Zd0=NwVWZwOUvsD9AWWsGjEzjXsRezTL-Pe-_MDvM46w@mail.gmail.com>
Download mbox | patch
Permalink /patch/120759/
State New
Headers show

Comments

Uros Bizjak - Oct. 20, 2011, 8:20 a.m.
Hello!

This patch builds on recent patch by Michael (that implemented
fine-grained control on -mrecip option) and with -ffast-math emits
reciprocal sequences with additional NR step for vectorized SFmode
division and vectorized sqrtf(x).

2011-10-20  Uros Bizjak  <ubizjak@gmail.com>

	* config/i386/i386.h (RECIP_MASK_DEFAULT): New define.
	* config/i386/i386.op (recip_mask): Initialize with RECIP_MASK_DEFAULT.
	* doc/invoke.texi (mrecip): Document that GCC implements vectorized
	single float division and vectorized sqrtf(x) with reciprocal sequence
	with additional Newton-Raphson step with -ffast-math.

The patch was tested on x86_64-pc-linux-gnu, but I would like Joseph
to check if I didn't mess something with options handling.

The effect of the patch is 7% faster gas_dyn from polyhedron testsuite
on corei7-avx.

Uros.
Michael Matz - Oct. 20, 2011, 9:31 a.m.
Hi,

On Thu, 20 Oct 2011, Uros Bizjak wrote:

> This patch builds on recent patch by Michael (that implemented 
> fine-grained control on -mrecip option) and with -ffast-math emits 
> reciprocal sequences with additional NR step for vectorized SFmode 
> division and vectorized sqrtf(x).

FWIW, I didn't yet come to do the same for cpu2006, but here are the two 
results of polyhedron (sandybridge, with baseflags "-Ofast -funroll-loops 
-fpeel-loops -march=corei7-avx -mveclibabi=svml -flto -fwhole-program", 
i.e. without increasing the inline limits, and linking against libimf and 
libsvml).  With the above flags:

  Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      4.68     4086864      6.16       2  0.0211
      aermod     68.22     5603956     13.40       5  0.1864
         air     10.46     4961134      3.78       5  0.2888
    capacita      3.74     4213850     19.24       3  0.0998
     channel      1.44     4808524      1.22       5  0.2898
       doduc     12.64     4288238     19.91       5  0.1128
     fatigue      4.47     4217301      3.71       5  0.0989
     gas_dyn      6.92     4211997      3.43       5  2.8640
      induct      7.44     4385543     10.33       5  0.2719
       linpk      1.28     4053798      5.88       2  0.0647
        mdbx      3.97     4114107      7.63       5  0.1365
          nf      4.89     4147809      7.90       2  0.0380
     protein     15.07     5049415     20.70       5  0.7615
      rnflow     11.89     4260434     16.05       5  0.1359
    test_fpu      8.11     4207868      3.69       5  0.6687
        tfft      0.99     4110713      0.84       5  0.3024

Geometric Mean Execution Time =       6.35 seconds

With the above flags plus "-mrecip=vec-sqrt,vec-div":

   Benchmark   Compile  Executable   Ave Run  Number   Estim
        Name    (secs)     (bytes)    (secs) Repeats   Err %
   ---------   -------  ----------   ------- -------  ------
          ac      3.85     4086864      6.17       2  0.0227
      aermod     68.31     5603956     13.38       2  0.0019
         air     10.92     4961134      3.77       5  0.1367
    capacita      3.71     4213850     18.68       2  0.0391
     channel      1.41     4808524      1.22       5  0.3327
       doduc     12.66     4288238     19.93       5  0.2391
     fatigue      4.36     4217301      3.70       2  0.0567
     gas_dyn      6.91     4211997      2.31       2  0.0867
      induct      7.46     4385543     10.31       5  0.1201
       linpk      1.70     4053798      5.88       2  0.0383
        mdbx      3.98     4114107      7.68       5  0.4000
          nf      4.89     4147809      7.89       2  0.0348
     protein     14.00     5049415     20.51       2  0.0478
      rnflow     11.89     4260434     16.05       4  0.0837
    test_fpu      8.09     4207868      3.71       5  0.7097
        tfft      1.13     4110713      0.83       5  0.2290

Geometric Mean Execution Time =       6.18 seconds

I.e. gas_dyn improves quite a bit (as expected), and the rest still works.  
I know that cpu2006 also works, but as said have no recent measurements 
for that, which I'm going to take now.


Ciao,
Michael.
Joseph S. Myers - Oct. 20, 2011, 2:45 p.m.
On Thu, 20 Oct 2011, Uros Bizjak wrote:

> The patch was tested on x86_64-pc-linux-gnu, but I would like Joseph
> to check if I didn't mess something with options handling.

I have no comments on the option handling in this patch.

> +for vectorized single float division and vectorized sqrtf(x) already with

@code{sqrtf (@var{x})}

Patch

Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 180176)
+++ config/i386/i386.h	(working copy)
@@ -2322,6 +2322,7 @@ 
 #define RECIP_MASK_VEC_SQRT	0x08
 #define RECIP_MASK_ALL	(RECIP_MASK_DIV | RECIP_MASK_SQRT \
 			 | RECIP_MASK_VEC_DIV | RECIP_MASK_VEC_SQRT)
+#define RECIP_MASK_DEFAULT (RECIP_MASK_VEC_DIV | RECIP_MASK_VEC_SQRT)
 
 #define TARGET_RECIP_DIV	((recip_mask & RECIP_MASK_DIV) != 0)
 #define TARGET_RECIP_SQRT	((recip_mask & RECIP_MASK_SQRT) != 0)
Index: config/i386/i386.opt
===================================================================
--- config/i386/i386.opt	(revision 180176)
+++ config/i386/i386.opt	(working copy)
@@ -32,7 +32,7 @@ 
 HOST_WIDE_INT ix86_isa_flags_explicit
 
 TargetVariable
-int recip_mask
+int recip_mask = RECIP_MASK_DEFAULT
 
 Variable
 int recip_mask_explicit
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 180176)
+++ doc/invoke.texi	(working copy)
@@ -12927,6 +12927,11 @@ 
 already with @option{-ffast-math} (or the above option combination), and
 doesn't need @option{-mrecip}.
 
+Also note that GCC emits the above sequence with additional Newton-Raphson step
+for vectorized single float division and vectorized sqrtf(x) already with
+@option{-ffast-math} (or the above option combination), and doesn't need
+@option{-mrecip}.
+
 @item -mrecip=@var{opt}
 @opindex mrecip=opt
 This option allows to control which reciprocal estimate instructions