diff mbox

Work around PR 50031, sphinx3 slowdown in powerpc on GCC 4.6 and GCC 4.7

Message ID 20110809180750.GA19356@hungry-tiger.westford.ibm.com
State New
Headers show

Commit Message

Michael Meissner Aug. 9, 2011, 6:07 p.m. UTC
This is an initial patch to work around the slow down of sphinx3 in power7 VSX
that first shows up in GCC 4.6 and is still present in the current GCC 4.7
trunk.  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031

The key part of the slowdown is in this inner loop in the
vector_gautbl_eval_logs3  function in sphinx3 vector.c:

{
  int32 i, r;
  float64 f;
  int32 end, veclen;
  float32 *m1, *m2, *v1, *v2;
  float64 dval1, dval2, diff1, diff2;

    /* ... */

    for (i = 0; i < veclen; i++) {
      diff1 = x[i] - m1[i];
      dval1 -= diff1 * diff1 * v1[i];
      diff2 = x[i] - m2[i];
      dval2 -= diff2 * diff2 * v2[i];
    }

    /* ... */

}

In particular, the compiler 4.6 and beyond vectorizes this inner loop.  Because
it doesn't know the alignment of the float pointers, it generates code to use
unaligned vector loads unconditionally, which on the powerpc, involves using a
load of an aligned pointer, and then doing a vperm instruction to permute the
bytes.  Since the code first does the calculation in 32-bit floating point and
then converts it to 64-bit floating point, the compiler does a vector convert
of V4SF to V2DF in the loop.  On the powerpc, this involes two more permutes,
and then the vector conversion.  Thus in the inner loop, there are:

    4 vector loads
    4 vector permutes to do the unalgined load
    8 vector permutes to get things in the right registers for conversion
    4 vector conversions

This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that
disables the vector float/int conversions to double.  Overall this is a win:

GCC 4.6, 32-bit:
    12% improvement, 464.h264ref
     5% improvement, 450.soplex
     3% regression,  465.tonto
     2% improvement, 481.wrf
     9% improvement, 482.sphinx3

GCC 4.6, 64-bit:
     5% improvement, 456.hmmer
     6% improvement, 464.h264ref
    14% improvement, 482.sphinx3

GCC 4.7, 32-bit:
      2% improvement, 437.leslie3d
      9% improvement, 482.sphinx3

I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if
desired.

While I don't think this is the only solution to 50031, it at least helps us.
It is encouraging that GCC 4.7 doesn't have the regression in tonto.

I have bootstraped and run make check on both 4.6 and 4.7 compilers with no
regressions.  Is it ok to install in the 4.7 tree?  At present, I have made the
default to generate the vectorized conversion, but it may make sense to flip
the default.  Is this patch ok to apply?  Given if affects 4.6, did you want to
see it in 4.6 as well?

[gcc]
2011-08-09  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR tree-optimization/50031
	* doc/invoke.texi (RS/6000 and PowerPC Options): Add
	-mnvsx-vector-32bit-to-64bit switch.

	* config/rs6000/rs6000.md (vec_unpacks_lo_v4sf): Add conditions on
	-mvector-convert-32bit-to-64bit switch.
	(vec_unpacks_float_hi_v4s): Ditto.
	(vec_unpacks_float_lo_v4s): Ditto.
	(vec_unpacku_float_hi_v4s): Ditto.
	(vec_unpacku_float_lo_v4s): Ditto.

	* config/rs6000/rs6000.opt (-mvector-convert-32bit-to-64bit): New
	switch to control whether the compiler does 32->64 bit conversions.

[gcc/testsuite]
2011-08-09  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR tree-optimization/50031
	* gcc.target/powerpc/vsx-vector-7.c: New test for
	-mvector-convert-32bit-to-64bit.
	* gcc.target/powerpc/vsx-vector-8.c: Ditto.

Comments

David Edelsohn Aug. 9, 2011, 6:30 p.m. UTC | #1
On Tue, Aug 9, 2011 at 2:07 PM, Michael Meissner
<meissner@linux.vnet.ibm.com> wrote:
> This is an initial patch to work around the slow down of sphinx3 in power7 VSX
> that first shows up in GCC 4.6 and is still present in the current GCC 4.7
> trunk.  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031
>
> The key part of the slowdown is in this inner loop in the
> vector_gautbl_eval_logs3  function in sphinx3 vector.c:
>
> {
>  int32 i, r;
>  float64 f;
>  int32 end, veclen;
>  float32 *m1, *m2, *v1, *v2;
>  float64 dval1, dval2, diff1, diff2;
>
>    /* ... */
>
>    for (i = 0; i < veclen; i++) {
>      diff1 = x[i] - m1[i];
>      dval1 -= diff1 * diff1 * v1[i];
>      diff2 = x[i] - m2[i];
>      dval2 -= diff2 * diff2 * v2[i];
>    }
>
>    /* ... */
>
> }
>
> In particular, the compiler 4.6 and beyond vectorizes this inner loop.  Because
> it doesn't know the alignment of the float pointers, it generates code to use
> unaligned vector loads unconditionally, which on the powerpc, involves using a
> load of an aligned pointer, and then doing a vperm instruction to permute the
> bytes.  Since the code first does the calculation in 32-bit floating point and
> then converts it to 64-bit floating point, the compiler does a vector convert
> of V4SF to V2DF in the loop.  On the powerpc, this involes two more permutes,
> and then the vector conversion.  Thus in the inner loop, there are:
>
>    4 vector loads
>    4 vector permutes to do the unalgined load
>    8 vector permutes to get things in the right registers for conversion
>    4 vector conversions
>
> This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that
> disables the vector float/int conversions to double.  Overall this is a win:
>
> GCC 4.6, 32-bit:
>    12% improvement, 464.h264ref
>     5% improvement, 450.soplex
>     3% regression,  465.tonto
>     2% improvement, 481.wrf
>     9% improvement, 482.sphinx3
>
> GCC 4.6, 64-bit:
>     5% improvement, 456.hmmer
>     6% improvement, 464.h264ref
>    14% improvement, 482.sphinx3
>
> GCC 4.7, 32-bit:
>      2% improvement, 437.leslie3d
>      9% improvement, 482.sphinx3
>
> I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if
> desired.

Mike,

Your analysis pinpoints the problem, but the patch is a work-around.
Unlike -mpointers-to-nested-functions, this patch does not change the
ABI and the user should not be aware of it.  This is a problem in the
cost model of the auto-vectorizer or instruction selection.  GCC
should not be generating this sequence for vectors that are unaligned
or whose alignment is unknown.  Introducing a new option that we need
to maintain going forward is not the correct solution.

Thanks, David
Richard Biener Aug. 10, 2011, 8:08 a.m. UTC | #2
On Tue, Aug 9, 2011 at 8:07 PM, Michael Meissner
<meissner@linux.vnet.ibm.com> wrote:
> This is an initial patch to work around the slow down of sphinx3 in power7 VSX
> that first shows up in GCC 4.6 and is still present in the current GCC 4.7
> trunk.  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031
>
> The key part of the slowdown is in this inner loop in the
> vector_gautbl_eval_logs3  function in sphinx3 vector.c:
>
> {
>  int32 i, r;
>  float64 f;
>  int32 end, veclen;
>  float32 *m1, *m2, *v1, *v2;
>  float64 dval1, dval2, diff1, diff2;
>
>    /* ... */
>
>    for (i = 0; i < veclen; i++) {
>      diff1 = x[i] - m1[i];
>      dval1 -= diff1 * diff1 * v1[i];
>      diff2 = x[i] - m2[i];
>      dval2 -= diff2 * diff2 * v2[i];
>    }
>
>    /* ... */
>
> }
>
> In particular, the compiler 4.6 and beyond vectorizes this inner loop.  Because
> it doesn't know the alignment of the float pointers, it generates code to use
> unaligned vector loads unconditionally, which on the powerpc, involves using a
> load of an aligned pointer, and then doing a vperm instruction to permute the
> bytes.  Since the code first does the calculation in 32-bit floating point and
> then converts it to 64-bit floating point, the compiler does a vector convert
> of V4SF to V2DF in the loop.  On the powerpc, this involes two more permutes,
> and then the vector conversion.  Thus in the inner loop, there are:
>
>    4 vector loads
>    4 vector permutes to do the unalgined load
>    8 vector permutes to get things in the right registers for conversion
>    4 vector conversions

Are the arrays all well-aligned in practice?  Thus, would versioning the loop
for all-good-alignment help?

If we have 4 permutes and then 8 further ones - can we combine for example
an unaligned load permute and the following permute for the sf->df conversion?

Does ppc have a VSX tuned cost-model and is it applied correctly in this case?
Maybe we need more fine-grained costs?

Thanks,
Richard.

> This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that
> disables the vector float/int conversions to double.  Overall this is a win:
>
> GCC 4.6, 32-bit:
>    12% improvement, 464.h264ref
>     5% improvement, 450.soplex
>     3% regression,  465.tonto
>     2% improvement, 481.wrf
>     9% improvement, 482.sphinx3
>
> GCC 4.6, 64-bit:
>     5% improvement, 456.hmmer
>     6% improvement, 464.h264ref
>    14% improvement, 482.sphinx3
>
> GCC 4.7, 32-bit:
>      2% improvement, 437.leslie3d
>      9% improvement, 482.sphinx3
>
> I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if
> desired.
>
> While I don't think this is the only solution to 50031, it at least helps us.
> It is encouraging that GCC 4.7 doesn't have the regression in tonto.
>
> I have bootstraped and run make check on both 4.6 and 4.7 compilers with no
> regressions.  Is it ok to install in the 4.7 tree?  At present, I have made the
> default to generate the vectorized conversion, but it may make sense to flip
> the default.  Is this patch ok to apply?  Given if affects 4.6, did you want to
> see it in 4.6 as well?
>
> [gcc]
> 2011-08-09  Michael Meissner  <meissner@linux.vnet.ibm.com>
>
>        PR tree-optimization/50031
>        * doc/invoke.texi (RS/6000 and PowerPC Options): Add
>        -mnvsx-vector-32bit-to-64bit switch.
>
>        * config/rs6000/rs6000.md (vec_unpacks_lo_v4sf): Add conditions on
>        -mvector-convert-32bit-to-64bit switch.
>        (vec_unpacks_float_hi_v4s): Ditto.
>        (vec_unpacks_float_lo_v4s): Ditto.
>        (vec_unpacku_float_hi_v4s): Ditto.
>        (vec_unpacku_float_lo_v4s): Ditto.
>
>        * config/rs6000/rs6000.opt (-mvector-convert-32bit-to-64bit): New
>        switch to control whether the compiler does 32->64 bit conversions.
>
> [gcc/testsuite]
> 2011-08-09  Michael Meissner  <meissner@linux.vnet.ibm.com>
>
>        PR tree-optimization/50031
>        * gcc.target/powerpc/vsx-vector-7.c: New test for
>        -mvector-convert-32bit-to-64bit.
>        * gcc.target/powerpc/vsx-vector-8.c: Ditto.
>
> --
> Michael Meissner, IBM
> 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
> meissner@linux.vnet.ibm.com     fax +1 (978) 399-6899
>
Michael Meissner Aug. 10, 2011, 11:07 p.m. UTC | #3
On Wed, Aug 10, 2011 at 10:08:54AM +0200, Richard Guenther wrote:
> Are the arrays all well-aligned in practice?  Thus, would versioning the loop
> for all-good-alignment help?

I suspect yes on 64-bit, but no on 32-bit, due to malloc not returning 128-bit
aligned memory in 32-bit.  It only returns memory that is aligned to double the
alignment of size_t.  Long doubles in powerpc are 128 bits, as are the vector
types.

I did a test, eliminating the vec_realign stuff under switch control.  This has
the effect of versioning the loop into a vector loop that is run when all are
aligned, and a scalar loop that is run when they aren't all aligned.  I ran
spec 2006 in 32-bit, and I see the following differences (eliminating the ones
that are close enough).

Benchmark	   % of baseline
=========	   =============
400.perlbench        96.09%
429.mcf             104.50%
456.hmmer            95.85%
458.sjeng           104.23%
464.h264ref         112.18%
483.xalancbmk       102.35%
410.bwaves          107.02%
416.gamess           96.01%
433.milc             98.90%
434.zeusmp           94.92%
435.gromacs         105.55%
450.soplex          108.58%
453.povray          103.71%
454.calculix         97.54%
459.GemsFDTD         97.35%
465.tonto            97.79%
470.lbm              98.56%
481.wrf              87.11%
482.sphinx3         110.33%

I was hoping that doing the versioning for an aligned loop and unaligned loop
would eliminate the percentages under 100%.

Note, the powerpc VSX memory instructions for V4SF/V4SI types can run if the
pointer is not aligned to a 128-bit boundary, but there is a slowdown if they
get pointers that aren't aligned to a 64-bit boundary.  I'm doing a run right
now, with movmisalign enabled for V4SF/V4SI, and I am seeing some regressions
in the run.

> If we have 4 permutes and then 8 further ones - can we combine for example
> an unaligned load permute and the following permute for the sf->df conversion?

I don't think so.

The unaligned stuff is to load up a 128-bit value in a register using a left
half and a right half, and a mask.  The Altivec instruction set has an
instruction (lvsl) that computes the mask based on the address, and the loads
and stores ignore the bottom 4 bits.

The unaligned loop looks something like:

	left = vector_load (addr & -16)
	mask = lvsl (addr)
	for (...) {
	    addr += 16;
	    right = vector_load (addr & -16)
	    value = permute (left, right, mask);
	    	  /* ... */
	    left = right;
	}

The two permutes for the conversion, get the values in the correct place for
the conversion instruction, ie if you have a vector with the parts:

	+====+====+====+====+
	|  A |  B |  C |  D |
	+====+====+====+====+

The first permute (xxmrghw) in the conversion would create a vector:

	+====+====+====+====+
	|  A |  A |  B |  B |
	+====+====+====+====+

and the second (xxmrglw) would create:

	+====+====+====+====+
	|  C |  C |  D |  D |
	+====+====+====+====+

Note, the values are doubled, because the instruction takes 2 registers as
input, and we just give the same register for both inputs.

The xvcvspdp instruction then takes a vector of the form (ignoring the 2nd and
4th fields):

	+====+====+====+====+
	|  X | ?? |  Y | ?? |
	+====+====+====+====+

and converts it to double precision:

	+=========+=========+
	|       X |       Y |
	+=========+=========+

> Does ppc have a VSX tuned cost-model and is it applied correctly in this case?
> Maybe we need more fine-grained costs?

The ppc has a cost model, but as I said in 50031, I think it needs to be
improved.
diff mbox

Patch

Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 177467)
+++ gcc/doc/invoke.texi	(working copy)
@@ -813,7 +813,8 @@  See RS/6000 and PowerPC Options.
 -mrecip -mrecip=@var{opt} -mno-recip -mrecip-precision @gol
 -mno-recip-precision @gol
 -mveclibabi=@var{type} -mfriz -mno-friz @gol
--mpointers-to-nested-functions -mno-pointers-to-nested-functions}
+-mpointers-to-nested-functions -mno-pointers-to-nested-functions @gol
+-mvector-convert-32bit-to-64bit -mno-vector-convert-32bit-to-64bit}
 
 @emph{RX Options}
 @gccoptlist{-m64bit-doubles  -m32bit-doubles  -fpu  -nofpu@gol
@@ -16426,6 +16427,13 @@  static chain value to be loaded in regis
 not be able to call through pointers to nested functions or pointers
 to functions compiled in other languages that use the static chain if
 you use the @option{-mno-pointers-to-nested-functions}.
+
+@item -mvector-convert-32bit-to-64bit
+@itemx -mno-vector-convert-32bit-to-64bit
+@opindex mvector-convert-32bit-to-64bit
+Generate (do not generate) code to use VSX vector instructions when
+converting 32-bit types to 64-bit types.  The default is
+@option{-mvector-convert-32bit-to-64bit}.
 @end table
 
 @node RX Options
Index: gcc/config/rs6000/vector.md
===================================================================
--- gcc/config/rs6000/vector.md	(revision 177467)
+++ gcc/config/rs6000/vector.md	(working copy)
@@ -797,7 +797,8 @@  (define_expand "vec_pack_ufix_trunc_v2df
 (define_expand "vec_unpacks_hi_v4sf"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SF 1 "vfloat_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SFmode);
 
@@ -809,7 +810,8 @@  (define_expand "vec_unpacks_hi_v4sf"
 (define_expand "vec_unpacks_lo_v4sf"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SF 1 "vfloat_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SFmode);
 
@@ -821,7 +823,8 @@  (define_expand "vec_unpacks_lo_v4sf"
 (define_expand "vec_unpacks_float_hi_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
@@ -833,7 +836,8 @@  (define_expand "vec_unpacks_float_hi_v4s
 (define_expand "vec_unpacks_float_lo_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
@@ -845,7 +849,8 @@  (define_expand "vec_unpacks_float_lo_v4s
 (define_expand "vec_unpacku_float_hi_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
@@ -857,7 +862,8 @@  (define_expand "vec_unpacku_float_hi_v4s
 (define_expand "vec_unpacku_float_lo_v4si"
   [(match_operand:V2DF 0 "vfloat_operand" "")
    (match_operand:V4SI 1 "vint_operand" "")]
-  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)"
+  "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)
+   && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT"
 {
   rtx reg = gen_reg_rtx (V4SImode);
 
Index: gcc/config/rs6000/rs6000.opt
===================================================================
--- gcc/config/rs6000/rs6000.opt	(revision 177467)
+++ gcc/config/rs6000/rs6000.opt	(working copy)
@@ -203,6 +203,10 @@  mvsx-scalar-memory
 Target Undocumented Report Var(TARGET_VSX_SCALAR_MEMORY)
 ; If -mvsx, use VSX scalar memory reference instructions for scalar double (off by default)
 
+mvector-convert-32bit-to-64bit
+Target Report Var(TARGET_VECTOR_CONVERT_32BIT_TO_64BIT) Init(1)
+If -mvsx, enable conversion of 32-bit types to 64-bit types with vector instructions
+
 mvsx-align-128
 Target Undocumented Report Var(TARGET_VSX_ALIGN_128)
 ; If -mvsx, set alignment to 128 bits instead of 32/64