Message ID | 20110809180750.GA19356@hungry-tiger.westford.ibm.com |
---|---|
State | New |
Headers | show |
On Tue, Aug 9, 2011 at 2:07 PM, Michael Meissner <meissner@linux.vnet.ibm.com> wrote: > This is an initial patch to work around the slow down of sphinx3 in power7 VSX > that first shows up in GCC 4.6 and is still present in the current GCC 4.7 > trunk. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031 > > The key part of the slowdown is in this inner loop in the > vector_gautbl_eval_logs3 function in sphinx3 vector.c: > > { > int32 i, r; > float64 f; > int32 end, veclen; > float32 *m1, *m2, *v1, *v2; > float64 dval1, dval2, diff1, diff2; > > /* ... */ > > for (i = 0; i < veclen; i++) { > diff1 = x[i] - m1[i]; > dval1 -= diff1 * diff1 * v1[i]; > diff2 = x[i] - m2[i]; > dval2 -= diff2 * diff2 * v2[i]; > } > > /* ... */ > > } > > In particular, the compiler 4.6 and beyond vectorizes this inner loop. Because > it doesn't know the alignment of the float pointers, it generates code to use > unaligned vector loads unconditionally, which on the powerpc, involves using a > load of an aligned pointer, and then doing a vperm instruction to permute the > bytes. Since the code first does the calculation in 32-bit floating point and > then converts it to 64-bit floating point, the compiler does a vector convert > of V4SF to V2DF in the loop. On the powerpc, this involes two more permutes, > and then the vector conversion. Thus in the inner loop, there are: > > 4 vector loads > 4 vector permutes to do the unalgined load > 8 vector permutes to get things in the right registers for conversion > 4 vector conversions > > This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that > disables the vector float/int conversions to double. Overall this is a win: > > GCC 4.6, 32-bit: > 12% improvement, 464.h264ref > 5% improvement, 450.soplex > 3% regression, 465.tonto > 2% improvement, 481.wrf > 9% improvement, 482.sphinx3 > > GCC 4.6, 64-bit: > 5% improvement, 456.hmmer > 6% improvement, 464.h264ref > 14% improvement, 482.sphinx3 > > GCC 4.7, 32-bit: > 2% improvement, 437.leslie3d > 9% improvement, 482.sphinx3 > > I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if > desired. Mike, Your analysis pinpoints the problem, but the patch is a work-around. Unlike -mpointers-to-nested-functions, this patch does not change the ABI and the user should not be aware of it. This is a problem in the cost model of the auto-vectorizer or instruction selection. GCC should not be generating this sequence for vectors that are unaligned or whose alignment is unknown. Introducing a new option that we need to maintain going forward is not the correct solution. Thanks, David
On Tue, Aug 9, 2011 at 8:07 PM, Michael Meissner <meissner@linux.vnet.ibm.com> wrote: > This is an initial patch to work around the slow down of sphinx3 in power7 VSX > that first shows up in GCC 4.6 and is still present in the current GCC 4.7 > trunk. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50031 > > The key part of the slowdown is in this inner loop in the > vector_gautbl_eval_logs3 function in sphinx3 vector.c: > > { > int32 i, r; > float64 f; > int32 end, veclen; > float32 *m1, *m2, *v1, *v2; > float64 dval1, dval2, diff1, diff2; > > /* ... */ > > for (i = 0; i < veclen; i++) { > diff1 = x[i] - m1[i]; > dval1 -= diff1 * diff1 * v1[i]; > diff2 = x[i] - m2[i]; > dval2 -= diff2 * diff2 * v2[i]; > } > > /* ... */ > > } > > In particular, the compiler 4.6 and beyond vectorizes this inner loop. Because > it doesn't know the alignment of the float pointers, it generates code to use > unaligned vector loads unconditionally, which on the powerpc, involves using a > load of an aligned pointer, and then doing a vperm instruction to permute the > bytes. Since the code first does the calculation in 32-bit floating point and > then converts it to 64-bit floating point, the compiler does a vector convert > of V4SF to V2DF in the loop. On the powerpc, this involes two more permutes, > and then the vector conversion. Thus in the inner loop, there are: > > 4 vector loads > 4 vector permutes to do the unalgined load > 8 vector permutes to get things in the right registers for conversion > 4 vector conversions Are the arrays all well-aligned in practice? Thus, would versioning the loop for all-good-alignment help? If we have 4 permutes and then 8 further ones - can we combine for example an unaligned load permute and the following permute for the sf->df conversion? Does ppc have a VSX tuned cost-model and is it applied correctly in this case? Maybe we need more fine-grained costs? Thanks, Richard. > This patch offers a new option (-mno-vector-convert-32bit-to-64bit) that > disables the vector float/int conversions to double. Overall this is a win: > > GCC 4.6, 32-bit: > 12% improvement, 464.h264ref > 5% improvement, 450.soplex > 3% regression, 465.tonto > 2% improvement, 481.wrf > 9% improvement, 482.sphinx3 > > GCC 4.6, 64-bit: > 5% improvement, 456.hmmer > 6% improvement, 464.h264ref > 14% improvement, 482.sphinx3 > > GCC 4.7, 32-bit: > 2% improvement, 437.leslie3d > 9% improvement, 482.sphinx3 > > I haven't measured GCC 4.7 64-bit mode at the present time, but I can do so if > desired. > > While I don't think this is the only solution to 50031, it at least helps us. > It is encouraging that GCC 4.7 doesn't have the regression in tonto. > > I have bootstraped and run make check on both 4.6 and 4.7 compilers with no > regressions. Is it ok to install in the 4.7 tree? At present, I have made the > default to generate the vectorized conversion, but it may make sense to flip > the default. Is this patch ok to apply? Given if affects 4.6, did you want to > see it in 4.6 as well? > > [gcc] > 2011-08-09 Michael Meissner <meissner@linux.vnet.ibm.com> > > PR tree-optimization/50031 > * doc/invoke.texi (RS/6000 and PowerPC Options): Add > -mnvsx-vector-32bit-to-64bit switch. > > * config/rs6000/rs6000.md (vec_unpacks_lo_v4sf): Add conditions on > -mvector-convert-32bit-to-64bit switch. > (vec_unpacks_float_hi_v4s): Ditto. > (vec_unpacks_float_lo_v4s): Ditto. > (vec_unpacku_float_hi_v4s): Ditto. > (vec_unpacku_float_lo_v4s): Ditto. > > * config/rs6000/rs6000.opt (-mvector-convert-32bit-to-64bit): New > switch to control whether the compiler does 32->64 bit conversions. > > [gcc/testsuite] > 2011-08-09 Michael Meissner <meissner@linux.vnet.ibm.com> > > PR tree-optimization/50031 > * gcc.target/powerpc/vsx-vector-7.c: New test for > -mvector-convert-32bit-to-64bit. > * gcc.target/powerpc/vsx-vector-8.c: Ditto. > > -- > Michael Meissner, IBM > 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA > meissner@linux.vnet.ibm.com fax +1 (978) 399-6899 >
On Wed, Aug 10, 2011 at 10:08:54AM +0200, Richard Guenther wrote: > Are the arrays all well-aligned in practice? Thus, would versioning the loop > for all-good-alignment help? I suspect yes on 64-bit, but no on 32-bit, due to malloc not returning 128-bit aligned memory in 32-bit. It only returns memory that is aligned to double the alignment of size_t. Long doubles in powerpc are 128 bits, as are the vector types. I did a test, eliminating the vec_realign stuff under switch control. This has the effect of versioning the loop into a vector loop that is run when all are aligned, and a scalar loop that is run when they aren't all aligned. I ran spec 2006 in 32-bit, and I see the following differences (eliminating the ones that are close enough). Benchmark % of baseline ========= ============= 400.perlbench 96.09% 429.mcf 104.50% 456.hmmer 95.85% 458.sjeng 104.23% 464.h264ref 112.18% 483.xalancbmk 102.35% 410.bwaves 107.02% 416.gamess 96.01% 433.milc 98.90% 434.zeusmp 94.92% 435.gromacs 105.55% 450.soplex 108.58% 453.povray 103.71% 454.calculix 97.54% 459.GemsFDTD 97.35% 465.tonto 97.79% 470.lbm 98.56% 481.wrf 87.11% 482.sphinx3 110.33% I was hoping that doing the versioning for an aligned loop and unaligned loop would eliminate the percentages under 100%. Note, the powerpc VSX memory instructions for V4SF/V4SI types can run if the pointer is not aligned to a 128-bit boundary, but there is a slowdown if they get pointers that aren't aligned to a 64-bit boundary. I'm doing a run right now, with movmisalign enabled for V4SF/V4SI, and I am seeing some regressions in the run. > If we have 4 permutes and then 8 further ones - can we combine for example > an unaligned load permute and the following permute for the sf->df conversion? I don't think so. The unaligned stuff is to load up a 128-bit value in a register using a left half and a right half, and a mask. The Altivec instruction set has an instruction (lvsl) that computes the mask based on the address, and the loads and stores ignore the bottom 4 bits. The unaligned loop looks something like: left = vector_load (addr & -16) mask = lvsl (addr) for (...) { addr += 16; right = vector_load (addr & -16) value = permute (left, right, mask); /* ... */ left = right; } The two permutes for the conversion, get the values in the correct place for the conversion instruction, ie if you have a vector with the parts: +====+====+====+====+ | A | B | C | D | +====+====+====+====+ The first permute (xxmrghw) in the conversion would create a vector: +====+====+====+====+ | A | A | B | B | +====+====+====+====+ and the second (xxmrglw) would create: +====+====+====+====+ | C | C | D | D | +====+====+====+====+ Note, the values are doubled, because the instruction takes 2 registers as input, and we just give the same register for both inputs. The xvcvspdp instruction then takes a vector of the form (ignoring the 2nd and 4th fields): +====+====+====+====+ | X | ?? | Y | ?? | +====+====+====+====+ and converts it to double precision: +=========+=========+ | X | Y | +=========+=========+ > Does ppc have a VSX tuned cost-model and is it applied correctly in this case? > Maybe we need more fine-grained costs? The ppc has a cost model, but as I said in 50031, I think it needs to be improved.
Index: gcc/doc/invoke.texi =================================================================== --- gcc/doc/invoke.texi (revision 177467) +++ gcc/doc/invoke.texi (working copy) @@ -813,7 +813,8 @@ See RS/6000 and PowerPC Options. -mrecip -mrecip=@var{opt} -mno-recip -mrecip-precision @gol -mno-recip-precision @gol -mveclibabi=@var{type} -mfriz -mno-friz @gol --mpointers-to-nested-functions -mno-pointers-to-nested-functions} +-mpointers-to-nested-functions -mno-pointers-to-nested-functions @gol +-mvector-convert-32bit-to-64bit -mno-vector-convert-32bit-to-64bit} @emph{RX Options} @gccoptlist{-m64bit-doubles -m32bit-doubles -fpu -nofpu@gol @@ -16426,6 +16427,13 @@ static chain value to be loaded in regis not be able to call through pointers to nested functions or pointers to functions compiled in other languages that use the static chain if you use the @option{-mno-pointers-to-nested-functions}. + +@item -mvector-convert-32bit-to-64bit +@itemx -mno-vector-convert-32bit-to-64bit +@opindex mvector-convert-32bit-to-64bit +Generate (do not generate) code to use VSX vector instructions when +converting 32-bit types to 64-bit types. The default is +@option{-mvector-convert-32bit-to-64bit}. @end table @node RX Options Index: gcc/config/rs6000/vector.md =================================================================== --- gcc/config/rs6000/vector.md (revision 177467) +++ gcc/config/rs6000/vector.md (working copy) @@ -797,7 +797,8 @@ (define_expand "vec_pack_ufix_trunc_v2df (define_expand "vec_unpacks_hi_v4sf" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SF 1 "vfloat_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SFmode); @@ -809,7 +810,8 @@ (define_expand "vec_unpacks_hi_v4sf" (define_expand "vec_unpacks_lo_v4sf" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SF 1 "vfloat_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SFmode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SFmode); @@ -821,7 +823,8 @@ (define_expand "vec_unpacks_lo_v4sf" (define_expand "vec_unpacks_float_hi_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); @@ -833,7 +836,8 @@ (define_expand "vec_unpacks_float_hi_v4s (define_expand "vec_unpacks_float_lo_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); @@ -845,7 +849,8 @@ (define_expand "vec_unpacks_float_lo_v4s (define_expand "vec_unpacku_float_hi_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); @@ -857,7 +862,8 @@ (define_expand "vec_unpacku_float_hi_v4s (define_expand "vec_unpacku_float_lo_v4si" [(match_operand:V2DF 0 "vfloat_operand" "") (match_operand:V4SI 1 "vint_operand" "")] - "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode)" + "VECTOR_UNIT_VSX_P (V2DFmode) && VECTOR_UNIT_ALTIVEC_OR_VSX_P (V4SImode) + && TARGET_VECTOR_CONVERT_32BIT_TO_64BIT" { rtx reg = gen_reg_rtx (V4SImode); Index: gcc/config/rs6000/rs6000.opt =================================================================== --- gcc/config/rs6000/rs6000.opt (revision 177467) +++ gcc/config/rs6000/rs6000.opt (working copy) @@ -203,6 +203,10 @@ mvsx-scalar-memory Target Undocumented Report Var(TARGET_VSX_SCALAR_MEMORY) ; If -mvsx, use VSX scalar memory reference instructions for scalar double (off by default) +mvector-convert-32bit-to-64bit +Target Report Var(TARGET_VECTOR_CONVERT_32BIT_TO_64BIT) Init(1) +If -mvsx, enable conversion of 32-bit types to 64-bit types with vector instructions + mvsx-align-128 Target Undocumented Report Var(TARGET_VSX_ALIGN_128) ; If -mvsx, set alignment to 128 bits instead of 32/64