Patchwork PING Re: [PATCH, MIPS] add new peephole for 74k dspr2

login
register
mail settings
Submitter Richard Sandiford
Date Oct. 10, 2012, 7:59 p.m.
Message ID <87zk3uksj2.fsf@talisman.home>
Download mbox | patch
Permalink /patch/190752/
State New
Headers show

Comments

Richard Sandiford - Oct. 10, 2012, 7:59 p.m.
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Sun, 7 Oct 2012, Richard Sandiford wrote:
>
>> >  So I think this can't really be selected automatically for all cores, 
>> > some human-supplied knowledge about the MD unit used is required -- that 
>> > obviously affects other operations too, e.g. some multiplications 
>> > involving a constant that may be cheaper to do either directly or with a 
>> > sequence of additions depending on the MD unit present (unless optimising 
>> > for size, of course).
>> 
>> Yeah, as far as this optimisation goes, I think your original suggestion
>> of using the DFA to work out the best sequence is still right.  If the DFA
>> says that multiplications take a handful of cycles when they actually take
>> 32 then we're not going to optimise multiplication very well whatever happens.
>> 
>> In practice, code destined for non-4kc cores with iterative multipliers
>> would probably tune well with -mtune=4kp.
>
>  Hmm, maybe, although I find requiring people to tune for an obsolete 
> MIPS32 rev. 1 processor to get the right MDU performance figures for 
> modern processors a bit odd (the 4Kp DFA undoubtedly lacks reservations 
> for newer instructions introduced with the MIPS32r2 architecture update).
>
>  I've thought of -miterative-mdu or suchlike a knob that would override 
> the default cost of multiplication/division as appropriate (i.e. to 32/64 
> plus any extra operation-specific constant as required), perhaps by 
> forcing the 4Kp insn reservation (extended to 64 bits as required) over 
> the usual one; of course there would be nothing to override the -mtune=4Kp 
> arrangement with.  Nothing urgent though, just an idea to ponder.

FWIW, using the 4Kp reservation itself wouldn't necessarily work,
because all modern scheduling descriptions define their own CPU units.
You'd probably need to add new reservations to each affected MIPS32
and MIPS64 .md file.

>> Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
>> ===================================================================
>> --- /dev/null	2012-10-06 09:26:21.400998949 +0100
>> +++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-06 22:24:24.857713316 +0100
>> @@ -0,0 +1,22 @@
>> +/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
>> +/* References to RESULT within the loop need to have a higher frequency than
>> +   references to RESULT outside the loop, otherwise there is no reason
>> +   to prefer multiply/accumulator registers over GPRs.  */
>> +/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
>> +
>> +/* Check that the zero-initialization of the accumulator feeding into
>> +   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
>
>  This comment looks reversed to me as in this case we do NOT want a MULT 
> to be used...
>
>> +
>> +NOMIPS16 long long f (int n, int *v, int m)
>> +{
>> +  long long result = 0;
>> +  int i;
>> +
>> +  for (i = 0; i < n; i++)
>> +    result = __builtin_mips_madd (result, v[i], m);
>> +  return result;
>> +}
>> +
>> +/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
>> +/* { dg-final { scan-assembler "\tmthi\t" } } */
>> +/* { dg-final { scan-assembler "\tmtlo\t" } } */
>
>  ... and in fact you do check that we didn't get one.

Thanks, fixed as follows.

Richard


gcc/testsuite/
	* gcc.target/mips/mips32-dsp-accinit-2.c: Fix test description.
Maciej W. Rozycki - Oct. 11, 2012, 3:26 p.m.
On Wed, 10 Oct 2012, Richard Sandiford wrote:

> >  I've thought of -miterative-mdu or suchlike a knob that would override 
> > the default cost of multiplication/division as appropriate (i.e. to 32/64 
> > plus any extra operation-specific constant as required), perhaps by 
> > forcing the 4Kp insn reservation (extended to 64 bits as required) over 
> > the usual one; of course there would be nothing to override the -mtune=4Kp 
> > arrangement with.  Nothing urgent though, just an idea to ponder.
> 
> FWIW, using the 4Kp reservation itself wouldn't necessarily work,
> because all modern scheduling descriptions define their own CPU units.
> You'd probably need to add new reservations to each affected MIPS32
> and MIPS64 .md file.

 Thanks for the hint.  I have skimmed through documentation I have 
available and the following MTI processors implement an iterative MDU as 
an RTL option: M4K, M14K, M14Kc, and the following ones always have one: 
4Kp, 4KEp.  No 64-bit MTI processor had it so far it would seem (the 5K 
chips have an iterative divider and a high-performance multiplier).

 We currently have no pipeline descriptions for any of the M cores, so my 
concern seemed to be a little bit premature (though still valid of 
course).  The option proposed will only be needed once/if we get any of 
the respective pipeline descriptions.  Arguably anyone targetting them is 
likely going to be after -Os anyway, although at least with the M14Kc I 
think it should not be taken for granted.

 We're missing a pipeline description for the 4KE-family cores as well.  
If not adding costs for the extra MIPS32r2 additions I think at the very 
least we should trivially enable the existing bits of 4k.md for these 
cores; the pipeline is the same across the whole 4K/4KE family and it's 
only the extra instructions that have been added with the E update.

  Maciej

Patch

Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
===================================================================
--- gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-07 09:45:04.000000000 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-10 20:51:44.402528138 +0100
@@ -5,7 +5,8 @@ 
 /* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
 
 /* Check that the zero-initialization of the accumulator feeding into
-   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+   the madd is done by means of an mthi & mtlo pair instead of a
+   "mult $0,$0" instruction.  */
 
 NOMIPS16 long long f (int n, int *v, int m)
 {