diff mbox

PING Re: [PATCH, MIPS] add new peephole for 74k dspr2

Message ID 87lig719ig.fsf@talisman.home
State New
Headers show

Commit Message

Richard Sandiford Sept. 18, 2012, 6:16 p.m. UTC
Sandra Loosemore <sandra@codesourcery.com> writes:
> On 08/27/2012 10:36 AM, Richard Sandiford wrote:
>> Sandra Loosemore<sandra@codesourcery.com>  writes:
>>> On 08/19/2012 11:22 AM, Richard Sandiford wrote:
>>>>
>>>> Not sure whether a peephole is the right choice here.  In practice,
>>>> I'd imagine these opportunities would only come from a DImode move of
>>>> $0 into a doubleword register, so we could simply emit the pattern in
>>>> mips_split_doubleword_move.
>>>>
>>>> That would also allow us to use it for plain HI and LO.  It wasn't
>>>> obvious from the patch why it was restricted to the DSP extension
>>>> registers.
>>>>
>>>> Please also add a scan-assembler test.
>>>
>>> How is this version of the fix?
>>
>> Just to say that I've not forgotten about this.  I'd still like to
>> remove the !TARGET_64BIT and ISA_HAS_DSP_MULT tests, because the
>> idea isn't specific to either.
>>
>> Also, reviewing the patch made me realise that it might be better
>> to keep the move intact and simply use "mult" in the output code.
>> That's my fault for suggesting the wrong thing, though, so I was
>> hoping to find time this weekend to try it myself.  The testsuite
>> stuff ended up taking up all the available time instead.
>
> Richard,
>
> Have you had time to think about this some more?  I am not sure I can 
> guess how you'd like me to fix this patch now without some more specific 
> review and/or suggestions about where the optimization should happen and 
> what cases it should be extended to detect in addition to the dsp 
> accumulator multiplies.

The patch below is the one I've been testing.  But I got sidetracked
by looking into the possibility of removing the MD0_REG and MD1_REG
classes, in order to get more sensible costs.  I think that was needed
for the madd-9.c test to pass.

Richard

Comments

Maciej W. Rozycki Sept. 24, 2012, 3:08 p.m. UTC | #1
On Tue, 18 Sep 2012, Richard Sandiford wrote:

> > Have you had time to think about this some more?  I am not sure I can 
> > guess how you'd like me to fix this patch now without some more specific 
> > review and/or suggestions about where the optimization should happen and 
> > what cases it should be extended to detect in addition to the dsp 
> > accumulator multiplies.
> 
> The patch below is the one I've been testing.  But I got sidetracked
> by looking into the possibility of removing the MD0_REG and MD1_REG
> classes, in order to get more sensible costs.  I think that was needed
> for the madd-9.c test to pass.

 Sorry to come up with this so late -- I have only now noticed this being 
discussed.

> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>    return simplify_gen_subreg (word_mode, op, mode, byte);
>  }
>  
> -/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
> +/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
> +
> +static bool
> +mips_mult_move_p (rtx dest, rtx src)
> +{
> +  return (src == const0_rtx
> +	  && REG_P (dest)
> +	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
> +	  && (ISA_HAS_DSP_MULT
> +	      ? ACC_REG_P (REGNO (dest))
> +	      : MD_REG_P (REGNO (dest))));
> +}
> +
> +/* Return true if a move from SRC to DEST should be split into two.  */

 Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower 
than MTHI $0/MTLO $0?  The latency of multiplication varies among 
implementations, for example the original R3000 took 12 cycles (of course 
the R3000 itself is not relevant for this change, but you see the 
picture!).  On the other hand in some (but not all!) processors 
multiplication runs in parallel to the main pipeline so it is the 
difference, if positive, between the number of cycles consumed by other 
instructions up to the next HI/LO access instruction and the latency of 
MULT run in the background that matters.

 From the context I am assuming none of this matters for the 74K (and 
presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
isn't it something that should be decided based on instruction costs from 
DFA schedulers?  Is there anything that I've missed here?  It doesn't 
appear to me your (and neither the original) proposal takes instruction 
cost calculation into consideration.

  Maciej
Richard Sandiford Sept. 24, 2012, 8:48 p.m. UTC | #2
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Tue, 18 Sep 2012, Richard Sandiford wrote:
>
>> > Have you had time to think about this some more?  I am not sure I can 
>> > guess how you'd like me to fix this patch now without some more specific 
>> > review and/or suggestions about where the optimization should happen and 
>> > what cases it should be extended to detect in addition to the dsp 
>> > accumulator multiplies.
>> 
>> The patch below is the one I've been testing.  But I got sidetracked
>> by looking into the possibility of removing the MD0_REG and MD1_REG
>> classes, in order to get more sensible costs.  I think that was needed
>> for the madd-9.c test to pass.
>
>  Sorry to come up with this so late -- I have only now noticed this being 
> discussed.
>
>> @@ -4105,39 +4105,55 @@ mips_subword (rtx op, bool high_p)
>>    return simplify_gen_subreg (word_mode, op, mode, byte);
>>  }
>>  
>> -/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
>> +/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
>> +
>> +static bool
>> +mips_mult_move_p (rtx dest, rtx src)
>> +{
>> +  return (src == const0_rtx
>> +	  && REG_P (dest)
>> +	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
>> +	  && (ISA_HAS_DSP_MULT
>> +	      ? ACC_REG_P (REGNO (dest))
>> +	      : MD_REG_P (REGNO (dest))));
>> +}
>> +
>> +/* Return true if a move from SRC to DEST should be split into two.  */
>
>  Does the DSP ASE guarantee that a MULT $0, $0 is going not to be slower 
> than MTHI $0/MTLO $0?  The latency of multiplication varies among 
> implementations, for example the original R3000 took 12 cycles (of course 
> the R3000 itself is not relevant for this change, but you see the 
> picture!).  On the other hand in some (but not all!) processors 
> multiplication runs in parallel to the main pipeline so it is the 
> difference, if positive, between the number of cycles consumed by other 
> instructions up to the next HI/LO access instruction and the latency of 
> MULT run in the background that matters.
>
>  From the context I am assuming none of this matters for the 74K (and 
> presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
> isn't it something that should be decided based on instruction costs from 
> DFA schedulers?  Is there anything that I've missed here?  It doesn't 
> appear to me your (and neither the original) proposal takes instruction 
> cost calculation into consideration.

In practice, we only move 0 into HI and LO for MADD- and MSUB-style
operations.  We deliberately don't use HI and LO as scratch space.

I think it's a reasonable default assumption that anything that supports
those instructions also has a fast path from MULT to MADD or MULT to MSUB.
I certainly don't know of any counter-examples.  The decision is deliberately
centeralised in one place so that the condition can be tweaked in future
if necessary.

Richard
Maciej W. Rozycki Sept. 24, 2012, 10:09 p.m. UTC | #3
On Mon, 24 Sep 2012, Richard Sandiford wrote:

> >  From the context I am assuming none of this matters for the 74K (and 
> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
> > isn't it something that should be decided based on instruction costs from 
> > DFA schedulers?  Is there anything that I've missed here?  It doesn't 
> > appear to me your (and neither the original) proposal takes instruction 
> > cost calculation into consideration.
> 
> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
> operations.  We deliberately don't use HI and LO as scratch space.
> 
> I think it's a reasonable default assumption that anything that supports
> those instructions also has a fast path from MULT to MADD or MULT to MSUB.

 According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
obviously the resulting larger code may adversely affect cache performance 
in some scenarios.

> I certainly don't know of any counter-examples.  The decision is deliberately
> centeralised in one place so that the condition can be tweaked in future
> if necessary.

 Sure, the cost comparison could be done in that single place as well.

  Maciej
Richard Sandiford Sept. 25, 2012, 8:07 a.m. UTC | #4
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Mon, 24 Sep 2012, Richard Sandiford wrote:
>
>> >  From the context I am assuming none of this matters for the 74K (and 
>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
>> > isn't it something that should be decided based on instruction costs from 
>> > DFA schedulers?  Is there anything that I've missed here?  It doesn't 
>> > appear to me your (and neither the original) proposal takes instruction 
>> > cost calculation into consideration.
>> 
>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
>> operations.  We deliberately don't use HI and LO as scratch space.
>> 
>> I think it's a reasonable default assumption that anything that supports
>> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
>
>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
> obviously the resulting larger code may adversely affect cache performance 
> in some scenarios.

That's not how the 4650 DFA models it though.

(define_insn_reservation "generic_hilo" 1
  (eq_attr "type" "mfhi,mflo,mthi,mtlo")
  "imuldiv*3")

(define_insn_reservation "r4650_imul" 4
  (and (eq_attr "cpu" "r4650")
       (eq_attr "type" "imul,imul3,imadd"))
  "imuldiv*4")

So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

Richard
Richard Sandiford Sept. 25, 2012, 8:36 a.m. UTC | #5
Richard Sandiford <rdsandiford@googlemail.com> writes:
> "Maciej W. Rozycki" <macro@codesourcery.com> writes:
>> On Mon, 24 Sep 2012, Richard Sandiford wrote:
>>
>>> >  From the context I am assuming none of this matters for the 74K (and 
>>> > presumably the 24KE/34K) and a MULT $0, $0 is indeed faster, but overall 
>>> > isn't it something that should be decided based on instruction costs from 
>>> > DFA schedulers?  Is there anything that I've missed here?  It doesn't 
>>> > appear to me your (and neither the original) proposal takes instruction 
>>> > cost calculation into consideration.
>>> 
>>> In practice, we only move 0 into HI and LO for MADD- and MSUB-style
>>> operations.  We deliberately don't use HI and LO as scratch space.
>>> 
>>> I think it's a reasonable default assumption that anything that supports
>>> those instructions also has a fast path from MULT to MADD or MULT to MSUB.
>>
>>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
>> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
>> obviously the resulting larger code may adversely affect cache performance 
>> in some scenarios.
>
> That's not how the 4650 DFA models it though.
>
> (define_insn_reservation "generic_hilo" 1
>   (eq_attr "type" "mfhi,mflo,mthi,mtlo")
>   "imuldiv*3")
>
> (define_insn_reservation "r4650_imul" 4
>   (and (eq_attr "cpu" "r4650")
>        (eq_attr "type" "imul,imul3,imadd"))
>   "imuldiv*4")
>
> So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
required for this optimisation is getting a bit ridiculous.

Richard
Maciej W. Rozycki Sept. 25, 2012, 5:57 p.m. UTC | #6
On Tue, 25 Sep 2012, Richard Sandiford wrote:

> >>  According to my sources the R4650 has a 4-cycle MULT latency (MAD is 3-4 
> >> cycles on that processor).  An MTHI/MTLO pair will take 2 cycles; 
> >> obviously the resulting larger code may adversely affect cache performance 
> >> in some scenarios.
> >
> > That's not how the 4650 DFA models it though.
> >
> > (define_insn_reservation "generic_hilo" 1
> >   (eq_attr "type" "mfhi,mflo,mthi,mtlo")
> >   "imuldiv*3")
> >
> > (define_insn_reservation "r4650_imul" 4
> >   (and (eq_attr "cpu" "r4650")
> >        (eq_attr "type" "imul,imul3,imadd"))
> >   "imuldiv*4")
> >
> > So if we believed the DFA, MTLO + MTHI would occupy the muldiv unit for 6
> > rather than 4 cycles.  Any attempt to use the DFA would still favour MULT.

 I can't track a reference on R4650 MTHI/MTLO latency; I'd be happy to 
learn of one, or otherwise I wonder where the delay is coming from.  Also 
a small update: apparently MULT is 3 clocks only on the R4650 where 
operands are 16 bits (unsure if it is enough if only one is; for a zero by 
zero multiplication it surely does not matter though).  So I think using a 
MULT here is at least reasonable.

> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
> MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
> required for this optimisation is getting a bit ridiculous.

 I have double-checked some documentation, and in fact many MIPS cores, 
including the current ones, have a configuration option to include either 
a high-performance or an area-efficient MD unit.  Take the M14Kc for 
example -- its high-performance unit has a one-cycle latency/issue rate 
for 16-bit multiplication (two-cycle for full 32 bits; here the width of 
rt is explicitly named) and the area-efficient has a 32-cycle 
latency/issue rate only regardless of the operand size (obviously 
iterating over addition one bit at a time).  The latency of MTHI/MTLO is 1 
across both units.

 So I think this can't really be selected automatically for all cores, 
some human-supplied knowledge about the MD unit used is required -- that 
obviously affects other operations too, e.g. some multiplications 
involving a constant that may be cheaper to do either directly or with a 
sequence of additions depending on the MD unit present (unless optimising 
for size, of course).

  Maciej
diff mbox

Patch

Index: gcc/config/mips/mips-protos.h
===================================================================
--- gcc/config/mips/mips-protos.h	2012-09-03 07:49:57.319188985 +0100
+++ gcc/config/mips/mips-protos.h	2012-09-04 20:15:10.240130458 +0100
@@ -212,8 +212,8 @@  extern int m16_simm8_8 (rtx, enum machin
 extern int m16_nsimm8_8 (rtx, enum machine_mode);
 
 extern rtx mips_subword (rtx, bool);
-extern bool mips_split_64bit_move_p (rtx, rtx);
-extern void mips_split_doubleword_move (rtx, rtx);
+extern bool mips_split_move_p (rtx, rtx);
+extern void mips_split_move (rtx, rtx);
 extern const char *mips_output_move (rtx, rtx);
 extern bool mips_cfun_has_cprestore_slot_p (void);
 extern bool mips_cprestore_address_p (rtx, bool);
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c	2012-09-04 20:15:08.191130518 +0100
+++ gcc/config/mips/mips.c	2012-09-04 20:15:17.173130256 +0100
@@ -2395,11 +2395,11 @@  mips_load_store_insns (rtx mem, rtx insn
   mode = GET_MODE (mem);
 
   /* Try to prove that INSN does not need to be split.  */
-  might_split_p = true;
-  if (GET_MODE_BITSIZE (mode) == 64)
+  might_split_p = GET_MODE_SIZE (mode) > UNITS_PER_WORD;
+  if (might_split_p)
     {
       set = single_set (insn);
-      if (set && !mips_split_64bit_move_p (SET_DEST (set), SET_SRC (set)))
+      if (set && !mips_split_move_p (SET_DEST (set), SET_SRC (set)))
 	might_split_p = false;
     }
 
@@ -4105,39 +4105,55 @@  mips_subword (rtx op, bool high_p)
   return simplify_gen_subreg (word_mode, op, mode, byte);
 }
 
-/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
+/* Return true if SRC can be moved into DEST using MULT $0, $0.  */
+
+static bool
+mips_mult_move_p (rtx dest, rtx src)
+{
+  return (src == const0_rtx
+	  && REG_P (dest)
+	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
+	  && (ISA_HAS_DSP_MULT
+	      ? ACC_REG_P (REGNO (dest))
+	      : MD_REG_P (REGNO (dest))));
+}
+
+/* Return true if a move from SRC to DEST should be split into two.  */
 
 bool
-mips_split_64bit_move_p (rtx dest, rtx src)
+mips_split_move_p (rtx dest, rtx src)
 {
-  if (TARGET_64BIT)
+  /* Check whether the move can be done using some variant of MULT $0,$0.  */
+  if (mips_mult_move_p (dest, src))
     return false;
 
   /* FPR-to-FPR moves can be done in a single instruction, if they're
      allowed at all.  */
-  if (FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
+  unsigned int size = GET_MODE_SIZE (GET_MODE (dest));
+  if (size == 8 && FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
     return false;
 
   /* Check for floating-point loads and stores.  */
-  if (ISA_HAS_LDC1_SDC1)
+  if (size == 8 && ISA_HAS_LDC1_SDC1)
     {
       if (FP_REG_RTX_P (dest) && MEM_P (src))
 	return false;
       if (FP_REG_RTX_P (src) && MEM_P (dest))
 	return false;
     }
-  return true;
+
+  /* Otherwise split all multiword moves.  */
+  return size > UNITS_PER_WORD;
 }
 
-/* Split a doubleword move from SRC to DEST.  On 32-bit targets,
-   this function handles 64-bit moves for which mips_split_64bit_move_p
-   holds.  For 64-bit targets, this function handles 128-bit moves.  */
+/* Split a move from SRC to DEST, given that mips_split_move_p holds.  */
 
 void
-mips_split_doubleword_move (rtx dest, rtx src)
+mips_split_move (rtx dest, rtx src)
 {
   rtx low_dest;
 
+  gcc_assert (mips_split_move_p (dest, src));
   if (FP_REG_RTX_P (dest) || FP_REG_RTX_P (src))
     {
       if (!TARGET_64BIT && GET_MODE (dest) == DImode)
@@ -4209,7 +4225,7 @@  mips_output_move (rtx dest, rtx src)
   mode = GET_MODE (dest);
   dbl_p = (GET_MODE_SIZE (mode) == 8);
 
-  if (dbl_p && mips_split_64bit_move_p (dest, src))
+  if (mips_split_move_p (dest, src))
     return "#";
 
   if ((src_code == REG && GP_REG_P (REGNO (src)))
@@ -4220,6 +4236,14 @@  mips_output_move (rtx dest, rtx src)
 	  if (GP_REG_P (REGNO (dest)))
 	    return "move\t%0,%z1";
 
+	  if (mips_mult_move_p (dest, src))
+	    {
+	      if (ISA_HAS_DSP_MULT)
+		return "mult\t%q0,%.,%.";
+	      else
+		return "mult\t%.,%.";
+	    }
+
 	  /* Moves to HI are handled by special .md insns.  */
 	  if (REGNO (dest) == LO_REGNUM)
 	    return "mtlo\t%z1";
@@ -10430,8 +10454,8 @@  mips_save_reg (rtx reg, rtx mem)
     {
       rtx x1, x2;
 
-      if (mips_split_64bit_move_p (mem, reg))
-	mips_split_doubleword_move (mem, reg);
+      if (mips_split_move_p (mem, reg))
+	mips_split_move (mem, reg);
       else
 	mips_emit_move (mem, reg);
 
Index: gcc/config/mips/mips.md
===================================================================
--- gcc/config/mips/mips.md	2012-09-03 07:49:57.318188985 +0100
+++ gcc/config/mips/mips.md	2012-09-04 20:15:10.293130456 +0100
@@ -204,7 +204,7 @@  (define_attr "jal_macro" "no,yes"
 ;; the split instructions; in some cases, it is more appropriate for the
 ;; scheduling type to be "multi" instead.
 (define_attr "move_type"
-  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,move,fmove,
+  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,imul,move,fmove,
    const,constN,signext,ext_ins,logical,arith,sll0,andi,loadpool,
    shift_shift"
   (const_string "unknown"))
@@ -369,6 +369,7 @@  (define_attr "type"
 	 (eq_attr "move_type" "mflo") (const_string "mflo")
 
 	 ;; These types of move are always single insns.
+	 (eq_attr "move_type" "imul") (const_string "imul")
 	 (eq_attr "move_type" "fmove") (const_string "fmove")
 	 (eq_attr "move_type" "loadpool") (const_string "load")
 	 (eq_attr "move_type" "signext") (const_string "signext")
@@ -4254,13 +4255,13 @@  (define_insn "*mov<mode>_ra"
    (set_attr "mode" "<MODE>")])
 
 (define_insn "*movdi_32bit"
-  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
-	(match_operand:DI 1 "move_operand" "d,i,m,d,*J*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
+	(match_operand:DI 1 "move_operand" "d,i,m,d,*J,*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
   "!TARGET_64BIT && !TARGET_MIPS16
    && (register_operand (operands[0], DImode)
        || reg_or_0_operand (operands[1], DImode))"
   { return mips_output_move (operands[0], operands[1]); }
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
    (set_attr "mode" "DI")])
 
 (define_insn "*movdi_32bit_mips16"
@@ -4707,14 +4708,14 @@  (define_expand "movti"
 })
 
 (define_insn "*movti"
-  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d")
-	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*d*J,*a"))]
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d")
+	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*J,*d,*a"))]
   "TARGET_64BIT
    && !TARGET_MIPS16
    && (register_operand (operands[0], TImode)
        || reg_or_0_operand (operands[1], TImode))"
-  "#"
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo")
+  { return mips_output_move (operands[0], operands[1]); }
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo")
    (set_attr "mode" "TI")])
 
 (define_insn "*movti_mips16"
@@ -4765,21 +4766,20 @@  (define_insn "*movtf_mips16"
 (define_split
   [(set (match_operand:MOVE64 0 "nonimmediate_operand")
 	(match_operand:MOVE64 1 "move_operand"))]
-  "reload_completed && !TARGET_64BIT
-   && mips_split_64bit_move_p (operands[0], operands[1])"
+  "reload_completed && mips_split_move_p (operands[0], operands[1])"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move (operands[0], operands[1]);
   DONE;
 })
 
 (define_split
   [(set (match_operand:MOVE128 0 "nonimmediate_operand")
 	(match_operand:MOVE128 1 "move_operand"))]
-  "TARGET_64BIT && reload_completed"
+  "reload_completed && mips_split_move_p (operands[0], operands[1])"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move (operands[0], operands[1]);
   DONE;
 })
 
Index: gcc/testsuite/gcc.target/mips/madd-9.c
===================================================================
--- gcc/testsuite/gcc.target/mips/madd-9.c	2012-09-03 07:49:57.318188985 +0100
+++ gcc/testsuite/gcc.target/mips/madd-9.c	2012-09-04 21:21:17.132015119 +0100
@@ -1,7 +1,13 @@ 
 /* { dg-do compile } */
 /* { dg-options "isa_rev>=1 -mgp32" } */
-/* { dg-skip-if "code quality test" { *-*-* } { "-O0" } { "" } } */
+/* References to X within the loop need to have a higher frequency than
+   references to X outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
 /* { dg-final { scan-assembler-not "\tmul\t" } } */
+/* { dg-final { scan-assembler-not "\tmthi" } } */
+/* { dg-final { scan-assembler-not "\tmtlo" } } */
+/* { dg-final { scan-assembler "\tmult\t" } } */
 /* { dg-final { scan-assembler "\tmadd\t" } } */
 
 NOMIPS16 long long
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c
===================================================================
--- /dev/null	2012-08-19 20:42:12.842999468 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit.c	2012-09-04 21:04:52.697043742 +0100
@@ -0,0 +1,20 @@ 
+/* { dg-options "-mdspr2 -mgp32" } */
+/* References to RESULT within the loop need to have a higher frequency than
+   references to RESULT outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+  long long result = 0;
+  int i;
+
+  for (i = 0; i < n; i++)
+    result = __builtin_mips_madd (result, v[i], m);
+  return result;
+}
+
+/* { dg-final { scan-assembler "\tmult\t\\\$ac.,\\\$0,\\\$0" } } */