Patchwork PING Re: [PATCH, MIPS] add new peephole for 74k dspr2

login
register
mail settings
Submitter Richard Sandiford
Date Oct. 7, 2012, 8:45 a.m.
Message ID <87bogeoekw.fsf@talisman.home>
Download mbox | patch
Permalink /patch/189794/
State New
Headers show

Comments

Richard Sandiford - Oct. 7, 2012, 8:45 a.m.
"Maciej W. Rozycki" <macro@codesourcery.com> writes:
> On Tue, 25 Sep 2012, Richard Sandiford wrote:
>> Although I see the 4kp with its 32-cycle MULTs and MADDs is one where
>> MULT $0,$0 would be a really bad choice.  Sigh.  The amount of effort
>> required for this optimisation is getting a bit ridiculous.
>
>  I have double-checked some documentation, and in fact many MIPS cores, 
> including the current ones, have a configuration option to include either 
> a high-performance or an area-efficient MD unit.  Take the M14Kc for 
> example -- its high-performance unit has a one-cycle latency/issue rate 
> for 16-bit multiplication (two-cycle for full 32 bits; here the width of 
> rt is explicitly named) and the area-efficient has a 32-cycle 
> latency/issue rate only regardless of the operand size (obviously 
> iterating over addition one bit at a time).  The latency of MTHI/MTLO is 1 
> across both units.
>
>  So I think this can't really be selected automatically for all cores, 
> some human-supplied knowledge about the MD unit used is required -- that 
> obviously affects other operations too, e.g. some multiplications 
> involving a constant that may be cheaper to do either directly or with a 
> sequence of additions depending on the MD unit present (unless optimising 
> for size, of course).

Yeah, as far as this optimisation goes, I think your original suggestion
of using the DFA to work out the best sequence is still right.  If the DFA
says that multiplications take a handful of cycles when they actually take
32 then we're not going to optimise multiplication very well whatever happens.

In practice, code destined for non-4kc cores with iterative multipliers
would probably tune well with -mtune=4kp.

Anyway, I said upthread that I was looking into removing MD0_REG and
MD1_REG because I thought I'd seen a case where that was needed for
madd-9.c to pass.  I can no longer reproduce that, so it was probably
pilot error.

The final patch ended up being much more complicated than I'd have liked,
but at least it should be easier to add other automatically-derived tuning
costs in future.

Tested on mipsisa64-elf and mipsisa32-elf.  Applied.

Richard


gcc/
	* config/mips/mips-protos.h (mips_split_type): New enum.
	(mips_split_64bit_move_p, mips_split_doubleword_move): Delete.
	(mips_split_move_p, mips_split_move, mips_split_move_insn_p)
	(mips_split_move_insn): Declare.
	* config/mips/mips.c (mips_tuning_info): New variable.
	(mips_load_store_insns): Use mips_split_move_insn_p instead of
	mips_split_64bit_move_p.
	(mips_emit_move_or_split, mips_mult_move_p): New functions.
	(mips_split_64bit_move_p): Rename to...
	(mips_split_move_p): ...this and take a mips_split_type argument.
	Generalize to all moves.  Call mips_mult_move_p.
	(mips_split_doubleword_move): Rename to...
	(mips_split_move): ...this and take a mips_split_type argument.
	Assert that mips_split_move_p holds.
	(mips_insn_split_type, mips_split_move_insn_p, mips_split_move_insn):
	New functions.
	(mips_output_move): Use mips_split_move_p instead of
	mips_split_64bit_move_p.  Handle MULT $0, $0 moves.
	(mips_save_reg): Use mips_emit_move_or_split.
	(mips_sim_reset): Assign to curr_state.  Call targetm.sched.init
	and advance_state.
	(mips_sim_init): Call targetm.sched.init_dfa_pre_cycle_insn and
	targetm.sched.init_dfa_post_cycle_insn, if defined.
	(mips_sim_next_cycle): Assign to curr_state.  Use advance_state
	instead of state_transition.
	(mips_sim_issue_insn): Assign to curr_state.  Use
	targetm.sched.variable_issue to see how many more insns
	can be issued.
	(mips_seq_time, mips_mult_zero_zero_cost)
	(mips_set_fast_mult_zero_zero_p, mips_set_tuning_info)
	(mips_expand_to_rtl_hook): New functions.
	(TARGET_EXPAND_TO_RTL_HOOK): Define.
	* config/mips/mips.md (move_type): Add imul.
	(type): Map imul move_types to imul.
	(*movdi_32bit, *movti): Add imul alternatives.
	Use mips_split_move_insn_p and mips_split_move_insn instead of
	mips_split_64bit_move_p and mips_split_doubleword_move in move
	splitters.

gcc/testsuite/
2012-10-07  Richard Sandiford  <rdsandiford@googlemail.com>
	    Sandra Loosemore  <sandra@codesourcery.com>

	* gcc.target/mips/madd-9.c: Force code to be tuned for the 4kc
	and test that the accumulator is initialized using MULT.
	* gcc.target/mips/mips32-dsp-accinit-1.c: New test.
	* gcc.target/mips/mips32-dsp-accinit-2.c: Likewise.
Maciej W. Rozycki - Oct. 8, 2012, 11:23 p.m.
On Sun, 7 Oct 2012, Richard Sandiford wrote:

> >  So I think this can't really be selected automatically for all cores, 
> > some human-supplied knowledge about the MD unit used is required -- that 
> > obviously affects other operations too, e.g. some multiplications 
> > involving a constant that may be cheaper to do either directly or with a 
> > sequence of additions depending on the MD unit present (unless optimising 
> > for size, of course).
> 
> Yeah, as far as this optimisation goes, I think your original suggestion
> of using the DFA to work out the best sequence is still right.  If the DFA
> says that multiplications take a handful of cycles when they actually take
> 32 then we're not going to optimise multiplication very well whatever happens.
> 
> In practice, code destined for non-4kc cores with iterative multipliers
> would probably tune well with -mtune=4kp.

 Hmm, maybe, although I find requiring people to tune for an obsolete 
MIPS32 rev. 1 processor to get the right MDU performance figures for 
modern processors a bit odd (the 4Kp DFA undoubtedly lacks reservations 
for newer instructions introduced with the MIPS32r2 architecture update).

 I've thought of -miterative-mdu or suchlike a knob that would override 
the default cost of multiplication/division as appropriate (i.e. to 32/64 
plus any extra operation-specific constant as required), perhaps by 
forcing the 4Kp insn reservation (extended to 64 bits as required) over 
the usual one; of course there would be nothing to override the -mtune=4Kp 
arrangement with.  Nothing urgent though, just an idea to ponder.

> The final patch ended up being much more complicated than I'd have liked,
> but at least it should be easier to add other automatically-derived tuning
> costs in future.

 Thanks for taking my concerns into account.  One nit below.

> Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
> ===================================================================
> --- /dev/null	2012-10-06 09:26:21.400998949 +0100
> +++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-06 22:24:24.857713316 +0100
> @@ -0,0 +1,22 @@
> +/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
> +/* References to RESULT within the loop need to have a higher frequency than
> +   references to RESULT outside the loop, otherwise there is no reason
> +   to prefer multiply/accumulator registers over GPRs.  */
> +/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
> +
> +/* Check that the zero-initialization of the accumulator feeding into
> +   the madd is done by means of a mult instruction instead of mthi/mtlo.  */

 This comment looks reversed to me as in this case we do NOT want a MULT 
to be used...

> +
> +NOMIPS16 long long f (int n, int *v, int m)
> +{
> +  long long result = 0;
> +  int i;
> +
> +  for (i = 0; i < n; i++)
> +    result = __builtin_mips_madd (result, v[i], m);
> +  return result;
> +}
> +
> +/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
> +/* { dg-final { scan-assembler "\tmthi\t" } } */
> +/* { dg-final { scan-assembler "\tmtlo\t" } } */

 ... and in fact you do check that we didn't get one.

  Maciej

Patch

Index: gcc/config/mips/mips-protos.h
===================================================================
--- gcc/config/mips/mips-protos.h	2012-10-06 17:26:43.408618293 +0100
+++ gcc/config/mips/mips-protos.h	2012-10-07 08:21:40.892791332 +0100
@@ -173,6 +173,25 @@  enum mips_call_type {
   MIPS_CALL_EPILOGUE
 };
 
+/* Controls the conditions under which certain instructions are split.
+
+   SPLIT_IF_NECESSARY
+	Only perform splits that are necessary for correctness
+	(because no unsplit version exists).
+
+   SPLIT_FOR_SPEED
+	Perform splits that are necessary for correctness or
+	beneficial for code speed.
+
+   SPLIT_FOR_SIZE
+	Perform splits that are necessary for correctness or
+	beneficial for code size.  */
+enum mips_split_type {
+  SPLIT_IF_NECESSARY,
+  SPLIT_FOR_SPEED,
+  SPLIT_FOR_SIZE
+};
+
 extern bool mips_symbolic_constant_p (rtx, enum mips_symbol_context,
 				      enum mips_symbol_type *);
 extern int mips_regno_mode_ok_for_base_p (int, enum machine_mode, bool);
@@ -212,8 +231,10 @@  extern int m16_simm8_8 (rtx, enum machin
 extern int m16_nsimm8_8 (rtx, enum machine_mode);
 
 extern rtx mips_subword (rtx, bool);
-extern bool mips_split_64bit_move_p (rtx, rtx);
-extern void mips_split_doubleword_move (rtx, rtx);
+extern bool mips_split_move_p (rtx, rtx, enum mips_split_type);
+extern void mips_split_move (rtx, rtx, enum mips_split_type);
+extern bool mips_split_move_insn_p (rtx, rtx, rtx);
+extern void mips_split_move_insn (rtx, rtx, rtx);
 extern const char *mips_output_move (rtx, rtx);
 extern bool mips_cfun_has_cprestore_slot_p (void);
 extern bool mips_cprestore_address_p (rtx, bool);
Index: gcc/config/mips/mips.c
===================================================================
--- gcc/config/mips/mips.c	2012-10-06 22:24:11.262714074 +0100
+++ gcc/config/mips/mips.c	2012-10-07 09:38:39.656797917 +0100
@@ -265,6 +265,24 @@  static const char *const mips_fp_conditi
   MIPS_FP_CONDITIONS (STRINGIFY)
 };
 
+/* Tuning information that is automatically derived from other sources
+   (such as the scheduler).  */
+static struct {
+  /* The architecture and tuning settings that this structure describes.  */
+  enum processor arch;
+  enum processor tune;
+
+  /* True if this structure describes MIPS16 settings.  */
+  bool mips16_p;
+
+  /* True if the structure has been initialized.  */
+  bool initialized_p;
+
+  /* True if "MULT $0, $0" is preferable to "MTLO $0; MTHI $0"
+     when optimizing for speed.  */
+  bool fast_mult_zero_zero_p;
+} mips_tuning_info;
+
 /* Information about a function's frame layout.  */
 struct GTY(())  mips_frame_info {
   /* The size of the frame in bytes.  */
@@ -2395,11 +2413,11 @@  mips_load_store_insns (rtx mem, rtx insn
   mode = GET_MODE (mem);
 
   /* Try to prove that INSN does not need to be split.  */
-  might_split_p = true;
-  if (GET_MODE_BITSIZE (mode) == 64)
+  might_split_p = GET_MODE_SIZE (mode) > UNITS_PER_WORD;
+  if (might_split_p)
     {
       set = single_set (insn);
-      if (set && !mips_split_64bit_move_p (SET_DEST (set), SET_SRC (set)))
+      if (set && !mips_split_move_insn_p (SET_DEST (set), SET_SRC (set), insn))
 	might_split_p = false;
     }
 
@@ -2441,6 +2459,18 @@  mips_emit_move (rtx dest, rtx src)
 	  : emit_move_insn_1 (dest, src));
 }
 
+/* Emit a move from SRC to DEST, splitting compound moves into individual
+   instructions.  SPLIT_TYPE is the type of split to perform.  */
+
+static void
+mips_emit_move_or_split (rtx dest, rtx src, enum mips_split_type split_type)
+{
+  if (mips_split_move_p (dest, src, split_type))
+    mips_split_move (dest, src, split_type);
+  else
+    mips_emit_move (dest, src);
+}
+
 /* Emit an instruction of the form (set TARGET (CODE OP0)).  */
 
 static void
@@ -4119,39 +4149,60 @@  mips_subword (rtx op, bool high_p)
   return simplify_gen_subreg (word_mode, op, mode, byte);
 }
 
-/* Return true if a 64-bit move from SRC to DEST should be split into two.  */
+/* Return true if SRC should be moved into DEST using "MULT $0, $0".
+   SPLIT_TYPE is the condition under which moves should be split.  */
+
+static bool
+mips_mult_move_p (rtx dest, rtx src, enum mips_split_type split_type)
+{
+  return ((split_type != SPLIT_FOR_SPEED
+	   || mips_tuning_info.fast_mult_zero_zero_p)
+	  && src == const0_rtx
+	  && REG_P (dest)
+	  && GET_MODE_SIZE (GET_MODE (dest)) == 2 * UNITS_PER_WORD
+	  && (ISA_HAS_DSP_MULT
+	      ? ACC_REG_P (REGNO (dest))
+	      : MD_REG_P (REGNO (dest))));
+}
+
+/* Return true if a move from SRC to DEST should be split into two.
+   SPLIT_TYPE describes the split condition.  */
 
 bool
-mips_split_64bit_move_p (rtx dest, rtx src)
+mips_split_move_p (rtx dest, rtx src, enum mips_split_type split_type)
 {
-  if (TARGET_64BIT)
+  /* Check whether the move can be done using some variant of MULT $0,$0.  */
+  if (mips_mult_move_p (dest, src, split_type))
     return false;
 
   /* FPR-to-FPR moves can be done in a single instruction, if they're
      allowed at all.  */
-  if (FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
+  unsigned int size = GET_MODE_SIZE (GET_MODE (dest));
+  if (size == 8 && FP_REG_RTX_P (src) && FP_REG_RTX_P (dest))
     return false;
 
   /* Check for floating-point loads and stores.  */
-  if (ISA_HAS_LDC1_SDC1)
+  if (size == 8 && ISA_HAS_LDC1_SDC1)
     {
       if (FP_REG_RTX_P (dest) && MEM_P (src))
 	return false;
       if (FP_REG_RTX_P (src) && MEM_P (dest))
 	return false;
     }
-  return true;
+
+  /* Otherwise split all multiword moves.  */
+  return size > UNITS_PER_WORD;
 }
 
-/* Split a doubleword move from SRC to DEST.  On 32-bit targets,
-   this function handles 64-bit moves for which mips_split_64bit_move_p
-   holds.  For 64-bit targets, this function handles 128-bit moves.  */
+/* Split a move from SRC to DEST, given that mips_split_move_p holds.
+   SPLIT_TYPE describes the split condition.  */
 
 void
-mips_split_doubleword_move (rtx dest, rtx src)
+mips_split_move (rtx dest, rtx src, enum mips_split_type split_type)
 {
   rtx low_dest;
 
+  gcc_checking_assert (mips_split_move_p (dest, src, split_type));
   if (FP_REG_RTX_P (dest) || FP_REG_RTX_P (src))
     {
       if (!TARGET_64BIT && GET_MODE (dest) == DImode)
@@ -4206,6 +4257,41 @@  mips_split_doubleword_move (rtx dest, rt
 	}
     }
 }
+
+/* Return the split type for instruction INSN.  */
+
+static enum mips_split_type
+mips_insn_split_type (rtx insn)
+{
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  if (bb)
+    {
+      if (optimize_bb_for_speed_p (bb))
+	return SPLIT_FOR_SPEED;
+      else
+	return SPLIT_FOR_SIZE;
+    }
+  /* Once CFG information has been removed, we should trust the optimization
+     decisions made by previous passes and only split where necessary.  */
+  return SPLIT_IF_NECESSARY;
+}
+
+/* Return true if a move from SRC to DEST in INSN should be split.  */
+
+bool
+mips_split_move_insn_p (rtx dest, rtx src, rtx insn)
+{
+  return mips_split_move_p (dest, src, mips_insn_split_type (insn));
+}
+
+/* Split a move from SRC to DEST in INSN, given that mips_split_move_insn_p
+   holds.  */
+
+void
+mips_split_move_insn (rtx dest, rtx src, rtx insn)
+{
+  mips_split_move (dest, src, mips_insn_split_type (insn));
+}
 
 /* Return the appropriate instructions to move SRC into DEST.  Assume
    that SRC is operand 1 and DEST is operand 0.  */
@@ -4223,7 +4309,7 @@  mips_output_move (rtx dest, rtx src)
   mode = GET_MODE (dest);
   dbl_p = (GET_MODE_SIZE (mode) == 8);
 
-  if (dbl_p && mips_split_64bit_move_p (dest, src))
+  if (mips_split_move_p (dest, src, SPLIT_IF_NECESSARY))
     return "#";
 
   if ((src_code == REG && GP_REG_P (REGNO (src)))
@@ -4234,6 +4320,14 @@  mips_output_move (rtx dest, rtx src)
 	  if (GP_REG_P (REGNO (dest)))
 	    return "move\t%0,%z1";
 
+	  if (mips_mult_move_p (dest, src, SPLIT_IF_NECESSARY))
+	    {
+	      if (ISA_HAS_DSP_MULT)
+		return "mult\t%q0,%.,%.";
+	      else
+		return "mult\t%.,%.";
+	    }
+
 	  /* Moves to HI are handled by special .md insns.  */
 	  if (REGNO (dest) == LO_REGNUM)
 	    return "mtlo\t%z1";
@@ -10444,10 +10538,7 @@  mips_save_reg (rtx reg, rtx mem)
     {
       rtx x1, x2;
 
-      if (mips_split_64bit_move_p (mem, reg))
-	mips_split_doubleword_move (mem, reg);
-      else
-	mips_emit_move (mem, reg);
+      mips_emit_move_or_split (mem, reg, SPLIT_IF_NECESSARY);
 
       x1 = mips_frame_set (mips_subword (mem, false),
 			   mips_subword (reg, false));
@@ -14907,10 +14998,15 @@  struct mips_sim {
 static void
 mips_sim_reset (struct mips_sim *state)
 {
+  curr_state = state->dfa_state;
+
   state->time = 0;
   state->insns_left = state->issue_rate;
   memset (&state->last_set, 0, sizeof (state->last_set));
-  state_reset (state->dfa_state);
+  state_reset (curr_state);
+
+  targetm.sched.init (0, false, 0);
+  advance_state (curr_state);
 }
 
 /* Initialize STATE before its first use.  DFA_STATE points to an
@@ -14919,6 +15015,12 @@  mips_sim_reset (struct mips_sim *state)
 static void
 mips_sim_init (struct mips_sim *state, state_t dfa_state)
 {
+  if (targetm.sched.init_dfa_pre_cycle_insn)
+    targetm.sched.init_dfa_pre_cycle_insn ();
+
+  if (targetm.sched.init_dfa_post_cycle_insn)
+    targetm.sched.init_dfa_post_cycle_insn ();
+
   state->issue_rate = mips_issue_rate ();
   state->dfa_state = dfa_state;
   mips_sim_reset (state);
@@ -14929,9 +15031,11 @@  mips_sim_init (struct mips_sim *state, s
 static void
 mips_sim_next_cycle (struct mips_sim *state)
 {
+  curr_state = state->dfa_state;
+
   state->time++;
   state->insns_left = state->issue_rate;
-  state_transition (state->dfa_state, 0);
+  advance_state (curr_state);
 }
 
 /* Advance simulation state STATE until instruction INSN can read
@@ -15037,8 +15141,11 @@  mips_sim_record_set (rtx x, const_rtx pa
 static void
 mips_sim_issue_insn (struct mips_sim *state, rtx insn)
 {
-  state_transition (state->dfa_state, insn);
-  state->insns_left--;
+  curr_state = state->dfa_state;
+
+  state_transition (curr_state, insn);
+  state->insns_left = targetm.sched.variable_issue (0, false, insn,
+						    state->insns_left);
 
   mips_sim_insn = insn;
   note_stores (PATTERN (insn), mips_sim_record_set, state);
@@ -15089,6 +15196,109 @@  mips_sim_finish_insn (struct mips_sim *s
       break;
     }
 }
+
+/* Use simulator state STATE to calculate the execution time of
+   instruction sequence SEQ.  */
+
+static unsigned int
+mips_seq_time (struct mips_sim *state, rtx seq)
+{
+  mips_sim_reset (state);
+  for (rtx insn = seq; insn; insn = NEXT_INSN (insn))
+    {
+      mips_sim_wait_insn (state, insn);
+      mips_sim_issue_insn (state, insn);
+    }
+  return state->time;
+}
+
+/* Return the execution-time cost of mips_tuning_info.fast_mult_zero_zero_p
+   setting SETTING, using STATE to simulate instruction sequences.  */
+
+static unsigned int
+mips_mult_zero_zero_cost (struct mips_sim *state, bool setting)
+{
+  mips_tuning_info.fast_mult_zero_zero_p = setting;
+  start_sequence ();
+
+  enum machine_mode dword_mode = TARGET_64BIT ? TImode : DImode;
+  rtx hilo = gen_rtx_REG (dword_mode, MD_REG_FIRST);
+  mips_emit_move_or_split (hilo, const0_rtx, SPLIT_FOR_SPEED);
+
+  /* If the target provides mulsidi3_32bit then that's the most likely
+     consumer of the result.  Test for bypasses.  */
+  if (dword_mode == DImode && HAVE_maddsidi4)
+    {
+      rtx gpr = gen_rtx_REG (SImode, GP_REG_FIRST + 4);
+      emit_insn (gen_maddsidi4 (hilo, gpr, gpr, hilo));
+    }
+
+  unsigned int time = mips_seq_time (state, get_insns ());
+  end_sequence ();
+  return time;
+}
+
+/* Check the relative speeds of "MULT $0,$0" and "MTLO $0; MTHI $0"
+   and set up mips_tuning_info.fast_mult_zero_zero_p accordingly.
+   Prefer MULT -- which is shorter -- in the event of a tie.  */
+
+static void
+mips_set_fast_mult_zero_zero_p (struct mips_sim *state)
+{
+  if (TARGET_MIPS16)
+    /* No MTLO or MTHI available.  */
+    mips_tuning_info.fast_mult_zero_zero_p = true;
+  else
+    {
+      unsigned int true_time = mips_mult_zero_zero_cost (state, true);
+      unsigned int false_time = mips_mult_zero_zero_cost (state, false);
+      mips_tuning_info.fast_mult_zero_zero_p = (true_time <= false_time);
+    }
+}
+
+/* Set up costs based on the current architecture and tuning settings.  */
+
+static void
+mips_set_tuning_info (void)
+{
+  if (mips_tuning_info.initialized_p
+      && mips_tuning_info.arch == mips_arch
+      && mips_tuning_info.tune == mips_tune
+      && mips_tuning_info.mips16_p == TARGET_MIPS16)
+    return;
+
+  mips_tuning_info.arch = mips_arch;
+  mips_tuning_info.tune = mips_tune;
+  mips_tuning_info.mips16_p = TARGET_MIPS16;
+  mips_tuning_info.initialized_p = true;
+
+  dfa_start ();
+
+  struct mips_sim state;
+  mips_sim_init (&state, alloca (state_size ()));
+
+  mips_set_fast_mult_zero_zero_p (&state);
+
+  dfa_finish ();
+}
+
+/* Implement TARGET_EXPAND_TO_RTL_HOOK.  */
+
+static void
+mips_expand_to_rtl_hook (void)
+{
+  /* We need to call this at a point where we can safely create sequences
+     of instructions, so TARGET_OVERRIDE_OPTIONS is too early.  We also
+     need to call it at a point where the DFA infrastructure is not
+     already in use, so we can't just call it lazily on demand.
+
+     At present, mips_tuning_info is only needed during post-expand
+     RTL passes such as split_insns, so this hook should be early enough.
+     We may need to move the call elsewhere if mips_tuning_info starts
+     to be used for other things (such as rtx_costs, or expanders that
+     could be called during gimple optimization).  */
+  mips_set_tuning_info ();
+}
 
 /* The VR4130 pipeline issues aligned pairs of instructions together,
    but it stalls the second instruction if it depends on the first.
@@ -17760,6 +17970,8 @@  #define TARGET_MACHINE_DEPENDENT_REORG m
 #undef  TARGET_PREFERRED_RELOAD_CLASS
 #define TARGET_PREFERRED_RELOAD_CLASS mips_preferred_reload_class
 
+#undef TARGET_EXPAND_TO_RTL_HOOK
+#define TARGET_EXPAND_TO_RTL_HOOK mips_expand_to_rtl_hook
 #undef TARGET_ASM_FILE_START
 #define TARGET_ASM_FILE_START mips_file_start
 #undef TARGET_ASM_FILE_START_FILE_DIRECTIVE
Index: gcc/config/mips/mips.md
===================================================================
--- gcc/config/mips/mips.md	2012-10-06 21:08:44.902431555 +0100
+++ gcc/config/mips/mips.md	2012-10-06 22:24:24.854713317 +0100
@@ -204,7 +204,7 @@  (define_attr "jal_macro" "no,yes"
 ;; the split instructions; in some cases, it is more appropriate for the
 ;; scheduling type to be "multi" instead.
 (define_attr "move_type"
-  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,move,fmove,
+  "unknown,load,fpload,store,fpstore,mtc,mfc,mtlo,mflo,imul,move,fmove,
    const,constN,signext,ext_ins,logical,arith,sll0,andi,loadpool,
    shift_shift"
   (const_string "unknown"))
@@ -369,6 +369,7 @@  (define_attr "type"
 	 (eq_attr "move_type" "mflo") (const_string "mflo")
 
 	 ;; These types of move are always single insns.
+	 (eq_attr "move_type" "imul") (const_string "imul")
 	 (eq_attr "move_type" "fmove") (const_string "fmove")
 	 (eq_attr "move_type" "loadpool") (const_string "load")
 	 (eq_attr "move_type" "signext") (const_string "signext")
@@ -4242,14 +4243,17 @@  (define_insn "*mov<mode>_ra"
    (set_attr "mode" "<MODE>")])
 
 (define_insn "*movdi_32bit"
-  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
-	(match_operand:DI 1 "move_operand" "d,i,m,d,*J*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d,*f,*f,*d,*m,*B*C*D,*B*C*D,*d,*m")
+	(match_operand:DI 1 "move_operand" "d,i,m,d,*J,*d,*a,*J*d,*m,*f,*f,*d,*m,*B*C*D,*B*C*D"))]
   "!TARGET_64BIT && !TARGET_MIPS16
    && (register_operand (operands[0], DImode)
        || reg_or_0_operand (operands[1], DImode))"
   { return mips_output_move (operands[0], operands[1]); }
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
-   (set_attr "mode" "DI")])
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo,mtc,fpload,mfc,fpstore,mtc,fpload,mfc,fpstore")
+   (set (attr "mode")
+   	(if_then_else (eq_attr "move_type" "imul")
+		      (const_string "SI")
+		      (const_string "DI")))])
 
 (define_insn "*movdi_32bit_mips16"
   [(set (match_operand:DI 0 "nonimmediate_operand" "=d,y,d,d,d,d,m,*d")
@@ -4695,15 +4699,18 @@  (define_expand "movti"
 })
 
 (define_insn "*movti"
-  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*d")
-	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*d*J,*a"))]
+  [(set (match_operand:TI 0 "nonimmediate_operand" "=d,d,d,m,*a,*a,*d")
+	(match_operand:TI 1 "move_operand" "d,i,m,dJ,*J,*d,*a"))]
   "TARGET_64BIT
    && !TARGET_MIPS16
    && (register_operand (operands[0], TImode)
        || reg_or_0_operand (operands[1], TImode))"
-  "#"
-  [(set_attr "move_type" "move,const,load,store,mtlo,mflo")
-   (set_attr "mode" "TI")])
+  { return mips_output_move (operands[0], operands[1]); }
+  [(set_attr "move_type" "move,const,load,store,imul,mtlo,mflo")
+   (set (attr "mode")
+   	(if_then_else (eq_attr "move_type" "imul")
+		      (const_string "SI")
+		      (const_string "TI")))])
 
 (define_insn "*movti_mips16"
   [(set (match_operand:TI 0 "nonimmediate_operand" "=d,y,d,d,d,d,m,*d")
@@ -4753,21 +4760,20 @@  (define_insn "*movtf_mips16"
 (define_split
   [(set (match_operand:MOVE64 0 "nonimmediate_operand")
 	(match_operand:MOVE64 1 "move_operand"))]
-  "reload_completed && !TARGET_64BIT
-   && mips_split_64bit_move_p (operands[0], operands[1])"
+  "reload_completed && mips_split_move_insn_p (operands[0], operands[1], insn)"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move_insn (operands[0], operands[1], curr_insn);
   DONE;
 })
 
 (define_split
   [(set (match_operand:MOVE128 0 "nonimmediate_operand")
 	(match_operand:MOVE128 1 "move_operand"))]
-  "TARGET_64BIT && reload_completed"
+  "reload_completed && mips_split_move_insn_p (operands[0], operands[1], insn)"
   [(const_int 0)]
 {
-  mips_split_doubleword_move (operands[0], operands[1]);
+  mips_split_move_insn (operands[0], operands[1], curr_insn);
   DONE;
 })
 
Index: gcc/testsuite/gcc.target/mips/madd-9.c
===================================================================
--- gcc/testsuite/gcc.target/mips/madd-9.c	2012-10-06 17:26:43.407618293 +0100
+++ gcc/testsuite/gcc.target/mips/madd-9.c	2012-10-06 22:24:24.855713317 +0100
@@ -1,7 +1,13 @@ 
 /* { dg-do compile } */
-/* { dg-options "isa_rev>=1 -mgp32" } */
-/* { dg-skip-if "code quality test" { *-*-* } { "-O0" } { "" } } */
+/* { dg-options "isa_rev>=1 -mgp32 -mtune=4kc" } */
+/* References to X within the loop need to have a higher frequency than
+   references to X outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
 /* { dg-final { scan-assembler-not "\tmul\t" } } */
+/* { dg-final { scan-assembler-not "\tmthi" } } */
+/* { dg-final { scan-assembler-not "\tmtlo" } } */
+/* { dg-final { scan-assembler "\tmult\t" } } */
 /* { dg-final { scan-assembler "\tmadd\t" } } */
 
 NOMIPS16 long long
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-1.c
===================================================================
--- /dev/null	2012-10-06 09:26:21.400998949 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-1.c	2012-10-06 22:24:24.856713316 +0100
@@ -0,0 +1,22 @@ 
+/* { dg-options "-mdspr2 -mgp32 -mtune=74kc" } */
+/* References to RESULT within the loop need to have a higher frequency than
+   references to RESULT outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+  long long result = 0;
+  int i;
+
+  for (i = 0; i < n; i++)
+    result = __builtin_mips_madd (result, v[i], m);
+  return result;
+}
+
+/* { dg-final { scan-assembler "\tmult\t\\\$ac.,\\\$0,\\\$0" } } */
+/* { dg-final { scan-assembler-not "mthi\t" } } */
+/* { dg-final { scan-assembler-not "mtlo\t" } } */
Index: gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c
===================================================================
--- /dev/null	2012-10-06 09:26:21.400998949 +0100
+++ gcc/testsuite/gcc.target/mips/mips32-dsp-accinit-2.c	2012-10-06 22:24:24.857713316 +0100
@@ -0,0 +1,22 @@ 
+/* { dg-options "-mdspr2 -mgp32 -mtune=4kp" } */
+/* References to RESULT within the loop need to have a higher frequency than
+   references to RESULT outside the loop, otherwise there is no reason
+   to prefer multiply/accumulator registers over GPRs.  */
+/* { dg-skip-if "requires register frequencies" { *-*-* } { "-O0" "-Os" } { "" } } */
+
+/* Check that the zero-initialization of the accumulator feeding into
+   the madd is done by means of a mult instruction instead of mthi/mtlo.  */
+
+NOMIPS16 long long f (int n, int *v, int m)
+{
+  long long result = 0;
+  int i;
+
+  for (i = 0; i < n; i++)
+    result = __builtin_mips_madd (result, v[i], m);
+  return result;
+}
+
+/* { dg-final { scan-assembler-not "mult\t\[^\n\]*\\\$0" } } */
+/* { dg-final { scan-assembler "\tmthi\t" } } */
+/* { dg-final { scan-assembler "\tmtlo\t" } } */