Patchwork [ARM] Thumb2 replicated constants

login
register
mail settings
Submitter Andrew Stubbs
Date April 21, 2011, 11:23 a.m.
Message ID <4DB013B7.4090504@codesourcery.com>
Download mbox | patch
Permalink /patch/92396/
State New
Headers show

Comments

Andrew Stubbs - April 21, 2011, 11:23 a.m.
This patch is a repost of the one I previously posted here:

   http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00652.html

As requested, I've broken out the other parts of the original patch, and 
those have already been reposted yesterday (and one committed also).

This (final) part is support for using Thumb2's replicated constants and 
addw/subw instructions as part of split constant loads. Previously the 
compiler could use these constants, but only where they would be loaded 
in a single instruction.

This patch must be applied on top of the addw/subw patch I posted yesterday.

The patch also optimizes the use of inverted or negated constants as a 
short-cut to the final value. The previous code did this in some cases, 
but could not be easily adapted to replicated constants.

The previous code also had a bug that prevented optimal use of shifted 
constants in Thumb code by imposing the same restrictions as ARM code. 
This has been fixed.

Example 1: addw as part of a split constant load

a + 0xfffff

    Before:
          movw    r3, #65535       ; 0x0ffff
          movt    r3, 15           ; 0xf0000
          adds    r3, r0, r3
    After:
          add     r0, r0, #1044480 ; 0xff000
          addw    r0, r0, #4095    ; 0x00fff

Example 2: arbitrary shifts bug fix

a - 0xfff1

    Before:
          sub     r0, r0, #65024   ; 0xfe00
          sub     r0, r0, #496     ; 0x01f0
          sub     r0, r0, #1       ; 0x0001
    After:
          sub     r0, r0, #65280   ; 0xff00
          sub     r0, r0, #241     ; 0x00f1

Example 3: 16-bit replicated patterns

a + 0x44004401

    Before:
          movw    r3, #17409          ; 0x00004401
          movt    r3, 17408           ; 0x44000000
          adds    r3, r0, r3
    After:
          add     r0, r0, #1140868096 ; 0x44004400
          adds    r0, r0, #1          ; 0x00000001

Example 4: 32-bit replicated patterns

a & 0xaaaaaa00

    Before:
          mov     r3, #43520           ; 0x0000aa00
          movt    r3, 43690            ; 0xaaaa0000
          and     r3, r0, r3
    After:
          and     r0, r0, #-1431655766 ; 0xaaaaaaaa
          bic     r0, r0, #170         ; 0x000000aa

The constant splitting code was duplicated in two places, and I would 
have needed to modify both quite heavily, so I have taken the 
opportunity to unify the two, and hopefully reduce the future 
maintenance burden.

Let me respond to a point Richard Earnshaw raised following the original 
posting:

 > A final note is that you may have missed some cases.  Now that we have
 > movw,
 > 	reg&  ~(16-bit const)
 > can now be done in at most 2 insns:
 > 	movw t1, #16-bit const
 > 	bic  Rd, reg, t1

Actually, I think we can do better than that for a 16-bit constant.

Given:

    a & ~(0xabcd)

Before my changes, GCC gave:

         bic     r0, r0, #43520
         bic     r0, r0, #460
         bic     r0, r0, #1

and after applying my patch:

         bic     r0, r0, #43776
         bic     r0, r0, #205

Two instructions and no temporary register.

 > On thumb-2 you can also use ORN that way as well.

It turns out that my previous patch was broken for ORN. I traced the 
problem to some confusing code already in arm.c that set can_invert for 
IOR, but then explicitly ignored it later (I had removed the second 
part, but not the first). I posted, and committed a patch to fix this 
yesterday.

In fact ORN is only of limited use for this kind of thing. Like AND, you 
can't use multiple ORNs to build a constant. The compiler already does 
use ORN in some circumstances, and this patch has not changed that.

Is the patch OK?

Andrew
Richard Earnshaw - May 6, 2011, 10:18 a.m.
On Thu, 2011-04-21 at 12:23 +0100, Andrew Stubbs wrote:
> This patch is a repost of the one I previously posted here:
> 
>    http://gcc.gnu.org/ml/gcc-patches/2010-12/msg00652.html
> 
> As requested, I've broken out the other parts of the original patch, and 
> those have already been reposted yesterday (and one committed also).
> 
> This (final) part is support for using Thumb2's replicated constants and 
> addw/subw instructions as part of split constant loads. Previously the 
> compiler could use these constants, but only where they would be loaded 
> in a single instruction.
> 
> This patch must be applied on top of the addw/subw patch I posted yesterday.
> 
> The patch also optimizes the use of inverted or negated constants as a 
> short-cut to the final value. The previous code did this in some cases, 
> but could not be easily adapted to replicated constants.
> 
> The previous code also had a bug that prevented optimal use of shifted 
> constants in Thumb code by imposing the same restrictions as ARM code. 
> This has been fixed.
> 
> Example 1: addw as part of a split constant load
> 
> a + 0xfffff
> 
>     Before:
>           movw    r3, #65535       ; 0x0ffff
>           movt    r3, 15           ; 0xf0000
>           adds    r3, r0, r3
>     After:
>           add     r0, r0, #1044480 ; 0xff000
>           addw    r0, r0, #4095    ; 0x00fff
> 
> Example 2: arbitrary shifts bug fix
> 
> a - 0xfff1
> 
>     Before:
>           sub     r0, r0, #65024   ; 0xfe00
>           sub     r0, r0, #496     ; 0x01f0
>           sub     r0, r0, #1       ; 0x0001
>     After:
>           sub     r0, r0, #65280   ; 0xff00
>           sub     r0, r0, #241     ; 0x00f1
> 
> Example 3: 16-bit replicated patterns
> 
> a + 0x44004401
> 
>     Before:
>           movw    r3, #17409          ; 0x00004401
>           movt    r3, 17408           ; 0x44000000
>           adds    r3, r0, r3
>     After:
>           add     r0, r0, #1140868096 ; 0x44004400
>           adds    r0, r0, #1          ; 0x00000001
> 
> Example 4: 32-bit replicated patterns
> 
> a & 0xaaaaaa00
> 
>     Before:
>           mov     r3, #43520           ; 0x0000aa00
>           movt    r3, 43690            ; 0xaaaa0000
>           and     r3, r0, r3
>     After:
>           and     r0, r0, #-1431655766 ; 0xaaaaaaaa
>           bic     r0, r0, #170         ; 0x000000aa
> 
> The constant splitting code was duplicated in two places, and I would 
> have needed to modify both quite heavily, so I have taken the 
> opportunity to unify the two, and hopefully reduce the future 
> maintenance burden.
> 
> Let me respond to a point Richard Earnshaw raised following the original 
> posting:
> 
>  > A final note is that you may have missed some cases.  Now that we have
>  > movw,
>  > 	reg&  ~(16-bit const)
>  > can now be done in at most 2 insns:
>  > 	movw t1, #16-bit const
>  > 	bic  Rd, reg, t1
> 
> Actually, I think we can do better than that for a 16-bit constant.
> 
> Given:
> 
>     a & ~(0xabcd)
> 
> Before my changes, GCC gave:
> 
>          bic     r0, r0, #43520
>          bic     r0, r0, #460
>          bic     r0, r0, #1
> 
> and after applying my patch:
> 
>          bic     r0, r0, #43776
>          bic     r0, r0, #205
> 
> Two instructions and no temporary register.
> 
>  > On thumb-2 you can also use ORN that way as well.
> 
> It turns out that my previous patch was broken for ORN. I traced the 
> problem to some confusing code already in arm.c that set can_invert for 
> IOR, but then explicitly ignored it later (I had removed the second 
> part, but not the first). I posted, and committed a patch to fix this 
> yesterday.
> 
> In fact ORN is only of limited use for this kind of thing. Like AND, you 
> can't use multiple ORNs to build a constant. The compiler already does 
> use ORN in some circumstances, and this patch has not changed that.
> 
> Is the patch OK?
> 
> Andrew

+   RETURN_SEQUENCE must be an int[4].

It would be a more robust coding style to define a struct with an int[4]
array as its only member.  Then it wouldn't be possible to pass an
undersized object to these routines.

OK with a change to do that.

R.

Patch

2011-04-21  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	* config/arm/arm.c: (count_insns_for_constant): Delete function.
	(find_best_start): Delete function.
	(optimal_immediate_sequence): New function.
	(optimal_immediate_sequence_1): New function.
	(arm_gen_constant): Move constant splitting code to
	optimal_immediate_sequence.
	Rewrite constant negation/invertion code.

	gcc/testsuite/
	* gcc.target/arm/thumb2-replicated-constant1.c: New file.
	* gcc.target/arm/thumb2-replicated-constant2.c: New file.
	* gcc.target/arm/thumb2-replicated-constant3.c: New file.
	* gcc.target/arm/thumb2-replicated-constant4.c: New file.

--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -129,7 +129,12 @@  static void thumb1_output_function_prologue (FILE *, HOST_WIDE_INT);
 static int arm_comp_type_attributes (const_tree, const_tree);
 static void arm_set_default_type_attributes (tree);
 static int arm_adjust_cost (rtx, rtx, rtx, int);
-static int count_insns_for_constant (HOST_WIDE_INT, int);
+static int optimal_immediate_sequence (enum rtx_code code,
+				       unsigned HOST_WIDE_INT val,
+				       int return_sequence[]);
+static int optimal_immediate_sequence_1 (enum rtx_code code,
+					 unsigned HOST_WIDE_INT val,
+					 int return_sequence[], int i);
 static int arm_get_strip_length (int);
 static bool arm_function_ok_for_sibcall (tree, tree);
 static enum machine_mode arm_promote_function_mode (const_tree,
@@ -2436,68 +2441,42 @@  arm_split_constant (enum rtx_code code, enum machine_mode mode, rtx insn,
 			   1);
 }
 
-/* Return the number of instructions required to synthesize the given
-   constant, if we start emitting them from bit-position I.  */
+/* Return a sequence of integers, in RETURN_SEQUENCE that fit into
+   ARM/THUMB2 immediates, and add up to VAL.
+   RETURN_SEQUENCE must be an int[4].
+   Thr function return value gives the number of insns required.  */
 static int
-count_insns_for_constant (HOST_WIDE_INT remainder, int i)
-{
-  HOST_WIDE_INT temp1;
-  int step_size = TARGET_ARM ? 2 : 1;
-  int num_insns = 0;
-
-  gcc_assert (TARGET_ARM || i == 0);
-
-  do
-    {
-      int end;
-
-      if (i <= 0)
-	i += 32;
-      if (remainder & (((1 << step_size) - 1) << (i - step_size)))
-	{
-	  end = i - 8;
-	  if (end < 0)
-	    end += 32;
-	  temp1 = remainder & ((0x0ff << end)
-				    | ((i < end) ? (0xff >> (32 - end)) : 0));
-	  remainder &= ~temp1;
-	  num_insns++;
-	  i -= 8 - step_size;
-	}
-      i -= step_size;
-    } while (remainder);
-  return num_insns;
-}
-
-static int
-find_best_start (unsigned HOST_WIDE_INT remainder)
+optimal_immediate_sequence (enum rtx_code code, unsigned HOST_WIDE_INT val,
+			    int return_sequence[])
 {
   int best_consecutive_zeros = 0;
   int i;
   int best_start = 0;
+  int insns1, insns2;
+  int tmp_sequence[4];
 
   /* If we aren't targetting ARM, the best place to start is always at
-     the bottom.  */
-  if (! TARGET_ARM)
-    return 0;
-
-  for (i = 0; i < 32; i += 2)
+     the bottom, otherwise look more closely.  */
+  if (TARGET_ARM)
     {
-      int consecutive_zeros = 0;
-
-      if (!(remainder & (3 << i)))
+      for (i = 0; i < 32; i += 2)
 	{
-	  while ((i < 32) && !(remainder & (3 << i)))
-	    {
-	      consecutive_zeros += 2;
-	      i += 2;
-	    }
-	  if (consecutive_zeros > best_consecutive_zeros)
+	  int consecutive_zeros = 0;
+
+	  if (!(val & (3 << i)))
 	    {
-	      best_consecutive_zeros = consecutive_zeros;
-	      best_start = i - consecutive_zeros;
+	      while ((i < 32) && !(val & (3 << i)))
+		{
+		  consecutive_zeros += 2;
+		  i += 2;
+		}
+	      if (consecutive_zeros > best_consecutive_zeros)
+		{
+		  best_consecutive_zeros = consecutive_zeros;
+		  best_start = i - consecutive_zeros;
+		}
+	      i -= 2;
 	    }
-	  i -= 2;
 	}
     }
 
@@ -2524,13 +2503,161 @@  find_best_start (unsigned HOST_WIDE_INT remainder)
      the constant starting from `best_start', and also starting from
      zero (i.e. with bit 31 first to be output).  If `best_start' doesn't
      yield a shorter sequence, we may as well use zero.  */
+  insns1 = optimal_immediate_sequence_1 (code, val, return_sequence, best_start);
   if (best_start != 0
-      && ((((unsigned HOST_WIDE_INT) 1) << best_start) < remainder)
-      && (count_insns_for_constant (remainder, 0) <=
-	  count_insns_for_constant (remainder, best_start)))
-    best_start = 0;
+      && ((((unsigned HOST_WIDE_INT) 1) << best_start) < val))
+    {
+      insns2 = optimal_immediate_sequence_1 (code, val, tmp_sequence, 0);
+      if (insns2 <= insns1)
+	{
+	  memcpy (return_sequence, tmp_sequence, sizeof(tmp_sequence));
+	  insns1 = insns2;
+	}
+    }
+
+  return insns1;
+}
+
+/* As for optimal_immediate_sequence, but starting at bit-position I.  */
+static int
+optimal_immediate_sequence_1 (enum rtx_code code, unsigned HOST_WIDE_INT val,
+			     int return_sequence[], int i)
+{
+  int remainder = val & 0xffffffff;
+  int insns = 0;
+
+  /* Try and find a way of doing the job in either two or three
+     instructions.
+     
+     In ARM mode we can use 8-bit constants, rotated to any 2-bit aligned
+     location.  We start at position I.  This may be the MSB, or
+     optimial_immediate_sequence may have positioned it at the largest block 
+     of zeros that are aligned on a 2-bit boundary. We then fill up the temps,
+     wrapping around to the top of the word when we drop off the bottom.
+     In the worst case this code should produce no more than four insns.
+
+     In Thumb2 mode, we can use 32/16-bit replicated constants, and 8-bit
+     constants, shifted to any arbitrary location.  We should always start
+     at the MSB.  */
+  do
+    {
+      int end;
+      int b1, b2, b3, b4;
+      unsigned HOST_WIDE_INT result;
+      int loc;
+
+      gcc_assert (insns < 4);
+
+      if (i <= 0)
+	i += 32;
+
+      /* First, find the next normal 12/8-bit shifted/rotated immediate.  */
+      if (remainder & ((TARGET_ARM ? (3 << (i - 2)) : (1 << (i - 1)))))
+	{
+	  loc = i;
+	  if (i <= 12 && TARGET_THUMB2 && code == PLUS)
+	    /* We can use addw/subw for the last 12 bits.  */
+	    result = remainder;
+	  else
+	    {
+	      /* Use an 8-bit shifted/rotated immediate.  */
+	      end = i - 8;
+	      if (end < 0)
+		end += 32;
+	      result = remainder & ((0x0ff << end)
+				   | ((i < end) ? (0xff >> (32 - end))
+						: 0));
+	      i -= 8;
+	    }
+	}
+      else
+	{
+	  /* Arm allows rotates by a multiple of two. Thumb-2 allows
+	     arbitrary shifts.  */
+	  i -= TARGET_ARM ? 2 : 1;
+	  continue;
+	}
+
+      /* Next, see if we can do a better job with a thumb2 replicated
+	 constant.
+       
+         We do it this way around to catch the cases like 0x01F001E0 where
+	 two 8-bit immediates would work, but a replicated constant would
+	 make it worse.
+       
+         TODO: 16-bit constants that don't clear all the bits, but still win.
+         TODO: Arithmetic splitting for set/add/sub, rather than bitwise.  */
+      if (TARGET_THUMB2)
+	{
+	  b1 = (remainder & 0xff000000) >> 24;
+	  b2 = (remainder & 0x00ff0000) >> 16;
+	  b3 = (remainder & 0x0000ff00) >> 8;
+	  b4 = remainder & 0xff;
+
+	  if (loc > 24)
+	    {
+	      /* The 8-bit immediate already found clears b1 (and maybe b2),
+		 but must leave b3 and b4 alone.  */
+
+	      /* First try to find a 32-bit replicated constant that clears
+		 almost everything.  We can assume that we can't do it in one,
+		 or else we wouldn't be here.  */
+	      unsigned int tmp = b1 & b2 & b3 & b4;
+	      unsigned int tmp2 = tmp + (tmp << 8) + (tmp << 16)
+				  + (tmp << 24);
+	      unsigned int matching_bytes = (tmp == b1) + (tmp == b2)
+					    + (tmp == b3) + (tmp == b4);
+	      if (tmp
+		  && (matching_bytes >= 3
+		      || (matching_bytes == 2
+			  && const_ok_for_op (remainder & ~tmp2, code))))
+		{
+		  /* At least 3 of the bytes match, and the fourth has at 
+		     least as many bits set, or two of the bytes match
+		     and it will only require one more insn to finish.  */
+		  result = tmp2;
+		  i = tmp != b1 ? 32
+		      : tmp != b2 ? 24
+		      : tmp != b3 ? 16
+		      : 8;
+		}
+
+	      /* Second, try to find a 16-bit replicated constant that can
+		 leave three of the bytes clear.  If b2 or b4 is already
+		 zero, then we can.  If the 8-bit from above would not
+		 clear b2 anyway, then we still win.  */
+	      else if (b1 == b3 && (!b2 || !b4
+			       || (remainder & 0x00ff0000 & ~result)))
+		{
+		  result = remainder & 0xff00ff00;
+		  i = 24;
+		}
+	    }
+	  else if (loc > 16)
+	    {
+	      /* The 8-bit immediate already found clears b2 (and maybe b3)
+		 and we don't get here unless b1 is alredy clear, but it will
+		 leave b4 unchanged.  */
+
+	      /* If we can clear b2 and b4 at once, then we win, since the
+		 8-bits couldn't possibly reach that far.  */
+	      if (b2 == b4)
+		{
+		  result = remainder & 0x00ff00ff;
+		  i = 16;
+		}
+	    }
+	}
+
+      return_sequence[insns++] = result;
+      remainder &= ~result;
+
+      if (code == SET || code == MINUS)
+	code = PLUS;
+    }
+  while (remainder);
 
-  return best_start;
+  return insns;
 }
 
 /* Emit an instruction with the indicated PATTERN.  If COND is
@@ -2547,7 +2674,6 @@  emit_constant_insn (rtx cond, rtx pattern)
 
 /* As above, but extra parameter GENERATE which, if clear, suppresses
    RTL generation.  */
-/* ??? This needs more work for thumb2.  */
 
 static int
 arm_gen_constant (enum rtx_code code, enum machine_mode mode, rtx cond,
@@ -2559,15 +2685,14 @@  arm_gen_constant (enum rtx_code code, enum machine_mode mode, rtx cond,
   int final_invert = 0;
   int can_negate_initial = 0;
   int i;
-  int num_bits_set = 0;
   int set_sign_bit_copies = 0;
   int clear_sign_bit_copies = 0;
   int clear_zero_bit_copies = 0;
   int set_zero_bit_copies = 0;
-  int insns = 0;
+  int insns = 0, neg_insns, inv_insns;
   unsigned HOST_WIDE_INT temp1, temp2;
   unsigned HOST_WIDE_INT remainder = val & 0xffffffff;
-  int step_size = TARGET_ARM ? 2 : 1;
+  int immediates[4], neg_immediates[4], inv_immediates[4];
 
   /* Find out which operations are safe for a given CODE.  Also do a quick
      check for degenerate cases; these can occur when DImode operations
@@ -3079,120 +3204,100 @@  arm_gen_constant (enum rtx_code code, enum machine_mode mode, rtx cond,
       break;
     }
 
-  for (i = 0; i < 32; i++)
-    if (remainder & (1 << i))
-      num_bits_set++;
+  /* Calculate what the instruction sequences would be if we generated it
+     normally, negated, or inverted.  */
+  if (code == AND)
+    /* AND cannot be split into multiple insns, so invert and use BIC.  */
+    insns = 99;
+  else
+    insns = optimal_immediate_sequence (code, remainder, immediates);
+
+  if (can_negate)
+    neg_insns = optimal_immediate_sequence (code, (-remainder) & 0xffffffff,
+					    neg_immediates);
+  else
+    neg_insns = 99;
 
-  if ((code == AND) || (can_invert && num_bits_set > 16))
-    remainder ^= 0xffffffff;
-  else if (code == PLUS && num_bits_set > 16)
-    remainder = (-remainder) & 0xffffffff;
+  if (can_invert)
+    inv_insns = optimal_immediate_sequence (code, remainder ^ 0xffffffff,
+					    inv_immediates);
+  else
+    inv_insns = 99;
 
-  /* For XOR, if more than half the bits are set and there's a sequence
-     of more than 8 consecutive ones in the pattern then we can XOR by the
-     inverted constant and then invert the final result; this may save an
-     instruction and might also lead to the final mvn being merged with
-     some other operation.  */
-  else if (code == XOR && num_bits_set > 16
-	   && (count_insns_for_constant (remainder ^ 0xffffffff,
-					 find_best_start
-					 (remainder ^ 0xffffffff))
-	       < count_insns_for_constant (remainder,
-					   find_best_start (remainder))))
+  /* Is the negated immediate sequence more efficient?  */
+  if (neg_insns < insns && neg_insns <= inv_insns)
     {
-      remainder ^= 0xffffffff;
-      final_invert = 1;
+      insns = neg_insns;
+      memcpy (immediates, neg_immediates, sizeof (immediates));
     }
   else
+    can_negate = 0;
+
+  /* Is the inverted immediate sequence more efficient?
+     We must allow for an extra NOT instruction for XOR operations, although
+     there is some chance that the final 'mvn' will get optimized later.  */
+  if (inv_insns < insns && (code != XOR || (inv_insns + 1) < insns))
     {
-      can_invert = 0;
-      can_negate = 0;
-    }
+      insns = inv_insns;
+      memcpy (immediates, inv_immediates, sizeof (immediates));
 
-  /* Now try and find a way of doing the job in either two or three
-     instructions.
-     We start by looking for the largest block of zeros that are aligned on
-     a 2-bit boundary, we then fill up the temps, wrapping around to the
-     top of the word when we drop off the bottom.
-     In the worst case this code should produce no more than four insns.
-     Thumb-2 constants are shifted, not rotated, so the MSB is always the
-     best place to start.  */
+      if (code == XOR)
+	final_invert = 1;
+    }
+  else
+    can_invert = 0;
 
-  /* ??? Use thumb2 replicated constants when the high and low halfwords are
-     the same.  */
-  {
-    /* Now start emitting the insns.  */
-    i = find_best_start (remainder);
-    do
-      {
-	int end;
+  /* Now output the chosen sequence as instructions.  */
+  if (generate)
+    {
+      for (i = 0; i < insns; i++)
+	{
+	  rtx new_src, temp1_rtx;
 
-	if (i <= 0)
-	  i += 32;
-	if (remainder & (3 << (i - 2)))
-	  {
-	    end = i - 8;
-	    if (end < 0)
-	      end += 32;
-	    temp1 = remainder & ((0x0ff << end)
-				 | ((i < end) ? (0xff >> (32 - end)) : 0));
-	    remainder &= ~temp1;
-
-	    if (generate)
-	      {
-		rtx new_src, temp1_rtx;
+	  temp1 = immediates[i];
 
-		if (code == SET || code == MINUS)
-		  {
-		    new_src = (subtargets ? gen_reg_rtx (mode) : target);
-		    if (can_invert && code != MINUS)
-		      temp1 = ~temp1;
-		  }
-		else
-		  {
-		    if ((final_invert || remainder) && subtargets)
-		      new_src = gen_reg_rtx (mode);
-		    else
-		      new_src = target;
-		    if (can_invert)
-		      temp1 = ~temp1;
-		    else if (can_negate)
-		      temp1 = -temp1;
-		  }
+	  if (code == SET || code == MINUS)
+	    {
+	      new_src = (subtargets ? gen_reg_rtx (mode) : target);
+	      if (can_invert && code != MINUS)
+		temp1 = ~temp1;
+	    }
+	  else
+	    {
+	      if ((final_invert || i < (insns - 1)) && subtargets)
+		new_src = gen_reg_rtx (mode);
+	      else
+		new_src = target;
+	      if (can_invert)
+		temp1 = ~temp1;
+	      else if (can_negate)
+		temp1 = -temp1;
+	    }
 
-		temp1 = trunc_int_for_mode (temp1, mode);
-		temp1_rtx = GEN_INT (temp1);
+	  temp1 = trunc_int_for_mode (temp1, mode);
+	  temp1_rtx = GEN_INT (temp1);
 
-		if (code == SET)
-		  ;
-		else if (code == MINUS)
-		  temp1_rtx = gen_rtx_MINUS (mode, temp1_rtx, source);
-		else
-		  temp1_rtx = gen_rtx_fmt_ee (code, mode, source, temp1_rtx);
+	  if (code == SET)
+	    ;
+	  else if (code == MINUS)
+	    temp1_rtx = gen_rtx_MINUS (mode, temp1_rtx, source);
+	  else
+	    temp1_rtx = gen_rtx_fmt_ee (code, mode, source, temp1_rtx);
 
-		emit_constant_insn (cond,
-				    gen_rtx_SET (VOIDmode, new_src,
-						 temp1_rtx));
-		source = new_src;
-	      }
+	  emit_constant_insn (cond,
+			      gen_rtx_SET (VOIDmode, new_src,
+					   temp1_rtx));
+	  source = new_src;
 
-	    if (code == SET)
-	      {
-		can_invert = 0;
-		code = PLUS;
-	      }
-	    else if (code == MINUS)
+	  if (code == SET)
+	    {
+	      can_invert = 0;
 	      code = PLUS;
-
-	    insns++;
-	    i -= 8 - step_size;
-	  }
-	/* Arm allows rotates by a multiple of two. Thumb-2 allows arbitrary
-	   shifts.  */
-	i -= step_size;
-      }
-    while (remainder);
-  }
+	    }
+	  else if (code == MINUS)
+	    code = PLUS;
+	}
+    }
 
   if (final_invert)
     {
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/thumb2-replicated-constant1.c
@@ -0,0 +1,27 @@ 
+/* Ensure simple replicated constant immediates work.  */
+/* { dg-options "-mthumb -O2" } */
+/* { dg-require-effective-target arm_thumb2_ok } */
+
+int
+foo1 (int a)
+{
+  return a + 0xfefefefe;
+}
+
+/* { dg-final { scan-assembler "add.*#-16843010" } } */
+
+int
+foo2 (int a)
+{
+  return a - 0xab00ab00;
+}
+
+/* { dg-final { scan-assembler "sub.*#-1426019584" } } */
+
+int
+foo3 (int a)
+{
+  return a & 0x00cd00cd;
+}
+
+/* { dg-final { scan-assembler "and.*#13435085" } } */
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/thumb2-replicated-constant2.c
@@ -0,0 +1,75 @@ 
+/* Ensure split constants can use replicated patterns.  */
+/* { dg-options "-mthumb -O2" } */
+/* { dg-require-effective-target arm_thumb2_ok } */
+
+int
+foo1 (int a)
+{
+  return a + 0xfe00fe01;
+}
+
+/* { dg-final { scan-assembler "add.*#-33489408" } } */
+/* { dg-final { scan-assembler "add.*#1" } } */
+
+int
+foo2 (int a)
+{
+  return a + 0xdd01dd00;
+}
+
+/* { dg-final { scan-assembler "add.*#-587145984" } } */
+/* { dg-final { scan-assembler "add.*#65536" } } */
+
+int
+foo3 (int a)
+{
+  return a + 0x00443344;
+}
+
+/* { dg-final { scan-assembler "add.*#4456516" } } */
+/* { dg-final { scan-assembler "add.*#13056" } } */
+
+int
+foo4 (int a)
+{
+  return a + 0x77330033;
+}
+
+/* { dg-final { scan-assembler "add.*#1996488704" } } */
+/* { dg-final { scan-assembler "add.*#3342387" } } */
+
+int
+foo5 (int a)
+{
+  return a + 0x11221122;
+}
+
+/* { dg-final { scan-assembler "add.*#285217024" } } */
+/* { dg-final { scan-assembler "add.*#2228258" } } */
+
+int
+foo6 (int a)
+{
+  return a + 0x66666677;
+}
+
+/* { dg-final { scan-assembler "add.*#1717986918" } } */
+/* { dg-final { scan-assembler "add.*#17" } } */
+
+int
+foo7 (int a)
+{
+  return a + 0x99888888;
+}
+
+/* { dg-final { scan-assembler "add.*#-2004318072" } } */
+/* { dg-final { scan-assembler "add.*#285212672" } } */
+
+int
+foo8 (int a)
+{
+  return a + 0xdddddfff;
+}
+
+/* { dg-final { scan-assembler "add.*#-572662307" } } */
+/* { dg-final { scan-assembler "addw.*#546" } } */
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/thumb2-replicated-constant3.c
@@ -0,0 +1,28 @@ 
+/* Ensure negated/inverted replicated constant immediates work.  */
+/* { dg-options "-mthumb -O2" } */
+/* { dg-require-effective-target arm_thumb2_ok } */
+
+int
+foo1 (int a)
+{
+  return a | 0xffffff00;
+}
+
+/* { dg-final { scan-assembler "orn.*#255" } } */
+
+int
+foo2 (int a)
+{
+  return a & 0xffeeffee;
+}
+
+/* { dg-final { scan-assembler "bic.*#1114129" } } */
+
+int
+foo3 (int a)
+{
+  return a & 0xaaaaaa00;
+}
+
+/* { dg-final { scan-assembler "and.*#-1431655766" } } */
+/* { dg-final { scan-assembler "bic.*#170" } } */
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/thumb2-replicated-constant4.c
@@ -0,0 +1,22 @@ 
+/* Ensure replicated constants don't make things worse.  */
+/* { dg-options "-mthumb -O2" } */
+/* { dg-require-effective-target arm_thumb2_ok } */
+
+int
+foo1 (int a)
+{
+  /* It might be tempting to use 0x01000100, but it wouldn't help. */
+  return a + 0x01f001e0;
+}
+
+/* { dg-final { scan-assembler "add.*#32505856" } } */
+/* { dg-final { scan-assembler "add.*#480" } } */
+
+int
+foo2 (int a)
+{
+  return a + 0x0f100e10;
+}
+
+/* { dg-final { scan-assembler "add.*#252706816" } } */
+/* { dg-final { scan-assembler "add.*#3600" } } */