Patchwork [ARM] Improve 64-bit shifts (non-NEON)

login
register
mail settings
Submitter Andrew Stubbs
Date Jan. 31, 2012, 2:29 p.m.
Message ID <4F27FAC0.1010107@codesourcery.com>
Download mbox | patch
Permalink /patch/138789/
State New
Headers show

Comments

Andrew Stubbs - Jan. 31, 2012, 2:29 p.m.
On 30/01/12 15:25, Richard Earnshaw wrote:
> What's the impact of this on -Os?  At present we fall back to the
> libcalls, but I can't immediately see how the new code would do that.
>
> Gut feeling is that shift by a constant is always worth inlining at -Os,
> but shift by a register isn't.

Ah, I hadn't considered that. Good point!

This updated patch causes it to fall back to the old behaviour in 
optimize_size mode. This should do what you want.

OK?

Andrew
Ramana Radhakrishnan - Feb. 7, 2012, 10:19 p.m.
Hi Andrew

I find it interesting that cond_exec's in this form survive all the
way till reload and "work".  AFAIK we could never have cond_exec's
before reload . The documentation doesn't appear to mention this,
therefore I would like to see if  the cond_exec's can be recast as
if_then_else forms from expand rather than these forms. Has this
survived bootstrap and testing ?


> +      /* If we're optimizing for size, we prefer the libgcc calls.  */
> +      if (optimize_size)
> +	FAIL;

In addition you want to replace optimize_size with
optimize_function_for_size_p in these cases.

cheers
Ramana
Steven Bosscher - Feb. 7, 2012, 10:33 p.m.
On Tue, Feb 7, 2012 at 11:19 PM, Ramana Radhakrishnan
<ramana.radhakrishnan@linaro.org> wrote:
> Hi Andrew
>
> I find it interesting that cond_exec's in this form survive all the
> way till reload and "work".  AFAIK we could never have cond_exec's
> before reload .

There is nothing wrong per-se with cond_execs before reload, as long
as you don't have to reload a predicate pseudo-reg. For ia64, with its
(IIRC) 64 predicate registers, it was not practical, or even possible,
to assign the predicate registers to hard regs during expand.

AFAIU this isn't an issue on ARM because there is just one condition
code register.

Ciao!
Steven
Andrew Stubbs - Feb. 8, 2012, 11:18 a.m.
On 07/02/12 22:19, Ramana Radhakrishnan wrote:
> I find it interesting that cond_exec's in this form survive all the
> way till reload and "work".  AFAIK we could never have cond_exec's
> before reload . The documentation doesn't appear to mention this,
> therefore I would like to see if  the cond_exec's can be recast as
> if_then_else forms from expand rather than these forms. Has this
> survived bootstrap and testing ?

I've tried to do this, but it can't be done by a straight translation 
because we don't have the insns available to do it. As I understand it, 
all "predicable" instructions automatically get a cond_exec variant, but 
the only if_then_else I could find (it's hard to grep because 
if_then_else occurs many times in attributes) is in the conditional move 
instruction. Have I missed something?

Anyway, I've tried to redo it using conditional move, and that works, 
but it's not as efficient.

Hopefully Steven Bosscher is correct, and there's no problem with 
cond_exec on ARM.

>> +      /* If we're optimizing for size, we prefer the libgcc calls.  */
>> +      if (optimize_size)
>> +	FAIL;
>
> In addition you want to replace optimize_size with
> optimize_function_for_size_p in these cases.

Ok, I've fixed this. I'll post an update soon.

Unfortunately, my testing has found a problem in the native bootstrap, 
so I'm going to need to track that down first.

Andrew
Bernd Schmidt - Feb. 8, 2012, 12:02 p.m.
On 02/07/2012 11:33 PM, Steven Bosscher wrote:
> On Tue, Feb 7, 2012 at 11:19 PM, Ramana Radhakrishnan
> <ramana.radhakrishnan@linaro.org> wrote:
>> Hi Andrew
>>
>> I find it interesting that cond_exec's in this form survive all the
>> way till reload and "work".  AFAIK we could never have cond_exec's
>> before reload .
> 
> There is nothing wrong per-se with cond_execs before reload, as long
> as you don't have to reload a predicate pseudo-reg.

I thought the problem was that we'd have to emit conditional reload
insns and inheritance wouldn't work.


Bernd
Richard Guenther - Feb. 8, 2012, 12:12 p.m.
On Wed, Feb 8, 2012 at 1:02 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 02/07/2012 11:33 PM, Steven Bosscher wrote:
>> On Tue, Feb 7, 2012 at 11:19 PM, Ramana Radhakrishnan
>> <ramana.radhakrishnan@linaro.org> wrote:
>>> Hi Andrew
>>>
>>> I find it interesting that cond_exec's in this form survive all the
>>> way till reload and "work".  AFAIK we could never have cond_exec's
>>> before reload .
>>
>> There is nothing wrong per-se with cond_execs before reload, as long
>> as you don't have to reload a predicate pseudo-reg.
>
> I thought the problem was that we'd have to emit conditional reload
> insns and inheritance wouldn't work.

It probably depends on how DF sees conditional uses / defs.  If they
look like regular uses / defs then I suppose un-conditional spills/reloads
are fine - otherwise of course you'd corrupt one of the two register set
states.  But that also means it's probably safe if the sequence of conditional
insns is of length 1.

Not sure we want to open that possible can of worms though ;)

Richard.

>
> Bernd
Bernd Schmidt - Feb. 8, 2012, 12:41 p.m.
On 02/08/2012 01:12 PM, Richard Guenther wrote:
> On Wed, Feb 8, 2012 at 1:02 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>> On 02/07/2012 11:33 PM, Steven Bosscher wrote:
>>> On Tue, Feb 7, 2012 at 11:19 PM, Ramana Radhakrishnan
>>> <ramana.radhakrishnan@linaro.org> wrote:
>>>> Hi Andrew
>>>>
>>>> I find it interesting that cond_exec's in this form survive all the
>>>> way till reload and "work".  AFAIK we could never have cond_exec's
>>>> before reload .
>>>
>>> There is nothing wrong per-se with cond_execs before reload, as long
>>> as you don't have to reload a predicate pseudo-reg.
>>
>> I thought the problem was that we'd have to emit conditional reload
>> insns and inheritance wouldn't work.
> 
> It probably depends on how DF sees conditional uses / defs.  If they
> look like regular uses / defs then I suppose un-conditional spills/reloads
> are fine - otherwise of course you'd corrupt one of the two register set
> states.

I'm pretty sure conditional defs are always RMW, but I'd have to go
look. Can't imagine it working otherwise though.


Bernd
Richard Guenther - Feb. 8, 2012, 1:26 p.m.
On Wed, Feb 8, 2012 at 1:41 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 02/08/2012 01:12 PM, Richard Guenther wrote:
>> On Wed, Feb 8, 2012 at 1:02 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
>>> On 02/07/2012 11:33 PM, Steven Bosscher wrote:
>>>> On Tue, Feb 7, 2012 at 11:19 PM, Ramana Radhakrishnan
>>>> <ramana.radhakrishnan@linaro.org> wrote:
>>>>> Hi Andrew
>>>>>
>>>>> I find it interesting that cond_exec's in this form survive all the
>>>>> way till reload and "work".  AFAIK we could never have cond_exec's
>>>>> before reload .
>>>>
>>>> There is nothing wrong per-se with cond_execs before reload, as long
>>>> as you don't have to reload a predicate pseudo-reg.
>>>
>>> I thought the problem was that we'd have to emit conditional reload
>>> insns and inheritance wouldn't work.
>>
>> It probably depends on how DF sees conditional uses / defs.  If they
>> look like regular uses / defs then I suppose un-conditional spills/reloads
>> are fine - otherwise of course you'd corrupt one of the two register set
>> states.
>
> I'm pretty sure conditional defs are always RMW, but I'd have to go
> look. Can't imagine it working otherwise though.

I was thinking about

  (cond_exec 1 (set (reg:SI 30) (...)))
  (cond_exec 0 (set (reg:SI 30) (...)))
  (cond_exec 1 (use (reg:SI 30) (...)))

where if we spill/reload 30 with non-cond_exec stmts then we might
clobber the shared register set of the conditional execution paths.
Similar of course shared stack space if we happen to spill/reload in
both paths.  If of course defs/uses of both paths conflict and get different
hard registers assigned the register issue doesn't exist, still the
shared spill stack space issue may.

Richard.
Andrew Stubbs - Feb. 8, 2012, 4:28 p.m.
On 08/02/12 11:18, Andrew Stubbs wrote:
> I've tried to do this, but it can't be done by a straight translation
> because we don't have the insns available to do it. As I understand it,
> all "predicable" instructions automatically get a cond_exec variant, but
> the only if_then_else I could find (it's hard to grep because
> if_then_else occurs many times in attributes) is in the conditional move
> instruction. Have I missed something?

Ok, I missed the various patterns like if_arith_move.

Unfortunately, these don't work in Thumb mode (no IT block), and I'd 
have to add arith-shift variants, I think, for ARM mode to work.

Hmmmm ... I'll try again.

Andrew
Richard Henderson - Feb. 11, 2012, 1:11 a.m.
On 02/08/2012 08:28 AM, Andrew Stubbs wrote:
> Unfortunately, these don't work in Thumb mode (no IT block), and I'd have to add arith-shift variants, I think, for ARM mode to work.
> 
> Hmmmm ... I'll try again.

Does it work to simply use branches initially, and rely on post-reload
ifcvt to transform to cond_exec when possible?


r~

Patch

2012-01-31  Andrew Stubbs  <ams@codesourcery.com>

	* config/arm/arm-protos.h (arm_emit_coreregs_64bit_shift): New
	prototype.
	* config/arm/arm.c (arm_emit_coreregs_64bit_shift): New function.
	* config/arm/arm.md (ashldi3): Use arm_emit_coreregs_64bit_shift.
	(ashrdi3,lshrdi3): Likewise.

---
 gcc/config/arm/arm-protos.h |    3 +
 gcc/config/arm/arm.c        |  198 +++++++++++++++++++++++++++++++++++++++++++
 gcc/config/arm/arm.md       |  102 ++++++++++++++++------
 3 files changed, 276 insertions(+), 27 deletions(-)

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 296550a..df8d7a9 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -242,6 +242,9 @@  struct tune_params
 
 extern const struct tune_params *current_tune;
 extern int vfp3_const_double_for_fract_bits (rtx);
+
+extern void arm_emit_coreregs_64bit_shift (enum rtx_code, rtx, rtx, rtx, rtx,
+					   rtx);
 #endif /* RTX_CODE */
 
 #endif /* ! GCC_ARM_PROTOS_H */
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index 0bded8d..eefc45c 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -25139,5 +25139,203 @@  vfp3_const_double_for_fract_bits (rtx operand)
   return 0;
 }
 
+/* The default expansion of general 64-bit shifts in core-regs is suboptimal
+   on ARM, since we know that shifts by negative amounts are no-ops.
+
+   It's safe for the input and output to be the same register, but
+   early-clobber rules apply for the shift amount and scratch registers.
+
+   Shift by register requires both scratch registers.  Shift by a constant
+   less than 32 in Thumb2 mode requires SCRATCH1 only.  In all other cases
+   the scratch registers may be NULL.
+   
+   Additionally, ashiftrt by a register also clobbers the CC register.  */
+void
+arm_emit_coreregs_64bit_shift (enum rtx_code code, rtx out, rtx in,
+			       rtx amount, rtx scratch1, rtx scratch2)
+{
+  rtx out_high = gen_highpart (SImode, out);
+  rtx out_low = gen_lowpart (SImode, out);
+  rtx in_high = gen_highpart (SImode, in);
+  rtx in_low = gen_lowpart (SImode, in);
+
+  /* Bits flow from up-stream to down-stream.  */
+  rtx out_up   = code == ASHIFT ? out_low : out_high;
+  rtx out_down = code == ASHIFT ? out_high : out_low;
+  rtx in_up   = code == ASHIFT ? in_low : in_high;
+  rtx in_down = code == ASHIFT ? in_high : in_low;
+
+  gcc_assert (code == ASHIFT || code == ASHIFTRT || code == LSHIFTRT);
+  gcc_assert (out
+	      && (REG_P (out) || GET_CODE (out) == SUBREG)
+	      && GET_MODE (out) == DImode);
+  gcc_assert (in
+	      && (REG_P (in) || GET_CODE (in) == SUBREG)
+	      && GET_MODE (in) == DImode);
+  gcc_assert (amount
+	      && (((REG_P (amount) || GET_CODE (amount) == SUBREG)
+		   && GET_MODE (amount) == SImode)
+		  || CONST_INT_P (amount)));
+  gcc_assert (scratch1 == NULL
+	      || (GET_CODE (scratch1) == SCRATCH)
+	      || (GET_MODE (scratch1) == SImode
+		  && REG_P (scratch1)));
+  gcc_assert (scratch2 == NULL
+	      || (GET_CODE (scratch2) == SCRATCH)
+	      || (GET_MODE (scratch2) == SImode
+		  && REG_P (scratch2)));
+  gcc_assert (!REG_P (out) || !REG_P (amount)
+	      || !HARD_REGISTER_P (out)
+	      || (REGNO (out) != REGNO (amount)
+		  && REGNO (out) + 1 != REGNO (amount)));
+
+  /* Macros to make following code more readable.  */
+  #define SUB_32(DEST,SRC) \
+	    gen_addsi3 ((DEST), (SRC), gen_rtx_CONST_INT (VOIDmode, -32))
+  #define RSB_32(DEST,SRC) \
+	    gen_subsi3 ((DEST), gen_rtx_CONST_INT (VOIDmode, 32), (SRC))
+  #define SUB_S_32(DEST,SRC) \
+	    gen_addsi3_compare0 ((DEST), (SRC), \
+				 gen_rtx_CONST_INT (VOIDmode, -32))
+  #define SET(DEST,SRC) \
+	    gen_rtx_SET (SImode, (DEST), (SRC))
+  #define SHIFT(CODE,SRC,AMOUNT) \
+	    gen_rtx_fmt_ee ((CODE), SImode, (SRC), (AMOUNT))
+  #define LSHIFT(CODE,SRC,AMOUNT) \
+	    gen_rtx_fmt_ee ((CODE) == ASHIFT ? ASHIFT : LSHIFTRT, \
+			    SImode, (SRC), (AMOUNT))
+  #define REV_LSHIFT(CODE,SRC,AMOUNT) \
+	    gen_rtx_fmt_ee ((CODE) == ASHIFT ? LSHIFTRT : ASHIFT, \
+			    SImode, (SRC), (AMOUNT))
+  #define ORR(A,B) \
+	    gen_rtx_IOR (SImode, (A), (B))
+  #define IF(COND,RTX) \
+	    gen_rtx_COND_EXEC (VOIDmode, \
+			       gen_rtx_ ## COND (CCmode, cc_reg, \
+						 const0_rtx), \
+			       (RTX))
+
+  if (CONST_INT_P (amount))
+    {
+      /* Shifts by a constant amount.  */
+      if (INTVAL (amount) <= 0)
+	/* Match what shift-by-register would do.  */
+	emit_insn (gen_movdi (out, in));
+      else if (INTVAL (amount) >= 64)
+	{
+	  /* Match what shift-by-register would do.  */
+	  if (code == ASHIFTRT)
+	    {
+	      rtx const31_rtx = gen_rtx_CONST_INT (VOIDmode, 31);
+	      emit_insn (SET (out_down, SHIFT (code, in_up, const31_rtx)));
+	      emit_insn (SET (out_up, SHIFT (code, in_up, const31_rtx)));
+	    }
+	  else
+	    emit_insn (gen_movdi (out, const0_rtx));
+	}
+      else if (INTVAL (amount) < 32)
+	{
+	  /* Shifts by a constant less than 32.  */
+	  rtx reverse_amount = gen_rtx_CONST_INT (VOIDmode,
+						  32 - INTVAL (amount));
+
+	  emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
+	  emit_insn (SET (out_down,
+			  ORR (REV_LSHIFT (code, in_up, reverse_amount),
+			       out_down)));
+	  emit_insn (SET (out_up, SHIFT (code, in_up, amount)));
+	}
+      else
+	{
+	  /* Shifts by a constant greater than 31.  */
+	  rtx adj_amount = gen_rtx_CONST_INT (VOIDmode, INTVAL (amount) - 32);
+
+	  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
+	  if (code == ASHIFTRT)
+	    emit_insn (gen_ashrsi3 (out_up, in_up,
+				    gen_rtx_CONST_INT (VOIDmode, 31)));
+	  else
+	    emit_insn (SET (out_up, const0_rtx));
+	}
+    }
+  else
+    {
+      /* Shifts by a variable amount.  */
+      rtx cc_reg = gen_rtx_REG (CC_NCVmode, CC_REGNUM);
+
+      gcc_assert (scratch1 && REG_P (scratch1));
+      gcc_assert (scratch2 && REG_P (scratch2));
+
+      switch (code)
+	{
+	case ASHIFT:
+	  emit_insn (SUB_32 (scratch1, amount));
+	  emit_insn (RSB_32 (scratch2, amount));
+	  break;
+	case ASHIFTRT:
+	  emit_insn (RSB_32 (scratch1, amount));
+	  emit_insn (SUB_S_32 (scratch2, amount));
+	  break;
+	case LSHIFTRT:
+	  emit_insn (RSB_32 (scratch1, amount));
+	  emit_insn (SUB_32 (scratch2, amount));
+	  break;
+	default:
+	  gcc_unreachable ();
+	}
+
+      emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
+
+      if (!TARGET_THUMB2)
+	{
+	  /* If this were only called during expand we could just use the else
+	     case and let combine deal with it, but this can also be called
+	     from post-reload splitters.  */
+	  emit_insn (SET (out_down,
+			  ORR (SHIFT (ASHIFT, in_up, scratch1), out_down)));
+	  if (code == ASHIFTRT)
+	    {
+	      emit_insn (IF (GE,
+			     SET (out_down,
+				  ORR (SHIFT (ASHIFTRT, in_up, scratch2),
+				       out_down))));
+	    }
+	  else
+	    emit_insn (SET (out_down, ORR (SHIFT (LSHIFTRT, in_up, scratch2),
+					   out_down)));
+	}
+      else
+	{
+	  /* Thumb2 can't do shift and or in one insn.  */
+	  emit_insn (SET (scratch1, SHIFT (ASHIFT, in_up, scratch1)));
+	  emit_insn (gen_iorsi3 (out_down, out_down, scratch1));
+
+	  if (code == ASHIFTRT)
+	    {
+	      emit_insn (IF (GE, SET (scratch2,
+				      SHIFT (ASHIFTRT, in_up, scratch2))));
+	      emit_insn (IF (GE, SET (out_down, ORR (out_down, scratch2))));
+	    }
+	  else
+	    {
+	      emit_insn (SET (scratch2, SHIFT (LSHIFTRT, in_up, scratch2)));
+	      emit_insn (gen_iorsi3 (out_down, out_down, scratch2));
+	    }
+	}
+
+      emit_insn (SET (out_up, SHIFT (code, in_up, amount)));
+    }
+
+  #undef SUB_32
+  #undef RSB_32
+  #undef SUB_S_32
+  #undef SET
+  #undef SHIFT
+  #undef LSHIFT
+  #undef REV_LSHIFT
+  #undef ORR
+  #undef IF
+}
+
 #include "gt-arm.h"
 
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 751997f..e814dc4 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -3466,21 +3466,37 @@ 
                    (match_operand:SI 2 "reg_or_int_operand" "")))]
   "TARGET_32BIT"
   "
-  if (GET_CODE (operands[2]) == CONST_INT)
+  if (!CONST_INT_P (operands[2])
+      && (TARGET_REALLY_IWMMXT || (TARGET_HARD_FLOAT && TARGET_MAVERICK)))
+    ; /* No special preparation statements; expand pattern as above.  */
+  else
     {
-      if ((HOST_WIDE_INT) INTVAL (operands[2]) == 1)
+      rtx scratch1, scratch2;
+
+      if (GET_CODE (operands[2]) == CONST_INT
+	  && (HOST_WIDE_INT) INTVAL (operands[2]) == 1)
         {
           emit_insn (gen_arm_ashldi3_1bit (operands[0], operands[1]));
           DONE;
         }
-        /* Ideally we shouldn't fail here if we could know that operands[1] 
-           ends up already living in an iwmmxt register. Otherwise it's
-           cheaper to have the alternate code being generated than moving
-           values to iwmmxt regs and back.  */
-        FAIL;
+
+      /* Ideally we should use iwmmxt here if we could know that operands[1] 
+         ends up already living in an iwmmxt register. Otherwise it's
+         cheaper to have the alternate code being generated than moving
+         values to iwmmxt regs and back.  */
+
+      /* If we're optimizing for size, we prefer the libgcc calls.  */
+      if (optimize_size)
+	FAIL;
+
+      /* Expand operation using core-registers.
+	 'FAIL' would achieve the same thing, but this is a bit smarter.  */
+      scratch1 = gen_reg_rtx (SImode);
+      scratch2 = gen_reg_rtx (SImode);
+      arm_emit_coreregs_64bit_shift (ASHIFT, operands[0], operands[1],
+				     operands[2], scratch1, scratch2);
+      DONE;
     }
-  else if (!TARGET_REALLY_IWMMXT && !(TARGET_HARD_FLOAT && TARGET_MAVERICK))
-    FAIL;
   "
 )
 
@@ -3525,21 +3541,37 @@ 
                      (match_operand:SI 2 "reg_or_int_operand" "")))]
   "TARGET_32BIT"
   "
-  if (GET_CODE (operands[2]) == CONST_INT)
+  if (!CONST_INT_P (operands[2])
+      && (TARGET_REALLY_IWMMXT || (TARGET_HARD_FLOAT && TARGET_MAVERICK)))
+    ; /* No special preparation statements; expand pattern as above.  */
+  else
     {
-      if ((HOST_WIDE_INT) INTVAL (operands[2]) == 1)
+      rtx scratch1, scratch2;
+
+      if (GET_CODE (operands[2]) == CONST_INT
+	  && (HOST_WIDE_INT) INTVAL (operands[2]) == 1)
         {
           emit_insn (gen_arm_ashrdi3_1bit (operands[0], operands[1]));
           DONE;
         }
-        /* Ideally we shouldn't fail here if we could know that operands[1] 
-           ends up already living in an iwmmxt register. Otherwise it's
-           cheaper to have the alternate code being generated than moving
-           values to iwmmxt regs and back.  */
-        FAIL;
+
+      /* Ideally we should use iwmmxt here if we could know that operands[1] 
+         ends up already living in an iwmmxt register. Otherwise it's
+         cheaper to have the alternate code being generated than moving
+         values to iwmmxt regs and back.  */
+
+      /* If we're optimizing for size, we prefer the libgcc calls.  */
+      if (optimize_size)
+	FAIL;
+
+      /* Expand operation using core-registers.
+	 'FAIL' would achieve the same thing, but this is a bit smarter.  */
+      scratch1 = gen_reg_rtx (SImode);
+      scratch2 = gen_reg_rtx (SImode);
+      arm_emit_coreregs_64bit_shift (ASHIFTRT, operands[0], operands[1],
+				     operands[2], scratch1, scratch2);
+      DONE;
     }
-  else if (!TARGET_REALLY_IWMMXT)
-    FAIL;
   "
 )
 
@@ -3582,21 +3614,37 @@ 
                      (match_operand:SI 2 "reg_or_int_operand" "")))]
   "TARGET_32BIT"
   "
-  if (GET_CODE (operands[2]) == CONST_INT)
+  if (!CONST_INT_P (operands[2])
+      && (TARGET_REALLY_IWMMXT || (TARGET_HARD_FLOAT && TARGET_MAVERICK)))
+    ; /* No special preparation statements; expand pattern as above.  */
+  else
     {
-      if ((HOST_WIDE_INT) INTVAL (operands[2]) == 1)
+      rtx scratch1, scratch2;
+
+      if (GET_CODE (operands[2]) == CONST_INT
+	  && (HOST_WIDE_INT) INTVAL (operands[2]) == 1)
         {
           emit_insn (gen_arm_lshrdi3_1bit (operands[0], operands[1]));
           DONE;
         }
-        /* Ideally we shouldn't fail here if we could know that operands[1] 
-           ends up already living in an iwmmxt register. Otherwise it's
-           cheaper to have the alternate code being generated than moving
-           values to iwmmxt regs and back.  */
-        FAIL;
+
+      /* Ideally we should use iwmmxt here if we could know that operands[1] 
+         ends up already living in an iwmmxt register. Otherwise it's
+         cheaper to have the alternate code being generated than moving
+         values to iwmmxt regs and back.  */
+
+      /* If we're optimizing for size, we prefer the libgcc calls.  */
+      if (optimize_size)
+	FAIL;
+
+      /* Expand operation using core-registers.
+	 'FAIL' would achieve the same thing, but this is a bit smarter.  */
+      scratch1 = gen_reg_rtx (SImode);
+      scratch2 = gen_reg_rtx (SImode);
+      arm_emit_coreregs_64bit_shift (LSHIFTRT, operands[0], operands[1],
+				     operands[2], scratch1, scratch2);
+      DONE;
     }
-  else if (!TARGET_REALLY_IWMMXT)
-    FAIL;
   "
 )